## Langchain + ChromaDB - Q&A Multiple files

- Multiple Files
- ChromaDB
- gpt-3.5-turbo API

In [None]:
!pip -q install chromadb==0.4.15 langchain==0.0.330 openai==v0.28.1 tiktoken

In [None]:
!pip show langchain

In [2]:
import os

In [3]:
os.environ["OPENAI_API_KEY"] = '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'

In [4]:
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


In [5]:
!unzip -q techcrunch_articles.zip -d articles

In [6]:
# load multiple documents and process documents

loader = DirectoryLoader("./articles/", glob="./*.txt", loader_cls=TextLoader)
documents = loader.load()

In [None]:
documents

In [7]:
# split the text into chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)


In [None]:
texts

In [8]:
len(texts)

111

In [13]:
# Create a ChromaDB

In [9]:
persist_directory = "db"

embedding = OpenAIEmbeddings()

In [10]:
vectordb = Chroma.from_documents(
    documents = texts,
    embedding = embedding,
    persist_directory = persist_directory
)

In [21]:
# persist the db to the disk
vectordb.persist()
vectordb = None

In [22]:
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function = embedding
)

In [23]:
# Create a retreiver

In [45]:
retriever = vectordb.as_retriever()

In [56]:
retriever = vectordb.as_retriever(search_kwargs={"k": 4})

In [47]:
retriever.search_type

'similarity'

In [48]:
retriever.search_kwargs

{'k': 2}

In [49]:
turbo_llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo")



In [57]:
qa_chain = RetrievalQA.from_chain_type(
    llm = turbo_llm,
    chain_type="stuff",
    retriever = retriever,
    return_source_documents=True
)

In [58]:
query = "What is the news about Pando?"
llm_response = qa_chain(query)
llm_response

{'query': 'What is the news about Pando?',
 'result': "The news about Pando is that it has raised $30 million in a Series B funding round, bringing its total raised to $45 million. The funding will be used to expand Pando's global sales, marketing, and delivery capabilities. Pando is a startup developing fulfillment management technologies for global logistics operations.",
 'source_documents': [Document(page_content='Pando was co-launched by Jayakrishnan and Abhijeet Manohar, who previously worked together at iDelivery, an India-based freight tech marketplace — and their first startup. The two saw firsthand manufacturers, distributors and retailers were struggling with legacy tech and point solutions to understand, optimize and manage their global logistics operations — or at least, that’s the story Jayakrishnan tells.\n\n“Supply chain leaders were trying to build their own tech and throwing people at the problem,” he said. “This caught our attention — we spent months talking to and b

In [59]:
# helper function to display output

def process_llm_response(llm_response):
  print(llm_response["result"])
  print('\n\nSources:')
  for source in llm_response["source_documents"]:
    print(source.metadata['source'])

In [60]:
query = "What is the news about Pando?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

The news about Pando is that it has raised $30 million in a Series B funding round, bringing its total raised to $45 million. The funding will be used to expand Pando's global sales, marketing, and delivery capabilities. Pando is a startup developing fulfillment management technologies for global logistics operations.


Sources:
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [61]:
query = "What is the news about databricks?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

The news about Databricks is that they have acquired Okera, a data governance platform with a focus on AI.


Sources:
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
