# Getting started with ChromaDB 

Delete the contents of /db folder and restart the notebook.  Otherwise, this will duplicate rows in the db.

## References

- [YouTube Course](https://youtu.be/8KrTO9bS91s?si=rEKPcDYKbav56GQj)
- [GitHub Repo](https://github.com/entbappy/Complete-Generative-AI-Course-on-YouTube/blob/main/Vector%20Database/1.Chroma_DB_demo.ipynb)

In [None]:
import os
import dotenv

dotenv_path = dotenv.find_dotenv()
dotenv.load_dotenv(dotenv_path)

In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader

In [2]:
lightning_path = '/teamspace/studios/this_studio/woodshed/ai/notebooks/ChromaDB/data/articles'
loader = DirectoryLoader(lightning_path, glob = "./*.txt", loader_cls= TextLoader)
document = loader.load()

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200)
text = text_splitter.split_documents(document)

In [4]:
# len(text)
text[1]

Document(page_content='But the Alliance of Motion Picture and Television Producers (AMPTP) refused to engage with that proposal, instead offering a yearly meeting to discuss “advances in technology.”\n\n“When we first put [the proposal] in, we thought we were covering our bases — you know, some of our members are worried about this, the area is moving quickly, we should get ahead of it,” Conover said. “We didn’t think it’d be a contentious issue because the fact of the matter is, the current state of the text-generation technology is completely incapable of writing any work that could be used in a production.”', metadata={'source': '/teamspace/studios/this_studio/woodshed/ai/notebooks/ChromaDB/data/articles/05-03-ai-replace-tv-writers-strike.txt'})

In [5]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

persist_directory = 'db'

embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=text,
    embedding=embedding
)

In [6]:
# persist the db to disk
vectordb.persist()
vectordb = None

  warn_deprecated(


In [7]:

# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(
    embedding_function=embedding
)

In [8]:

retriever = vectordb.as_retriever()

docs = retriever.get_relevant_documents(
    "How much money did Microsoft raise?"
)


  warn_deprecated(


In [9]:

len(docs)

4

In [10]:
docs

[Document(page_content='April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023\n\nCalled ChatGPT Business, OpenAI describes the forthcoming offering as “for professionals who need more control over their data as well as enterprises seeking to manage their end users.”', metadata={'source': '/teamspace/studios/this_studio/woodshed/ai/notebooks/ChromaDB/data/articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt'}),


In [11]:

retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [12]:

retriever.search_type


'similarity'

In [13]:

retriever.search_kwargs

{'k': 2}

# Make a chain

In [18]:
from langchain.chains import RetrievalQA
llm=OpenAI()

# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

In [19]:

## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [20]:
# full example
query = "How much money did Microsoft raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

  warn_deprecated(



The amount of money that Microsoft raised is believed to be around $10 billion.


Sources:
/teamspace/studios/this_studio/woodshed/ai/vectordb/chromadb/data/articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt
/teamspace/studios/this_studio/woodshed/ai/vectordb/chromadb/data/articles/05-03-checks-the-ai-powered-data-protection-project-incubated-in-area-120-officially-exits-to-google.txt


In [21]:
# break it down
query = "What is the news about Pando?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Pando has raised $30 million in a Series B round, bringing its total raised to $45 million. The funding will be used to expand Pando's global sales, marketing, and delivery capabilities, and they are open to exploring strategic partnerships and acquisitions. Pando was co-launched by Nitin Jayakrishnan and Abhijeet Manohar to solve global logistics issues through a software-as-a-service platform.


Sources:
/teamspace/studios/this_studio/woodshed/ai/vectordb/chromadb/data/articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
/teamspace/studios/this_studio/woodshed/ai/vectordb/chromadb/data/articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
