# Upload your Documents to a ChromaDB Vector Database and use it for a GPT-4 chatbot  to talk with it!

In this notebook, I will introduce you to vector databases. I will:
- Store the LangChain documentation in a Chroma DB vector database
- Create a retriever to retrieve the desired information
- Create a Q&A chatbot with GPT-4
- Show how you can delete and reopen a vector database locally to save space
Visualise your vector database (very cool, read till the end!)

This notebook is connected to a medium article: [Medium articles](https://medium.com/@rubentak)

In [1]:
%%capture
!pip install langchain --upgrade
!pip install pypdf
!pip install openai
!pip install chromadb
!pip install tiktoken
!pip install unstructured
!pip install "unstructured[pdf]"
!pip install Cython

# Q&A bot with langchain over a directory

In [3]:
# Import libraries
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
import os
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader

In [4]:
# Create a new openai api key
#os.environ["OPENAI_API_KEY"] = "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"
# set up openai api key
#openai_api_key = os.environ.get('OPENAI_API_KEY')

openai_api_key = "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"

In [10]:
# Print number of txt files in directory
#loader = DirectoryLoader('iit_data') #, glob="**/*.pdf")
#docs = loader.load()
#len(docs)


KeyboardInterrupt



In [1]:
pwd

'/home/Chatbot trial-1/Langchain/notebooks'

In [69]:
from langchain.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader('iit_data')
docs = loader.load()

In [121]:
len(docs)

649

In [122]:
# Splitting the text into chunks
text_splitter = RecursiveCharacterTextSplitter (chunk_size=1000, \
                                                chunk_overlap=200)
texts = text_splitter.split_documents(docs)

In [123]:
# Count the number of chunks
len(texts)

6775

In [124]:
# Print the first chunk
texts[0]

Document(page_content='Highly effective Mg 2Si1−xSnxthermoelectrics\nV . K. Zaitsev, M. I. Fedorov, *E. A. Gurieva, I. S. Eremin, P. P. Konstantinov, A. Yu. Samunin, and M. V . Vedernikov\nA. F . Ioffe Physico-Technical Institute of the Russian Academy of Sciences, Saint Petersburg, Russia\n/H20849Received 19 October 2005; revised manuscript received 8 June 2006; published 14 July 2006 /H20850\nResults of detailed investigations of Mg 2BIV/H20849BIV=Si, Ge, Sn /H20850compounds and their quasibinary alloys are', metadata={'source': 'iit_data/Zaitsev, Federov - Highly effective Mg2Si1−xSnx Thermoelectrics.pdf', 'page': 0})

# Data base creation with ChromaDB

https://www.youtube.com/watch?v=3yPBVii7Ct0

In [125]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk

# OpenAI embeddings

os.environ["OPENAI_API_KEY"] = openai_api_key
embedding = OpenAIEmbeddings()
persist_directory = 'db_dir'
vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)
# Persist the db to disk
vectordb.persist()

In [127]:
# Persist the db to disk
#vectordb.persist()
#vectordb = None

In [139]:
# Now we can load the persisted database from disk, and use it as normal.
persist_directory = 'phd_db1'
#embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

# Create retriever

In [140]:
retriever = vectordb.as_retriever()

In [141]:
doc1 = retriever.get_relevant_documents("What is Seebeck Effect?")

In [151]:
len(doc1[0].page_content)

453

In [152]:
len(doc1)

4

In [165]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [166]:
retriever.search_type

'similarity'

In [167]:
retriever.search_kwargs

{'k': 2}

# Create a question answering chain

In [168]:
# Create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(model_name='gpt-3.5-turbo', temperature=0.0, max_tokens=999),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True,
                                  verbose=True)

In [169]:
# Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [173]:
query = "What is acoustic phonon deformation potential of a band?"
#query = "How is the band mass of an electron measured?"
#query="Formula for acoutic phonon deformation potential"
#query = "Explain Pisarenko plot in detail."


llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 Acoustic phonon deformation potential is the deformation potential that relates to changes in the electronic structure due to the presence of a phonon.


Sources:
iit_data/CRC_May_Snyder_Final_Handout_OneCol.pdf
iit_data/Bahk - Electron transport modeling and energy filtering for efficient thermoelectric.pdf


In [46]:
# Question
query = "Who were Marianne and Germania? What was the importance of the way in which they were portrayed?"

query = "How are minerals formed in igneous and metamorphic rocks?"

query = "Why do we need to conserve mineral resources ?"

query = "Describe the distribution of coal in India in about 120 words"

query = "Why are the means of transportation and communication called the lifelines of a nation and its economy?. Answer in about 120 words. "
llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 
The means of transportation and communication are the lifelines of a nation and its economy because they provide the necessary infrastructure for efficient trade and commerce. Transportation and communication systems are essential for the smooth functioning of a nation and its economy. They allow goods and services to be moved from one place to another, and they ensure that information is quickly and accurately exchanged. Without them, trade would be slow and inefficient. They also enable the development of other industries such as tourism, banking, and finance. In addition, they provide access to markets, resources, and knowledge which are essential for economic development. Therefore, transportation and communication systems are crucial for a nation’s long-term economic growth.


Sources:
data/jess107.pdf
data/jess107.pdf


In [20]:
query = "Briefly trace the process of German unification."
llm_response = qa_chain(query)
process_llm_response(llm_response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 The process of German unification began in the 19th century, with Prussia taking on the leadership of the movement. Otto von Bismarck, the chief minister of Prussia, used the Prussian army and bureaucracy to achieve unification. This was achieved through three wars over seven years - with Austria, Denmark and France - which Prussia was victorious in. In 1871, the Prussian king William I was proclaimed German Emperor at a ceremony held in Versailles. The process of unification was completed when the South German states joined with Prussia to form the German Empire in 1871, after Prussia won the Franco-Prussia War.


Sources:
jess301.pdf
jess301.pdf


# Deleteing the DB

In [250]:
!zip -r db.zip ./db

updating: db/ (stored 0%)
updating: db/chroma-embeddings.parquet (deflated 29%)
updating: db/index/ (stored 0%)
updating: db/index/index_metadata_b9a5e02f-ebd0-4b13-8858-b30b211c4546.pkl (deflated 5%)
updating: db/index/id_to_uuid_b9a5e02f-ebd0-4b13-8858-b30b211c4546.pkl (deflated 37%)
updating: db/index/uuid_to_id_d80886e4-65e1-4231-8c73-99ff58d68061.pkl (deflated 39%)
updating: db/index/index_b9a5e02f-ebd0-4b13-8858-b30b211c4546.bin (deflated 17%)
updating: db/index/index_d80886e4-65e1-4231-8c73-99ff58d68061.bin (deflated 17%)
updating: db/index/uuid_to_id_b9a5e02f-ebd0-4b13-8858-b30b211c4546.pkl (deflated 41%)
updating: db/index/id_to_uuid_d80886e4-65e1-4231-8c73-99ff58d68061.pkl (deflated 32%)
updating: db/index/index_metadata_d80886e4-65e1-4231-8c73-99ff58d68061.pkl (deflated 5%)
updating: db/chroma-collections.parquet (deflated 50%)
updating: db/.DS_Store (deflated 96%)


In [251]:
# To clean up, you can delete the collection
vectordb.delete_collection()
vectordb.persist()

# Delete the directory
!rm -rf db/

# Starting again loading the db

In [57]:
!unzip db.zip

Archive:  db.zip
replace db/chroma-embeddings.parquet? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [None]:
os.environ["OPENAI_API_KEY"] = "sk-..."

In [59]:
persist_directory = 'db'
embedding = OpenAIEmbeddings()

vectordb2 = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding,
                   )

retriever = vectordb2.as_retriever(search_kwargs={"k": 2})

Using embedded DuckDB with persistence: data will be stored in: db
