# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

In [6]:
!pip install --upgrade langchain
!pip install openai
!pip install chromadb
!pip install langchain-community

Collecting openai
  Using cached openai-1.37.1-py3-none-any.whl.metadata (22 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Using cached anyio-4.4.0-py3-none-any.whl.metadata (4.6 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Using cached httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting sniffio (from openai)
  Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting tqdm>4 (from openai)
  Using cached tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Using cached httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Using cached h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Using cached openai-1.37.1-py3-none-any.whl (337 kB)
Using cached anyio-4.4.0-py3-none-any.whl (86 kB)
Using cached distro-1.9.0-py3-none-any.whl (20 kB)
Using 

In [7]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import VectorDBQA
from langchain.document_loaders import TextLoader

## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [20]:
import os
os.listdir()

['.git',
 '.gitignore',
 'env',
 'LICENSE',
 'persistent-qa.ipynb',
 'qa.ipynb',
 'README.md',
 'requirements.txt',
 'state_of_the_union.txt']

In [23]:
# Load and process the text
loader = TextLoader('state_of_the_union.txt',encoding='utf-8')
print(loader,'loader')
documents = loader.load()
print(documents,'documents')

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

<langchain_community.document_loaders.text.TextLoader object at 0x000001EE7EB09340> loader
[Document(metadata={'source': 'state_of_the_union.txt'}, page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFro

## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted. 

In [104]:
!pip install -U langchain-openai
from langchain_openai import OpenAIEmbeddings
!pip install openai

Collecting openai<2.0.0,>=1.32.0 (from langchain-openai)
  Using cached openai-1.37.1-py3-none-any.whl.metadata (22 kB)
Using cached openai-1.37.1-py3-none-any.whl (337 kB)
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 0.28.0
    Uninstalling openai-0.28.0:
      Successfully uninstalled openai-0.28.0
Successfully installed openai-1.37.1


In [105]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
import os
# os.environ["OPENAI_API_KEY"] = "sk-proj-TVd5qZCZW0UqjLx8zq4RT3BlbkFJGgw5ugpD61uvDeoYfRS2"
persist_directory = 'db'

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)

## Persist the Database
In a notebook, we should call `persist()` to ensure the embeddings are written to disk.
This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

In [106]:
vectordb.persist()
vectordb = None

## Load the Database from disk, and create the chain
Be sure to pass the same `persist_directory` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [107]:
# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vectordb)



In [108]:
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

' The president nominated Ketanji Brown Jackson for the United States Supreme Court and described her as a top legal mind and a consensus builder with a diverse background.'

## Ask questions!

Now we can use the chain to ask questions!

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [10]:
# To cleanup, you can delete the collection
vectordb.delete_collection()
vectordb.persist()

# Or just nuke the persist directory
!rm -rf db/

Persisting DB to disk, putting it in the save folder db
