# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

In [2]:
from dotenv import load_dotenv 

# Load the environment variables from .env
load_dotenv()

True

In [3]:
#Text Loader
from langchain.document_loaders import TextLoader

#Text Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter


#Embeddings model Hugging Face Transformers
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings #Ejecución local
#from langchain_community.embeddings import HuggingFaceHubEmbeddings #Legacy
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings #Ejecución en servidores HuggingFace

#vector store
from langchain_chroma import Chroma



## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [4]:
# Load and process the text
loader = TextLoader('state_of_the_union.txt',encoding='UTF-8')
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100, )
texts = text_splitter.split_documents(documents)
len(texts)

95

In [5]:
print("0",texts[0].page_content, len(texts[0].page_content))
print("1",texts[1].page_content, len(texts[1].page_content))
print(texts[2])

0 Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 490
1 And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 403
page_content='He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, the

In [6]:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100, )
texts = text_splitter.split_documents(documents)
len(texts)

96

In [7]:
for i in range( 5):
    print(i,texts[i].page_content, len(texts[i].page_content))


0 Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 490
1 And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 403
2 He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessn

## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted. 

In [8]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'ChromaDB'
model_name = "sentence-transformers/all-mpnet-base-v2"
embedding = HuggingFaceEndpointEmbeddings(model=model_name)
vectordb = Chroma.from_documents(documents=texts[0:30], embedding=embedding, persist_directory=persist_directory)

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
vectordb.similarity_search("How can we react?")


[Document(metadata={'source': 'state_of_the_union.txt'}, page_content='America will lead that effort, releasing 30 Million barrels from our own Strategic Petroleum Reserve. And we stand ready to do more if necessary, unified with our allies.  \n\nThese steps will help blunt gas prices here at home. And I know the news about what’s happening can seem alarming. \n\nBut I want you to know that we are going to be okay. \n\nWhen the history of this era is written Putin’s war on Ukraine will have left Russia weaker and the rest of the world stronger.'),
 Document(metadata={'source': 'state_of_the_union.txt'}, page_content='America will lead that effort, releasing 30 Million barrels from our own Strategic Petroleum Reserve. And we stand ready to do more if necessary, unified with our allies.  \n\nThese steps will help blunt gas prices here at home. And I know the news about what’s happening can seem alarming. \n\nBut I want you to know that we are going to be okay. \n\nWhen the history of thi

In [10]:
vectordb.max_marginal_relevance_search("How can we react?")

[Document(metadata={'source': 'state_of_the_union.txt'}, page_content='America will lead that effort, releasing 30 Million barrels from our own Strategic Petroleum Reserve. And we stand ready to do more if necessary, unified with our allies.  \n\nThese steps will help blunt gas prices here at home. And I know the news about what’s happening can seem alarming. \n\nBut I want you to know that we are going to be okay. \n\nWhen the history of this era is written Putin’s war on Ukraine will have left Russia weaker and the rest of the world stronger.'),
 Document(metadata={'source': 'state_of_the_union.txt'}, page_content='Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. \n\nPlease rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. \n\nThroughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos.   \n\nThey keep moving.   

In [11]:
vectordb.add_documents(texts[31:40])

HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-mpnet-base-v2 (Request ID: kCy64j5h3KWtFlJKKZRn8)

Model too busy, unable to get response in less than 60 second(s)

### Max tokens by vector

In [None]:
from sentence_transformers import SentenceTransformer
embeding_model : SentenceTransformer = SentenceTransformer(model_name)


In [None]:
embeding_model.get_max_seq_length()

In [None]:
embeding_model.tokenizer

In [None]:
from transformers import MPNetTokenizer

tokenizer = MPNetTokenizer.from_pretrained(model_name)

In [None]:
tokens=tokenizer.tokenize(texts[0].page_content)
print(len(tokens),tokens[:30])