#  Vector Store

Using Chroma, an open source vector store that works very well with Langchain, we can save our embeddings vectors and later load them.

Vector Store attributes:
1) Can store large N-dimensional vectors
2) Can directly index an embedded vector to its associated string text document to allow to see, for example in a cosine similarity, what is the corresponding original string from the stored embedding vector.
3) Can be queried, allowing for a cosine similarity search between a new vector not in the database and the store vectors
4) Can easily add, update, or delete new vectors

In [3]:
import chromadb
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

Operations:
1) Load the document and split into chunks. It's recommended to use smaller chunks that the allowed because this way it allows us to push them into a chat message call as extra context.
2) Using the embedding model, embed the obtained chunks and obtain the embeddings vectors
3) We can now save the obtained vector chunks in the Vector store, in this case ChromaDB
4) Perform similarity search on the vector store (chromadb)

In [14]:
# load the document
loader = TextLoader("FDR_State_of_Union_1944.txt")
documents = loader.load()

In [19]:
# split the document into chunks
# docs contain the list of documents chunked in 500 tokens
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(documents)

### Connect to OpenAI for Embeddings

In [16]:
import os
openai_api_key = os.getenv(key="OPENAI_API_KEY")

In [17]:
embedding_function = OpenAIEmbeddings()

### Pass Embeddings and Docs into Chroma and save them on our disk

In [9]:
# load it into Chroma
db = Chroma.from_documents(docs, embedding_function,persist_directory='./speech_embedding_db')
db.persist()

This results in 2 parquets and an index folder:

1) chroma-collections: original strings
2) chroma-embeddings: vectors
3) index: connects the two above and allows to do lookups/similarity searches etc

### Load Embeddings from Disk

In [21]:
db_connection = Chroma(persist_directory='./speech_embedding_db/',embedding_function=embedding_function)

In [25]:
new_doc = "What did FDR say about the cost of food law?"

All that chroma is going to do is vectorize the new document and return the document in the vector store that is most similar to our text. It's not using a chat model to interpret our question.

In this example, using a string like "cost of food law, FDR" would be as effective as the original string because we are just performing similarity search.

In [26]:
docs = db_connection.similarity_search(new_doc)

In [29]:
# By default chroma returns the 4 most similar documents, order by similarity, meaning that the first document is the most similar
docs

[Document(page_content='That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.\n\n(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.\n\n(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer ma

In [30]:
print(docs[0].page_content)

That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.

Therefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:

(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.

(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.

(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer may expect for his production; and

This shows that using smaller chunks is sometimes more effective than larger chunks. In this case it would probably be enough to return only paragraph (3).

## Add New Document to chroma

In [32]:
# load the document and split it into chunks
loader = TextLoader("Lincoln_State_of_Union_1862.txt")
documents = loader.load()

In [33]:
# split it into chunks
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(documents)

Created a chunk of size 608, which is longer than the specified 500
Created a chunk of size 539, which is longer than the specified 500
Created a chunk of size 686, which is longer than the specified 500


In [34]:
# load it into Chroma
db = Chroma.from_documents(docs, embedding_function,persist_directory='./speech_embedding_db')

In [35]:
docs = db.similarity_search('slavery')

In [36]:
docs[0].page_content

'As to the second article, I think it would be impracticable to return to bondage the class of persons therein contemplated. Some of them, doubtless, in the property sense belong to loyal owners, and hence provision is made in this article for compensating such. The third article relates to the future of the freed people. It does not oblige, but merely authorizes Congress to aid in colonizing such as may consent. This ought not to be regarded as objectionable on the one hand or on the other, insomuch as it comes to nothing unless by the mutual consent of the people to be deported and the American voters, through their representatives in Congress.\n\nI can not make it better known than it already is that I strongly favor colonization; and yet I wish to say there is an objection urged against free colored persons remaining in the country which is largely imaginary, if not sometimes malicious.\n\nIt is insisted that their presence would injure and displace white labor and white laborers. 

In [37]:
docs = db.similarity_search('FDR food law')

In [38]:
docs[0].page_content

'That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.\n\n(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.\n\n(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer may expect for his produc

In [45]:
docs[0].metadata # this shows the source of the chunk

{'source': 'FDR_State_of_Union_1944.txt'}

## Retrievers

In [49]:
db_new_connection = Chroma.from_documents(docs, embedding_function,persist_directory='./speech_embedding_db')

In [52]:
retriever = db_new_connection.as_retriever()

In [53]:
retriever.get_relevant_documents("cost of food law")

[Document(page_content='That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.\n\n(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.\n\n(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer ma

## Multi query retriever

Sometimes, the documents in our vector store may contain phrasing that we are not aware of due to their size which can cause issues in trying to think what's the best query string to send to the model.

Multi query retriever helps solve this by sending the query to a LLM and having the LLM generate multiple queries variations that are similar to our original query. This helps us not think of the best single query but instead focus on the general query idea.



In [55]:
from langchain.document_loaders import WikipediaLoader
loader = WikipediaLoader(query='MKUltra')
documents = loader.load()

In [57]:
len(documents)

24

In [77]:
len(documents[0].page_content)

4000

We have 24 documents and each document is quite large, so let's split them by character text

In [62]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(documents)

Created a chunk of size 504, which is longer than the specified 500
Created a chunk of size 528, which is longer than the specified 500


In [64]:
len(docs)

52

In [78]:
embedding_function = OpenAIEmbeddings()

In [79]:
# Now we can save the content downloaded from wikipedia, splitted in chunks and vectorized using embedding_function to chromadb
db = Chroma.from_documents(docs, embedding_function,persist_directory='./some_new_mk_ultra')
db.persist()

In [97]:
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

In [98]:
question="When was this declassified?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=db.as_retriever(),llm=llm)

In [92]:
retriever_from_llm.llm_chain.prompt.template

'You are an AI language model assistant. Your task is \n    to generate 3 different versions of the given user \n    question to retrieve relevant documents from a vector  database. \n    By generating multiple perspectives on the user question, \n    your goal is to help the user overcome some of the limitations \n    of distance-based similarity search. Provide these alternative \n    questions separated by newlines. Original question: {question}'

We can see that the prompt is asking to generate 3 different versions of the same question to retrieve documents from a vector database.

In [84]:
# Behind the scenes logging

import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [None]:
# This does not answer the query, it only returns the documents that are most similar/relevant
unique_docs = retriever_from_llm.get_relevant_documents(query=question)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What is the date of declassification for this information?', '2. Can you provide the declassification date for this?', '3. At what time was this information officially declassified?']


In [None]:
print(unique_docs[0].page_content)

The United States President's Commission on CIA Activities within the United States was ordained by President Gerald Ford in 1975 to investigate the activities of the Central Intelligence Agency and other intelligence agencies within the United States. The Presidential Commission was led by Vice President Nelson Rockefeller, from whom it gained the nickname the Rockefeller Commission.
The commission was created in response to a December 1974 report in The New York Times that the CIA had conducted illegal domestic activities, including experiments on US citizens, during the 1960s. The commission issued a single report in 1975, touching upon certain CIA abuses including mail opening and surveillance of domestic dissident groups. It also publicized Project MKUltra, a CIA mind control research program.
Several weeks later, committees were established in the House and Senate for a similar purpose. White House Personnel, including future Vice President Dick Cheney, edited the results, exclud