## Top-K Similarity Search 
Top-K Similarity search type which specifies how you should retrieve documents from your knowledge base. This is type of a retrieval method

Top-K Similarity search is the process of pulling "K" number of documents from your knowledge base. Its aim is to get you the most similar documents to a query according to your similarity metric.

There are two concepts which are important

- **Similarity Metric** - This is the equation or metric you'll use to compare the embedding of your query to the embeddings of your corpus of documents. The most common similarity metrics are cosine, dotproduct, and euclidean. The smaller the metric, the more similar two embeddings are expected to be.
- **Number of documents to retrieve** - This specifies the number of documents to retrieve. It may seem like more is better, but remember, it's important to keep the signal to noise ratio of your retrieval process high. The more documents you retrieve the more likely it is you'll overwhelm your model.
You'll often see us refer to this method as "vanilla retrieval." This is a delicious example of the basic retrieval method which powers many chatbots and question/answer tools. Further, it's commonly referred to as the "Hello world" example of doing retrieval.

Why is this helpful?
- **Simple** - This method has very few moving parts while taking advantage of powerful language models
- **Foundational** - This method is the start point for many advanced methods of retrieval

In [10]:
%pip install openai python-dotenv --quiet

Note: you may need to restart the kernel to use updated packages.


In [1]:
# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
import os

load_dotenv()

True

**Load your data**

Next let's load up some data. I've put a few 'loaders' on there which will load data from different locations. Feel free to use the one that suits you. The default one queries one of Paul Graham's essays for a simple example. This process will only stage the loader, not actually load it.

In [2]:
loader = TextLoader(file_path="../data/vb.txt")

## Other options for loaders 
# loader = PyPDFLoader("../data/field-guide-to-data-science.pdf")
# loader = UnstructuredPDFLoader("../data/field-guide-to-data-science.pdf")
# loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")

Then let's go ahead and actually load the data.

In [3]:
data = loader.load()

Then let's actually check out what's been loaded

In [4]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your sample document')
print (f'Here is a sample: {data[0].page_content[:200]}')

You have 1 document(s) in your data
There are 9160 characters in your sample document
Here is a sample: Life is Short

January 2016

Life is short, as everyone knows. When I was a kid I used to wonder about this. Is life actually short, or are we really complaining about its finiteness? Would we be just


**Chunk your data up into smaller documents**

While we could pass the entire essay to a model w/ long context, we want to be picky about which information we share with our model. The better signal to noise ratio we have the more likely we are to get the right answer.

The first thing we'll do is chunk up our document into smaller pieces. The goal will be to take only a few of those smaller pieces and pass them to the LLM.

In [5]:
# We'll split our data into chunks around 500 characters each with a 50 character overlap. These are relatively small.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(data)

In [6]:
# Let's see how many small chunks we have
print (f'Now you have {len(texts)} documents')

Now you have 27 documents


**Create embeddings of your documents to get ready for semantic search**

Next up we need to prepare for similarity searches. The way we do this is through embedding our documents (getting a vector per document).

This will help us compare documents later on.

In [11]:
from langchain.vectorstores import Chroma
from langchain_openai import AzureOpenAIEmbeddings

Then we'll get our embeddings engine going. You can use whatever embeddings engine you would like. We'll use OpenAI's ada today.

In [13]:
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.getenv("AZURE_OPENAI_EMBEDDING_ENDPOINT"),
    openai_api_key=os.getenv("AZURE_OPENAI_EMBEDDING_API_KEY"),
    openai_api_version=os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION"),
    azure_deployment="text-embedding-3-small",
)

I like Chroma because it's local and easy to set up without an account.

First we'll pass our texts to Chroma via `.from_documents`, this will 1) embed the documents and get a vector, then 2) add them to the vectorstore for retrieval later.

In [14]:
# load it into Chroma
vectorstore = Chroma.from_documents(texts, embeddings)

Let's test it out. I want to see which documents are most closely related to a query.

In [15]:
query = "What is great about having kids?"
docs = vectorstore.similarity_search(query)

Then we can check them out. In theory, the texts which are deemed most similar should hold the answer to our question. But keep in mind that our query just happens to be a question, it could be a random statement or sentence and it would still work.

In [16]:
# Here's an example of the first document that was returned
for doc in docs:
    print (f"{doc.page_content}\n")

One great thing about having small children is that they make you spend time on things that matter: them. They grab your sleeve as you're staring at your phone and say "will you play with me?" And odds are that is in fact the bullshit-minimizing option.

It is possible to slow time somewhat. I've gotten better at it. Kids help. When you have small children, there are a lot of moments so perfect that you can't help noticing.

It does help too to feel that you've squeezed everything out of some experience. The reason I'm sad about my mother is not just that I miss her but that I think of all the things we could have done that we didn't. My oldest son will be 7 soon. And while I miss the 3 year old version of him, I at least don't have any regrets over what might have been. We had the best time a daddy and a 3 year old ever had.

Having kids showed me how to convert a continuous quantity, time, into discrete quantities. You only get 52 weekends with your 2 year old. If Christmas-as-magic 

**Query those docs to get your answer back**

Great, those are just the docs which should hold our answer. Now we can pass those to a LangChain chain to query the LLM.

We could do this manually, but a chain is a convenient helper for us.

In [18]:
from langchain_openai import AzureChatOpenAI
from langchain.chains.question_answering import load_qa_chain

Here we'll initalize our LLM and our chain

In [19]:
llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    openai_api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    openai_api_version=os.getenv("OPENAI_API_VERSION"),
    azure_deployment="gpt-4",
    temperature=0,
)

chain = load_qa_chain(llm, chain_type="stuff")

Then let's go ask the same question as before

In [20]:
query = "What is great about having kids?"
docs = vectorstore.similarity_search(query)

In [24]:
chain.invoke({"input_documents":docs, "question": query})

{'input_documents': [Document(page_content='One great thing about having small children is that they make you spend time on things that matter: them. They grab your sleeve as you\'re staring at your phone and say "will you play with me?" And odds are that is in fact the bullshit-minimizing option.', metadata={'source': '../data/vb.txt'}),
  Document(page_content="It is possible to slow time somewhat. I've gotten better at it. Kids help. When you have small children, there are a lot of moments so perfect that you can't help noticing.", metadata={'source': '../data/vb.txt'}),
  Document(page_content="It does help too to feel that you've squeezed everything out of some experience. The reason I'm sad about my mother is not just that I miss her but that I think of all the things we could have done that we didn't. My oldest son will be 7 soon. And while I miss the 3 year old version of him, I at least don't have any regrets over what might have been. We had the best time a daddy and a 3 year

Awesome! We just went and queried an external data source!