Now that we got the document loaded & parsed , the next step is to calculate an "embedding" of that text. 
An embedding in it's simplest form is a multi dimensional mathematical vector that calculates the similarity of a piece of text. Topics and content that are similar have vectors that are close together. 

Embeddings enhance traditional search engines that use keywords or synonyms. Both have their place, but in this example we'll use embeddings as a way to calculate the proximity of texts.

Why similarity you might ask ? Well RAG allows us to provide and LLM more context about a subject when we ask it a question. This context is done by looking up documents that have similarity and giving that information. So in this section we're laying the groundwork for that end-goal.

In [1]:
# Install the necessary packages - tiktoken is used for calculating tokens
%pip install -q langchain langchain-openai

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -q python-dotenv
from dotenv import load_dotenv
load_dotenv()

Note: you may need to restart the kernel to use updated packages.


True

In the following example we have a bunch of texts and calculate embeddings with them.
There are many different flavors of embeddings. Some work better with long texts, some better with multi lingual texts. Often llm providers will also provide an API to calculate embeddings through their service.
They are not per se the best for your use case.

In [3]:
# https://python.langchain.com/docs/modules/data_connection/text_embedding/

from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
print(f"Embeddings calculated: {len(embeddings)}")
print(f"Vectorsize of first embedding :{len(embeddings[0])}")


Embeddings calculated: 5
Vectorsize of first embedding :1536


Next to calculating the embeddings from documents, we could also calculate that of a question.
We assume that through that we can find the documents that are best to provide the answer.
In this case the vector size is 1536.

In [4]:
# calculate the embedding of the query
embedded_query_results = embeddings_model.embed_query("What was the name mentioned in the conversation?")
print(f"Vectorsize of query text embedding :{len(embedded_query_results)}")

Vectorsize of query text embedding :1536


In [5]:
print(embedded_query_results)

[0.005377274053424193, -0.0006527779663918576, 0.038980290283414216, -0.0029673974995148606, -0.008834564037682272, 0.021192398079370758, -0.017154494072746104, -0.001736892552805223, -0.0030053353938687753, -0.010418056571957662, 0.02239321199391067, 0.009157860593078297, 0.003971925383860513, -0.009322808081311315, -0.010154141149578431, 0.0028568830269881237, 0.03642031152452606, 0.004321612829568597, 0.02057219679021343, -0.032356015696266685, -0.003221415988781072, -0.005664282086903141, 0.0015241108191819336, 0.026233180132242895, -0.011434129597699844, 0.019648494208870115, 0.028159762622479025, -0.018421286610050155, -0.0025418342650989478, -0.016824599096280066, 0.011434129597699844, 0.0010465890933763265, -0.01475286531234362, 0.006822210455439956, -0.050909259551845124, -0.003925740161661081, 0.005565313687095596, -0.017867065806302297, 0.026681835512097192, 0.004661404040655658, 0.022775890613646148, 0.0017550366973484285, -0.005403665409397583, -0.017946239408561136, -0.01