# 02 Embeddings and Vectors

In this lab, we'll explore how we can bring our own data into the models used by Azure OpenAI.

We'll start as usual by defining our Azure OpenAI service API key and endpoint details, specifying the model deployment we want to use and then we'll initiate a connection to the Azure OpenAI service.

In [None]:
import os
from langchain.llms import AzureOpenAI
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

DEPLOYMENT_ID = "text-davinci-003" # For example "text-davinci-003"

openai_api_type = os.getenv("OPENAI_API_TYPE")
openai_api_key = os.getenv("OPENAI_API_KEY")
openai_api_base = os.getenv("OPENAI_API_BASE")
openai_api_version = os.getenv("OPENAI_API_VERSION")

# Create an instance of Azure OpenAI
llm = AzureOpenAI(
    openai_api_type = openai_api_type,
    openai_api_version = openai_api_version,
    openai_api_base = openai_api_base,
    openai_api_key = openai_api_key,
    deployment_name = DEPLOYMENT_ID
)

Now, let's ask the AI a question.

In [29]:
# Call the API
r = llm("Tell me about the latest Ant-Man movie. When was it released? What is it about?")

# Print the response
print(r)



Ant-Man and the Wasp is the latest movie in the Marvel Cinematic Universe and the sequel to 2015's Ant-Man. The film was released in the United States on July 6, 2018. The movie follows Scott Lang (Paul Rudd) as he balances his home life as a father with his responsibilities as Ant-Man. He is enlisted by Dr. Hank Pym (Michael Douglas) and his daughter Hope van Dyne (Evangeline Lilly) to help them with a secret mission. They must find Hope's mother, Janet van Dyne (Michelle Pfeiffer), who is lost in the Quantum Realm. In the process, they must deal with a new villain, Ghost (Hannah John-Kamen).


What do you notice about the response?

The AI thinks the latest "Ant-Man" movie was "Ant-Man and the Wasp" and it was released in July 2018. 

OpenAI models are trained on a large set of data, but that happened at a specific point in time depending on the model. So, many of the models have no information about events that took place in recent months or years.

To help the AI out, we can provide additional information. This is the same process you would follow if you want the AI to work with your own company data. The AI won't know about information that you don't make publically available, so if you want the AI to work with that information, then you'll need to get that information into the model.

The thing is, you can't actually do that. The models are pre-trained, so the only way to get more information in is to retrain the model, which is an expensive and time consuming process.

However, there *are* ways to get the AI models to work with new data. The most popular of these methods is to use *embeddings*, which we'll explore in the next sections.


## Bring Your Own Data

Langchain provides a number of useful tools, which include tools to simplify the process of working with external documents. Below, we'll use the `DirectoryLoader` which can read multiple files from a directory and the `UnstructuredMarkdownLoader` which can process files in Markdown format. We'll use these to process a bunch of markdown formatted files that contain details of movies that were released in the year 2023.

In [None]:
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader

data_dir = "data/movies"

documents = DirectoryLoader(path=data_dir, glob="*.md", show_progress=True, loader_cls=UnstructuredMarkdownLoader).load()

We now have a `documents` object which contains all of the information from our documents about movies.

Let's use the `question_answering` chain to query our AI again.

In [30]:
# Question answering chain
from langchain.chains.question_answering import load_qa_chain

# Prepare the chain and the query
chain = load_qa_chain(llm)
query = "Tell me about the latest Ant Man movie. When was it released? What is it about?"

chain.run(input_documents=documents, question=query)

" The latest Ant Man movie is called Ant Man and the Wasp: Quantumania and it was released on 2023-02-15. It is about Super-Hero partners Scott Lang and Hope van Dyne, along with with Hope's parents Janet van Dyne and Hank Pym, and Scott's daughter Cassie Lang, finding themselves exploring the Quantum Realm, interacting with strange new creatures and embarking on an adventure that will push them beyond the limits of what they thought possible."

Great! The model now knows about the latest Ant-Man movie.

However, there's something lurking! Let's take a look at what happened behind the scenes.

We'll do two things here. First we'll add the `verbose=True` parameter to the chain, and we'll wrap the chain execution in a callback, which will allow us to capture the number of tokens consumed.

In [None]:
# Support for callbacks
from langchain.callbacks import get_openai_callback

# Prepare the chain and the query
chain = load_qa_chain(llm, verbose=True)
query = "Tell me about the latest Ant Man movie. When was it released? What is it about?"

# Run the chain, using the callback to capture the number of tokens used
with get_openai_callback() as callback:
    chain.run(input_documents=documents, question=query)
    total_tokens = callback.total_tokens

print(f"Total tokens used: {total_tokens}")

Wow! That request used around 2,900 tokens! That's a lot of tokens. Plus, with the verbose option enabled, you can see that a prompt was constructed which included all of the information from our documents in the prompt, which is why it used so many tokens.

As we've discussed previously, AI models have a maximum number of tokens you can use. These are relatively small documents that we're working with here and there's only 20 of them, so clearly this is not going to scale when we want to work with larger documents and more of them.

## Vectors

The solution to working with large amounts of external information is to use *Vectors*. In simple terms, vectors allow human readable information to be converted into a numeric format that allows computers to understand the meaning as well. We can convert data into vectors and store that vector information in a database. We can then run queries by converting our human language query into a vector and then attempting to match that vector with vectors in the database. If the vector that represents your query is similar to vectors in the database, then it's likely to be a good response to the query.

To prevent overloading a prompt with a large number of tokens, we can perform a vector search first to narrow down to a set of interesting results, and then use that smaller subset of information as part of a prompt.

The process of creating embeddings and ... usually looks something like this

1. Use an embeddings model to vectorise documents.
2. Save the vectors to a vector database
3. Use an embeddings model to vectorise a query you want to perform
4. Search the vector database using the vectorised query to find matches
5. Use the search results to pass to the AI and 

AI Orchestration tools aim to simplify this process

Use `text-embedding-ada-002`. If your deployment of this model has a different name, replace the text below as appropriate.

In [12]:
from langchain.embeddings import OpenAIEmbeddings
EMBEDDING_MODEL = "text-embedding-ada-002"

In [13]:
embeddings_model = OpenAIEmbeddings(
    openai_api_type = openai_api_type,
    openai_api_version = openai_api_version,
    openai_api_base = openai_api_base,
    openai_api_key = openai_api_key,
    deployment = EMBEDDING_MODEL,
    chunk_size = 1
)

**NOTE:** The `chunk_size = 1` parameter is used to workaround a temporary limitation in the Azure OpenAI API which only allows one embedding to be processed at a time with each call to the API.

Now that we've initialised a model to create embeddings, let's go ahead and embed some documents.

As we did in the previous example, we'll use Langchain's builit in loaders to read the documents from a directory.

In [31]:
documents = DirectoryLoader(path=data_dir, glob="*.md", show_progress=True, loader_cls=UnstructuredMarkdownLoader).load()

100%|██████████| 20/20 [00:00<00:00, 152.52it/s]


The next step is to use a *splitter*. A splitter enables us to break up larger documents into chunks, so that we don't risk hitting the token limit when submitting our data to the embedding model.

In [32]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
document_chunks = text_splitter.split_documents(documents)

The next stage is to convert the chunks of split documents into vectors which we do by passing the data through an embedding model. The resultant vectors are then stored in a vector database. In this example, we're using the **Qdrant** (pronounced 'quadrant') database. We initialise it using the `location=":memory:"` option, so that the database will be stored in memory rather than persisted to disk.

In [33]:
from langchain.vectorstores import Qdrant

qdrant = Qdrant.from_documents(
    document_chunks,
    embeddings_model,
    location=":memory:",
    collection_name="movies",
)

100%|██████████| 1/1 [00:00<00:00,  5.33it/s]
100%|██████████| 20/20 [00:01<00:00, 18.17it/s]


In [34]:
retriever = qdrant.as_retriever()

In [35]:
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

Now, we'll run our query again. However, we'll make one small change.

You may be thinking that it's not surprising that the AI now knows about the latest Ant-Man movie, because we told it about the lastest Ant-Man movie! So, let's try and show that the AI is actually doing some work here.

If you're not a fan of these movies, Ant-Man originates from Marvel comic books. And the collection of movies that originate from Marvel comic books are said to be part of the Marvel Cinematic Universe, sometimes referred to as the MCU. We haven't mentioned Marvel or MCU in the data we've provided, so if we ask the AI about the MCU, let's see if it can use it's models to figure out what we mean.

In [36]:
query = "Tell me about the latest MCU movie. When was it released? What is it about?"
qa.run(query)

" The latest MCU movie is Ant-Man and the Wasp: Quantumania. It was released on February 15, 2023. It follows the story of Scott Lang and Hope van Dyne, along with with Hope's parents Janet van Dyne and Hank Pym, and Scott's daughter Cassie Lang, who explore the Quantum Realm and interact with strange creatures. They must save the world from a new threat."

AI Orchestrators like Langchain and Semantic Kernel can help simplify the process of embedding, vectorization and search. In the code below, we use Langchain's document loader as we did previously to load and process our Markdown formatted documents. We also use a `VectorstoreIndexCreator` which you can see only requires a couple of parameters - the embedding model that we want to use and the source data (`loader`) to use. However, that simple code hides an awful lot of complexity.

Behind the scenes, the `VectorstoreIndexCreator` does several things.

- Documents are split into chunks. This is done to ensure that any large documents don't use more tokens than the models allow.
- Embeddings (vectors) are created for each document.
- Documents and embeddings are placed in a vector store database.
- Create a `retriever` that can be used for querying.

You can implement each of these steps yourself using Langchain, which will give you more control over the process. However, using the `VectorstoreIndexCreator` provides a quick solution for this walkthrough.

In [None]:
from langchain.indexes import VectorstoreIndexCreator

loader = DirectoryLoader(path=data_dir, glob="*.md", show_progress=True, loader_cls=UnstructuredMarkdownLoader)

index = VectorstoreIndexCreator(
    embedding=OpenAIEmbeddings(chunk_size=1)
    ).from_loaders([loader])

**NOTE**: Depending on your configuration, you might hit a `RateLimitError` when running the above. However, you will notice that Langchain detects that you were rate limited and automatically retries the request after a few seconds. This is another of the advantages of using an AI Orchestrator!

Now, to run a query against our data, we just need to specify the prompt and then call the index we've created above and pass in the model (`llm`) we want to use and the question we want to ask.

In [None]:
query = "Tell me about the latest Ant Man movie. When was it released? What is it about?"
index.query(llm=llm, question=query)

The above looks the same as the result we had previously, as the AI has been able to return correct details about the very latest Ant-Man movie that was released in 2023. So, what's the difference?

In this case, the query that you pass in is vectorised and then the vector database that was created behind the scenes is searched. Any matches found in the database are then used when sending the complete prompt to the AI, rather than using all of the documents are we did before.

Like we did last time, let's use a callback so we can see how many tokens were used.

In [None]:
# Run the chain, using the callback to capture the number of tokens used
with get_openai_callback() as callback:
    index.query(llm=llm, question=query)
    total_tokens = callback.total_tokens

print(f"Total tokens used: {total_tokens}")

The exact number of tokens used may vary, but it should be clear that this query will have used far fewer tokens than our original query, typically around 2,000 fewer.