# Module 2 - Embedding, Vector Databases & Search

## Knowledge Databases

> **Introduction and Knowledge Bases**

Let's dive into detail into one of the more common applications of LLM's
which is **knowledge-based question answering**, basically getting an LLM to use
information in a bunch of documents, a knowledge base to answer questions, and to do tasks
related to a specific area.

Knowledge base QA makes a lot of sense as a very common application uh because it it really
helps people be more productive. You can use that for internal processes inside your company. You
can use that for uh you know user-facing stuff things like customer support. You can also update
the knowledge very quickly. You can just plop in some new documents and your application,
begin answering questions about them and it works around many of the fundamental
challenges that just a language model alone has. For example, a language model alone may
not accurately remember all knowledge but if you put it into an application that, you know,
looks up some documents and cites them and returns a source for everything it generates
suddenly you've got something more reliable and something where the user can look at it
and judge and say *"yeah this is accurate"* or *"it isn't"*.

We're using an LLM to understand and process
information in our knowledge base but where do embeddings, vector databases and search come in?
Well these are the tools that we use to help the LLM find the appropriate knowledge uh for a
particular task and then use it. 
- **Embeddings** are a powerful kind of feature you can get out of the way language models work, where you can map documents and questions to mathematical vectors and find related things very easily.
- **Vector databases** are a type of search technology really. You can use classical uh you know keyword based search as well that help you search for relevant documents, given a task or an input that you have.

And you you can actually train them to be good at that and basically we'll take our LLM take some input to it and search for relevant knowledge and then feed that knowledge plus the original question into it and try to get it to answer things. And we'll talk about these foundational
technologies that are that are required to do this and some of the algorithms in this space. This space is fast evolving but we think it is one of the most common things you'll want to do when putting LLMs into practice, especially in a custom domain.

## Vector Databases

> **Vector Databases Use Cases**

Let's go through a few more example use cases on vector databases. So with vector search
we can **calculate the similarity between the vectors**. This is incredibly helpful to build
knowledge-based Q and A systems and it also gives us the ability to find duplicate items as well.
We also see that in the industry, people use vector search and vector databases to
**build recommendation engines**. So here is an example blog post published by Spotify
that talks about how they actually use vector databases to help them build a recommendation
engine for podcast episodes based on user queries. And lastly, we can also use vector
databases or vector search, in general, to **find anomalies and detect security threats**.

> **Search and Retrieval-Augmented Generation (RAG)**

RAG is a powerful technique for enhancing the performance of Large Language Models (LLMs) by leveraging external information retrieval. Here's a breakdown of the workflow:

1. Search:
The LLM generates an initial query or partial output.
This query is fed into an external search engine or knowledge base.
The search engine retrieves a set of relevant documents or information snippets.

2. Retrieval:
The retrieved information is filtered and ranked based on its relevance to the LLM's query and context.
Techniques like keyword matching, semantic similarity scoring, or neural retrieval models are used.
Top-ranked information is selected for further processing.

3. Augmentation:
The retrieved information is incorporated into the LLM's internal context.
This can be done by concatenating the information with the LLM's existing context, summarizing it, or extracting key facts and entities.
The augmented context enriches the LLM's understanding of the task or topic.

4. Generation:
The LLM leverages the augmented context to generate a more accurate and informative output.
This could be a continuation of the initial query, a complete text document, or a creative format like a poem or script.

5. Benefits of RAG:
- Improved accuracy and relevance: External information helps the LLM avoid hallucinations and generate outputs more consistent with real-world knowledge.
- Enhanced factual grounding: The LLM can incorporate facts and evidence from retrieved sources, leading to more factual and verifiable outputs.
- Increased domain expertise: RAG enables LLMs to adapt to specific domains by utilizing relevant information from that domain.
- Reduced knowledge gap: RAG helps overcome the limitations of LLMs' internal knowledge by providing access to real-world information.

6. Challenges of RAG:
- Search engine dependence: The quality of retrieved information significantly impacts the LLM's performance.
- Information overload: Filtering and selecting the most relevant information from retrieved documents can be challenging.
- Integration challenges: Combining retrieved information with the LLM's existing context requires careful attention to coherence and consistency.
- Overall, RAG is a promising technique for enhancing LLM capabilities and pushing the boundaries of natural language processing. As research progresses and challenges are addressed, RAG is likely to play an increasingly crucial role in future LLM applications.

7. Further exploration:
- Explore specific RAG architectures and their implementations.
- Dive deeper into the techniques used for information retrieval and ranking.
- Analyze the impact of RAG on different LLM tasks like question answering or text summarization.
- Consider the ethical implications of using external information sources with LLMs.

## Search

> **Vector Search Strategies**

In vector search, there are two main strategies: 
- Exact Search: Exact search means that you are using a brute force method to find your nearest neighbors; there's no room or very little room for error. And this is exactly what the conventional KNN does generally.
- Approximate Search: As the name implies, with ANN (approximate nearest neighbor) search, you are finding less accurate nearest neighbors but you are gaining in speed.


Here is a list of common indexing algorithms. We can call them indexing algorithms because the output of these algorithms is a data structure called a vector index. So as we mentioned in the earlier segment, a vector index helps you to hold all the necessary information to conduct an efficient vector search. FAISS and HNSW, which are two of the most popular algorithms implemented by vector stores.

- Tree-based [ANNOY by Spotify]
- Proximity graphs [HNSW]
- Clustering [FAISS by Facebook]
- Hashing [LSH]
- Vector Compression [SCaNN by Google]


> **Measure Vector Similarity**

How do we actually determine if two vectors are similar? The answer is using distance or similarity metrics. 
- For distance metrics, we commonly see L1 Manhattan distance or L2 Euclidean distance. Euclidean distance is often the more popular choice. So as you can tell, when the distance metric gets higher, then the less similar the vectors will be.
- To measure similarity between vectors by using cosine similarity measure. When you have a higher similarity metric, it means that you have more similar vectors.

It's also worth calling out that when you use either of this L2 distance or cosine similarity on normalized embeddings, then they produce functionally equivalent ranking distances for your vectors.


> **Compressing Vectors with Product Quantization**

Dense embedding vectors usually take up a lot of space. A common method to reduce that memory usage is to compress the vectors using product quantization, abbreviated as PQ. This fancy method, called PQ, it really just essentially reduces the number of bytes. And quantization refers to how we represent the vectors using a smaller set of vectors. So very naively speaking, quantization means that you can either round down or round up a number.

In the context of nearest neighbor search, we start with the original big vector and then we split the big vector into segments of subvectors.
Each subvector is then quantized independently and then mapped to the nearest centroid. So say that the first subvector is closest to the first centroid, so centroid one. Then, we will replace the vector value with a value of 1. So now you can start to see how we can actually reduce the number of bytes: instead of storing many floats, we are storing a single integer value.


> **Indexing Algorith - FAISS**

FAISS stands for Facebook AI Similarity Search. It's a clustering algorithm that computes L2 Euclidean distance between the query vectors and all the other points. And as you can imagine, the computation time will only increase as you have more and more vectors.
So to optimize the search process, FAISS makes use of something called Voronoi cells. What this does is that, instead of computing the distance between every single vector that you have in the storage and the query vector, FAISS actually computes the distance between the query vector and the centroid first. Once it identifies the closest centroid to the query vector, then it will find all the other vectors similar to that query vector that exists in the same Voronoi cells. This **works very well for dense vectors, but not so much for sparse vectors**.


> **Indexing Algorith - HNWS**

HNSW, which stands for Hierarchical Navigable Small Worlds. It also uses Euclidean distance as a metric but instead of clustering,
it is a proximity graph-based approach. There are a lot of nitty-gritty details over here, but we will focus on the main structural components that make up HNSW. The first is what we call as a linked list or a skip list. So on the left image, you will see
that as we go from layer 0 to layer 3, we skip more and more intermediate nodes or vertices.
We are looking for the nearest neighbor by traversing from left to right and if we overshot,
we will move down to the previous layer. But what if there are just way too many
nodes, needing us to build many layers? The answer is to introduce hierarchy. We begin at a predefined entry point and then we traverse through the graph to find the local minimum, where the vector actually is the closest to the query vector.

## Filtering

> **Filtering Vector Databases**

Adding filtering function in vector databases is actually quite hard. Different vector databases also implement this differently as well. There are largely three categories of filtering strategies you can either do filtering by:

- **Post-Query**: Say that now you are trying to find the best Pixar movie that's similar to Frozen. After we identify the top-K nearest neighbors ordered results, we can then apply the Pixar Studio filter. The upside here is that we get to leverage the speed of ANN but the downside is that the number of results is highly unpredictable and maybe there's no such movie that meets your requirements.

- **In-Query**: This is quite interesting because the algorithm does both ANN and filtering at the same time. For instance, when you search the movie again that is similar to Frozen but produced by Pixar, all the movie data will have to be converted into vectors. But in the meantime, the studio information is also stored in the system as a scalar field. So during search, both vector similarity and metadata information will need to be computed. This can put a rather high demand on the system memory because it needs to load both the vector data and the scalar data for filtering and vector search. So perhaps when you shop online and when you have used a lot of filters at once, you might realize that when you add more filters, the website sometimes may take more time to return you the results, or you might actually hit out-of-memory issues. But this approach is actually quite suitable for row-based data because, in a row storage format, you need to read in all columns in a row at once; as opposed to for columnar storage format that allows you to read in subsets of columns of data at a time.

- **Pre-query**: This limits similarity search within a certain scope. What vectors can I actually even consider, you know, after I apply the filter. The downside of this approach is that it doesn't leverage the speed of ANN and all data will have to be filtered in a brute-force manner. So this is often not as performant as the post-query method or the in-query filtering method because the other two methods could easily leverage the speed of ANN.

But there are also vector databases, which implement their own proprietary filtering algorithms, grounded in one of these as well.

## Vector Stores

> **Vector Stores - Databases, Libraries & Plugins**


*How do we Interface with these Vectors?*
- The answer is using vector stores. Loosely speaking, when talking about vector store, it includes vector databases, vector libraries, and also plugins on top of their existing regular databases.


*Why do I care about Vector Stores? Why can't I just use a Regular Database to Store Vectors?*
- Vector stores aren't actually too different from regular databases. Specifically, a vector database is actually just like a regular database. It inherits full-fledged database properties like CRUD, which stands for Create-Read-Update- and Delete. But a vector database is specialized to store unstructured data as vectors and in fact, the differentiating capability of vector stores is providing search as a service. You don't have to implement your own search algorithm. Vector stores provide search functionality for you out of the box.


*What about Vector Libraries?*
- Vector libraries do create vector indexes for you, a vector index is a data structure that helps you to conduct efficient vector search. So if you don't want to integrate with a new database system, it's actually completely fine to use a vector library that creates this vector index for you. Typically a vector index can contain three different components: the first is an optional pre-processing step that users typically implement on their own, where you may want to normalize your embeddings or reduce the embedding dimensions. The primary step is where an indexing algorithm is actually involved; for example, we talk about FAISS and we talk about HNSW and the last optional post-processing step is where you may actually want to further quantize or hash your vectors to optimize for search speed. So a vector library like FAISS is often sufficient for small and static data but all vector libraries do not have database properties, so it means that you wouldn't come to expect a vector library to have vector database properties, like the CRUD support, data replication or being able to store the data on disk, or you'll probably just have to wait for the full import to complete before you can query. And it means that it also means that every single time you make changes to the data, the vector index will have to completely rebuild from scratch. So whether or not you use a vector database or a vector library really comes down to how often does your data change and whether you need the full-fledged database properties that comes with a vector database or no.


*What about Vector Plugins?*
- There are also existing relational databases or search systems that provide you Vector search plugins they typically have fewer metrics or ANN choices but I won't be surprised if you will see a lot more vector search support for these plugins, even in the coming months.


*Do I need a Vector Database?*
- Best practice: Start without, and scale out as necessary.
- PROS
    - Scalability: Millions or billions of records.
    - Speed: Fast query time with a low latancy.
    - Full-fledged database property:
        - If use vector libraries, need to come up with a way to store the objects and performe filtering.
        - If data changes frequently, it's cheaper that using an online model to compute embbedings dynamically.
- CONS
    - Adding a vector database to your architecture means that you are going to pay for an additional service and you do have one more system to learn, integrate and maintain.

## Best Practices for Real World

> **Best Practices - Do I always Need a Vector Database?**

In the context of LLMs, whether or not you need a vector store, you know, whether it is a vector
database or a library or a plugin on top of your relational database, it all comes down to do you
need context augmentation. Vector stores extend LLMs with knowledge and it can provide relevant
vector lookup and therefore extend the context. So this can be really helpful to help with factual
recall, as we mentioned. And it can also help with the concept called hallucination which is
an LLM problem, but generally speaking, there are use cases that
probably do not need context augmentation to help with factual recall. For example,
**summarization**, your **text classification use cases including sentiment analysis,
and translation**. For these use cases, you probably should feel safe enough to not use vector stores.


> **Best Practices - How to Improve Retrieval Performance**

At a very high level, there are two different strategies. One is regarding your embedding model
selection and the second has to do with how you store your documents. Let's start with embeddings.

- Tip one: you should absolutely choose your embedding model wisely. A proxy question that you can ask yourself is: is your embedding model currently trained on similar data as yours? If the answer is yes, then good news, you can keep using the embedding model. But if the answer is no, then you have two options over here. First is to look into using another pre-trained embedding model. Or the second is to either train your own embeddings or fine-tune your embeddings based on your data. The latter approach over here has been around in the field of NLP for years. It is a very established approach and we used to talk about fine-tuning BERTembeddings all the time before the hype of ChatGPT or chatbots surfaced.

- Tip 2: make sure that your embedding space actually captures all of your data, including your user queries as well. For example, if your data is about movies and you ask something about medicine then the search retriever system would definitely have a bad performance. So just always make sure the documents in your vector database actually contain relevant information to your queries. Similarly, use similar models to index your documents and your queries if you want them to have the same embedding space. And the same embedding space is really important if you want relevant results to be returned.


> **Best Practices - Chunking Strategy: Should I split my Documents?**

Document storage strategy. I'm going to preface all of this with a caveat that how
to best store your documents is still not very well defined but I'll share some points
for your consideration when it comes to document storage. We have two choices:
one is either to store a document as a whole document or we can store a single
document by chunks. It means that we are splitting a document up into multiple chunks
so each chunk could be a paragraph, could be a section or could just be anything
that that you arbitrarily define. It means that one document can produce many vectors
and your chunking strategy may determine how relevant is the chunks returned to the query
itself but you also need to think about how much context or chunks can you actually fit in within
the model's token limit? Do you need to pass this output to the next LLM? So passing outputs to
another LLM is something that we haven't touched upon in this module, but we'll talk about it in
Module 3. As an example, if you were to have four documents with two thousand tokens in total, it
could be that each chunk has roughly 500 tokens. That will be to split the document even evenly.
But know that chunking strategy is highly use-case specific. In machine learning, we talk about how
developing a model is usually an iterative process and you should absolutely also treat chunking
strategy as in the same way as well. Experiment with different sizes and different approaches.
How long is your document? Is your document with single sentence or many many sentences?
If a chunk is only one sentence, then your embeddings will only focus on specific meaning
for that particular sentence. But if your chunk actually captures multiple paragraphs,
then your embeddings would capture broader themes of your text.
You can split by headers; you can split by sections; you can split by paragraphs.
But you should also consider the user behaviors as well. Can you anticipate how long the user queries
will be? If you have longer queries , then there is a higher chance for the query embeddings to
be aligned better with the chunks that are returned. But if you have shorter queries,
then they tend to be more precise and maybe having a shorter chunk would actually make sense.


> **Best Practices - Preventing Silent Failures and Undesired Performance**

Now say that I choose the wrong embedding model and my chunking strategy was not good,
can we actually add some guard rails to prevent silent failures or undesired performance?
So for users, it will be helpful for you to actually include explicit instructions
in the prompts. As we discussed in Module 1, where you can tell the
model not to make things up if it doesn't know the answer. So this can help you to
actually know where the model limitation is rather than relying on unreliable outputs.
But for software engineers, there are a few things that you can consider first is to maybe
add a failover logic. If the distance-X exceeds threshold, then maybe you have to show a generic
list of responses, rather showing nothing. So going back to the Nike example, if there are no
Nike shoes, return then probably you can show a generic list of the most popular shoes that users
can buy. In terms of toxicity or discrimination or exclusion, you can also add a basic toxicity
classification model on top to prevent users from actually submitting offensive inputs.
In 2016, there is this chatbot released by Microsoft called Tay that actually became a
really racist chatbot because users start submitting racist remarks. So by having some
guardrail model on top will help prevent a chatbot from functioning differently as you expect.
And you can also choose to discard all the offensive content to avoid retraining
or fine-tuning on this offensive content. And lastly, you should also can think about
consider configuring your vector database to actually timeout if a query takes too long to
return a response. Maybe this indicates that there are actually no similar vectors found.

## Summary

- Vector stores can include vector databases, vector libraries and also search plugins on top of relational databases. They're only useful when you need context augmentation and not all text use cases need that.

- Vector search it's all about calculating vector similarities and distances. If you are trying to decide whether or not to use a database for vectors, just think of a vector database as a regular database with out-of-the-box search capability.

- They are only useful when you need database properties, have big data, or if you have very stringent serving latency requirements.

- For a good knowledge-based search retrieval system to work well, you need to select the right embedding model for your data and you will probably need to iterate upon document splitting or chunking strategy.