# Text Embeddings & Semantic Search

This notebook focuses on Embeddings and Vectorstores. We'll look into different embedding functions and compare them. We'll look into vectorstores, build an index, look into similarity scores and practice an async db querying!

<img src="images/sentiment_analysis.png" width="50%"/>

<a href="https://colab.research.google.com/github/miztiik/llm-bootcamp/blob/main/chapters/hf_transformers/hf_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
# Comment the above line to see the installation logs

# Install the dependencies
!pip install -qU python-dotenv
!pip install -qU langchain
!pip install -qU langchain-openai
!pip install -qU numpy
!pip install -qU sns
!pip install -qU matplotlib


In [None]:
# Load environment variables
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

In [None]:
# Not a good practice, but we will ignore warnings in this notebook, as tensor has deprecated some methods and will be removed in future versions.
# https://github.com/pytorch/pytorch/issues/97207#issuecomment-1494781560
import warnings

warnings.filterwarnings(
    "ignore", category=UserWarning, message="TypedStorage is deprecated"
)

Update your `API_KEY` in the `.env` file. You can get the API keys from the following links. 

If you are running in google colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `SVC_NAME_API_KEY`.

_Note: Some of the services may require you to have an account and some may charge you for usage_
- [OpenAI API Key](https://platform.openai.com/account/api-keys)
- [Hugging Face API Key](https://huggingface.co/settings/tokens)
- [Serper API Key](https://serper.dev/api-key)

In [None]:
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI

# To specify a particular model refer to the OpenAI documentation - https://platform.openai.com/docs/models
# Completions Model: https://platform.openai.com/docs/models/completions
# Chat Model: https://platform.openai.com/docs/models/completions

llm = OpenAI()
llm_chat = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0.3)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Embeddings

In this section, we generate embeddings for a list of sentences using different methods.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

hf_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

In [None]:
sentences = [
    "Best travel neck pillow for long flights",
    "Lightweight backpack for hiking and travel",
    "Waterproof duffel bag for outdoor adventures",
    "Stainless steel cookware set for induction cooktops",
    "High-quality chef's knife set",
    "High-performance stand mixer for baking",
    "New releases in fiction literature",
    "Inspirational biographies and memoirs",
    "Top self-help books for personal growth",
]

In [None]:
# Creating a list of embeddings
text_embedding_list = [hf_embeddings.embed_query(s) for s in sentences]

This is how our embedding looks like:

In [None]:
text_embedding_list[0][:10]

In [None]:
len(text_embedding_list[0]), len(text_embedding_list)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(
    [text_embedding_list[0]], [text_embedding_list[1]]
), cosine_similarity([text_embedding_list[0]], [text_embedding_list[-1]])

As one can see, the cosine_similarity metric for `sentences[0]` and `sentences[1] (0.445)` is much higher than between `sentences[0]` and `sentences[9] (0.08)`, symbolyzing much higher similarity.

Indeed, sentence _Best travel neck pillow for long flights_ has more in common with _Lightweight backpack for hiking and travel_ than _Top self-help books for personal growth_

In [None]:
# Calculate the number of sentences
import pandas as pd
num_sentences = len(sentences)

# Initialize a similarity matrix with zeros
similarity_matrix = np.zeros((num_sentences, num_sentences))

# Calculate the cosine similarity between sentence embeddings
# Iterate through the upper triangular part of the matrix
for i in range(num_sentences):
    for j in range(i + 1, num_sentences):
        # Retrieve embeddings for the current pair of sentences
        embedding_i, embedding_j = text_embedding_list[i], text_embedding_list[j]

        # Calculate and assign the cosine similarity between the embeddings
        similarity_matrix[i, j] = cosine_similarity(
            [embedding_i], [embedding_j])

# Copy the values from the upper triangular part to the lower triangular part
similarity_matrix += similarity_matrix.T

# Fill the diagonal of the similarity matrix with ones, indicating self-similarity
np.fill_diagonal(similarity_matrix, 1)
# Create a heatmap using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(similarity_matrix, xticklabels=False, yticklabels=False)

# Add labels and title
plt.title("Embedding Heatmap")

# Show the plot
plt.show()
EMBEDDING_SIZE = 75
CHUNK_SIZE = EMBEDDING_SIZE // 3
sorted_var_indexes = np.argsort(np.var(text_embedding_list, axis=0))
filtered_embedding_values = np.array(np.array(text_embedding_list).T)[
    sorted_var_indexes[-EMBEDDING_SIZE:]
]
_df = pd.DataFrame(filtered_embedding_values)
_df.head(3)
sort_indexes = list(_df[[0, 1, 2]].mean(
    axis=1).sort_values().index)[:CHUNK_SIZE]
sort_indexes += [
    i for i in _df[[3, 4, 5]].mean(axis=1).sort_values().index if i not in sort_indexes
][:CHUNK_SIZE]
sort_indexes += [
    i for i in _df[[6, 7, 8]].mean(axis=1).sort_values().index if i not in sort_indexes
]
sorted_text_embedding_list = filtered_embedding_values[sort_indexes]
# Create a heatmap using Seaborn
plt.figure(figsize=(8, 3))
sns.heatmap(sorted_text_embedding_list.T, xticklabels=False, yticklabels=False)

# Add labels and title
plt.title("Embeddings with 75 dimensions")

# Show the plot
plt.show()

## Additional Reading

- [LLM Bootcamp](https://github.com/miztiik/llm-bootcamp)
- [Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex](https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5) Colab: <a href="https://colab.research.google.com/drive/1LPvJyEON6btMpubYdwySfNs0FuNR9nza?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
