# Embeddings and document Q&A

In [None]:
%pip install --quiet -U llama-index-vector-stores-chroma llama-index sentence-transformers sentencepiece InstructorEmbedding pydantic llama-index-embeddings-huggingface llama-index-embeddings-instructor

# Embeddings

Computers only know how to talk in numbers, so embeddings **convert text to numbers**. I've already written [a lot of words about embeddings](https://investigate.ai/text-analysis/word-embeddings/), and that will hopefully be helpful to understanding them.

For example, we can take a look at seeing what "cat" turns into.

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["cat"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings[0][:25])

Generating embeddings also works for **entire sentences** (or paragraphs, or books, or anything!).

In [None]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings[0][:50])

## Document similarity

Because everything is "just numbers," we can use those numbers to compare sentences. We're going to use a dataset below because we'll be able to see how different embeddings get us to different results.

In [None]:
import pandas as pd

sentences = [
    "Molly ate a fish",
    "Jen consumed a carp",
    "I would like to sell you a house",
    "Я пытаюсь купить дачу",
    "J'aimerais vous louer un grand appartement",
    "This is a wonderful investment opportunity",
    "write some more sentences 1",
    "write some more sentences 2",
    "write some more sentences 3",
    "write some more sentences 4",
]

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

In [None]:
model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')
embeddings = model.encode(sentences)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)

Why would you care about all of this? **Searching through your documents!** Sometimes you don't know exactly the words you're looking for, you just want something that kind of captures a feeling.

That's how John Keefe and Jeremy Merrill helped ICIJ navigate through over 300gb of multilingual data in [the Luanda Leaks](https://qz.com/1786896/ai-for-investigations-sorting-through-the-luanda-leaks). Jeremy is also releasing a tool today or tomorrow, you can sneak a peek at it [over here](https://github.com/jeremybmerrill/meaningfully), or check out [Semantra](https://github.com/freedmand/semantra) by Dylan Freedman.

You can also use embeddings for [general similarity clustering](https://www.commons-project.com/dockets/FDA-2019-N-5959), too!

# Retrieval-augmented generation/document-based Q&A

These days everyone loves to search across documents. Let's see how that works with embeddings!

In [None]:
import os
os.environ['OPENAI_API_KEY'] = 'XXXXXXXX'

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("documents").load_data()
index = VectorStoreIndex.from_documents(documents)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Who was the red cow?")
print(response)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Why was the red cow helping Ferko?")
print(response)

What happens if we run it again?

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Why was the red cow helping Ferko?")
print(response)

What if we want to provide more context? By default it only gives the top 2 most relevant documents.

In [None]:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Why was the red cow helping Ferko?")
print(response)

In [None]:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("As a landlord, can I discriminate against poor people?")
print(response)

In [None]:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Can I take my employees tips? I run the restaurant, I deserve them.")
print(response)

**Easy citations**

In [None]:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("""
Can I take my employees tips? I run the restaurant, I deserve them. 
Cite filenames and page numbers where you retrieved information.
""")
print(response)

**Sources**

In [None]:
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")


What happens if we run it again?

There's also plenty of [other customizations we can make](https://docs.llamaindex.ai/en/stable/getting_started/customization.html)

## Local embedding

These embeddings are currently coming from OpenAI, which makes them slow and expensive. They're pretty good, though! I can spill more words about them but the [general idea](https://openai.com/blog/new-embedding-models-and-api-updates) is:

> Both of our new embedding models were trained with a technique Matryoshka Representation Learning that allows developers to trade-off performance and cost of using embeddings. Specifically, developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions API parameter. For example, on the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 embedding with a size of 1536.

The "best" embeddings (and tools for using them) are constantly changing, and [the major leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is always being upset.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

In [None]:
embeddings = embed_model.get_text_embedding("Hello World!")
print(embeddings[:5])

In [None]:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")
Settings.embed_model = embed_model

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("documents").load_data()
index = VectorStoreIndex.from_documents(documents)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Why was the red cow helping Ferko?")
print(response)

## Other things to research

RAG goes very deep, and is an incredibly active field of research. You might want to look at [RAGatouille](https://github.com/bclavie/RAGatouille) and [ColBERT embeddings](https://github.com/stanford-futuredata/ColBERT) if you're interested in more recent/interesting items. Also **[reranking](https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/CohereRerank.html)** to improve relevant results