# Embeddings and Vector Databases

This notebook explains how to create text embeddings from documents, store them in a vector database, then use them to create context for an LLM like ChatGPT. It is based on tutorial at [RealPython](https://realpython.com/chromadb-vector-database) and Pere Martra's [LLM Course](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/2-Vector%20Databases%20with%20LLMs/how-to-use-a-embedding-database-with-a-llm-from-hf.ipynb) ([article](https://pub.towardsai.net/harness-the-power-of-vector-databases-influencing-language-models-with-personalized-information-ab2f995f09ba)).

Embeddings are dense numeric vector encodings of unstructured objects like text, audio, and video. Vector databases like ChromaDB store embeddings. An application can parse a query to find the most relevant documents in the database, then include those documents as context in a message to the LLM. This is called retrieval-augmented generation (RAG).

## Word Embeddings

Word embeddings are the simplest application. A word emdedding is a many-dimensional vector of semantic relationships with other words. There are static word embedding algorithms (Word2vec, GloVe) and dynamic algorithms (like those used in LLMs) that change based on context. Distance functions measure vector similarity. Cosine similarity appears to be the preferred one (see [OpenAI Q&A](https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use)).

In [1]:
import numpy as np

def cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Compute the cosine similarity between two vectors"""
    
    return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))


The next chunk creates a word embedding for a few words. The `en_core_web_md` model from the **spaCy** general purpose NLP library contains 20k 300-dimension word embeddings. Dog and cat have a hight similarity; dog and apple do not.

In [2]:
import spacy

# Instantiate an embedding object with the medium-sized English model.
nlp = spacy.load("en_core_web_md")

# Here are three embeddings from the model.
dog_embedding = nlp.vocab["dog"].vector
cat_embedding = nlp.vocab["cat"].vector
apple_embedding = nlp.vocab["apple"].vector

# Use cosine similarity to meaure their similarity. Cat and dog are relatively similar compared to dog and apple.
print("dog/cat", cosine_similarity(dog_embedding, cat_embedding))
print("dog/apple", cosine_similarity(dog_embedding, apple_embedding))


dog/cat 0.8220817
dog/apple 0.22881007


## Text Embeddings

The logic of embeddings extends to sentences, documents, and even other data types such as audio and video. However, a simple model like `en_core_web_md` that is a dictionary of pre-calculated embeddings cannot embed text. Instead, use a pre-trained model that recognizes complex semantic relationships. The SentenceTransformers library works with multiple models, one of which is `all-MiniLM-L6-v2`. This model encodes texts up to 256 words, truncating anything longer.

In [35]:
import pandas as pd

# Got this from
# https://www.kaggle.com/code/kerneler/starter-topic-labeled-news-dataset-870843f1-3
news_data = pd.read_csv('./data/labelled_newscatcher_dataset.csv', sep=';')

# Just keep 1,000 rows for demo
news_subset = news_data.head(1000)

news_subset.head()

Unnamed: 0,topic,link,domain,published_date,title,lang
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en


In [34]:
from sentence_transformers import SentenceTransformer
from cosine_similarity import compute_cosine_similarity

model = SentenceTransformer("all-MiniLM-L6-v2")

# encode() creates a 384-dimension embedding for each title.
text_embeddings = model.encode(news_subset["title"])
print(text_embeddings.shape)

# Create a dictionary
text_embeddings_dict = dict(zip(news_subset["title"], list(text_embeddings)))

# Try it out! Use cosine similarity to meaure their similarity.
# Even though "NASA" is in both of the first two titles, the second two
# are actually more similar. The text embedding picks up on that.

test_1 = cosine_similarity(
    text_embeddings_dict[news_subset["title"][5]],
    text_embeddings_dict[news_subset["title"][6]]
)
print(f'{news_subset["title"][5]} \n{news_subset["title"][6]} \n {test_1} \n')

test_2 = cosine_similarity(
    text_embeddings_dict[news_subset["title"][6]],
    text_embeddings_dict[news_subset["title"][7]]
)
print(f'{news_subset["title"][6]} \n{news_subset["title"][7]} \n {test_2} \n')


(1000, 384)
NASA Releases In-Depth Map of Beirut Explosion Damage 
SpaceX, NASA Demo-2 Rocket Launch Set for Saturday: How to Watch 
 0.23386673629283905 

SpaceX, NASA Demo-2 Rocket Launch Set for Saturday: How to Watch 
Orbital space tourism set for rebirth in 2021 
 0.2754879593849182 



## Vector Databases

So far we've calculated embeddings for a collection of unstructured objects, stored them in a dictionary, then compared their similarity. That's great, but what you really want to do is find relevant matches to a search string. This is facilitated by storing the vectors in a database like ChromaDB. Let's create a database with 10 documents, each related to a different topic.

In [37]:
import chromadb
from chromadb.utils import embedding_functions

# Define the db storage structure and create a directory.
chroma_client = chromadb.PersistentClient(path="chroma_data/")

# Set the embedding function
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Instantiate the storage structure. A collection is basically a database table.
# Collection "news_collection" uses the all-MiniLM-L6-v2 embedding function and cosine 
# similarity.

# Remove the existing one if present.
if len(chroma_client.list_collections()) > 0 and "news_collection" in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name="news_collection")

chroma_collection = chroma_client.create_collection(
    name="news_collection",
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},
)

Now we have a database. Let's load it with 10 documents.

In [38]:
chroma_collection.add(
    documents=news_subset["title"].tolist(),
    ids=[f"id{i}" for i in range(len(news_subset["title"]))],
    metadatas=[{"TOPIC": topic} for topic in news_subset["topic"].tolist()]
)

Let's try it out.

In [39]:
query_results = chroma_collection.query(
    query_texts=["I want to travel in space."],
    n_results=3,
)

query_results

{'ids': [['id7', 'id157', 'id122']],
 'distances': [[0.46669214963912964, 0.5274306535720825, 0.563733696937561]],
 'metadatas': [[{'TOPIC': 'SCIENCE'},
   {'TOPIC': 'SCIENCE'},
   {'TOPIC': 'SCIENCE'}]],
 'embeddings': None,
 'documents': [['Orbital space tourism set for rebirth in 2021',
   'NASA astronauts "This is an extraordinary day to be in space ..." shoot music videos in the orbit',
   'SpaceX brings NASA astronauts safely home']],
 'uris': None,
 'data': None}

Supply multiple prompts.

In [42]:
query_results = chroma_collection.query(
    query_texts=["I want to travel in space.",
                 "Is AI a threat?"],
    include=["documents", "distances", "metadatas"],
    n_results=2
)

# The documents [1] has two results, and documents [2] has two results. Reference
# them as query_results["documents"][0][1], query_results["documents"][1][1], etc.
query_results

{'ids': [['id7', 'id157'], ['id710', 'id2']],
 'distances': [[0.46669209003448486, 0.5274306535720825],
  [0.4232991933822632, 0.4656652808189392]],
 'metadatas': [[{'TOPIC': 'SCIENCE'}, {'TOPIC': 'SCIENCE'}],
  [{'TOPIC': 'HEALTH'}, {'TOPIC': 'SCIENCE'}]],
 'embeddings': None,
 'documents': [['Orbital space tourism set for rebirth in 2021',
   'NASA astronauts "This is an extraordinary day to be in space ..." shoot music videos in the orbit'],
  ['The Ethics Of AI And Death',
 'uris': None,
 'data': None}

Filter on the metadata so TOPIC is "SCIENCE". The "HEALTH" article is excluded now.

In [45]:
chroma_collection.query(
    query_texts=["Is AI a threat?"],
    where={"TOPIC": {"$eq": "SCIENCE"}},
    n_results=2,
)



{'ids': [['id2', 'id198']],
 'distances': [[0.46566540002822876, 0.5246715545654297]],
 'metadatas': [[{'TOPIC': 'SCIENCE'}, {'TOPIC': 'SCIENCE'}]],
 'embeddings': None,
   'Scientists Discover New Material That Could ‘Merge AI With Human Brain’']],
 'uris': None,
 'data': None}

## RAG

Embeddings and vector databases are the foundation of retrieval-augmented generation (RAG) in LLMs. In RAG, documents are passed through an embedding function and loaded into the vector databse. Queries to the LLM is first passed through the embedding function and compared to the documents in the vector database. The most similar documents are sent as context to the LLM. RAG enables LLMs to make inferences using information that wasn't included in its training data.

Instead of sending our data to OpenAI, this time we will download the [dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b) model from Hugging Face. dolly-v2-3b is a 2.8 billion parameter model. pytoch_model.bin takes up 5.68G on my system.

In [46]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

In the pipeline call, `max_new_tokens` limits the response size to 256 tokens. `device_map = "autu"`  lets the model decide whether to use CPU or GPU for text generation.

In [47]:
pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    device_map="auto",
)


Create the prompt.

In [49]:
# Let's get 10 article titles from our ChromaDB collection.
results = chroma_collection.query(query_texts=["laptop"], n_results=10 )

# The prompt consistes of the context and question separated by a couple newlines.
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
question = "Can I buy a Toshiba laptop?"

prompt_template = f"Relevant context: {context}\n\n The user's question: {question}"

print(prompt_template)

Relevant context: #The Legendary Toshiba is Officially Done With Making Laptops #3 gaming laptop deals you can’t afford to miss today #Lenovo and HP control half of the global laptop market #Asus ROG Zephyrus G14 gaming laptop announced in India #Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865) #Apple's Next MacBook Could Be the Cheapest in Company's History #Features of Huawei's Desktop Computer Revealed #Redmi to launch its first gaming laptop on August 14: Here are all the details #Toshiba shuts the lid on laptops after 35 years #This is the cheapest Windows PC by a mile and it even has a spare SSD slot

 The user's question: Can I buy a Toshiba laptop?


Simply pass the prompt into `pipe()`, wait about 90s, and the result appears.

In [50]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Relevant context: #The Legendary Toshiba is Officially Done With Making Laptops #3 gaming laptop deals you can’t afford to miss today #Lenovo and HP control half of the global laptop market #Asus ROG Zephyrus G14 gaming laptop announced in India #Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865) #Apple's Next MacBook Could Be the Cheapest in Company's History #Features of Huawei's Desktop Computer Revealed #Redmi to launch its first gaming laptop on August 14: Here are all the details #Toshiba shuts the lid on laptops after 35 years #This is the cheapest Windows PC by a mile and it even has a spare SSD slot

 The user's question: Can I buy a Toshiba laptop?
The answer: No, Toshiba has decided to stop manufacturing laptops.




We could have passed this into OpenAI instead.

In [57]:
from openai import OpenAI

client = OpenAI()

chat_response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": context},
        {"role": "user", "content": question},
    ],
    temperature=0,
    n=1,
)

print(chat_response.choices[0].message.content)

Yes, you can still find Toshiba laptops available for purchase from various retailers. However, please note that Toshiba has officially exited the laptop market, so the availability of new models may be limited. It is recommended to check with local retailers or online marketplaces to see if they have any Toshiba laptops in stock.
