In [None]:
!pip install sentence_transformers plotly -Uq
!pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122 -Uq
!wget https://tufts.box.com/shared/static/325sgkodnq30ez61ugazvctif6r24hsu.csv -O daf.csv

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
--2024-07-29 17:13:39--  https://tufts.box.com/shared/static/325sgkodnq30ez61ugazvctif6r2

# Introducing Semantic Search

Information retrieval is a large and complicated field. In this notebook, we'll look at the steps involved in a specific information retrieval algorithm called "semantic search," which employs a language model to compare the similarity of a search query to chunks of original data. The steps are as follows:

* Load our model
* Read in and chunk our data
* Embed the chunks
* Take in and embed our user query
* Take the dot product between our user query and our document embeddings
* Align relevant indices with original chunked data
* Return chunks to the user or another process

At the end of the notebook, we'll pass this information that we retrieved to an LLM and complete a process called Retrieval Augmented Generation (RAG).

## Some key concepts in semantic search
**Masked Language Modeling**: The type of language modeling that we use when we are doing semantic search may seem confusing because it is unlike the modeling we have done in other notebooks. That said, it is more similar that it might seem. As we will see, these models which we use for this task take in a string (usually representing a sentence or paragraph) and output a vector of numbers. Unlike other forms of artificial intelligence, these models do not produce more text or images, rather they tell us the way they interpret language. The vectors and matrices that these models produce (called embeddings) represent how this model understands the text we give it. In training, as opposed to predicting the next token, they are given a full sentence with a random assortment of words in it masked with a special token. The model then has to guess at these masked words. This type fo training gives the models an internal sense of semantic meaning that is more accurate to human understanding than predicting the next word in a sequence of words.

**Dot Product**
Once we have generated embeddings for our source documents and our query string, we need some way of comparing them. We would like a function that took in a vector and a matrix of specific sizes and return how similar each row of the matrix is to the vector. Thankfully, in linear algebra, this exact function exists. It is called the "dot product" (we will be using the "scaled dot product"). Given a vector, $V$, of size (1, N) and a matrix, $M$, of (M, N), $V \cdot M^{T}$ will return a row vector if size (M, 1). Each element of this new vector will be a score for -1 to 1 (CHECK) which represents how similar $V$ was to a row in $M$. More details to follow.


## Data and model prep

In [None]:
#imports
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import re
import nltk
nltk.download('punkt')
import plotly.express as px
import plotly.graph_objects as go
from llama_cpp import Llama
from pprint import pprint

  from tqdm.autonotebook import tqdm, trange
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# loading our embedding model
model = SentenceTransformer('BAAI/bge-m3', trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
df = pd.read_csv('daf.csv')
df # our data

Unnamed: 0,title,text,footnotes
0,The Extent Of The Empire In The Age Of The Ant...,Introduction. The Extent And Military Fo...,"[('1', 'Dion Cassius, (l. liv. p. 736,) with t..."
1,The Extent Of The Empire In The Age Of The Ant...,"It was an ancient tradition, that when the Cap...","[('22', 'Ovid. Fast. l. ii. ver. 667. See Livy..."
2,The Extent Of The Empire In The Age Of The Ant...,The camp of a Roman legion presented the appea...,"[('60', 'Vegetius finishes his second book, an..."
3,The Internal Prosperity In The Age Of The Anto...,Of The Union And Internal Prosperity Of The Ro...,"[('1', 'They were erected about the midway bet..."
4,The Internal Prosperity In The Age Of The Anto...,Till the privileges of Romans had been progres...,"[('26', 'The senators were obliged to have one..."
...,...,...,...
291,Final Settlement Of The Ecclesiastical State.—...,Never perhaps has the energy and effect of a s...,"[('28', 'Fortifiocca, l. ii. c. 11. From the a..."
292,Final Settlement Of The Ecclesiastical State.—...,"Without drawing his sword, count Pepin restore...","[('50', 'The troubles of Rome, from the depart..."
293,Final Settlement Of The Ecclesiastical State.—...,"The royal prerogative of coining money, which ...","[('77', 'See the xxviith Dissertation of the A..."
294,Prospect Of The Ruins Of Rome In The Fifteenth...,Prospect Of The Ruins Of Rome In The Fifteenth...,"[('101', 'It should be Pope Martin the Fifth. ..."


In [None]:
df = df.drop('footnotes', axis=1)
df['sentences'] = df['text'].apply(nltk.sent_tokenize)
sentences = df.explode('sentences')
mask = sentences['sentences'].apply(lambda x: len(x) < 25) # remvoing all short sentences
sentences = sentences[~mask]

In [None]:
sentences

Unnamed: 0,title,text,sentences
0,The Extent Of The Empire In The Age Of The Ant...,Introduction. The Extent And Military Fo...,The Extent And Military Force Of The Empire In...
0,The Extent Of The Empire In The Age Of The Ant...,Introduction. The Extent And Military Fo...,"In the second century of the Christian Æra, th..."
0,The Extent Of The Empire In The Age Of The Ant...,Introduction. The Extent And Military Fo...,The frontiers of that extensive monarchy were ...
0,The Extent Of The Empire In The Age Of The Ant...,Introduction. The Extent And Military Fo...,The gentle but powerful influence of laws and ...
0,The Extent Of The Empire In The Age Of The Ant...,Introduction. The Extent And Military Fo...,Their peaceful inhabitants enjoyed and abused ...
...,...,...,...
295,Prospect Of The Ruins Of Rome In The Fifteenth...,These general observations may be separately a...,Those provinces and tributes had been lost in ...
295,Prospect Of The Ruins Of Rome In The Fifteenth...,These general observations may be separately a...,"The population of Rome, far below the measure ..."
295,Prospect Of The Ruins Of Rome In The Fifteenth...,These general observations may be separately a...,The various causes and progressive effects are...
295,Prospect Of The Ruins Of Rome In The Fifteenth...,These general observations may be separately a...,The historian may applaud the importance and v...


Below we begin a process called 'embedding', where we take our individual sub-documents (in this case each sentence from the *Decline and Fall*) and pass them through our embedding model. As mentioned above, this model is trained to output a representation of the given strings in multi-dimensional space in the form of vectors. When we give a model like this mulitple sentences to embed, then it outputs multiple vectors all stacked on top of each other. This vertical arrangement of row vectors is also called a matrix and in this case has the shape: number of inputs x the model's hidden state dimension (this number is created by the model itself in training and we have no control over it).

In [None]:
embeddings = model.encode(
    sentences.sentences.to_list(), # our sentences
    batch_size=64, # high batch size = faster embedding, more VRAM
    show_progress_bar=True,
    device='cuda',
    normalize_embeddings=True # divides embeddings by their norm, centering the distribution at zero with a variance close to one
)

Batches:   0%|          | 0/124 [00:00<?, ?it/s]

In [None]:
embeddings.shape # number of documents x the model's hidden state dimension.

(7880, 1024)

In [None]:
embeddings[0] # single vector representing the first sentence in our list

array([ 0.03671465,  0.025844  , -0.04396637, ...,  0.02418644,
       -0.03459973, -0.05511779], dtype=float32)

In [None]:
embeddings[0].shape # an embedding is a single vector of the size of the model's hidden state

(1024,)

## Digression: Visualizing Embeddings

To build a better intuition for what embeddings are and how they work, we will see how we can use some simple data visualization techniques to see what these embeddings are telling us about the underlying data.

In [None]:
# using PCA to decompose our 1024 long vectors to 2
pca = PCA(n_components=2)
pca.fit(embeddings)
X = pca.transform(embeddings)
X.shape # 7880, 1024 -> 7880, 2

(7880, 2)

In [None]:
# making a dataframe to visualize the embeddings with the original sentences
plotting = pd.DataFrame({
    'x': X[:, 0],
    'y': X[:, 1],
    'title' : sentences.title,
    'sentence': sentences.sentences,
})
plotting['sentence'] = plotting['sentence'].str.wrap(100).apply(lambda x: x.replace('\n', '<br>'))

In [None]:
fig = px.scatter(plotting, x='x', y='y', hover_data='sentence')
fig.show()

In the scatter plot above, each dot represents a single embedding, which represents a single sentence. As a result, similar sentences tend (though not always) to get grouped together. This created clusters and subclusters of sentences which are similar. This internal structure of the embeddings will help us conduct information retrieval.

## Query-based retrieval

Now that we have some intuition on how embeddings are working, we can put them to test with a sample query.

Below we will use an extra string called `retrieval_instruction`. Often when we are taking in a query from the user, it will be mmuch shorted than the typical length of the documents in our sentence list. This extra string that we prepend to the user query makes the user query more comparable to the documents in our embeddings.

In [None]:
retrieval_instruction = "Represent this sentence for searching relevant passages: "
query = 'Who were the Goths'
query_embedding = model.encode(
    retrieval_instruction+query,
    device='cuda',
    normalize_embeddings=True
)

In [None]:
query_embedding.shape # just like a single embedding from above

(1024,)

In [None]:
# relevancy measure: dot product
sim_vector = query_embedding @ embeddings.T # (m, n) X (n, o) = (m, o), in our case: 1, 1024 X 1024, 7880
sim_vector.shape # 1, 7880, this vector is made of similarity scores between the sentences in our original list of sentences and the query

(7880,)

In [None]:
# argsort sorts the array by index
sim_vector.argsort()

array([1414,  404, 7862, ..., 1218, 4412, 4818])

In [None]:
sim_vector.argsort()[::-1] # reverses array

array([4818, 4412, 1218, ..., 7862,  404, 1414])

In [None]:
k = 20
rel_idx = sim_vector.argsort()[::-1][:k] # selects top k indices from the array
rel_idx

array([4818, 4412, 1218, 4190, 1185, 3922, 1203, 4704, 4026, 4641, 3339,
       1421, 4088, 4617, 1267, 3611, 3418, 3400, 1272, 1198])

In [None]:
rel_chunks = [sentences.sentences.to_list()[i] for i in rel_idx] # get back our sentences
rel_chunks # read through these to verify that we're on the right track

['The Goths fled from the city.',
 'For their subsistence, the Goths depended on the magazines of corn which was ground in portable mills by the hands of their women; on the milk and flesh of their flocks and herds; on the casual produce of the chase, and upon the contributions which they might impose on all who should presume to dispute the passage, or to refuse their friendly assistance.',
 'The Goths were now, on every side, surrounded and pursued by the Roman arms.',
 'The Goths were the foremost of these savage proselytes; and the nation was indebted for its conversion to a countryman, or, at least, to a subject, worthy to be ranked among the inventors of useful arts, who have deserved the remembrance and gratitude of posterity.',
 'In the beginning of the sixth century, and after the conquest of Italy, the Goths, in possession of present greatness, very naturally indulged themselves in the prospect of past and of future glory.',
 'The Western world was oppressed by the Goths and 

## Retrieval Augmented Generation (RAG)

Semantic search is interesting and useful by itself, but recently it has taken on a new importance. Users of modern AI systems are always seeking new away to condition AI output on relevant data. Semantic search offers a good way of dealing with this problem and thus constitutes the first phase in a process called Retrieval Augmented Generation or RAG, where first we use semantic search to get relevant documents and then pass those relevant documents to an AI in a prompt. Below is a quick example of doing so.

In [None]:
# loading our LLM
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2-7B-Instruct-GGUF",
    filename="*q4_0.gguf",
    verbose=True,
    n_gpu=-1,
    n_ctx=3000
)

qwen2-7b-instruct-q4_0.gguf:   0%|          | 0.00/4.44G [00:00<?, ?B/s]

llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/7c1879f2983b48bb6a5609f7546299b833d25d13/./qwen2-7b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = qwen2-7b-instruct
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   6:                 qwen

In [None]:
# RAG prompt, feel free to change and see the differences
base_prompt = """
# Question answering task
You are a helpful AI assistant that is skilled at answering user questions based on a given context.

## User question
{question}

## Context
{context}
""".strip()

message = [{
    "role":"user",
    "content":base_prompt.format(
        question=query, # our query from above
        context='\n'.join(rel_chunks) # relevant chunks
    )
}]

In [None]:
# may take some time (~5-10 minutes)
text = llm.create_chat_completion(message, max_tokens=-1)


llama_print_timings:        load time =  162577.43 ms
llama_print_timings:      sample time =      58.39 ms /   419 runs   (    0.14 ms per token,  7175.76 tokens per second)
llama_print_timings: prompt eval time =  341553.70 ms /  1062 tokens (  321.61 ms per token,     3.11 tokens per second)
llama_print_timings:        eval time =  340485.96 ms /   418 runs   (  814.56 ms per token,     1.23 tokens per second)
llama_print_timings:       total time =  683118.74 ms /  1480 tokens


In [None]:
pprint(text['choices'][0]['message']['content']) #output

('The Goths were a Germanic tribe that originated from Scandinavia or Prussia. '
 'They are known for their migrations and invasions during the late Roman '
 'Empire period. The Goths were initially settled in the region around the '
 'mouth of the Borysthenes river (now the Dnieper river) in what is now '
 'Ukraine. They were known for their military prowess and were often involved '
 'in conflicts with the Roman Empire.\n'
 '\n'
 'The Goths were skilled in various arts and crafts, including the invention '
 'of useful tools and weapons, which contributed to their survival and '
 'success. In the beginning of the sixth century, the Goths were able to '
 'conquer Italy after the fall of the Western Roman Empire.\n'
 '\n'
 'The Goths were also known for their religious conversion, which was '
 'attributed to a countryman who was an inventor of useful arts. This '
 'conversion likely helped them integrate into the societies they encountered '
 'during their migrations.\n'
 '\n'
 'During 

## Conclusion

In this notebook, we have begun an exploration of embeddings, but there is much more to understand. In future lessons, we'll see other ways to use document-level embeddings and train our own embedding model for languages other than English. If you are interested in exploring more, I would check out the documentation of the package we used to load the embedding model: [sBERT](https://www.sbert.net/). They have a lot of good articles on semantic search and other applications.