## What is RAG?

RAG stands for Retrieval Augmented Generation.

It was introduced in the paper [*Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*](https://arxiv.org/abs/2005.11401).

Each step can be roughly broken down to:

* **Retrieval** - Seeking relevant information from a source given a query. For example, getting relevant passages of Wikipedia text from a database given a question.
* **Augmented** - Using the relevant retrieved information to modify an input to a generative model (e.g. an LLM).
* **Generation** - Generating an output given an input. For example, in the case of an LLM, generating a passage of text given an input prompt.

## Why RAG?

The main goal of RAG is to improve the generation outptus of LLMs.

Two primary improvements can be seen as:
1. **Preventing hallucinations** - LLMs are incredible but they are prone to potential hallucination, as in, generating something that *looks* correct but isn't. RAG pipelines can help LLMs generate more factual outputs by providing them with factual (retrieved) inputs. And even if the generated answer from a RAG pipeline doesn't seem correct, because of retrieval, you also have access to the sources where it came from.
2. **Work with custom data** - Many base LLMs are trained with internet-scale text data. This means they have a great ability to model language, however, they often lack specific knowledge. RAG systems can provide LLMs with domain-specific data such as medical information or company documentation and thus customized their outputs to suit specific use cases.


RAG can also be a much quicker solution to implement than fine-tuning an LLM on specific data.



## What kind of problems can RAG be used for?

RAG can help anywhere there is a specific set of information that an LLM may not have in its training data (e.g. anything not publicly accessible on the internet).

For example you could use RAG for:
* **Customer support Q&A chat** - By treating your existing customer support documentation as a resource, when a customer asks a question, you could have a system retrieve relevant documentation snippets and then have an LLM craft those snippets into an answer. Think of this as a "chatbot for your documentation".
* **Email chain analysis** - Let's say you're an insurance company with long threads of emails between customers and insurance agents. Instead of searching through each individual email, you could retrieve relevant passages and have an LLM create strucutred outputs of insurance claims.
* **Company internal documentation chat** - If you've worked at a large company, you know how hard it can be to get an answer sometimes. Why not let a RAG system index your company information and have an LLM answer questions you may have? The benefit of RAG is that you will have references to resources to learn more if the LLM answer doesn't suffice.
* **Textbook Q&A** - Let's say you're studying for your exams and constantly flicking through a large textbook looking for answers to your quesitons. RAG can help provide answers as well as references to learn more.

All of these have the common theme of retrieving relevant resources and then presenting them in an understandable way using an LLM.



We'll write the code to:
1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.


## 1. Document/Text Processing and Embedding Creation

Ingredients:
* PDF document of choice.
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use

In [1]:
import os
import requests

# Get PDF document path
pdf_path = "hehe.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF
    url = "https://gcatnjust.github.io/ChenGong/paper/wei_tnnls19.pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been download and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {reponse.status_code}")

else:
    print(f"File {pdf_path} exists.")

[INFO] File doesn't exist, downloading...
[INFO] The file has been download and saved as hehe.pdf


In [2]:
!pip install PyMuPDF
!pip install tqdm

Collecting PyMuPDF
  Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.0


## 2. Text preprocessing

In [3]:
import fitz #for opening document
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """Opens a PDF file, reads its text content page by page, and collects statistics."""
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 0, # adjusted page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(", ")),
                                "page_token_count": len(text) / 4, #1 token has approx 4 characters
                                "text": text})
    return pages_and_texts
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': 0,
  'page_char_count': 7049,
  'page_word_count': 1036,
  'page_sentence_count_raw': 88,
  'page_token_count': 1762.25,
  'text': 'This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Harnessing Side Information for Classiﬁcation Under Label Noise Yang Wei , Chen Gong , Member, IEEE, Shuo Chen , Tongliang Liu , Member, IEEE, Jian Yang , Member, IEEE, and Dacheng Tao , Fellow, IEEE Abstract—Practical data sets often contain the label noise caused by various human factors or measurement errors, which means that a fraction of training examples might be mistakenly labeled. Such noisy labels will mislead the classiﬁer training and severely decrease the classiﬁcation performance. Existing approaches to handle this problem are usually developed through various surrogate loss functions under the framework of empiri- cal risk m

In [4]:
import random

random.sample(pages_and_texts, k=2)

[{'page_number': 6,
  'page_char_count': 4175,
  'page_word_count': 857,
  'page_sentence_count_raw': 45,
  'page_token_count': 1043.75,
  'text': 'This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WEI et al.: HARNESSING SIDE INFORMATION FOR CLASSIFICATION UNDER LABEL NOISE 7 Lemma 6 [40]: The function F : Rd×c →R deﬁned as F(W) = (1/2)∥W∥2 2,2 is (1/2)-strongly convex with respect to ∥· ∥2,2 over Rd×c, where ∥· ∥2,2 := ∥· ∥F. By combining Lemmas 5 and 6 with the bound given in Lemma 4, we obtain the following two corollaries. Corollary 7: Let W = {W : ∥W∥2,1 ≤W2,1} and A = {A ∈Rn×c : ∥A∥2,∞≤A2,∞}, and then the empirical Rademacher complexity of the function class with F(W) = (1/2)∥W∥2 2,q for q = (ln(c)/(ln(c) −1)) is bounded as Eσ \x13 sup f ∈F 1 nr nr \x02 α=1 σαtr(W⊤A(α)) \x14 ≤W2,1A2,∞ \x12 3 ln(c) nr (23) with the fact that the dual norm of ℓ2,1 is ℓ2,∞. Corollary 8: Let W = {W : ∥W∥F ≤

## 3. Making dataframe

Let's perform a rough exploratory data analysis (EDA) to get an idea of the size of the texts (e.g. character counts, word counts etc) we're working with.




In [5]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,0,7049,1036,88,1762.25,This article has been accepted for inclusion i...
1,1,6177,969,39,1544.25,This article has been accepted for inclusion i...
2,2,6151,1044,56,1537.75,This article has been accepted for inclusion i...
3,3,4901,927,65,1225.25,This article has been accepted for inclusion i...
4,4,4164,833,44,1041.0,This article has been accepted for inclusion i...


In [6]:
df.tail()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
10,10,4482,736,37,1120.5,This article has been accepted for inclusion i...
11,11,4223,688,37,1055.75,This article has been accepted for inclusion i...
12,12,3635,655,46,908.75,This article has been accepted for inclusion i...
13,13,7785,1298,240,1946.25,This article has been accepted for inclusion i...
14,14,6466,983,135,1616.5,This article has been accepted for inclusion i...


In [7]:
df.shape

(15, 6)

In [8]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,15.0,15.0,15.0,15.0,15.0
mean,7.0,5167.93,888.2,70.73,1291.98
std,4.47,1312.87,186.89,53.88,328.22
min,0.0,3377.0,538.0,31.0,844.25
25%,3.5,4199.0,784.5,41.5,1049.75
50%,7.0,4839.0,927.0,51.0,1209.75
75%,10.5,6164.0,982.0,73.5,1541.0
max,14.0,7785.0,1298.0,240.0,1946.25


## 4. Further text processing (splitting pages into sentences)
We will to follow the workflow of:

`Ingest text -> split it into groups/chunks -> embed the groups/chunks -> use the embeddings`

Why split into sentences?

* Easier to handle than larger pages of text (especially if pages are densely filled with text).
* Can get specific and find out which group of sentences were used to help within a RAG pipeline.


We will use spaCy to break our text into sentences since it's likely a bit more robust than just using `text.split(". ")`.

In [9]:
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/15 [00:00<?, ?it/s]

In [10]:
random.sample(pages_and_texts, k=1)

[{'page_number': 0,
  'page_char_count': 7049,
  'page_word_count': 1036,
  'page_sentence_count_raw': 88,
  'page_token_count': 1762.25,
  'text': 'This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Harnessing Side Information for Classiﬁcation Under Label Noise Yang Wei , Chen Gong , Member, IEEE, Shuo Chen , Tongliang Liu , Member, IEEE, Jian Yang , Member, IEEE, and Dacheng Tao , Fellow, IEEE Abstract—Practical data sets often contain the label noise caused by various human factors or measurement errors, which means that a fraction of training examples might be mistakenly labeled. Such noisy labels will mislead the classiﬁer training and severely decrease the classiﬁcation performance. Existing approaches to handle this problem are usually developed through various surrogate loss functions under the framework of empiri- cal risk m

In [11]:
df = pd.DataFrame(pages_and_texts)
df

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text,sentences,page_sentence_count_spacy
0,0,7049,1036,88,1762.25,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,39
1,1,6177,969,39,1544.25,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,56
2,2,6151,1044,56,1537.75,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,53
3,3,4901,927,65,1225.25,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,33
4,4,4164,833,44,1041.0,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,30
5,5,4839,939,79,1209.75,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,35
6,6,4175,857,45,1043.75,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,20
7,7,4358,839,51,1089.5,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,28
8,8,5737,981,68,1434.25,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,55
9,9,3377,538,31,844.25,This article has been accepted for inclusion i...,[This article has been accepted for inclusion ...,32


In [12]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,15.0,15.0,15.0,15.0,15.0,15.0
mean,7.0,5167.93,888.2,70.73,1291.98,56.07
std,4.47,1312.87,186.89,53.88,328.22,60.67
min,0.0,3377.0,538.0,31.0,844.25,20.0
25%,3.5,4199.0,784.5,41.5,1049.75,32.5
50%,7.0,4839.0,927.0,51.0,1209.75,36.0
75%,10.5,6164.0,982.0,73.5,1541.0,54.0
max,14.0,7785.0,1298.0,240.0,1946.25,268.0


## 5. Chunking our sentences together
Why do we do this?

1. Easier to manage similar sized chunks of text.
2. Don't overload the embedding models capacity for tokens (e.g. if an embedding model has a capacity of 384 tokens, there could be information loss if you try to embed a sequence of 400+ tokens).
3. Our LLM context window (the amount of tokens an LLM can take in) may be limited and requires compute power so we want to make sure we're using it as well as possible.

In [13]:
chunk_size = 100
def split_list(input_list: list[str],
               slice_size: int=chunk_size) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/15 [00:00<?, ?it/s]

In [14]:
random.sample(pages_and_texts,k=1)

[{'page_number': 9,
  'page_char_count': 3377,
  'page_word_count': 538,
  'page_sentence_count_raw': 31,
  'page_token_count': 844.25,
  'text': 'This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Fig. 3. Experimental results of the compared methods on ﬁve UCI benchmark data sets. (a)–(e) CNAE9, Wine, Breast Tissue, Pendigits, and Connect-4 data sets, respectively. TABLE II COMPARISON OF VARIOUS METHODS ON ISOLET DATA SET. THE CLASSIFICATION ACCURACIES (MEAN ± STD) UNDER DIFFERENT LEVELS OF LABEL NOISE ARE PRESENTED. •/◦INDICATES THAT LNSI IS SIGNIFICANTLY BETTER/WORSE THAN THE CORRESPONDING METHOD (PAIRED t-TEST AT 95% CONFIDENCE LEVEL). THE BEST ACCURACY UNDER EACH LABEL NOISE LEVEL IS MARKED IN BOLD TABLE III COMPARISON OF VARIOUS METHODS ON COIL20 DATA SET. THE CLASSIFICATION ACCURACIES (MEAN ± STD.) UNDER DIFFERENT LEVELS OF LA

In [15]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,15.0,15.0,15.0,15.0,15.0,15.0,15.0
mean,7.0,5167.93,888.2,70.73,1291.98,56.07,1.13
std,4.47,1312.87,186.89,53.88,328.22,60.67,0.52
min,0.0,3377.0,538.0,31.0,844.25,20.0,1.0
25%,3.5,4199.0,784.5,41.5,1049.75,32.5,1.0
50%,7.0,4839.0,927.0,51.0,1209.75,36.0,1.0
75%,10.5,6164.0,982.0,73.5,1541.0,54.0,1.0
max,14.0,7785.0,1298.0,240.0,1946.25,268.0,3.0


## 6. Splitting each chunk into its own item


In [16]:
import re

pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  "," ").strip()
        joined_sentence_chunk = re.sub(r'\.(A-Z)', r'. \1', joined_sentence_chunk) # convert ".A"to ". A"(only for capital letter)
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/15 [00:00<?, ?it/s]

17

In [17]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 1,
  'sentence_chunk': 'This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Fig.1.Motivation illustration. (a) Four examples from two classes, among which the label of the 3rd example is incorrect. (b) Y is the corrupted label matrix, of which the rows represent the label vectors of four examples displayed in (a).By taking the example feature matrix X as side information, the observed label matrix Y can be ideally decomposed as the sum of a low-rank recovered label matrix T = X Z∗and a row-sparse matrix E. Note that the nonzero row in E exactly corresponds to the 3rd example with noisy label.task, which has been widely used in many machine learning ﬁelds such as clustering [8] and multi-label learning [9].For example, Zhao et al. [8] propose the matrix completion-based approach for multi-view clustering and ﬁrst introduc

In [18]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,17.0,17.0,17.0,17.0
mean,7.71,4517.41,741.29,1129.35
std,4.63,1417.67,231.79,354.42
min,0.0,1922.0,253.0,480.5
25%,4.0,3607.0,627.0,901.75
50%,8.0,4333.0,814.0,1083.25
75%,12.0,5685.0,909.0,1421.25
max,14.0,7014.0,1001.0,1753.5


In [19]:
 df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,0,This article has been accepted for inclusion i...,7014,1001,1753.5
1,1,This article has been accepted for inclusion i...,6134,926,1533.5
2,2,This article has been accepted for inclusion i...,6107,1000,1526.75
3,3,This article has been accepted for inclusion i...,4871,897,1217.75
4,4,This article has been accepted for inclusion i...,4136,805,1034.0


In [20]:
min_token_length = 30
# for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
#     print(f'Chunk token count : {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

In [21]:
#filtering rows with token under 30
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 0,
  'sentence_chunk': 'This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Harnessing Side Information for Classiﬁcation Under Label Noise Yang Wei , Chen Gong , Member, IEEE, Shuo Chen , Tongliang Liu , Member, IEEE, Jian Yang , Member, IEEE, and Dacheng Tao , Fellow, IEEE Abstract—Practical data sets often contain the label noise caused by various human factors or measurement errors, which means that a fraction of training examples might be mistakenly labeled.Such noisy labels will mislead the classiﬁer training and severely decrease the classiﬁcation performance.Existing approaches to handle this problem are usually developed through various surrogate loss functions under the framework of empiri- cal risk minimization.However, they are only suitable for binary classiﬁcation and also require strong prior knowledge.The

In [22]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 13,
  'sentence_chunk': 'This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS It can be easily veriﬁed that the above problem (47) can be represented in the form of (40) by setting P = \x18B Z \x19 , Q = ⎡ ⎣ E J K ⎤ ⎦ (49) and AP = ⎡ ⎢⎢⎣ I O I O I O O I ⎤ ⎥⎥⎦, BQ = ⎡ ⎢⎢⎣ I O O O −X O O O −I O −I O ⎤ ⎥⎥⎦, C = ⎡ ⎢⎢⎣ Y O O O ⎤ ⎥⎥⎦ (50) where I and O are the identity matrices and zero matrices with proper sizes, respectively.The functions f (P) and g( Q) in (40) can be, respectively, expressed as f (P) = ∥Z∥∗+ λ1∥Z∥2 F (51) g( Q) = λ2tr((X J)⊤L(X J)) + λ3∥E∥2,1 + IC(K). (52) The unaugmented Lagrangian is formulated as L0 = ∥Z∥∗+λ1∥Z∥2 F +λ2tr((X J)⊤L(X J))+λ3∥E∥2,1 + tr(M⊤ 1 (Y −B −E)) + tr\x03M⊤ 2 (B −X J)\x04 + tr \x03 M⊤ 3 (Z −J) \x04 . (53) Obviously, both f (P) and g( Q) are closed, proper, and convex, and the unaugment

## 7. Embedding our text chunks

Embeddings of text will mean that similar meaning texts have similar numerical representation.


Our goal is to turn each of our chunks into a numerical representation (an embedding vector, where a vector is a sequence of numbers arranged in order).

We'll use our computers to find patterns in the embeddings and then we can use their text mappings to further our understanding.

We'll use the [`sentence-transformers`](https://www.sbert.net/docs/installation.html) library which contains many pre-trained embedding models.

Specifically, we'll get the `all-mpnet-base-v2` model ( [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses)).

In [23]:
!pip install sentence-transformers # for embedding models

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [24]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cuda")


for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  0%|          | 0/17 [00:00<?, ?it/s]

## 8. Embeddings model

In [25]:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

# Initialize the model
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cuda")

# Function to split a chunk into words and encode each word
def get_word_embeddings(text):
    words = text.split()  # Split text into individual words
    embeddings = embedding_model.encode(words)  # Get embedding for each word
    return words, embeddings

# Process each chunk in your dataset
for item in tqdm(pages_and_chunks_over_min_token_len):
    words, word_embeddings = get_word_embeddings(item["sentence_chunk"])
    item["words"] = words
    item["word_embeddings"] = word_embeddings

# Check shape of word embeddings for the first chunk
print(len(pages_and_chunks_over_min_token_len[0]["word_embeddings"]),
      pages_and_chunks_over_min_token_len[0]["word_embeddings"][0].shape)

100%|██████████| 17/17 [00:07<00:00,  2.42it/s]

1001 (768,)





### I. Individual word embeddings

In [26]:
pages_and_chunks_over_min_token_len[0]["word_embeddings"]

array([[-0.02399354,  0.02436408, -0.01130732, ...,  0.04851779,
        -0.03059575, -0.06718535],
       [ 0.02872143,  0.06235189, -0.00240356, ..., -0.02024206,
        -0.06902064,  0.01966141],
       [-0.02656304,  0.04583179, -0.00301339, ...,  0.03154208,
        -0.06035382, -0.00432453],
       ...,
       [-0.02972006, -0.00399144,  0.00966241, ...,  0.01339654,
        -0.02797396,  0.00976396],
       [ 0.05336843,  0.08734006, -0.00663722, ...,  0.05458362,
         0.01032661,  0.00404617],
       [ 0.00064343,  0.02722497, -0.03459514, ..., -0.03032272,
        -0.03990483, -0.01313225]], dtype=float32)

### II. Document embedding based on TF-IDF

In [27]:
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm
import numpy as np

# Initialize the SentenceTransformer model
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cuda")

# Extract all sentence chunks to build the TF-IDF vocabulary
corpus = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

# Fit the TF-IDF vectorizer on the corpus
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)

# Function to get embeddings for each word in a chunk
def get_word_embeddings(text):
    words = text.split()  # Split text into individual words
    embeddings = embedding_model.encode(words)  # Get 768-dim embedding for each word
    return words, embeddings

# Function to compute the TF-IDF weighted document embedding
def compute_tfidf_weighted_embedding(words, word_embeddings):
    # Get TF-IDF scores for the words in the current chunk
    tfidf_scores = vectorizer.transform([" ".join(words)]).toarray()[0]
    word_to_tfidf = dict(zip(vectorizer.get_feature_names_out(), tfidf_scores))

    # Initialize the document embedding with zeros
    document_embedding = np.zeros(word_embeddings[0].shape)  # Shape: (768,)
    total_weight = 0  # To normalize by total weight

    # Compute the weighted sum of word embeddings
    for i, word in enumerate(words):
        if word.lower() in word_to_tfidf:  # Match word with TF-IDF vocab
            weight = word_to_tfidf[word.lower()]
            document_embedding += weight * word_embeddings[i]
            total_weight += weight

    # Normalize the document embedding by total weight
    if total_weight > 0:
        document_embedding /= total_weight

    return document_embedding

# Process each chunk in your dataset and store results back into the dictionaries
for item in tqdm(pages_and_chunks_over_min_token_len):
    # Get words and word embeddings
    words, word_embeddings = get_word_embeddings(item["sentence_chunk"])
    item["words"] = words
    item["word_embeddings"] = word_embeddings

    # Compute the TF-IDF weighted document embedding
    item["document_embedding"] = compute_tfidf_weighted_embedding(words, word_embeddings)

# Check the shape of word embeddings for the first chunk
first_chunk = pages_and_chunks_over_min_token_len[0]
print(f"Number of word embeddings: {len(first_chunk['word_embeddings'])}")
print(f"Shape of first word embedding: {first_chunk['word_embeddings'][0].shape}")
print(f"Shape of document embedding: {first_chunk['document_embedding'].shape}")

# Optional: Display a sample chunk's words and embeddings (first 5 words)
print(f"First 5 words: {first_chunk['words']}")
print(f"First word embedding (sample): {first_chunk['word_embeddings'][0]}")  # Print first 5 values for readability


100%|██████████| 17/17 [00:06<00:00,  2.70it/s]

Number of word embeddings: 1001
Shape of first word embedding: (768,)
Shape of document embedding: (768,)
First 5 words: ['This', 'article', 'has', 'been', 'accepted', 'for', 'inclusion', 'in', 'a', 'future', 'issue', 'of', 'this', 'journal.Content', 'is', 'final', 'as', 'presented,', 'with', 'the', 'exception', 'of', 'pagination.IEEE', 'TRANSACTIONS', 'ON', 'NEURAL', 'NETWORKS', 'AND', 'LEARNING', 'SYSTEMS', '1', 'Harnessing', 'Side', 'Information', 'for', 'Classiﬁcation', 'Under', 'Label', 'Noise', 'Yang', 'Wei', ',', 'Chen', 'Gong', ',', 'Member,', 'IEEE,', 'Shuo', 'Chen', ',', 'Tongliang', 'Liu', ',', 'Member,', 'IEEE,', 'Jian', 'Yang', ',', 'Member,', 'IEEE,', 'and', 'Dacheng', 'Tao', ',', 'Fellow,', 'IEEE', 'Abstract—Practical', 'data', 'sets', 'often', 'contain', 'the', 'label', 'noise', 'caused', 'by', 'various', 'human', 'factors', 'or', 'measurement', 'errors,', 'which', 'means', 'that', 'a', 'fraction', 'of', 'training', 'examples', 'might', 'be', 'mistakenly', 'labeled.Such




In [28]:
first_chunk['document_embedding'].shape

(768,)

In [29]:
pages_and_chunks_over_min_token_len[0]["embedding"].shape


(768,)

In [30]:
pages_and_chunks_over_min_token_len[0]["embedding"].shape

(768,)

Our embedding has a shape of `(768,)` meaning it's a vector of 768 numbers which represent our text in high-dimensional space.

In [31]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[9]

'This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Fig.3.Experimental results of the compared methods on ﬁve UCI benchmark data sets. (a)–(e) CNAE9, Wine, Breast Tissue, Pendigits, and Connect-4 data sets, respectively.TABLE II COMPARISON OF VARIOUS METHODS ON ISOLET DATA SET.THE CLASSIFICATION ACCURACIES (MEAN ± STD) UNDER DIFFERENT LEVELS OF LABEL NOISE ARE PRESENTED. •/◦INDICATES THAT LNSI IS SIGNIFICANTLY BETTER/WORSE THAN THE CORRESPONDING METHOD (PAIRED t-TEST AT 95% CONFIDENCE LEVEL).THE BEST ACCURACY UNDER EACH LABEL NOISE LEVEL IS MARKED IN BOLD TABLE III COMPARISON OF VARIOUS METHODS ON COIL20 DATA SET.THE CLASSIFICATION ACCURACIES (MEAN ± STD.)UNDER DIFFERENT LEVELS OF LABEL NOISE ARE PRESENTED. •/◦INDICATES THAT LNSI IS SIGNIFICANTLY BETTER/WORSE THAN THE CORRESPONDING METHOD (PAIRED t-TEST AT 95% CONFIDENCE LEVEL).THE BES

In [32]:
len(text_chunks)

17

In [33]:
len(text_chunks)

17

In [34]:
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=16, # Embed all texts in batches
                                               convert_to_tensor=True)
text_chunk_embeddings[0]

tensor([ 2.2124e-02,  3.8127e-02, -9.9315e-03,  6.4407e-02,  3.0248e-02,
         4.1535e-02,  2.1948e-02,  2.1881e-02,  1.9086e-02, -1.8840e-02,
        -3.5388e-02,  3.7817e-03,  1.2992e-02, -2.3954e-02,  5.4627e-02,
        -9.0447e-03,  1.8339e-02, -1.5943e-02,  1.2163e-02,  1.4307e-03,
        -4.9970e-03, -4.1084e-02, -1.6222e-02,  6.2714e-03, -7.0415e-02,
        -2.5558e-04,  7.2844e-03, -2.6608e-03,  6.3036e-03, -1.0456e-02,
         8.2130e-02,  5.9045e-03, -7.9108e-03,  3.5973e-02,  2.4091e-06,
        -7.1029e-03, -1.4681e-02, -5.4000e-03, -5.1102e-03,  3.7531e-03,
        -1.4898e-02,  1.9143e-02, -2.5063e-02, -1.1821e-02,  1.5855e-02,
        -6.5839e-02,  5.9678e-02,  1.0835e-01, -2.5654e-03,  1.7020e-02,
        -8.8616e-03, -1.1839e-02, -7.5419e-03, -1.8383e-02,  2.0384e-02,
        -7.2165e-02,  6.7357e-02, -7.8544e-02, -1.0627e-01, -3.1834e-03,
         2.2758e-02,  3.4900e-02,  2.0432e-02,  3.1416e-02,  1.5606e-02,
         4.6315e-02,  9.2130e-03, -1.2025e-02, -1.4

## 9. Df showing everything

In [35]:
#Saving embedding to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(save_path, index=False)

In [36]:
# Import saved file and view
text_chunks_and_embeddings_df_load = pd.read_csv(save_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding,words,word_embeddings,document_embedding
0,0,This article has been accepted for inclusion i...,7014,1001,1753.5,[ 2.21239850e-02 3.81273478e-02 -9.93154477e-...,"['This', 'article', 'has', 'been', 'accepted',...",[[-0.02399354 0.02436408 -0.01130732 ... 0.0...,[ 1.83159053e-02 3.61173917e-02 -2.62305447e-...
1,1,This article has been accepted for inclusion i...,6134,926,1533.5,[ 2.10550893e-02 1.89801585e-02 5.71878441e-...,"['This', 'article', 'has', 'been', 'accepted',...",[[-0.02399348 0.02436411 -0.01130732 ... 0.0...,[ 2.12759182e-02 3.14661646e-02 -2.46859062e-...
2,2,This article has been accepted for inclusion i...,6107,1000,1526.75,[ 2.32288763e-02 -3.22096795e-02 -1.31214494e-...,"['This', 'article', 'has', 'been', 'accepted',...",[[-0.02399354 0.02436408 -0.01130732 ... 0.0...,[ 1.69693984e-02 2.67011831e-02 -2.58474208e-...
3,3,This article has been accepted for inclusion i...,4871,897,1217.75,[ 1.74471941e-02 1.00928722e-02 -8.88381561e-...,"['This', 'article', 'has', 'been', 'accepted',...",[[-0.02399354 0.02436408 -0.01130732 ... 0.0...,[ 1.75494336e-02 2.44253294e-02 -2.38905113e-...
4,4,This article has been accepted for inclusion i...,4136,805,1034.0,[-1.78029444e-02 8.12479015e-03 -1.64016578e-...,"['This', 'article', 'has', 'been', 'accepted',...",[[-0.02399354 0.02436408 -0.01130732 ... 0.0...,[ 1.73777745e-02 2.62514099e-02 -2.57682228e-...


## 10. RAG - Search and Answer

### Similarity search
Similarity search or semantic search or vector search is the idea of searching on *semantic*.

With keyword search, you are trying to match the string "apple" with the string "apple".

Whereas with similarity/semantic search, you may want to search "macronutrients functions".
And get back results that don't necessarily contain the words "macronutrients functions" but get back pieces of text that match that meaning.


In [37]:
import numpy as np

In [38]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

text_chunks_and_embedding_df = pd.read_csv(save_path)
#convert embedding to array (it got converted to string when it saved)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

#converting embedding into torch tensor
embeddings = torch.tensor(np.stack(text_chunks_and_embedding_df["embedding"].tolist(), axis=0), dtype=torch.float32).to(device)
# Convert texts and embedding df to list of dicts
pages_and_chunks = text = text_chunks_and_embedding_df.to_dict(orient="records")

text_chunks_and_embeddings_df

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding,words,word_embeddings,document_embedding
0,0,This article has been accepted for inclusion i...,7014,1001,1753.5,"[0.022123985, 0.038127348, -0.009931545, 0.064...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.01831590532765253, 0.036117391744675084, -0..."
1,1,This article has been accepted for inclusion i...,6134,926,1533.5,"[0.02105509, 0.018980158, 0.0057187844, 0.0361...","[This, article, has, been, accepted, for, incl...","[[-0.023993477, 0.024364106, -0.011307317, -0....","[0.021275918194007582, 0.03146616457338843, -0..."
2,2,This article has been accepted for inclusion i...,6107,1000,1526.75,"[0.023228876, -0.03220968, -0.013121449, 0.025...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.016969398394478568, 0.026701183117873787, -..."
3,3,This article has been accepted for inclusion i...,4871,897,1217.75,"[0.017447194, 0.010092872, -0.00088838156, 0.0...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.01754943361979846, 0.0244253293871066, -0.0..."
4,4,This article has been accepted for inclusion i...,4136,805,1034.0,"[-0.017802944, 0.00812479, -0.016401658, 0.038...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.017377774461851764, 0.026251409941161014, -..."
5,5,This article has been accepted for inclusion i...,4805,905,1201.25,"[-0.005026503, -0.016304526, -0.013110023, 0.0...","[This, article, has, been, accepted, for, incl...","[[-0.023993539, 0.024364077, -0.0113073345, -0...","[0.020575316407115806, 0.02532501490578474, -0..."
6,6,This article has been accepted for inclusion i...,4160,842,1040.0,"[-0.030130707, 0.0073698177, -0.006446064, 0.0...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.011563947027782798, 0.012915680992572594, -..."
7,7,This article has been accepted for inclusion i...,4333,814,1083.25,"[-0.06060846, 0.030870685, -0.023983289, 0.024...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.012161504941071858, 0.0393414103283349, -0...."
8,8,This article has been accepted for inclusion i...,5685,929,1421.25,"[0.010136956, 0.0011052894, -0.008268423, 0.03...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.01592404765243526, 0.028752114606936725, -0..."
9,9,This article has been accepted for inclusion i...,3349,510,837.25,"[0.0014660318, 0.0242222, -0.009057149, 0.0415...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.014890766357942134, 0.03084731397283874, -0..."


In [39]:
import torch
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

text_chunks_and_embedding_df = pd.read_csv(save_path)
#convert embedding to array (it got converted to string when it saved)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))
text_chunks_and_embedding_df["document_embedding"] = text_chunks_and_embedding_df["document_embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

embeddings = torch.tensor(np.stack(text_chunks_and_embedding_df["embedding"].tolist(), axis=0), dtype=torch.float32).to(device)
doc_embedings = torch.tensor(np.stack(text_chunks_and_embedding_df["document_embedding"].tolist(), axis=0), dtype=torch.float32).to(device)
# Convert texts and embedding df to list of dicts
pages_and_chunks = text = text_chunks_and_embedding_df.to_dict(orient="records")

text_chunks_and_embeddings_df



Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding,words,word_embeddings,document_embedding
0,0,This article has been accepted for inclusion i...,7014,1001,1753.5,"[0.022123985, 0.038127348, -0.009931545, 0.064...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.01831590532765253, 0.036117391744675084, -0..."
1,1,This article has been accepted for inclusion i...,6134,926,1533.5,"[0.02105509, 0.018980158, 0.0057187844, 0.0361...","[This, article, has, been, accepted, for, incl...","[[-0.023993477, 0.024364106, -0.011307317, -0....","[0.021275918194007582, 0.03146616457338843, -0..."
2,2,This article has been accepted for inclusion i...,6107,1000,1526.75,"[0.023228876, -0.03220968, -0.013121449, 0.025...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.016969398394478568, 0.026701183117873787, -..."
3,3,This article has been accepted for inclusion i...,4871,897,1217.75,"[0.017447194, 0.010092872, -0.00088838156, 0.0...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.01754943361979846, 0.0244253293871066, -0.0..."
4,4,This article has been accepted for inclusion i...,4136,805,1034.0,"[-0.017802944, 0.00812479, -0.016401658, 0.038...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.017377774461851764, 0.026251409941161014, -..."
5,5,This article has been accepted for inclusion i...,4805,905,1201.25,"[-0.005026503, -0.016304526, -0.013110023, 0.0...","[This, article, has, been, accepted, for, incl...","[[-0.023993539, 0.024364077, -0.0113073345, -0...","[0.020575316407115806, 0.02532501490578474, -0..."
6,6,This article has been accepted for inclusion i...,4160,842,1040.0,"[-0.030130707, 0.0073698177, -0.006446064, 0.0...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.011563947027782798, 0.012915680992572594, -..."
7,7,This article has been accepted for inclusion i...,4333,814,1083.25,"[-0.06060846, 0.030870685, -0.023983289, 0.024...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.012161504941071858, 0.0393414103283349, -0...."
8,8,This article has been accepted for inclusion i...,5685,929,1421.25,"[0.010136956, 0.0011052894, -0.008268423, 0.03...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.01592404765243526, 0.028752114606936725, -0..."
9,9,This article has been accepted for inclusion i...,3349,510,837.25,"[0.0014660318, 0.0242222, -0.009057149, 0.0415...","[This, article, has, been, accepted, for, incl...","[[-0.02399354, 0.024364082, -0.011307317, -0.0...","[0.014890766357942134, 0.03084731397283874, -0..."


In [40]:
embeddings.shape

torch.Size([17, 768])

Retrival is done by following steps:
1. Define a query string.
2. Turn the query string in an embedding with same model we used to embed our text chunks.
3. Perform a [dot product](https://pytorch.org/docs/stable/generated/torch.dot.html) or [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) function between the text embeddings and the query embedding to get similarity scores.
4. Sort the results from step 3 in descending order (a higher score means more similarity in the eyes of the model) and use these values to inspect the texts.

In [41]:
embeddings.shape

torch.Size([17, 768])

## 11. Similarity search

In [42]:
from sentence_transformers import util

query = "Tell me about the roles of E matrix in the whole optimization"
print(f"Query : {query}")

query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cuda")

dot_scores = util.dot_score(query_embedding, embeddings)[0]

top_results = torch.topk(dot_scores, k=17)
top_results

Query : Tell me about the roles of E matrix in the whole optimization


torch.return_types.topk(
values=tensor([0.4298, 0.3959, 0.3726, 0.3018, 0.2987, 0.2934, 0.2934, 0.2571, 0.2393,
        0.2341, 0.2130, 0.1988, 0.1617, 0.1565, 0.1109, 0.0991, 0.0902],
       device='cuda:0'),
indices=tensor([ 4,  5, 13, 16, 15,  7,  3, 11,  8,  2, 12,  1,  6,  0,  9, 10, 14],
       device='cuda:0'))

In [46]:
word_embeddings_tensor = torch.tensor(word_embeddings).to("cuda")

In [47]:
from sentence_transformers import util

query = "Tell me about the roles of E matrix in the whole optimization"
print(f"Query : {query}")

query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cuda")

dot_scores = util.dot_score(query_embedding, word_embeddings_tensor)[0]

top_results1 = torch.topk(dot_scores, k=17)
top_results1

Query : Tell me about the roles of E matrix in the whole optimization


torch.return_types.topk(
values=tensor([0.4697, 0.3270, 0.2985, 0.2804, 0.2782, 0.2579, 0.2574, 0.2476, 0.2425,
        0.2391, 0.2360, 0.2254, 0.2157, 0.2145, 0.2065, 0.2026, 0.1973],
       device='cuda:0'),
indices=tensor([209,  93,  76, 201,  75,  88,  95, 171, 221, 107,  50, 169, 750,  23,
         78, 630, 219], device='cuda:0'))

In [48]:
for score, idx in zip(top_results[0], top_results[1]):
    print(f"Score: {score:.4f}")
    print("Text")
    print(pages_and_chunks[idx]["sentence_chunk"])
    print("\n\n")


Score: 0.4298
Text
This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.WEI et al.:HARNESSING SIDE INFORMATION FOR CLASSIFICATION UNDER LABEL NOISE 5 According to [36], the closed-form solution to (9) can be expressed as Z = Udiag(max{ii −τ, 0})V ⊤ ∀i =1, 2, . . . ,min(d, c) (10) where U and V are obtained by conducting the singular value decomposition (SVD) on ˆT (i.e., ˆT = UV ⊤), and ii is the ith diagonal element of the singular value matrix . Update E: By dropping the unrelated terms to E in (8), the subproblem of E is min E λ3∥E∥2,1+tr  M⊤ 1 (Y −B−E)  + μ 2 ∥Y −B−E∥2 F ⇒min E λ3∥E∥2,1 −trM⊤ 1 E + μ 2 tr(E⊤E −2(Y −B)⊤E) ⇒min E λ3∥E∥2,1 + μ 2 tr  E⊤E −2  Y −B + 1 μ M1 ⊤ E 	 ⇒min E λ3 μ ∥E∥2,1 + 1 2 E −  Y −B + M1 μ  2 F ⇒min E η∥E∥2,1 + 1 2∥E − M∥2 F (11) where M = Y −B + (M1/μ) and η = (λ3/μ).Herein, the closed-form solution to the general optimization problem related to 

In [49]:
# Display top results with proper tensor conversion
print("Top results:")
for score, idx in zip(top_results.values, top_results.indices):
    # Convert tensor index to a standard integer
    idx = idx.item()  # Convert to integer

    # Access the correct sentence chunk from your DataFrame
    print(f"Score: {score:.4f}")
    print("Text:")
    print(df["sentence_chunk"].iloc[idx])  # Use .iloc to avoid potential index mismatches
    print("\n\n")


Top results:
Score: 0.4298
Text:
This article has been accepted for inclusion in a future issue of this journal.Content is final as presented, with the exception of pagination.WEI et al.:HARNESSING SIDE INFORMATION FOR CLASSIFICATION UNDER LABEL NOISE 5 According to [36], the closed-form solution to (9) can be expressed as Z = Udiag(max{ii −τ, 0})V ⊤ ∀i =1, 2, . . . ,min(d, c) (10) where U and V are obtained by conducting the singular value decomposition (SVD) on ˆT (i.e., ˆT = UV ⊤), and ii is the ith diagonal element of the singular value matrix . Update E: By dropping the unrelated terms to E in (8), the subproblem of E is min E λ3∥E∥2,1+tr  M⊤ 1 (Y −B−E)  + μ 2 ∥Y −B−E∥2 F ⇒min E λ3∥E∥2,1 −trM⊤ 1 E + μ 2 tr(E⊤E −2(Y −B)⊤E) ⇒min E λ3∥E∥2,1 + μ 2 tr  E⊤E −2  Y −B + 1 μ M1 ⊤ E 	 ⇒min E λ3 μ ∥E∥2,1 + 1 2 E −  Y −B + M1 μ  2 F ⇒min E η∥E∥2,1 + 1 2∥E − M∥2 F (11) where M = Y −B + (M1/μ) and η = (λ3/μ).Herein, the closed-form solution to the general optimization probl

## 12. FUnction for the same

In [57]:
def retrieve_relevant_resources1(query: str, n_resources_to_return: int=17):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """
    query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cuda")

    dot_scores = util.dot_score(query_embedding, embeddings)[0]

    scores, indices = torch.topk(dot_scores, k=n_resources_to_return)

    return scores, indices

def retrieve_relevant_resources2(query: str, n_resources_to_return: int=17):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """
    query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cuda")

    dot_scores = util.dot_score(query_embedding, word_embeddings_tensor)[0]

    scores, indices = torch.topk(dot_scores, k=n_resources_to_return)

    return scores, indices


In [58]:
scores,index_embeddings = retrieve_relevant_resources1(query)
scores,index_embeddings

(tensor([0.4298, 0.3959, 0.3726, 0.3018, 0.2987, 0.2934, 0.2934, 0.2571, 0.2393,
         0.2341, 0.2130, 0.1988, 0.1617, 0.1565, 0.1109, 0.0991, 0.0902],
        device='cuda:0'),
 tensor([ 4,  5, 13, 16, 15,  7,  3, 11,  8,  2, 12,  1,  6,  0,  9, 10, 14],
        device='cuda:0'))

### Clearly the vector at 4th index has the best retrievel but can we get the best output but there are other vectors containing some similarity score which might be of relevance to answer the query but if we take all this vectors then the context length of the LLM might be exhausted

***Thus can we do some weighted sum of all this document to make it happen***

In [59]:
scores1,index_embeddings1 = retrieve_relevant_resources2(query)
scores1,index_embeddings1

(tensor([0.4697, 0.3270, 0.2985, 0.2804, 0.2782, 0.2579, 0.2574, 0.2476, 0.2425,
         0.2391, 0.2360, 0.2254, 0.2157, 0.2145, 0.2065, 0.2026, 0.1973],
        device='cuda:0'),
 tensor([209,  93,  76, 201,  75,  88,  95, 171, 221, 107,  50, 169, 750,  23,
          78, 630, 219], device='cuda:0'))

In [60]:
def print_top_results_and_scores(query: str, n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.
    """
    scores, indices = retrieve_relevant_resources(query, n_resources_to_return=n_resources_to_return)
    for score, idx in zip(scores, indices):
        print(f"Score: {score:.4f}")
        print("Text")
        print(pages_and_chunks[idx]["sentence_chunk"])
        print("\n\n")

In [62]:
# print_top_results_and_scores(query)

# Installing Gemma-2b
We will be using Gemma_instruct_2b for this.

Combined Embeddings