## Environment Preparation

In [1]:
# Install packages
!pip install requests pandas scikit-learn jupyter



You should consider upgrading via the 'D:\learning\llm\search_engine\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [1]:
import pandas as pd
import requests

In [2]:
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'

docs_response = requests.get(docs_url)
docs_raw = docs_response.json()
# print(docs_raw)

In [3]:
documents = []
for course in docs_raw:
    for docs in course['documents']:
        # print(docs)
        docs['course'] = course['course']
        documents.append(docs)
print(documents[0])

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", 'section': 'General course-related questions', 'question': 'Course - When will the course start?', 'course': 'data-engineering-zoomcamp'}


In [4]:
# Create a pandas df
import pandas as pd

df = pd.DataFrame(documents,columns=['course','section','question','text'])
df.head()

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...


In [5]:
df.course.unique()

array(['data-engineering-zoomcamp', 'machine-learning-zoomcamp',
       'mlops-zoomcamp'], dtype=object)

## Text Search 
- Information retrieved from large dataset
- **vector space** - A mathematical representation where text is converted into vectors (points in space) allowing for quantitative comparison.
- **Bag of Words** - A simple text representation model treating each document as a collection of words disregarding grammar and word order but keeping multiplicity.
- **TF-IDF** (Term Frequency-Inverse Document Frequency) - A statistical measure used to evaluate how important a word is to a document in a collection or corpus. It increases with the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


In [7]:
cv = CountVectorizer()
cv.fit(df.text)

In [10]:
cv.get_feature_names_out()

array(['00', '00000000e', '0002', ..., '要了解键盘快捷键', '要启用屏幕阅读器支持', '请按ctrl'],
      dtype=object)

In [11]:
doc_text_samples = [
    "Course starts on 15th Jan 2024",
    "Prerequisites listed on GitHub",
    "Submit homeworks after start date",
    "Registration not required for participation",
    "Setup Google Cloud and Python before course"
]

 - Creating the bag of words

#### CountVectorizer:
**What it does**: Converts text data into numerical vectors (matrices) based on word frequencies.

**How it works**:
- Builds a vocabulary from the entire corpus (collection of documents).
- Represents each document as a vector where each element corresponds to the count of a specific word in that document.
- Ignores the importance of words across the entire corpus.

**Use case**:
- Simple bag-of-words representation for text classification or clustering.
- Useful when word frequency matters more than context.

In [14]:
cv = CountVectorizer(stop_words='english')
x = cv.fit_transform(doc_text_samples)

names = cv.get_feature_names_out()

In [18]:
# # Investigating the vectorized words - Sparse Matrix
# x = cv.transform(doc_text_samples)
# x.todense()
# pd.DataFrame(x.todense(),columns = cv.get_feature_names_out()).T

In [18]:
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(doc_text_samples)

names = cv.get_feature_names_out()

df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs

Unnamed: 0,0,1,2,3,4
15th,1,0,0,0,0
2024,1,0,0,0,0
cloud,0,0,0,0,1
course,1,0,0,0,1
date,0,0,1,0,0
github,0,1,0,0,0
google,0,0,0,0,1
homeworks,0,0,1,0,0
jan,1,0,0,0,0
listed,0,1,0,0,0


#### TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency):
**What it does**: Considers both term frequency and inverse document frequency to create vectors.

**How it works**:
- Computes term frequency (TF) for each word in a document (similar to CountVectorizer).
- Also calculates inverse document frequency (IDF) to penalize common words across all documents.
- Combines TF and IDF to create a weighted vector for each document.

**Use case**:
- Better captures the importance of words within a specific document.
- Useful for information retrieval, search engines, and text summarization.

In [28]:
# TfidfVectorizer - Classifies words based on importance 

#Vectorizers
cv = TfidfVectorizer(stop_words='english')
# X = cv.fit_transform(df.text)

#Matrices
X = cv.fit_transform(df.text)

names = cv.get_feature_names_out()

df_docs = pd.DataFrame(X.toarray(), columns=names)
df_docs.round(2)

<948x6461 sparse matrix of type '<class 'numpy.float64'>'
	with 31723 stored elements in Compressed Sparse Row format>

#### Key Differences:

##### CountVectorizer:
- Simple word count representation.
- Ignores the overall corpus context.
- Each word is equally important.

##### TF-IDF Vectorizer:
- Considers both local (document-specific) and global (corpus-wide) context.
- Penalizes common words and emphasizes rare, important words.
- Provides more meaningful features for machine learning models.

## Query document similarity

In [46]:
query = "Do I need to know python to sign up for the January course?"

q = cv.transform([query])
q.toarray()[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [47]:
# View the array

query_dict = dict(zip(names,q.toarray()[0]))
query_dict

doc_dict = dict(zip(names,X.toarray()[1]))
doc_dict

{'00': np.float64(0.0),
 '00000000e': np.float64(0.0),
 '0002': np.float64(0.0),
 '00021': np.float64(0.0),
 '001': np.float64(0.0),
 '009s': np.float64(0.0),
 '01': np.float64(0.0),
 '02': np.float64(0.0),
 '020': np.float64(0.0),
 '028879': np.float64(0.0),
 '02d': np.float64(0.0),
 '03': np.float64(0.0),
 '0315': np.float64(0.0),
 '04': np.float64(0.0),
 '04d': np.float64(0.0),
 '05': np.float64(0.0),
 '051': np.float64(0.0),
 '054': np.float64(0.0),
 '06': np.float64(0.0),
 '06_spark_sql': np.float64(0.0),
 '07': np.float64(0.0),
 '07cd': np.float64(0.0),
 '08': np.float64(0.0),
 '09': np.float64(0.0),
 '0ms': np.float64(0.0),
 '0x3c947bc5': np.float64(0.0),
 '0x7efe331cf790': np.float64(0.0),
 '0x7f797010a590': np.float64(0.0),
 '0x7fbaf2666280': np.float64(0.0),
 '0x800701bc': np.float64(0.0),
 '0xa0': np.float64(0.0),
 '0xff': np.float64(0.0),
 '0zw04wdetqo': np.float64(0.0),
 '10': np.float64(0.0),
 '100': np.float64(0.0),
 '1000': np.float64(0.0),
 '100000': np.float64(0.0),
 

In [48]:
df_qd = pd.DataFrame([query_dict, doc_dict], index=['query', 'doc']).T

(df_qd['query'] * df_qd['doc']).sum()

np.float64(0.0)

In [54]:
# Compute the similarity product (Document and vector) - Similar to computing the cosine_similarity
X.dot(q.T).todense().T

matrix([[0.16865333, 0.        , 0.        , 0.03733748, 0.04085934,
         0.        , 0.        , 0.09807025, 0.        , 0.        ,
         0.        , 0.13021262, 0.        , 0.        , 0.        ,
         0.05696417, 0.        , 0.        , 0.04230602, 0.05451619,
         0.        , 0.04440487, 0.16373635, 0.07091479, 0.        ,
         0.0880943 , 0.        , 0.17574325, 0.0960077 , 0.        ,
         0.        , 0.        , 0.        , 0.05523241, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.0075487 ,
         0.02332155, 0.0492298 , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.01964829, 0.04958753, 0.05643015,
         0.03890971, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.02765177, 0.        , 0.        , 0.        , 0.00745111,
         0.        , 0.        , 0.01692108, 0.        , 0.        ,
         0.        , 0.        , 0

### Compute cosine similarity

In [51]:
from sklearn.metrics.pairwise import cosine_similarity

In [53]:
score = cosine_similarity(X,q).flatten()
score

array([0.16865333, 0.        , 0.        , 0.03733748, 0.04085934,
       0.        , 0.        , 0.09807025, 0.        , 0.        ,
       0.        , 0.13021262, 0.        , 0.        , 0.        ,
       0.05696417, 0.        , 0.        , 0.04230602, 0.05451619,
       0.        , 0.04440487, 0.16373635, 0.07091479, 0.        ,
       0.0880943 , 0.        , 0.17574325, 0.0960077 , 0.        ,
       0.        , 0.        , 0.        , 0.05523241, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.0075487 ,
       0.02332155, 0.0492298 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.01964829, 0.04958753, 0.05643015,
       0.03890971, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.02765177, 0.        , 0.        , 0.        , 0.00745111,
       0.        , 0.        , 0.01692108, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [55]:
# Sort the documents
import numpy as np

In [74]:
indexes = np.argsort(score)[:-5]

In [76]:
df.iloc[indexes]

Unnamed: 0,course,section,question,text
946,mlops-zoomcamp,Module 6: Best practices,Isort Pre-commit,Problem description\nPre-commit command was fa...
859,mlops-zoomcamp,Module 2: Experiment tracking,Experiment not visible in MLflow UI,Make sure you launch the mlflow UI from the sa...
860,mlops-zoomcamp,Module 2: Experiment tracking,Hash Mismatch Error with Package Installation,Problem:\nGetting\nERROR: THESE PACKAGES DO NO...
861,mlops-zoomcamp,Module 2: Experiment tracking,How to Delete an Experiment Permanently from M...,"After deleting an experiment from UI, the dele..."
862,mlops-zoomcamp,Module 2: Experiment tracking,How to Update Git Public Repo Without Overwrit...,"Problem: I cloned the public repo, made edits,..."
...,...,...,...,...
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
411,data-engineering-zoomcamp,Workshop 1 - dlthub,Edit Course Profile.,The display name listed on the leaderboard is ...
10,data-engineering-zoomcamp,General course-related questions,Course - ​​How many hours per week am I expect...,It depends on your background and previous exp...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...


##### Search across fields

In [64]:
fields = ['section','question','text']

In [65]:
matrices = {}
vectorizers = {}

for f in fields:
    cv = TfidfVectorizer(stop_words='english',min_df=5)
    X = cv.fit_transform(df[f])
    matrices[f] = X
    vectorizers[f] = cv

In [66]:
matrices

{'section': <948x66 sparse matrix of type '<class 'numpy.float64'>'
 	with 3090 stored elements in Compressed Sparse Row format>,
 'question': <948x291 sparse matrix of type '<class 'numpy.float64'>'
 	with 3431 stored elements in Compressed Sparse Row format>,
 'text': <948x1333 sparse matrix of type '<class 'numpy.float64'>'
 	with 23808 stored elements in Compressed Sparse Row format>}

In [67]:
vectorizers

{'section': TfidfVectorizer(min_df=5, stop_words='english'),
 'question': TfidfVectorizer(min_df=5, stop_words='english'),
 'text': TfidfVectorizer(min_df=5, stop_words='english')}

In [68]:
df.head()

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...


In [69]:
#Score the query across all fields

n = len(df)

score = np.zeros(n)

query = "I just discovered the course, is it too late to join?"

#Add a boost 
boosts = {
    'question':3
}

for f in fields:
    q = vectorizers[f].transform([query])
    X = matrices[f]

    f_score = cosine_similarity(X,q).flatten()
    boost = boosts.get(f,1.0)
    score  = score + boost*f_score

In [70]:
score

array([3.52985023, 3.49512426, 2.70735166, 2.96614194, 3.49512426,
       3.49512426, 1.93689291, 3.67069698, 2.67242848, 3.49512426,
       3.10198469, 2.46096752, 0.49512426, 0.49512426, 0.49512426,
       0.59193348, 0.49512426, 2.63772182, 0.57041627, 0.49512426,
       0.49512426, 0.49512426, 0.79499188, 0.60033101, 0.49512426,
       0.49512426, 0.49512426, 0.76959902, 0.62340833, 0.49512426,
       0.49512426, 0.49512426, 0.49512426, 1.78972334, 3.49512426,
       1.72080809, 0.49512426, 0.49512426, 0.49512426, 0.52668735,
       0.54427244, 2.00115141, 0.49512426, 0.53842198, 0.        ,
       0.        , 0.        , 0.        , 0.02804374, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.06739038, 0.        , 0.00980845,
       0.        , 0.        , 0.        , 0.        , 0.05820102,
       0.        , 0.        , 0.        , 0.        , 0.     

##### Add a filter

In [71]:
# Filter only data engineering content

filter = {
    'course': 'data-engineering-zoomcamp'
}

for field,value in filter.items():
    mask = (df[field] == value).astype(int).values
    score = score*mask
mask

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [72]:
score

array([3.52985023, 3.49512426, 2.70735166, 2.96614194, 3.49512426,
       3.49512426, 1.93689291, 3.67069698, 2.67242848, 3.49512426,
       3.10198469, 2.46096752, 0.49512426, 0.49512426, 0.49512426,
       0.59193348, 0.49512426, 2.63772182, 0.57041627, 0.49512426,
       0.49512426, 0.49512426, 0.79499188, 0.60033101, 0.49512426,
       0.49512426, 0.49512426, 0.76959902, 0.62340833, 0.49512426,
       0.49512426, 0.49512426, 0.49512426, 1.78972334, 3.49512426,
       1.72080809, 0.49512426, 0.49512426, 0.49512426, 0.52668735,
       0.54427244, 2.00115141, 0.49512426, 0.53842198, 0.        ,
       0.        , 0.        , 0.        , 0.02804374, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.06739038, 0.        , 0.00980845,
       0.        , 0.        , 0.        , 0.        , 0.05820102,
       0.        , 0.        , 0.        , 0.        , 0.     

In [77]:
idx = np.argsort(-score)[:5]

df.iloc[idx]

Unnamed: 0,course,section,question,text
7,data-engineering-zoomcamp,General course-related questions,Course - Can I follow the course after it fini...,"Yes, we will keep all the materials after the ..."
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
34,data-engineering-zoomcamp,General course-related questions,How can we contribute to the course?,Star the repo! Share it with friends if you fi...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
9,data-engineering-zoomcamp,General course-related questions,Course - Which playlist on YouTube should I re...,All the main videos are stored in the Main “DA...


### Embeddings and Vector Search

**What are Embeddings?**
- **Conversion to Numbers:** Embeddings transform different words, sentences and documents into dense vectors (arrays with numbers).
- **Capturing Similarity:** They ensure similar items have similar numerical vectors, illustrating their closeness in terms of characteristics.
- **Dimensionality Reduction:** Embeddings reduce complex characteristics into vectors.
- **Use in Machine Learning:** These numerical vectors are used in machine learning models for tasks such as recommendations, text analysis, and pattern recognition.

**SVD**
Singular Value Decomposition is the simplest way to turn Bag-of-Words representation into embeddings

This way we still don't preserve the word order (because it wasn't in the Bag-of-Words representation) but we reduce dimensionality and capture synonyms.

We won't go into mathematics, it's sufficient to know that SVD "compresses" our input vectors in such a way that as much as possible of the original information is retained.

This compression is lossy compression - meaning that we won't be able to restore the 100% of the original vector, but the result is close enough.

In [83]:
from sklearn.decomposition import TruncatedSVD
#Reduce the dimensions of the document
X = matrices['text']
cv = vectorizers['text']

In [84]:
cv

In [85]:
svd = TruncatedSVD(n_components=16)
X_emb = svd.fit_transform(X)

In [86]:
X_emb.shape

(948, 16)

In [87]:
X_emb[0]

array([ 0.09653448, -0.08232656, -0.10136608, -0.08019789,  0.0679888 ,
       -0.05854308,  0.0194666 , -0.16156203,  0.22591134,  0.28661481,
        0.06278776,  0.08776968, -0.10176167, -0.07817687,  0.08435481,
        0.00148719])

In [88]:
"""This captures semantic similarities in the text"""

query = "I just signed up. Is it too late to join the course?"

Q = cv.transform([query])

Q_emb = svd.transform(Q)
Q_emb[0]

array([ 0.057905  , -0.03869153, -0.05651919, -0.02854049,  0.04049481,
       -0.05961523,  0.01069476, -0.10641272,  0.15879875,  0.18519603,
        0.05198217,  0.08554754, -0.08697101, -0.03155278,  0.04496728,
        0.01208409])

In [89]:
# Compute the cosine similarity btn the arrays - compute score - use similarity\
np.dot(X_emb[0],Q_emb[0])

np.float64(0.1552917640414362)

In [96]:
score = cosine_similarity(X_emb,Q_emb).flatten()
idx = np.argsort(-score)[:10]
list(df.loc[idx].question)

['What If I submitted only two projects and failed to submit the third?',
 'The course has already started. Can I still join it?',
 'Course - When will the course start?',
 'Can I submit the homework after the due date?',
 'Course - Can I follow the course after it finishes?',
 'Is it going to be live? When?',
 'Certificate - Can I follow the course in a self-paced mode and get a certificate?',
 'When does the next iteration start?',
 'What if my answer is not exactly the same as the choices presented?',
 'Homework - Are late submissions of homework allowed?']

In [99]:
df.iloc[449]

course                              machine-learning-zoomcamp
section                      General course-related questions
question    The course has already started. Can I still jo...
text        Yes, you can. You won’t be able to submit some...
Name: 449, dtype: object

### Non Negative Matrix Factorization
SVD creates values with negative numbers. It's difficult to interpet them.

NMF (Non-Negative Matrix Factorization) is a similar concept, except for non-negative input matrices it produces non-negative results.

We can interpret each of the columns (features) of the embeddings as different topic/concents and to what extent this document is about this concept.

Let's use it for the documents:

In [104]:
from sklearn.decomposition import NMF

nmf  = NMF(n_components=16)
X_emb = nmf.fit_transform(X)

X_emb[0]

array([0.13079639, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        ])

In [105]:
Q = cv.transform([query])
Q_emb = nmf.transform(Q)

Q_emb[0]

array([0.08643139, 0.00224416, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.00171504,
       0.        ])

In [107]:
score = cosine_similarity(X_emb,Q_emb).flatten()
idx = np.argsort(-score)[:10]
list(df.iloc[idx].question)

['Course - When will the course start?',
 'Course - Can I still join the course after the start date?',
 'The course has already started. Can I still join it?',
 'Is it going to be live? When?',
 'What if my answer is not exactly the same as the choices presented?',
 'What If I submitted only two projects and failed to submit the third?',
 'Can I submit the homework after the due date?',
 'What if I miss a session?',
 'When does the next iteration start?',
 'Certificate - Can I follow the course in a self-paced mode and get a certificate?']

### BERT
The above methods/approaches is that they donot take into account the word order, hence called bag of words.

**BERT** and other transformer models don't have this problem.

Let's create embeddings with BERT. We will use the Hugging Face library for that


In [None]:
import torch
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
