## Introduction to Embeddings and SVD

Traditional text search methods only find exact matches, missing out on synonyms and semantic similarity. Embeddings address this by converting words, sentences, or documents into dense numerical vectors that capture their meaning and relationships.

- **Conversion to Numbers:** Embeddings transform text into arrays of numbers.
- **Capturing Similarity:** Similar items have similar vectors, reflecting their semantic closeness.
- **Dimensionality Reduction:** Embeddings reduce complex text data into lower-dimensional vectors.
- **Use in Machine Learning:** These vectors are used for recommendations, text analysis, and pattern recognition.



In [1]:
import pandas as pd
import numpy as np
import requests

# Load Data
   - Download FAQ documents from a remote JSON file.
   - Convert the data into a pandas DataFrame for easier processing.

In [2]:
url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
doc = requests.get(url)
raw_data = doc.json()

In [3]:
df = pd.DataFrame(columns=['course', 'section', 'question', 'text'])
for course in raw_data:
    tmp = pd.DataFrame(course['documents'])
    tmp['course'] = course['course']
    df = pd.concat([df, tmp], axis=0)

df.sample(5)

Unnamed: 0,course,section,question,text
66,mlops-zoomcamp,Module 2: Experiment tracking,TypeError: send_file() unexpected keyword 'max...,Problem: When I ran `$ mlflow ui` on a remote ...
29,mlops-zoomcamp,Module 1: Introduction,Distplot takes too long,First remove the outliers (trips with unusual ...
123,machine-learning-zoomcamp,4. Evaluation Metrics for Classification,Using a variable to score,https://datatalks-club.slack.com/archives/C028...
310,machine-learning-zoomcamp,10. Kubernetes and TensorFlow Serving,'kind' is not recognized as an internal or ext...,Problem: I download kind from the next command...
161,data-engineering-zoomcamp,Module 2: Workflow Orchestration,Where are the FAQ questions from the previous ...,Prefect: https://docs.google.com/document/d/1K...


---
# Embedings using SVD Matrix Factorization
---

SVD is a simple way to turn Bag-of-Words representations into embeddings. While it does not preserve word order, it reduces dimensionality and helps capture synonyms. SVD "compresses" input vectors, retaining as much information as possible, though some loss is inevitable (lossy compression).


**Text Vectorization**
   - Use `TfidfVectorizer` from scikit-learn to convert text fields (`section`, `question`, `text`) into numerical vectors.
   - Store the vectorizers and transformed matrices for each field.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

fields = ['section', 'question', 'text']
vects = {}
transformed_matrices ={}

for field in fields:
    vc = TfidfVectorizer(stop_words='english', min_df=3)
    vects[field] = vc.fit(df[field])
    transformed_matrices[field] = vc.transform(df[field])

**Embeddings**
Use the vectorizer for the "text" field and turn it into embeddings

In [17]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=16)

docs = transformed_matrices['text']
docs_emb = svd.fit_transform(docs)

docs_emb[0]

array([ 0.088003  , -0.07495318, -0.10098837,  0.05220785,  0.05234601,
       -0.06120342,  0.02666958,  0.05652742, -0.19840912, -0.32404082,
        0.10883151,  0.11826502, -0.10906581, -0.0439981 , -0.00641178,
        0.03880909])

**Similarity Search**
   - For a given query, transform it using the same vectorizers.
   - Compute cosine similarity between the query and each document in the dataset.

In [18]:
query = 'I just signed up. Is it too late to join the course?'

cv = vects['text']
transformed_query = cv.transform([query])
query_emb = svd.transform(transformed_query)
query_emb[0]

array([ 0.04353757, -0.03063924, -0.04385529,  0.01288144,  0.02678051,
       -0.05145252,  0.01510529,  0.03337912, -0.11551623, -0.16973149,
        0.06819878,  0.08097627, -0.07188326, -0.00684188, -0.01545067,
        0.03573956])

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

score = cosine_similarity(docs_emb, query_emb).flatten()

idx = np.argsort(-score)[:10]
print("Text indx: ", idx)
list(df.iloc[idx].text)

Text indx:  [764 449 436   0   7  15  11 451 450 440]


['If you have submitted two projects (and peer-reviewed at least 3 course-mates’ projects for each submission), you will get the certificate for the course. According to the course coordinator, Alexey Grigorev, only two projects are needed to get the course certificate.\n(optional) David Odimegwu',
 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
 'The course videos are pre-recorded, you can start watching the course right now.\nWe will also occasionally have office hours - live sessions where we will answer your questions. The office hours sessions are recorded too.\nYou can see the office hours as well as the pre-recorded course videos in the course playlist on

---
# Embeddings using Non-Negative Matrix Factorization
---

- SVD creates values with negative numbers. It's difficult to interpet them.
- NMF (Non-Negative Matrix Factorization) is a similar concept, except for non-negative input matrices it produces non-negative results.

> We can interpret each of the columns (features) of the embeddings as different topic/concepts and to what extent this document is about this concept.


**Embeddings**

In [20]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=16)
docs_emb = nmf.fit_transform(docs)
docs_emb[0]

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.30711036,
       0.        , 0.0024522 , 0.        , 0.        , 0.        ,
       0.        ])

**Similarity Search**
   - For a given query, transform it using the same vectorizers.
   - Compute cosine similarity between the query and each document in the dataset.

In [21]:
query = 'I just signed up. Is it too late to join the course?'

cv = vects['text']
transformed_query = cv.transform([query])
query_emb = nmf.transform(transformed_query)
query_emb[0]

array([0.        , 0.00120315, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.17415437,
       0.        , 0.        , 0.        , 0.00079652, 0.        ,
       0.        ])

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

score = cosine_similarity(docs_emb, query_emb).flatten()

idx = np.argsort(-score)[:10]
print("Text indx: ", idx)
list(df.iloc[idx].text)

Text indx:  [  2 449 814  11   0 451   7 436  15 764]


["Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
 'Please choose the closest one to your answer. Also do not post your answer in the course slack channel.',
 "No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.",
 "The purpose

---
# Embeddings Vector using BERT
---

**Load the model from hugging face**

In [4]:
import torch
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()  # Set the model to evaluation mode if not training

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

**Split text into batches**

In [5]:
def make_batches(seq, n):
    result = []
    for i in range(0, len(seq), n):
        batch = seq[i:i+n]
        result.append(batch)
    return result

texts = df['text'].tolist()
text_batches = make_batches(texts, 8)
text_batches[0]

["The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
 "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
 'You can start by ins

**Extract Embeddings**

In [6]:
from tqdm.auto import tqdm

all_embeddings = []

for batch in tqdm(text_batches):
    encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        outputs = model(**encoded_input)
        hidden_states = outputs.last_hidden_state

        batch_embeddings = hidden_states.mean(dim=1)
        batch_embeddings_np = batch_embeddings.cpu().numpy()
        all_embeddings.append(batch_embeddings_np)

  0%|          | 0/119 [00:00<?, ?it/s]

In [7]:
docs_emb = np.vstack(all_embeddings)

**Embed the Query**

To perform a semantic search, we first need to embed the query using the same BERT model and tokenizer as used for the documents. This ensures the query and document vectors are in the same space for similarity comparison.

In [8]:
query = 'I just signed up. Is it too late to join the course?'

encoded_input = tokenizer([query], padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encoded_input)
    hidden_states = outputs.last_hidden_state
    query_emb = hidden_states.mean(dim=1).cpu().numpy()

query_emb.shape

(1, 768)

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

score = cosine_similarity(docs_emb, query_emb).flatten()

idx = np.argsort(-score)[:10]
print("Text idx: ", idx)
list(df.iloc[idx].text)

Text idx:  [  5 452 449 440 809 804 550   7 846 561]


["There are 3 Zoom Camps in a year, as of 2024. However, they are for separate courses:\nData-Engineering (Jan - Apr)\nMLOps (May - Aug)\nMachine Learning (Sep - Jan)\nThere's only one Data-Engineering Zoomcamp “live” cohort per year, for the certification. Same as for the other Zoomcamps.\nThey follow pretty much the same schedule for each cohort per zoomcamp. For Data-Engineering it is (generally) from Jan-Apr of the year. If you’re not interested in the Certificate, you can take any zoom camps at any time, at your own pace, out of sync with any “live” cohort.",
 'Welcome to the course! Go to the course page (http://mlzoomcamp.com/), scroll down and start going through the course materials. Then read everything in the cohort folder for your cohort’s year.\nClick on the links and start watching the videos. Also watch office hours from previous cohorts. Go to DTC youtube channel and click on Playlists and search for {course yyyy}. ML Zoomcamp was first launched in 2021.\nOr you can jus