# Installation/Setup
In this workshop we will be building a class-based searching algorithm designed to search and find sentences in text based on a query. Before we start, select a **T4 GPU** as the runtime and run the following cells to install the necessary packages and data.

In [3]:
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
nltk.download("stopwords")
nltk.download("wordnet")
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

--2024-10-24 14:41:34--  https://docs.google.com/uc?export=download&id=1dSl2QJhVUr93yPUnakzXqmrK5ZKBTciz
Resolving docs.google.com (docs.google.com)... 74.125.195.100, 74.125.195.113, 74.125.195.138, ...
Connecting to docs.google.com (docs.google.com)|74.125.195.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1dSl2QJhVUr93yPUnakzXqmrK5ZKBTciz&export=download [following]
--2024-10-24 14:41:34--  https://drive.usercontent.google.com/download?id=1dSl2QJhVUr93yPUnakzXqmrK5ZKBTciz&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 142.250.99.132, 2607:f8b0:400e:c07::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.250.99.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19338 (19K) [application/octet-stream]
Saving to: ‘cleaned-tcn-description.txt’


2024-10-24 14:41:36 (92.6 MB/s) - ‘cleaned-tcn-description

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Concepts behind our searching algorithm

## Evaluating sentences
The idea behind the searching algorithm is that we want to take some query and find the **most simlar** sentence to that query from a text. In order to do that, we need to have a method to compare our query to other sentences in a numerical way, so that we can rank them based on numerical simlilarity.

One method of comparing sentences to our query is to convert them to vectors and calculate the **cosine similarity** between them. If you have taken Calculus 3, this may look familiar, as the cosine similarity between two vectors *u, v* is defined as:
$$
S_c(u,v)=\frac{u\cdot v}{\|u\| \|v\|}
$$

The output of the function is the cosine of the angle between the two vectors, and its range is equal to $[-1, 1]$. If two vectors are in the same direction then,
$$
S_c(u,v)=\cos(0)=1
$$
and for vectors in the opposite direction
$$
S_c(u,v)=\cos(\pi)=-1
$$

We calculate the cosine similarity using the `cosine_similarity` function from `sklearn.metrics.pairwise`. The general input to this function is two matrices matrices of vectors and outputs a matrix corresponding to the cosine similarities of each vector.

If you see the idea here, cosine similarity is a function that measures the direction between two vectors with vectors the are close in direction being close to 1. The method from here is to turn the sentences into vectors in a way that captures their similarity into a vector.

In [None]:
i = np.array([1, 0])
j = np.array([0, 1])
# Calculates the cosine similarity of i with i, j, -i
cosine_similarity([i], [i, j, -i])

array([[ 1.,  0., -1.]])

## How to turn sentences into vectors
Now for the important part. The entire structure of the algorithm depends on how we turn the sentences into vectors. We need to turn these sentences into vectors in a way that *similar* sentences correspond to vectors in *similar* directions. There are multiple approches to doing this, and they each have their own use cases. For the purpose of this workshop, we will utilize a sentence transformer from Hugging Face to do the embedding for us.

The sentence transformer we are using is `paraphrase-MiniLM-L6-v2`, which converts each sentence into a vector in $\mathbb{R}^{384}$.

In [None]:
# Two sentences about machine learning
s1 = "Machine learning algorithms are designed to learn from data and improve their performance over time."
s2 = "Deep learning is a subfield of machine learning that utilizes artificial neural networks with multiple layers."
# A completely unrelated sentence
s3 = "The quick brown fox jumps over the lazy dog, showcasing a classic pangram for alphabet practice."

s1_vector = model.encode(s1)
s2_vector = model.encode(s2)
s3_vector = model.encode(s3)

In [None]:
print("s1 dimension:", s1_vector.shape)
print("Similarity between s1 and s2:", cosine_similarity([s1_vector], [s2_vector])[0])
print("Similarity between s1 and s3:", cosine_similarity([s1_vector], [s3_vector])[0])
print("Similarity between s1 and s1:", cosine_similarity([s1_vector], [s1_vector])[0])

s1 dimension: (384,)
Similarity between s1 and s2: [0.50385654]
Similarity between s1 and s3: [0.20394549]
Similarity between s1 and s1: [1.0000001]


## Using NLP Techniques
There are a few NLP techniques that we can use to help make the sentence transformer perform better.
### Removing Stopwords and Converting to Lowercase
We can remove stopwords from the sentences ("there", "is", "and") to leave only important keywords for the transformer to handle. We can also convert each sentence to be strictly lowercase.

In [None]:
stop_words = set(stopwords.words("english"))
s1 = ' '.join([word.lower() for word in s1.split() if word.lower() not in stop_words])
s2 = ' '.join([word.lower() for word in s2.split() if word.lower() not in stop_words])
s3 = ' '.join([word.lower() for word in s3.split() if word.lower() not in stop_words])
print(s1)
print(s2)
print(s3)

machine learning algorithms designed learn data improve performance time.
deep learning subfield machine learning utilizes artificial neural networks multiple layers.
quick brown fox jumps lazy dog, showcasing classic pangram alphabet practice.


### Lemmatization
We can also apply a lemmatizer to the sentences that change each word to its root form. For example "running" -> "run".

In [None]:
lemmatizer = WordNetLemmatizer()
s1 = ' '.join([lemmatizer.lemmatize(word) for word in s1.split()])
s2 = ' '.join([lemmatizer.lemmatize(word) for word in s2.split()])
s3 = ' '.join([lemmatizer.lemmatize(word) for word in s3.split()])
print(s1)
print(s2)
print(s3)

machine learning algorithm designed learn data improve performance time.
deep learning subfield machine learning utilizes artificial neural network multiple layers.
quick brown fox jump lazy dog, showcasing classic pangram alphabet practice.


# Putting it all Together
Now that we have the foundation of our searching algorithm with sentence transformers and NLP techniques, we can construct a class for handling our data. Feel free to add any helper functions as you implement the code.

In [None]:
class SearchAlgorithm:

    def __init__(self, model_name="paraphrase-MiniLM-L6-v2", model=None) -> None:
        if model:
            self.model = model
        else:
            self.model = SentenceTransformer(model_name)
        self.stop_words = set(stopwords.words("english"))
        self.lemmatizer = WordNetLemmatizer()
        self.documents = []
        self.document_vectors = []
        self.query_vector = None

    def load_documents(self, documents: list[str]) -> None:
        """
        TODO:
        1.  Load in self.documents with the passed in documents
        2.  Convert each document into a vector with the model
            and store the vectors in self.document_vectors.
        3.  Utilize NLP techniques to preprocess the documents.
        """
        pass

    def search(self, query: str, N: int = 3) -> list[str]:
        """
        TODO:
        Write an algorithm to return the N most similar sentences.
        1.  Convert the query into a vector
        2.  Calculate the cosine similariy of the query with sentences
        3.  Order the sentences by similarity and return the top N (hint: np.argsort())
        """
        pass

# Testing the Implementation

In [None]:
# Example of documents related to machine learning
# Already parsed and cleaned/unique

documents = [
    "Artificial intelligence is transforming industries with its ability to mimic human decision-making.",
    "Machine learning enables computers to learn from data and improve over time without being explicitly programmed.",
    "Natural language processing (NLP) allows computers to understand, interpret, and respond to human language in meaningful ways.",
    "Deep learning, a subset of machine learning, utilizes neural networks with multiple layers to extract high-level features from data.",
    "Transformers, a type of deep learning model, have revolutionized natural language processing tasks with their attention mechanism.",
    "Self-supervised learning allows models to learn from raw data without requiring labeled examples.",
    "In reinforcement learning, agents learn by interacting with their environment and receiving rewards or penalties.",
    "Supervised learning requires labeled datasets to train models, while unsupervised learning identifies patterns in data without labels.",
    "Neural networks consist of layers of interconnected nodes that process information similarly to the human brain.",
    "Transfer learning leverages knowledge from one domain to improve performance in another domain.",
    "Text search algorithms help retrieve relevant information from vast datasets using techniques like keyword matching and semantic search.",
    "Inverted indices are commonly used in search engines to quickly find documents containing specific keywords.",
    "Embeddings are vector representations of words, phrases, or documents that capture their semantic meaning.",
    "Cosine similarity is a metric used to measure the similarity between two vectors by calculating the cosine of the angle between them.",
    "Document retrieval systems are designed to search, index, and retrieve documents relevant to user queries.",
    "The bag-of-words model represents text data as a collection of words without considering the order of the words.",
    "TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a corpus.",
    "Clustering algorithms like K-Means group similar data points together based on features or distance metrics.",
    "Dimensionality reduction techniques like PCA (Principal Component Analysis) reduce the number of features while preserving important information.",
    "Data augmentation techniques like rotation, flipping, and scaling are used in image processing to improve model generalization.",
    "Regularization methods like L2 regularization help prevent overfitting in machine learning models by adding a penalty for large weights.",
    "Gradient descent is an optimization algorithm used to minimize the loss function in machine learning models.",
    "Random forests are ensemble learning methods that combine the outputs of multiple decision trees to make predictions.",
    "Support Vector Machines (SVM) classify data by finding the hyperplane that best separates different classes.",
    "Bayesian networks represent probabilistic relationships among variables and can be used for reasoning under uncertainty.",
    "Markov Chains model systems where the probability of transitioning from one state to another depends only on the current state.",
    "I am a random sentence about biology. I don't have anything to do with the query"
]


query = "What is machine learning?"

In [9]:
# Number of lines to combine
N = 3

# Example of documents copy and pasted from a machine learning paper
# Documents are partially parsed/cleaned, but are overall less organized
with open('cleaned-tcn-description.txt', 'r') as f:
    documents = f.readlines()
temp = []
for doc in documents:
    if len(doc) <= 5:
        continue
    for sentence in doc.split('.'):
        if len(sentence) <= 5:
            continue
        temp.append(sentence)
documents = temp
new_documents = []
temp = []
for doc in documents:
    if len(temp) == N:
        new_documents.append(' '.join(temp))
        temp = []
    temp.append(doc)
tcn_documents = new_documents
tcn_original_documents = documents.copy()

In [11]:
print("Length of TCN Documents:", len(tcn_documents))
print("Sample line:", tcn_documents[5])

Length of TCN Documents: 58
Sample line: Overview
 A TCN, short for Temporal Convolutional Network, consists of dilated, causal 1D convolutional layers with the same input and output lengths  The following sections go into detail about what these terms actually mean


In [None]:
# TODO:
# Instantiate the class, load the documents, and search with the query
# Try with both the organized documents and the TCN Paper



# Training the sentence transformer on custom data

While the sentence transformer itself helps encode the data into vectors it might not perform as well on certain sentences compared to others. In order to help the sentence transformer perform better, we can train it on custom datasets to increase the accuracy of similarity.

## Quora Dataset
For the purpose of this workshop, we will still be utilizing a more generalized dataset, whereas for specific use cases, it is better to train the transfomer on similar sentences to that you are encoding. The quora dataset is comprised of pairs of questions and label indicating whether they are essentially the same question. In other words, `is_duplicate=True` if the two questions have strong similarity.

In [None]:
from datasets import load_dataset
import os

os.environ["WANDB_DISABLED"] = "true"
dataset = load_dataset('quora', 'en', split='train')

In [None]:
for example in dataset.shuffle(seed=42).select(range(3)):
    print(example)

{'questions': {'id': [62616, 62617], 'text': ["How do I know if I'm good?", "How do I know what I'm good at?"]}, 'is_duplicate': False}
{'questions': {'id': [213590, 149174], 'text': ['How do I build my profile for top B-schools?', 'How do I build my profile for Harvard, Wharton, INSEAD etc.?']}, 'is_duplicate': True}
{'questions': {'id': [416022, 416023], 'text': ['How good is new Zealand for post graduation studies especially in management? And what are the job prospects in new zealand?', 'What are the job prospects for health care management in New Zealand?']}, 'is_duplicate': False}


In [None]:
train_examples = []
# Separates the two questions and provides a label for the training
# is_duplicate=True -> 1, is_duplicate=False -> 0
for example in dataset.shuffle(seed=42).select(range(10000)):
    question1 = example['questions']['text'][0]
    question2 = example['questions']['text'][1]
    label = float(example['is_duplicate'])
    train_examples.append(InputExample(texts=[question1, question2], label=label))

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)

In [None]:
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=4,
    warmup_steps=100,
    output_path='./output/training_stsbenchmark_quora',
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss
500,0.3429
1000,0.2482
1500,0.1629
2000,0.1229
2500,0.0895


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]