<a href="https://colab.research.google.com/github/parthasarathydNU/gen-ai-coursework/blob/main/sentence-similarity/notebooks/sentence-similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task:

Derive similarity score between two sentences applying three different techniques.
- Reference Article: [A beginner’s guide to measuring sentence similarity](https://medium.com/@igniobydigitate/a-beginners-guide-to-measuring-sentence-similarity-f3c78b9da0bc)

## Sentence Embedding

Sentence embedding represents a sentence as a vector of numbers. This numerical representation of a sentence is called sentence embedding. In a word embedding corresponds to a particular feature or aspect of the word. A sentence embedding is based on a similar concept where the dimensions collectively capture different aspects of the words used in the sentence, the grammatical structure of the sentence, and maybe some more underlying information.

There are various ways in which a sentence embedding can be created. Once we have each sentence represented as a vector of numbers, then the problem of finding sentence similarity translates to the problem of finding similarity between these numeric vectors.

In this notebook I will discuss a couple of statistical techniques to create numeric representations of sentences and briefly explore an idea of how one can utilize word embeddings for the same task. I will also discuss how similarity between sentence embeddings can be computed.

## Sample sentences

We take sentences form two unrelated movies to work with. This is done with the goal to demonstrate how sentences turn out to be either similar or dissimilar across these movies. I expect sentences from spiderman to show higher similarity with other senteces from the same movie and a lower similarity score from the sentences from the movie Godfather.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Sample sentences from Spider-Man
spiderman_sentences = [
    "With great power comes great responsibility.",
    "I missed the part where that's my problem.",
    "You're not Superman, you know.",
    "Remember, with great power comes great responsibility.",
    "I'm just Peter Parker. I'm Spider-Man no more.",
    "Whatever life holds in store for me, I will never forget these words.",
    "The truth is, I am Spider-Man.",
    "This is my gift, my curse. Who am I? I'm Spider-Man.",
    "Sometimes, to do what's right, we have to be steady and give up the things we want the most.",
    "I want to tell you the truth... here it is: I'm Spider-Man."
]

# Sample sentences from The Godfather
godfather_sentences = [
    "I'm gonna make him an offer he can't refuse.",
    "Revenge is a dish best served cold.",
    "A man who doesn't spend time with his family can never be a real man.",
    "Leave the gun. Take the cannoli.",
    "The lawyer with the briefcase can steal more money than the man with the gun.",
    "It's not personal, Sonny. It's strictly business.",
    "Women and children can be careless, but not men.",
    "Power wears out those who do not have it.",
    "Friendship is everything. Friendship is more than talent. It is more than the government. It is almost the equal of family.",
    "Great men are not born great, they grow great."
]

all_sentences = spiderman_sentences + godfather_sentences

## Bag of words

The basic idea is to find out which words are present in a sentence and assess the importance of a word based on how many times it occurs in a sentence.

#### Creating a dictionary and removing stop words

Words such as is, are, a, an, the etc do not add much value in terms of providing context to a sentence. These are called stop words. So before we go ahead and count the frequency of words, we want to remove these stop words from the sentences.

In [None]:
# Define a list of stop words
stop_words = {
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves',
    'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their',
    'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was',
    'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the',
    'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',
    'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out',
    'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how',
    'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
    'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
}

#### Removing punctuations
Before we go ahead and remove stop words, we first want to remove punctuations from the words so that they also get flagged as stop words. And even if not we don't want to consider men and men. as different words. So we remove punctuations.

Explanation of code below:

- str.maketrans('', '', string.punctuation) creates a translation table that maps each character in string.punctuation to None.
- str.maketrans is a static method that returns a translation table usable for str.translate.
- The first two arguments are empty strings ('') because we are not replacing any characters, only removing.
- The third argument is string.punctuation, which contains all punctuation characters.
- word.translate(...) uses the translation table to remove all punctuation characters from the word.

In [None]:
import string
def remove_punctuation(word):
    return word.translate(str.maketrans('', '', string.punctuation))

In [None]:
# Sample sentence
sentence = "This is an example sentence showing the removal of stop words."

def remove_stop_words(sentence):

    # Tokenize the sentence
    words = sentence.lower().split()

    # Remove stop words
    filtered_words = [remove_punctuation(word) for word in words if word not in stop_words]

    # Join the words back into a sentence
    filtered_sentence = ' '.join(filtered_words)

    return filtered_sentence

print("Original sentence:", sentence)
print("Filtered sentence:", remove_stop_words(sentence))

Original sentence: This is an example sentence showing the removal of stop words.
Filtered sentence: example sentence showing removal stop words


In [None]:
# Removing stopwords from all sentences in our database

spiderman_stop_removed = []
godfather_stop_removed = []

for sentence in spiderman_sentences:
    spiderman_stop_removed.append(remove_stop_words(sentence).split())

for sentence in godfather_sentences:
    godfather_stop_removed.append(remove_stop_words(sentence).split())

print("spiderman_original:")
print(spiderman_sentences[0])
print(spiderman_sentences[1])
print(spiderman_sentences[2])
print()
print("spiderman_stop_removed:")
print(spiderman_stop_removed[0])
print(spiderman_stop_removed[1])
print(spiderman_stop_removed[2])


spiderman_original:
With great power comes great responsibility.
I missed the part where that's my problem.
You're not Superman, you know.

spiderman_stop_removed:
['great', 'power', 'comes', 'great', 'responsibility']
['missed', 'part', 'thats', 'problem']
['youre', 'superman', 'know']


#### Creating a dictionary of words in all sentences

Now that we have removed the stop words, let's create a dictionary of all non stop words and create a dataset where for each sentence we have a row of frequency of each word

In [None]:
# Creating a set of all unique words

# Let's combine all arrays into one
all_sentences_stop_removed = spiderman_stop_removed+ godfather_stop_removed

# get unique words from all sentences and put it into a set
unique_words = set()
for sentence in all_sentences_stop_removed:
    for word in sentence:
        unique_words.add(word)

print(f"Total unique words : {len(unique_words)}")

Total unique words : 82


#### Creating the dataframe that shows the word count of each sentence

In [None]:
from collections import Counter

In [None]:
# Create a frequency matrix
frequency_matrix = []

for sentence in all_sentences_stop_removed:
    word_count = Counter(sentence)
    frequency_matrix.append([word_count.get(word, 0) for word in unique_words])

# Create a DataFrame
df = pd.DataFrame(frequency_matrix, columns=list(unique_words))
print(df)

    remember  equal  know  superman  man  lawyer  money  real  cannoli  im  \
0          0      0     0         0    0       0      0     0        0   0   
1          0      0     0         0    0       0      0     0        0   0   
2          0      0     1         1    0       0      0     0        0   0   
3          1      0     0         0    0       0      0     0        0   0   
4          0      0     0         0    0       0      0     0        0   2   
5          0      0     0         0    0       0      0     0        0   0   
6          0      0     0         0    0       0      0     0        0   0   
7          0      0     0         0    0       0      0     0        0   1   
8          0      0     0         0    0       0      0     0        0   0   
9          0      0     0         0    0       0      0     0        0   1   
10         0      0     0         0    0       0      0     0        0   1   
11         0      0     0         0    0       0      0     0   

#### Using cosine similarity to calculate how similar the sentences are

Consider that two n-dimensional arrays are plotted as two vectors in an n-dimensional space. Cosine similarity measures the angle between these two vectors and returns a value between -1 and 1. Mathematically, given two vectors A and B, cosine similarity is calculated as follows:

A. B / ( |A| |B| )

where,


- A.B = Dot product between two vectors. It is calculated by adding the product of corresponding vector values.


- |A|, |B| = Magnitude of a vector. It is the square root of the sum of squares of all the vector values.

#### The cosine similarity function

In [None]:
def consine(vec1, vec2):
    # Compute the dot product
    dot_product = np.dot(vec1, vec2)

    # Compute the Euclidean norm (magnitude) of each vector
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)

    # Compute the cosine similarity
    if norm_vec1 == 0 or norm_vec2 == 0:
        return 0.0
    else:
        return dot_product / (norm_vec1 * norm_vec2)

In [None]:
# Testing this between two sentences within the spiderman movie
print(all_sentences)
print(len(all_sentences))

['With great power comes great responsibility.', "I missed the part where that's my problem.", "You're not Superman, you know.", 'Remember, with great power comes great responsibility.', "I'm just Peter Parker. I'm Spider-Man no more.", 'Whatever life holds in store for me, I will never forget these words.', 'The truth is, I am Spider-Man.', "This is my gift, my curse. Who am I? I'm Spider-Man.", "Sometimes, to do what's right, we have to be steady and give up the things we want the most.", "I want to tell you the truth... here it is: I'm Spider-Man.", "I'm gonna make him an offer he can't refuse.", 'Revenge is a dish best served cold.', "A man who doesn't spend time with his family can never be a real man.", 'Leave the gun. Take the cannoli.', 'The lawyer with the briefcase can steal more money than the man with the gun.', "It's not personal, Sonny. It's strictly business.", 'Women and children can be careless, but not men.', 'Power wears out those who do not have it.', 'Friendship is

In [None]:
# Let's try the first and the 4th statement
all_sentences[0]

'With great power comes great responsibility.'

In [None]:
all_sentences[3]

'Remember, with great power comes great responsibility.'

In [None]:
consine(df.loc[0], df.loc[3])

0.9354143466934852

In [None]:
# Let's pick a sentence that does not have power or responsibility in it
all_sentences[18]

'Friendship is everything. Friendship is more than talent. It is more than the government. It is almost the equal of family.'

In [None]:
consine(df.loc[0], df.loc[18])

0.0

This is not similar at at all!

Let's write a function that given two sentences does this process end to end

In [None]:
def findSimilarityBagOfWords(sentence1, sentence2):
    # combining sentences to an array
    sentences = [sentence1, sentence2]
    stop_removed = []

    # removing stop words and punctuation from the words
    for sentence in sentences:
        stop_removed.append(remove_stop_words(sentence).split())

    # getting list of unique words
    unique_words = set()
    for sentence in stop_removed:
        for word in sentence:
            unique_words.add(word)

    # Create a frequency matrix
    frequency_matrix = []

    for sentence in stop_removed:
        word_count = Counter(sentence)
        frequency_matrix.append([word_count.get(word, 0) for word in unique_words])

    # Create a DataFrame
    df = pd.DataFrame(frequency_matrix, columns=list(unique_words))
    print(f"Similarity score: {consine(df.loc[0], df.loc[1])}")
    return df

In [None]:
findSimilarityBagOfWords("Who let the dogs out!", "Who let the cats out!")

Similarity score: 0.6666666666666667


Unnamed: 0,out,dogs,let,cats
0,1,1,1,0
1,1,0,1,1


In [None]:
findSimilarityBagOfWords("Mamma mia , here we go again ", "My my! how can I resist you ?")

Similarity score: 0.0


Unnamed: 0,mia,resist,mamma,go,my
0,1,0,1,1,0
1,0,1,0,0,1


In [None]:
paragraph1 = "In the bustling city of Metropolis, the skyline is dominated by towering skyscrapers that reach for the heavens. The streets below are a hive of activity, with people from all walks of life hurrying to and fro. The air is filled with the sounds of honking cars, distant sirens, and the constant chatter of passersby. Amidst the urban chaos, pockets of tranquility can be found in the form of small parks and green spaces, offering a brief respite from the hustle and bustle. At night, the city transforms into a sea of lights, with neon signs and street lamps illuminating the dark, creating a vibrant and lively atmosphere that never seems to sleep."
paragraph2 = "In the tranquil town of Riverview, life moves at a leisurely pace. The town is known for its picturesque scenery, with rolling hills and a serene river that winds its way through the heart of the community. The streets are lined with charming houses, each with well-tended gardens bursting with colorful flowers. The sound of birds singing fills the air, and the occasional laughter of children playing can be heard in the distance. Riverview's town square is a hub of local activity, where residents gather for farmers' markets, craft fairs, and community events. As the sun sets, the town is bathed in a golden glow, and the sky is painted with hues of pink and orange, bringing a peaceful end to another day in this idyllic setting."

print(paragraph1)
print()
print(paragraph2)

In the bustling city of Metropolis, the skyline is dominated by towering skyscrapers that reach for the heavens. The streets below are a hive of activity, with people from all walks of life hurrying to and fro. The air is filled with the sounds of honking cars, distant sirens, and the constant chatter of passersby. Amidst the urban chaos, pockets of tranquility can be found in the form of small parks and green spaces, offering a brief respite from the hustle and bustle. At night, the city transforms into a sea of lights, with neon signs and street lamps illuminating the dark, creating a vibrant and lively atmosphere that never seems to sleep.

In the tranquil town of Riverview, life moves at a leisurely pace. The town is known for its picturesque scenery, with rolling hills and a serene river that winds its way through the heart of the community. The streets are lined with charming houses, each with well-tended gardens bursting with colorful flowers. The sound of birds singing fills th

In [None]:
findSimilarityBagOfWords(paragraph1, paragraph2)

Similarity score: 0.05466133744605251


Unnamed: 0,streets,markets,hustle,occasional,bringing,neon,metropolis,hub,pink,orange,...,distance,serene,bathed,community,form,scenery,activity,hues,residents,day
0,1,0,1,0,0,1,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
1,1,1,0,1,1,0,0,1,1,1,...,1,1,1,2,0,1,1,1,1,1


## TF-IDF

#### Explanation of TF-IDF

Term frequency - Inverse Document Frequency

The bag of words approach gives equal weight to all words. However, a more sophisticated approach is the TF-IDF approach. TF-IDF stands for Term Frequency — Inverse Document Frequency. This approach is based on the rationale that the most common words are usually the least significant ones. While stop words are removed in the bag of words approach, **TF-IDF provides a more sophisticated approach to automatically give less weight to frequent words that appear in the whole corpus.**

Let's break it down further:

**Term Frequency** : How frequent a term appears within the given document. Can be within the same sentence, paragraph or whole text.

**Inverse Document Frequency** : This is representative of how rare this word is across all the documents in the corpus.

IDF is calculated by taking the logarithm of the ratio of the total number of documents and the number of documents containing the word (document frequency). The more frequently the word appears across the corpus, the lower its inverse document frequency making it less important. Similarly, the rarer the word in the corpus, the higher its inverse document frequency.

#### What makes TF-IDF better than Bag of words ?

Let's find out by trying it out.

**Step 1**: Create a corpus of words, that we have already done for the bag of words.

**Step 2**: Create a dicrionary of words from the corpus. Already done for bag of words `unique_words`

**Step 3**: Creating the word embeddings. This is where the difference comes in between bag of words and TF-IDF. Instead of just sticking to the term frequency, we also calculate the Inverse Document Frequency and multiple it with the TF to get the TF-IDF value for the word in the sentence.

In [None]:
import math

# Let's go through all sentences and words and create a counter of how many times a word has appeared across documents
idf = {}

# We pick the array of all sentences that have stop words removed [][]
for words_arr in all_sentences_stop_removed:
    unique_words_in_sentence = set(words_arr)

    # for each of these unique words, increment the value in the idf array
    for word in unique_words_in_sentence:
        if word in idf :
            idf[word] =  idf[word] + 1
        else:
            idf[word] = 1

number_of_documents = len(all_sentences_stop_removed)



# Create a frequency matrix
# Each row corresponds to each sentence
# Each colunm corresponds to one unique word in the corpus of unique words
frequency_matrix = []

for sentence in all_sentences_stop_removed:
    word_count = Counter(sentence)
    frequency_matrix.append([word_count.get(word, 0) * (math.log(number_of_documents / idf[word]) ) for word in unique_words])

# Create a DataFrame
df = pd.DataFrame(frequency_matrix, columns=list(unique_words))
print(df)

    remember     equal      know  superman       man    lawyer     money  \
0   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
1   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
2   0.000000  0.000000  2.995732  2.995732  0.000000  0.000000  0.000000   
3   2.995732  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
4   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
5   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
6   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
7   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
8   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
9   0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
10  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
11  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
12  0.000000

We notice that the text embeddings are completely different in this case. However the calculation for the similarity still remains the same. We can use cosine similarity from the values present in this dataframe.

In [None]:
print(all_sentences[0])
print(all_sentences[3])
consine(df.loc[0], df.loc[3])

With great power comes great responsibility.
Remember, with great power comes great responsibility.


0.8724395002378096

Initially using bag of words, we got 0.9354143466934852

Now with TF-IDF we get 0.8724395002378096

Let's write it into it's own function so that we can compare different methods in parallel.

In [None]:
def findSimilarityTfIdf(sentence1, sentence2):
    # combining sentences to an array
    sentences = [sentence1, sentence2]
    stop_removed = []

    # removing stop words and punctuation from the words
    for sentence in sentences:
        stop_removed.append(remove_stop_words(sentence).split())

    # getting list of unique words
    unique_words = set()
    for sentence in stop_removed:
        for word in sentence:
            unique_words.add(word)

    # Let's go through all sentences and words and create a counter of how many times a word has appeared across documents
    idf = {}
    # We pick the array of all sentences that have stop words removed [][]
    for words_arr in stop_removed:
        unique_words_in_sentence = set(words_arr)

        # for each of these unique words, increment the value in the idf dictionary
        for word in unique_words_in_sentence:
            if word in idf :
                idf[word] =  idf[word] + 1
            else:
                idf[word] = 1

    number_of_documents = len(stop_removed)

    # Create a frequency matrix
    # Each row corresponds to each sentence
    # Each colunm corresponds to one unique word in the corpus of unique words
    frequency_matrix = []

    for sentence in stop_removed:
        sentence_word_counter = Counter(sentence)
        frequency_matrix.append([sentence_word_counter.get(word, 0) * (math.log(number_of_documents / idf[word]) ) for word in unique_words])

    # Create a DataFrame
    df = pd.DataFrame(frequency_matrix, columns=list(unique_words))
    print(f"Similarity score: {consine(df.loc[0], df.loc[1])}")
    return df

In [None]:
def compareMethods(sentence1, sentence2, showDf):
    print(f"Sentence 1: {sentence1}")
    print()
    print(f"Sentence 2: {sentence2}")
    print()
    print("Bag of words")
    bowDf = findSimilarityBagOfWords(sentence1, sentence2)
    if(showDf):
        print(bowDf)
    print()
    print("TF-IDF")
    tfIdfDf = findSimilarityTfIdf(sentence1, sentence2)
    if(showDf):
        print(tfIdfDf)

In [None]:
compareMethods("Who let the dogs out!", "Who let the cats out!", True)

Sentence 1: Who let the dogs out!
Sentence 2: Who let the cats out!

Bag of words
Similarity score: 0.6666666666666667
   out  dogs  let  cats
0    1     1    1     0
1    1     0    1     1

TF-IDF
Similarity score: 0.0
   out      dogs  let      cats
0  0.0  0.693147  0.0  0.000000
1  0.0  0.000000  0.0  0.693147


## Insights

We took two sentences
- Who let the dogs out!
- Who let the cats out!

And we have a look at the word embeddings that were created using the two methods:

TF-IDF:
```
   out      dogs  let      cats
0  0.0  0.693147  0.0  0.000000
1  0.0  0.000000  0.0  0.693147
```

Bag of words:
```
   out  dogs  let  cats
0    1     1    1     0
1    1     0    1     1
```

We notice that while TF-IDF heavily penalizes the common words such as out and let in the corpus, those words are given the same priority as the rare words when we use the bag of words method. This way we can get to understand how TF IDF works better than Bag of words.

## Semantic understanding

While the above two methods are able to statistically capture the similarities amongst the words, they fail at instances where we need to capture the semantic understanding of the sentences and then find similar ones.

For example let's look at the following sentences where we praise two leaders of the world.

Barack Obama: "Barack Obama's eloquence and unwavering commitment to social justice and equality have inspired millions around the globe, making him a beacon of hope and progressive change."

Angela Merkel: "Angela Merkel's steadfast leadership and pragmatic approach to governance have earned her immense respect, as she navigated Germany through numerous crises with remarkable poise and integrity."

In [None]:
sentence1 = "Barack Obama's eloquence and unwavering commitment to social justice and equality have inspired millions around the globe, making him a beacon of hope and progressive change."
sentence2 = "Angela Merkel's steadfast leadership and pragmatic approach to governance have earned her immense respect, as she navigated Germany through numerous crises with remarkable poise and integrity."
compareMethods(sentence1, sentence2, showDf=False)

Sentence 1: Barack Obama's eloquence and unwavering commitment to social justice and equality have inspired millions around the globe, making him a beacon of hope and progressive change.

Sentence 2: Angela Merkel's steadfast leadership and pragmatic approach to governance have earned her immense respect, as she navigated Germany through numerous crises with remarkable poise and integrity.

Bag of words
Similarity score: 0.0

TF-IDF
Similarity score: 0.0


> This is crazy !! Both these sentences are about world leaders and in praise of them. This shows how Bag of words and Tf IDf are pretty rudimentary techniques objectively.

## Other Traditional Methods for Sentence Similarity

Before diving into neural network-based methods for calculating sentence similarity, it is beneficial to explore several traditional techniques. These methods are straightforward to implement and provide a solid foundation for understanding text similarity. Here are some notable traditional methods:

### 1. Jaccard Similarity
Jaccard similarity measures the similarity between two sets by comparing the size of their intersection to the size of their union. It is useful for comparing text based on the presence or absence of terms.

**Example:**
```python
def jaccard_similarity(set1, set2):
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union)
```

### 2. Levenshtein Distance (Edit Distance)
Levenshtein distance calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another. It measures the similarity between two strings.

**Example:**
```python
import Levenshtein

levenshtein_distance = Levenshtein.distance(sentence1, sentence2)
```

### 3. Overlap Coefficient
The overlap coefficient measures the overlap between two sets relative to the smaller set. It is particularly useful when comparing sets of different sizes.

**Example:**
```python
def overlap_coefficient(set1, set2):
    intersection = set1.intersection(set2)
    return len(intersection) / min(len(set1), len(set2))
```

### 4. Dice Coefficient
The Dice coefficient measures the similarity between two sets based on the ratio of twice the size of the intersection to the sum of the sizes of the sets.

**Example:**
```python
def dice_coefficient(set1, set2):
    intersection = set1.intersection(set2)
    return 2 * len(intersection) / (len(set1) + len(set2))
```

### Summary
These traditional methods offer various ways to measure the similarity between sentences without requiring complex model training. Each method has its strengths and weaknesses, depending on the specific use case and the nature of the text data. Exploring these techniques provides a solid foundation before moving on to more advanced, neural network-based methods for calculating sentence similarity.

## Applying a third technique for assignment completion

In [None]:
def jaccard_similarity(set1, set2):
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    return len(intersection) / len(union)

In [None]:
def overlap_coefficient(set1, set2):
    intersection = set1.intersection(set2)
    return len(intersection) / min(len(set1), len(set2))

In [None]:
def dice_coefficient(set1, set2):
    intersection = set1.intersection(set2)
    return 2 * len(intersection) / (len(set1) + len(set2))

In [None]:
# Comparing all traditional methods
def compare_other_traditional_methods(sentence1, sentence2):
    setSentence1 = set(sentence1.split())
    setSentence2 = set(sentence2.split())

    print(f"Jaccard similarity for sentence 1 and sentence 2 : {jaccard_similarity(setSentence1, setSentence2)}")
    print()
    print(f"Overlap Coefficient for sentence 1 and sentence 2 : {overlap_coefficient(setSentence1, setSentence2)}")
    print()
    print(f"Dice Coefficient for sentence 1 and sentence 2 : {dice_coefficient(setSentence1, setSentence2)}")

In [None]:
# Example 1:
sentence1 = "the cat is on the mat"
sentence2 = "the mat is under the cat"
# Example 2
sentence3 = "The quick brown fox jumped over the lazy dog"
sentence4 = "The fast brown fox jumped over the sleepy dog"

In [None]:
compare_other_traditional_methods(sentence1, sentence2)

Jaccard similarity for sentence 1 and sentence 2 : 0.6666666666666666

Overlap Coefficient for sentence 1 and sentence 2 : 0.8

Dice Coefficient for sentence 1 and sentence 2 : 0.8


# Further reading

We got a good gist of how similarity search works and how it can be implemented using traditional techniques. Next we will refer to the following resources and implement advanced techniques that help us to capture the theme, context and semantic meaning of the sentences while embedding them. These form the basis of applications like search engines, language translation and chatbots.

## Reference Links:
- [Word embeddings: Helping computers understand language semantics](https://medium.com/@igniobydigitate/word-embeddings-helping-computers-understand-language-semantics-dd3456b1f700)
- [word_vectors_game_of_thrones-LIVE- GitHub Notebook](https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE/blob/master/Thrones2Vec.ipynb)

# How does ChatGPT understand words ?

When dealing with advanced AI models like ChatGPT, it's important to understand that these models don't inherently grasp the concepts of meaning, grammar, or emotions. Instead, their functionality is rooted in recognizing and replicating patterns found within the vast amounts of text they were trained on during the model training phase. Essentially, these models operate as sophisticated pattern recognizers. For example, if ChatGPT were trained solely on texts with grammatical errors, it would likely produce outputs that mirror those same errors, because that's the pattern it recognizes as correct.

At the core of such models lies a mathematical framework—think of it as a complex equation with numerous variables, which we refer to as weights. During training, these weights are adjusted to capture the nuances and patterns of the language in the training corpus. However, to process language, which is inherently non-numerical, AI models like ChatGPT need to convert text into a form they can understand: numbers. This is done by transforming words into vectors, numerical representations that can be processed by the model. This transformation is where word embeddings come into play. Word embeddings are a powerful tool for translating the textual information into a numerical form that preserves the underlying meanings and relationships, setting the stage for how AI models process and generate language.

## What are Word Embeddings ?

Word embeddings are a method for representing words in a numerical format, essentially providing a mathematical depiction of their meanings. These embeddings convert words into vectors of real-valued numbers, where each dimension within the vector corresponds to a specific semantic feature or aspect of the word. For instance, one dimension of the vector might signify the word’s gender association, while another could relate to its tense. The numerical values within these vectors reflect the strength of the association between the word and each respective feature.

Think of word embeddings as a computer's dictionary. Just as we refer to a dictionary to understand the meanings of words, a computer utilizes word embeddings to retrieve the numerical vector representation of words. This allows the AI model to handle and process language data efficiently and meaningfully. Next, let’s explore how these word embeddings are calculated and how they contribute to the model's ability to interpret and generate human-like text.

### Example 1: Similarity
Word embeddings can capture semantic similarity by positioning similar words close together in the vector space. For instance:

- **Word:** "king"
- **Similar Words:** "queen", "monarch", "royalty"

The embeddings for these words will be closer in the vector space, indicating their relatedness in terms of meaning and usage.

### Example 2: Relationships
Embeddings can also capture relationships between words, often exemplified by the famous example of vector arithmetic:

- **Calculation:** vector("king") - vector("man") + vector("woman")
- **Result:** vector close to "queen"

This demonstrates how embeddings can encode certain relationships and analogies, providing insights into linguistic structures.

### Example 3: Contextual Features
Different dimensions of a word's vector can represent various aspects of its meaning, such as tense or part of speech:

- **Word:** "run"
  - **Tense:** Past, Present, Future
  - **Form:** Verb, Noun (as in "a long run")

The embeddings will slightly vary depending on how "run" is used contextually, highlighting its different grammatical and semantic properties.

### Example 4: Synonyms and Antonyms
Word embeddings can differentiate between synonyms and antonyms by their placement relative to one another:

- **Synonyms:** "happy", "joyful", "cheerful"
- **Antonyms:** "happy", "sad"

The vectors for synonyms will be nearer to each other, whereas the vectors for antonyms will be positioned further apart.

These examples help illustrate how word embeddings provide a nuanced and multidimensional representation of language, enabling AI models to perform a variety of tasks that require a deep understanding of word meanings and relationships.


# Exploring this topic using text from the game of thrones book "A Song of Ice and Fire"

Reference Notebook: https://github.com/llSourcell/word_vectors_game_of_thrones-LIVE/blob/master/Thrones2Vec.ipynb

You can refer to the above notebook for a quick rundown, however I have tried to take a slower approach here by adding in more explanations and examples to illustrate each aspect of the process.

In [None]:
import requests

url = "https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got1.txt"

# Download the data
response = requests.get(url)

# Save the data as a text file. This is saved in the local workspace as got1.txt
with open("got1.txt", "w") as text_file:
    text_file.write(response.text)

In [None]:
# Read the text file
with open("got1.txt", "r") as text_file:
    text = text_file.read()

## Cleaning the raw data

Explanation of the below python script:

- **Library Import and Model Download:** The script begins by importing necessary modules from the `nltk` library, which is essential for natural language processing. It downloads models for sentence tokenization and accessing a list of stop words.

- **Sentence Tokenization:** The text is split into individual sentences using `sent_tokenize`. This step is crucial for ensuring that contextual boundaries are respected during further processing.

- **Text Cleaning Function:** The `clean_text` function is defined to clean each sentence by:
  - **Removing Punctuation:** This reduces noise and prevents the model from treating punctuation as part of the word.
  - **Lowercasing Text:** Standardizes the text to avoid distinguishing the same words based on case.
  - **Removing Stop Words:** Filters out common but semantically weak words like "and", "the", etc., using NLTK's predefined list of stop words for English. This focuses the analysis on more meaningful content.

- **Applying Cleaning:** Each sentence from the `corpus` is cleaned using the `clean_text` function. This involves stripping unnecessary characters and words as per the defined rules.

- **Displaying Processed Text:** Finally, the script displays the last sentence from the cleaned text. This provides a quick sample of the processed output and ensures the text is ready for further analysis or processing tasks.

- **Use Case:** This preprocessing is foundational for more complex NLP tasks like sentiment analysis, topic modeling, or training machine learning models, where the quality of input data significantly impacts the output.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import re

# Downloading the required models
nltk.download('punkt')
nltk.download('stopwords')

# Setting up stop words
stop_words = set(stopwords.words('english'))

# Function to clean text: remove punctuation, lowercase text, remove stop words
def clean_text(text):
    # Remove punctuation and convert text to lowercase
    text = re.sub(r'[^\w\s]', '', text).lower()
    # Split the text into words and remove stop words
    words = [word for word in text.split() if word not in stop_words]
    # Join words back into a single string
    return ' '.join(words)

# Splitting the text into sentences
sentences = sent_tokenize(text)

# Cleaning each sentence
cleaned_sentences = [clean_text(sentence) for sentence in sentences]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### The original text

In [None]:
sentences[105:115]

['“The dragons cannot come to life.',
 'They are carved of stone, child.',
 'In olden days, our island was the westernmost outpost of the great Freehold of Valyria.',
 'It was the Valyrians who raised this citadel, and they had ways of shaping stone since lost to us.',
 'A castle must have towers wherever two walls meet at an angle, for defense.',
 'The Valyrians fashioned these towers in the shape of dragons to make their fortress seem more fearsome, just as they crowned their walls with a thousand gargoyles instead of simple crenellations.” He took her small pink hand in his own frail spotted one and gave it a gentle squeeze.',
 '“So you see, there is nothing to fear.”\n\nShireen was unconvinced.',
 '“What about the thing in the sky?',
 'Dalla and Matrice were talking by the well, and Dalla said she heard the red woman tell Mother that it was dragonsbreath.',
 'If the dragons are breathing, doesn’t that mean they are coming to life?”\n\nThe red woman, Maester Cressen thought sourly.'

### Cleaned Text

In [None]:
cleaned_sentences[105:115]

['dragons cannot come life',
 'carved stone child',
 'olden days island westernmost outpost great freehold valyria',
 'valyrians raised citadel ways shaping stone since lost us',
 'castle must towers wherever two walls meet angle defense',
 'valyrians fashioned towers shape dragons make fortress seem fearsome crowned walls thousand gargoyles instead simple crenellations took small pink hand frail spotted one gave gentle squeeze',
 'see nothing fear shireen unconvinced',
 'thing sky',
 'dalla matrice talking well dalla said heard red woman tell mother dragonsbreath',
 'dragons breathing doesnt mean coming life red woman maester cressen thought sourly']

## Understanding Word2Vec

Word2Vec is a popular machine learning technique in natural language processing (NLP) that transforms words into vectors. By representing words in a multi-dimensional vector space, Word2Vec captures the semantic relationships between words based on their contextual usage.

This allows the model to predict similar words, identify analogies, and even perform arithmetic operations with words

 (like "king" - "man" + "woman" = "queen").

 We use Word2Vec because it effectively captures word associations from a large corpus of text, which can significantly enhance various applications such as recommendation systems, search engines, and text analysis.

 The gensim library's Word2Vec model is used for training.

In [16]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

In [18]:

# Tokenizing each sentence into words: This step converts each cleaned sentence
# into a list of words.
# Word2Vec requires input as a list of word lists.
tokenized_sentences = [word_tokenize(sentence) for sentence in cleaned_sentences]
tokenized_sentences[115]

['ill',
 'enough',
 'shes',
 'filled',
 'head',
 'mother',
 'madness',
 'must',
 'poison',
 'daughters',
 'dreams',
 'well']

## Initializing the Word2Vec model

In [19]:
# This step involves setting up the model
# with the desired parameters.

# vector_size: The number of "dimensions" of the embeddings.
#              Check the paragraph below to understand what dimensions mean
# window:      The maximum distance between a target word and words around the
#              target word.
# min_count:   The minimum occurrence of words to consider when training the model;
#              words with an occurrence less than this count will be ignored.
#              Here we say that the word has to appear at least once
# workers:     The number of worker threads to use in training the model for
#              parallelization.
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

### Dimensions

In Word2Vec and similar word embedding models, "dimensions" refer to the number of features in each word vector. Here's a concise explanation:

1. **Semantic Features**: Each dimension represents different semantic or grammatical aspects of a word, although these specific features are not explicitly labeled.

2. **Vector Space**: Words with similar meanings are positioned closer together in this multi-dimensional space, facilitating tasks like synonym identification or analogy solving.

3. **Dimensionality Choice**: Common sizes for vectors are 100, 200, or 300 dimensions. More dimensions can capture finer semantic details but require more data and computational resources. Fewer dimensions may not adequately represent complex word relationships.

4. **Importance**: The number of dimensions determines the richness of the semantic information encoded. Optimal dimensionality balances detailed representation with computational efficiency and data availability.

## Training the model

In [20]:
# Training the Word2Vec model: This step involves building the vocabulary and
# training the model on the tokenized sentences.
model.train(tokenized_sentences, total_examples=model.corpus_count, epochs=model.epochs)



(819115, 845200)

In the context of training the Word2Vec model, `corpus_count` and `epochs` are key parameters used during the model training process. Let’s delve into what each of these terms means and where they are defined:

### `corpus_count`
- **Definition**: `corpus_count` is a property of the Word2Vec model that indicates the total number of sentences or documents in the dataset used for training the model.
- **Where It’s Defined**: It is automatically set by the Word2Vec model when you first pass your training data (in this case, `tokenized_sentences`) to it. This occurs during the initialization or any preprocessing stage where the model is made aware of the data structure and size. For instance, when you create a new Word2Vec instance and provide your tokenized sentences, it calculates how many sentences (or documents) are in your data.

### `epochs`
- **Definition**: `epochs` refers to the number of iterations over the entire dataset that the model will perform during training. Each epoch involves passing through the entire dataset once.
- **Where It’s Defined**: The number of epochs can be defined manually when setting up the model, or you can use the default setting provided by the Word2Vec implementation in the gensim library. In gensim's Word2Vec, if not specified, the default number of epochs is often set based on the size of your data and other training parameters. You can adjust this to increase or decrease based on how well your model is learning or to avoid overfitting.

### Example of Usage in Training
- **`total_examples=model.corpus_count`**: This tells the model how many training examples (in this case, sentences) are there in total. It uses this number to regulate the training process, particularly how it adjusts learning rates and handles batching of data.
- **`epochs=model.epochs`**: This sets how many times the model will go through the entire dataset. Each pass allows the model to refine its understanding and adjustment of the word vectors.

### Why They Matter
- **Optimizing Training**: Properly setting `corpus_count` and `epochs` ensures that the model trains effectively, using all available data and iterating through it the appropriate number of times to capture the complexities of the language represented in the corpus.
- **Balancing Efficiency and Performance**: Adjusting epochs and accurately defining the corpus count helps balance training efficiency (in terms of computational resources) and performance (in terms of model accuracy and quality of the embeddings).

These parameters are crucial for controlling the training process and directly impact the quality of the word vectors that the model produces. Adjusting them based on your specific dataset and needs can lead to better model performance.

In [None]:
# # Saving the trained model: This allows us to use the model later without retraining.
# model.save("word2vec.model")

# # Loading the model (optional, for demonstration here)
# model = Word2Vec.load("word2vec.model")

In [21]:
# Using the model to get the vector of a word (for example, 'dragons')
vector = model.wv['dragons']  # Get the vector for the word 'dragons'

# Printing the vector to see its dimensions and values
print(vector)

[-0.36631426  0.60955584  0.13336399  0.14445439  0.06204806 -0.7928069
  0.2086861   0.9930074  -0.09372328 -0.3835683  -0.18408199 -0.72677
 -0.07172154  0.14044607  0.16102912 -0.45973134  0.17065719 -0.5876251
  0.00873428 -0.89726406  0.33968502  0.11368651  0.45285192 -0.35902897
 -0.21950306  0.10265506 -0.40286672 -0.22197504 -0.3265237   0.19055262
  0.47440296  0.09198808  0.24462108 -0.42796713 -0.35232052  0.6048444
  0.03814029 -0.22557105 -0.13154188 -0.8310859   0.20507827 -0.48912492
 -0.07025751 -0.13140215  0.42509505 -0.23210727 -0.37323052  0.02947376
  0.28366733  0.35020363  0.33744666 -0.5979548  -0.41649908 -0.09604066
 -0.57285804  0.26571018  0.43274835  0.19915207 -0.4815969   0.19120543
  0.06091124  0.381497   -0.155061    0.18331501 -0.6704014   0.3859066
  0.08402716  0.29345518 -0.65951085  0.5110535  -0.21730019  0.06260721
  0.4299781  -0.4010384   0.5571703   0.24821563 -0.10244319  0.04410516
 -0.5278675   0.3311016  -0.11637681 -0.12379638 -0.621604

The vector obtained from the Word2Vec model for the word "dragons" is a numerical representation of the word in a 100-dimensional vector space. Each of the 100 values in this vector corresponds to a feature that captures some aspect of the semantic and contextual meaning of "dragons" based on how it appears in the training data. This vector encodes the relationships and nuances of "dragons" relative to other words in the dataset, allowing the model to recognize similarities, differences, and various linguistic patterns. By analyzing these values, algorithms can perform tasks like finding similar words, categorizing content, or even understanding complex language structures where "dragons" plays a role.