#Tokenizers
Copyright 2021-2023 Denis Rothman, MIT License

Reference 1 for word embedding:
https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/

Reference 2 for cosine similarity:
[SciKit Learn cosine similarity documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

**June 2023: notebook updated Gensim 4.0.0 and code updated**


#Pre-Requisistes

In [1]:
!pip install gensim
import nltk
nltk.download('punkt')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize 
import gensim 
from gensim.models import Word2Vec 
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import warnings 
warnings.filterwarnings(action = 'ignore') 

#Word2Vec Tokenization

Update: download from GitHub added

In [3]:
#1.Load text.txt using the Colab file manager
#2.Downloading the file from GitHub
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-2nd-Edition/master/Chapter09/text.txt --output "text.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.9M  100 10.9M    0     0  27.9M      0 --:--:-- --:--:-- --:--:-- 27.9M


Update: With Gensim 4.0.0, use the vector_size parameter instead of size when initializing the Word2Vec model.

In [6]:
#‘text.txt’ file 
sample = open("text.txt", "r") 
s = sample.read() 
print("text: s:", s[:500])

# processing escape characters 
f = s.replace("\n", " ") 
print("text: f:",f[:500])

text: s: 
December, 1971  [Etext #1]


The Project Gutenberg Etext of The Declaration of Independence.

All of the original Project Gutenberg Etexts from the
1970's were produced in ALL CAPS, no lower case.  The
computers we used then didn't have lower case at all.


This is a retranscription of one of the first Project
Gutenberg Etexts, officially dated December, 1971--
and now officially re-released on December 31, 1993--


The Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant

T
text: f:  December, 1971  [Etext #1]   The Project Gutenberg Etext of The Declaration of Independence.  All of the original Project Gutenberg Etexts from the 1970's were produced in ALL CAPS, no lower case.  The computers we used then didn't have lower case at all.   This is a retranscription of one of the first Project Gutenberg Etexts, officially dated December, 1971-- and now officially re-released on December 31, 1993--   The Project Gutenberg EBook of The Critique of Pure Reason, 

In [9]:
data = [] 
# sentence parsing
for i in sent_tokenize(f):  # Split into sentences that become lists -> sent_tokenize function split text based on punctuation and other language-specific rules. It will consider punctuation and create a list for each sentence 
	temp = [] 
	for j in word_tokenize(i): # tokenize each sentence content into tokens
		temp.append(j.lower())
	data.append(temp)
 
print("data[:100]: ", data[:100])

# Creating Skip Gram model 
model2 = gensim.models.Word2Vec(
    data, 
    min_count = 1,  # Min frequency for words to be included in Word2Vect model, min_count = 1, means that all words regardless of freq, will be included
    vector_size = 512, # Each world will be represented by a vec of dim 412, higher size can capture more fine-grained relationship, but increasse memory
    window = 5, # Max distance between target word and context world, model will consider only 5 words on each side of the word
    sg = 1)  # skip-gram algo, Word2Vec has only 2 algo, the other being "continuous bag of words"

print("model2:" , model2)

data[:100]:  [['december', ',', '1971', '[', 'etext', '#', '1', ']', 'the', 'project', 'gutenberg', 'etext', 'of', 'the', 'declaration', 'of', 'independence', '.'], ['all', 'of', 'the', 'original', 'project', 'gutenberg', 'etexts', 'from', 'the', '1970', "'s", 'were', 'produced', 'in', 'all', 'caps', ',', 'no', 'lower', 'case', '.'], ['the', 'computers', 'we', 'used', 'then', 'did', "n't", 'have', 'lower', 'case', 'at', 'all', '.'], ['this', 'is', 'a', 'retranscription', 'of', 'one', 'of', 'the', 'first', 'project', 'gutenberg', 'etexts', ',', 'officially', 'dated', 'december', ',', '1971', '--', 'and', 'now', 'officially', 're-released', 'on', 'december', '31', ',', '1993', '--', 'the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'critique', 'of', 'pure', 'reason', ',', 'by', 'immanuel', 'kant', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.'], ['you', 'may', 'copy', 'it', ',

The Word2Vec model is trained and the resulting model contains word embeddings.

#Cosine Similarity

It calculates the similarity between two words using the pre-trained Word2Vec model (model2).

Vocabulary:
The vocabulary refers to the set of unique words or tokens present in a given corpus or dataset. It represents the entire collection of words that the word representation model is designed to handle. The vocabulary typically includes all the distinct words encountered during the training phase of the model.

Embeddings:
Embeddings, or word embeddings, are vector representations of words within a word representation model. They are numerical representations that capture the semantic and syntactic properties of words based on their contextual usage in the training data. Each word in the vocabulary has an associated embedding vector.

In the case of Word2Vec, the model learns word embeddings by considering the context of each word within the training data. It assigns a high-dimensional vector to each word in the vocabulary, where each dimension of the vector represents a learned feature or property of the word. These embeddings encode information about the relationships between words based on their co-occurrence patterns in the training data.

In [13]:
# Fun takes 2 words as input
def similarity(word1,word2):
        cosine=False #default value, used to determine if cosine similarity calculation is possible for the given words
        try:
                a=model2[word1] # Retrieve the word embedding for word1
                cosine=True         # And cosine is set to true afterwards
        except KeyError:     #The KeyError exception is raised if word not found in the vocabulary
                print(word1, ":[unk] key not found in dictionary") #False implied

        try:
                b=model2[word2]#a=True implied
        except KeyError:       #The KeyError exception is raised
                cosine=False   #both a and b must be true
                print(word2, ":[unk] key not found in dictionary")

        if(cosine==True):
                b=model2[word2]
                # compute cosine similarity (this 4 lines are just an example, because the cosine_similarity fun is used instead later on)
                dot = np.dot(a, b)
                norma = np.linalg.norm(a)
                normb = np.linalg.norm(b)
                cos = dot / (norma * normb)
                print("Manual calculation of cos value: ", cos)

                aa = a.reshape(1,512) 
                ba = b.reshape(1,512)
                print("Word1",aa)
                print("Word2",ba)
                cos_lib = cosine_similarity(aa, ba)
                print("cosine_similarity fun calculation of cos value: ", cos_lib)
                print(cos_lib,"word similarity")
          
        if(cosine==False):cos_lib=0; # If cosine false, dinciating that at least 1 of the word not found in the voca, cos_lib = 0
        return cos_lib

#Case 0: Words in the dataset and the dictionary

Update: In Gensim 4.0.0, direct access to vectors using the model instance (like model[word]) has been changed. Use model.wv[word] to access the vector for a word.

In [15]:
def similarity(word1, word2):
    cosine = False  # default value
    try:
        a = model2.wv[word1]
        cosine = True
    except KeyError:     # The KeyError exception is raised
        print("The word ",word1," does not exist in the dictionary")
    try:
        b = model2.wv[word2]
    except KeyError:     # The KeyError exception is raised
        print("The word ",word2," does not exist in the dictionary")
        cosine = False  # reset to False if the second word doesn't exist
    if cosine: # if both words are in the vocabulary
        return cosine_similarity([a],[b]) # sklearn cosine_similarity requires 2D arrays
    else:
        return 0 # if either word is not in the vocabulary return similarity as 0

In [16]:
word1 = "freedom"
word2 = "liberty"
print("Similarity between", word1, "and", word2, "is", similarity(word1, word2))

Similarity between freedom and liberty is [[0.33216608]]


#Case 1: Words not in the dataset or the dictionary

In [17]:
word1="corporations";word2="rights"
print("Similarity",similarity(word1,word2),word1,word2)

The word  corporations  does not exist in the dictionary
Similarity 0 corporations rights


#Case 2: Noisy Relationship

In [18]:
word1="etext";word2="declaration"
print("Similarity",similarity(word1,word2),word1,word2)

Similarity [[0.5683219]] etext declaration


#Case 3: Words in the text but not in the dictionary

In [19]:
word1="pie";word2="logic"
print("Similarity",similarity(word1,word2),word1,word2)

The word  pie  does not exist in the dictionary
Similarity 0 pie logic


#Case 4: Rare words

In [20]:
word1="justiciar";word2="judgement"
print("Similarity",similarity(word1,word2),word1,word2)

Similarity [[0.24896161]] justiciar judgement


#Case 5: Replacing rare words

In [21]:
word1="judge";word2="judgement"
print("Similarity",similarity(word1,word2),word1,word2)

Similarity [[0.15906775]] judge judgement


#Case 6: Entailment

In [22]:
word1="pay";word2="debt"
print("Similarity",similarity(word1,word2),word1,word2)

Similarity [[0.51492745]] pay debt
