<a href="https://colab.research.google.com/github/igntrevor/Customs_Fraud_Detection_IB/blob/master/%5BPycon_Uganda_2023%5D_Word_Embeddings_A_Pythonic_Delight.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Pycon Uganda 2023] Word Embeddings: A Pythonic Delight

## Introduction
Today, we have chatbots all over the internet that have simplified and automated processes like sales, onboarding, etc., for thousands of organizations.<br>
We have artificial intelligence–based question answering systems that are beating benchmark scores set by human beings.<br>
We have deep learning models that are capable of generating entire essays that are indistinguishable from human-written ones.<br>
All these tasks require large amounts of training data, but how do algorithms and machine learning models understand human-written text data? The answer is ***vectors***.

Vectors are the fundamental entity in the field of linear algebra and are responsible for rapid development in the field of natural language processing (NLP) in the last few decades. Any document, sentence, or even a word in a given dataset is represented as a unique vector, and its configuration is decided by the other vectors in the dataset (vocabulary).

## Natural Language Processing
According to [Wikipedia](https://en.wikipedia.org/wiki/Natural_language_processing), natural language processing is "a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data." NLP is a vast field with many tasks, features, and utilites, but we are only going to focus on different processes used to encode text data and process it.

But we know it all starts from Data Right?

## Text Cleaning

### Read Data
Starting from a Python interpreter we first need to import the urllib.request module with:

In [None]:
import urllib.request

We will use urllib.request to download the text for a famous book, Moby Dick, from the Gutenberg project with:

In [None]:
url = "https://www.gutenberg.org/files/2701/2701-0.txt"
file = urllib.request.urlopen(url)

Next we decode and combine the text with a list comprehension and join like so:

In [None]:
text = [line.decode('utf-8') for line in file]
text = ''.join(text)

Lets now print out some of the text:

In [None]:
text[7600:8000]

'poor devil of a Sub-Sub appears to have gone through the long\r\n  Vaticans and street-stalls of the earth, picking up whatever random\r\n  allusions to whales he could anyways find in any book whatsoever,\r\n  sacred or profane. Therefore you must not, in every case at least,\r\n  take the higgledy-piggledy whale statements, however authentic, in\r\n  these extracts, for veritable gospel cetology. Far from'

### Tokenize
Moby Dick is now completely captured in text. Next we need to tokenize or break the document into words.

We can do that easily with:

In [None]:
tokens = text.split()

The split function splits the text on whitespace. We can display a range of words with:

In [None]:
tokens[200:222]

['13.',
 'Wheelbarrow.',
 'CHAPTER',
 '14.',
 'Nantucket.',
 'CHAPTER',
 '15.',
 'Chowder.',
 'CHAPTER',
 '16.',
 'The',
 'Ship.',
 'CHAPTER',
 '17.',
 'The',
 'Ramadan.',
 'CHAPTER',
 '18.',
 'His',
 'Mark.',
 'CHAPTER',
 '19.']

We often want to ignore capitalization of words. So we will first lowercase all the text and then tokenize like so:

In [None]:
tokens = text.lower().split()
tokens[200:222]

['13.',
 'wheelbarrow.',
 'chapter',
 '14.',
 'nantucket.',
 'chapter',
 '15.',
 'chowder.',
 'chapter',
 '16.',
 'the',
 'ship.',
 'chapter',
 '17.',
 'the',
 'ramadan.',
 'chapter',
 '18.',
 'his',
 'mark.',
 'chapter',
 '19.']

### Remove Punctuation
You may have noticed that many of our tokens still have punctuation. For basic NLP tasks we will often want to isolate or remove punctuation.

We can remove punctuation by first making a translation table with:

In [None]:
import string
table = str.maketrans('', '', string.punctuation)

The translation table will allow us to translate punctuation to empty using:

In [None]:
tokens = [w.translate(table) for w in tokens]
tokens[200:222]

['13',
 'wheelbarrow',
 'chapter',
 '14',
 'nantucket',
 'chapter',
 '15',
 'chowder',
 'chapter',
 '16',
 'the',
 'ship',
 'chapter',
 '17',
 'the',
 'ramadan',
 'chapter',
 '18',
 'his',
 'mark',
 'chapter',
 '19']

### Only Alphabetic
For most NLP tasks we want to stick with just language or alphabetic characters. That means removing all non-alphabetic characters.

We can remove all numeric characters with:

In [None]:
tokens = [word for word in tokens if word.isalpha()]

The if word.isalpha() test filters out all numeric characters.

Then we can look at the results with:

In [None]:
tokens[200:222]

['chapter',
 'going',
 'aboard',
 'chapter',
 'merry',
 'christmas',
 'chapter',
 'the',
 'lee',
 'shore',
 'chapter',
 'the',
 'advocate',
 'chapter',
 'postscript',
 'chapter',
 'knights',
 'and',
 'squires',
 'chapter',
 'knights',
 'and']

### What's Next
What we have covered above has allowed us to reduce our document to a set of tokens. This will work for most simple applications.

For complex applications we would need to reduce or translate the tokens further using the following concepts:

- Remove stop words: like 'the'
- Stemming of words: like 'like' from:
  - likes
  - liked
  - likely
  - liking
- Lemmatization of words:
  - rocks -> rock
  - corpora -> corpus
  - better -> good

Those tasks are better done with a full featured library like NLTK.

Let's Dig Deeper.

## Introduction To NTLK

### Reading Our Data
We had already used the urllib.request to download the text for a famous book, Moby Dick, from the Gutenberg project with.

Now lets go ahead and print an excerpt with:

In [None]:
text[7600:8000]

'poor devil of a Sub-Sub appears to have gone through the long\r\n  Vaticans and street-stalls of the earth, picking up whatever random\r\n  allusions to whales he could anyways find in any book whatsoever,\r\n  sacred or profane. Therefore you must not, in every case at least,\r\n  take the higgledy-piggledy whale statements, however authentic, in\r\n  these extracts, for veritable gospel cetology. Far from'

### Tokenization
With the text document loaded we can move on tokenizing.

NLTK comes with a number of modules and corpora for performing NLP and learning about it. For our purposes we only need the standard tokenizer called 'punkt'.

We can setup and download 'punkt' with the following code:

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

With NLTK we can tokenize the document into sentences with:

In [None]:
from nltk import sent_tokenize
sentences = sent_tokenize(text)

In [None]:
sentences[200]

'The Lamp.'

Or, tokenize the document into words with:

In [None]:
from nltk import word_tokenize
tokens = word_tokenize(text)

In [None]:
tokens[200:222]

['.',
 'CHAPTER',
 '3',
 '.',
 'The',
 'Spouter-Inn',
 '.',
 'CHAPTER',
 '4',
 '.',
 'The',
 'Counterpane',
 '.',
 'CHAPTER',
 '5',
 '.',
 'Breakfast',
 '.',
 'CHAPTER',
 '6',
 '.',
 'The']

### Clean Text
With the document tokenized we can return to cleaning the text.

We can remove all numeric characters with:

In [None]:
tokens = [word for word in tokens if word.isalpha()]
tokens[200:222]

['Shore',
 'CHAPTER',
 'The',
 'Advocate',
 'CHAPTER',
 'Postscript',
 'CHAPTER',
 'Knights',
 'and',
 'Squires',
 'CHAPTER',
 'Knights',
 'and',
 'Squires',
 'CHAPTER',
 'Ahab',
 'CHAPTER',
 'Enter',
 'Ahab',
 'to',
 'Him',
 'Stubb']

Remove punctuation with:

In [None]:
import string
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
tokens[200:222]

['Shore',
 'CHAPTER',
 'The',
 'Advocate',
 'CHAPTER',
 'Postscript',
 'CHAPTER',
 'Knights',
 'and',
 'Squires',
 'CHAPTER',
 'Knights',
 'and',
 'Squires',
 'CHAPTER',
 'Ahab',
 'CHAPTER',
 'Enter',
 'Ahab',
 'to',
 'Him',
 'Stubb']

Finally, lowercase everything with:

In [None]:
tokens = [word.lower() for word in tokens]
tokens[200:222]

['shore',
 'chapter',
 'the',
 'advocate',
 'chapter',
 'postscript',
 'chapter',
 'knights',
 'and',
 'squires',
 'chapter',
 'knights',
 'and',
 'squires',
 'chapter',
 'ahab',
 'chapter',
 'enter',
 'ahab',
 'to',
 'him',
 'stubb']

### Stop Words
Stop words are common words in NLP that often are better filtered out. These words may include: 'you', 'the', 'it' and so on.

NLTK provides a preassembled list of stop words we can download like so:

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Then we can remove all the stop words from the tokens with:

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
tokens[200:222]

['pictures',
 'whaling',
 'scenes',
 'chapter',
 'whales',
 'paint',
 'teeth',
 'wood',
 'stone',
 'mountains',
 'stars',
 'chapter',
 'brit',
 'chapter',
 'squid',
 'chapter',
 'line',
 'chapter',
 'stubb',
 'kills',
 'whale',
 'chapter']

### Stemming
Stemming is the process of bringing a word to its root value. We clip the stems from the word. Take for example the words like: acquire - acquired, firm - firms or product - production. Reducing a word to its root simplifies the NLP task with little loss to the meaning.

Reducing a word to its root value can have many variations. Fortunately, NLTK provides several and we will use PorterStemmer.

We can load the PorterStemmer and stem the words easily with:

In [None]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
stemmed[200:222]

['pictur',
 'whale',
 'scene',
 'chapter',
 'whale',
 'paint',
 'teeth',
 'wood',
 'stone',
 'mountain',
 'star',
 'chapter',
 'brit',
 'chapter',
 'squid',
 'chapter',
 'line',
 'chapter',
 'stubb',
 'kill',
 'whale',
 'chapter']

With a clean set of tokens in hand we can move on to discovering more about words in the next section

## Introduction To Bag Of Words

### Read Data

Since our Data is already in the notebook we will proceed with the next steps

### Tokenize and Clean
After the document is loaded we can proceed to tokenize and clean the document.

First we tokenize with:

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize
tokens = word_tokenize(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Then clean with:

In [None]:
import string
tokens = [word for word in tokens if word.isalpha()]
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
tokens = [word.lower() for word in tokens]

Finally, remove stop words and stem with:

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
tokens = [porter.stem(word) for word in tokens]
tokens[200:222]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['pictur',
 'whale',
 'scene',
 'chapter',
 'whale',
 'paint',
 'teeth',
 'wood',
 'stone',
 'mountain',
 'star',
 'chapter',
 'brit',
 'chapter',
 'squid',
 'chapter',
 'line',
 'chapter',
 'stubb',
 'kill',
 'whale',
 'chapter']

### Vocabulary
With a clean set of tokens we can move on to understanding the vocabulary. A vocabulary of a document represents all the words in that document and the frequency they appear.

NLTK has the FreqDist class that can help us count the words in a document with:

In [None]:
from nltk.probability import FreqDist

word_counts = FreqDist(tokens)
word_counts

FreqDist({'whale': 1454, 'one': 920, 'like': 590, 'upon': 567, 'ship': 553, 'ye': 521, 'man': 496, 'ahab': 495, 'sea': 461, 'seem': 460, ...})

Now that all the words are counted we can extract a vocabulary. In many cases we only want to understand the most frequent words. We can take the top most frequent words like so:

In [None]:
top = 500
vocabulary = word_counts.most_common(top)

vocabulary[:10]

[('whale', 1454),
 ('one', 920),
 ('like', 590),
 ('upon', 567),
 ('ship', 553),
 ('ye', 521),
 ('man', 496),
 ('ahab', 495),
 ('sea', 461),
 ('seem', 460)]

For most NLP tasks we may want a larger vocabulary that uses 5000 words or more.

### Count Vector
With a vocabulary established we can move on to scoring words. A simple way to score words is by frequency. We can then combine the frequency scores of words with the vocabulary and create a count vector or bag of words.

First we will import a helper library called numpy. Numpy helps us store all manners of data for machine learning.

In [None]:
import numpy as np

Then we can create a word vector of the words and counts like so:

In [None]:
voc_size = len(vocabulary)
doc_vector = np.zeros(voc_size)

word_vector = [(idx,word_counts[word[0]]) for idx, word in enumerate(vocabulary) if word[0] in word_counts.keys()]
word_vector[10]

(10, 443)

Then we can create a count vector for the document with:

In [None]:
for idx, count in word_vector:
  doc_vector[idx] = count

doc_vector

array([1454.,  920.,  590.,  567.,  553.,  521.,  496.,  495.,  461.,
        460.,  443.,  434.,  429.,  424.,  365.,  342.,  338.,  337.,
        331.,  322.,  317.,  315.,  312.,  312.,  311.,  308.,  307.,
        298.,  292.,  284.,  280.,  277.,  277.,  277.,  268.,  268.,
        266.,  256.,  255.,  251.,  249.,  247.,  243.,  241.,  240.,
        238.,  236.,  231.,  230.,  228.,  224.,  222.,  217.,  217.,
        215.,  211.,  211.,  205.,  204.,  204.,  203.,  203.,  201.,
        196.,  196.,  193.,  192.,  191.,  190.,  189.,  184.,  182.,
        182.,  180.,  179.,  178.,  176.,  175.,  171.,  171.,  168.,
        168.,  168.,  167.,  167.,  164.,  161.,  159.,  159.,  159.,
        153.,  153.,  153.,  152.,  148.,  143.,  142.,  140.,  139.,
        138.,  137.,  134.,  132.,  132.,  130.,  129.,  129.,  128.,
        128.,  128.,  127.,  126.,  126.,  125.,  125.,  124.,  123.,
        122.,  122.,  122.,  121.,  121.,  121.,  120.,  119.,  119.,
        119.,  118.,

### Bag of Words
We now have all the pieces to make a bag of words model for a set of documents.

For documents we will use a small collection of sentences from the Moby Dick text like so:

In [None]:
from nltk import sent_tokenize

docs = sent_tokenize(text)[703:706]
docs

['The more I pondered over this harpooneer, the more I abominated the\r\nthought of sleeping with him.',
 'It was fair to presume that being a\r\nharpooneer, his linen or woollen, as the case might be, would not be of\r\nthe tidiest, certainly none of the finest.',
 'I began to twitch all over.']

Then we want to import a helper module from SciKitLearn called CountVectorizer with:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

The helper CountVectorizer can tokenize, clean, and count all the tokens in our documents with:

In [None]:
count_vectorizer=CountVectorizer(stop_words='english')

word_count_vector=count_vectorizer.fit_transform(docs)
word_count_vector.shape

(3, 15)

The shape of the word_count_vector represents the number of documents (3) and total number of words (15) in those documents.

The word_count_vector represents the documents bag of words. We can view the contents of this by:

In [None]:
word_count_vector.toarray()

array([[1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

Where each 1D vector represents positions for the entire vocabulary. Each value of 1 represents that word is present in the document or in this case sentence.

We can view that list of words by querying the count_tokenizer with:

In [None]:
count_vectorizer.get_feature_names_out()

array(['abominated', 'began', 'case', 'certainly', 'fair', 'finest',
       'harpooneer', 'linen', 'pondered', 'presume', 'sleeping',
       'thought', 'tidiest', 'twitch', 'woollen'], dtype=object)

# Let's Now Talk Word Embeddings

### Read Data
For this lab we are going to use a nursery rhyme as a set of sample or toy documents. Using toy documents will allow us to better understand how similarity works with documents.

We will use the first 4 verses from the nursery rhyme "The House that Jack Built" (1755 London):

In [None]:
docs = ["This is the house that Jack built. "
        "This is the cheese that lay in the house that Jack built. "
        "This is the rat that ate the cheese "
        "That lay in the house that Jack built. "
        "This is the cat that chased the rat "
        "That ate the cheese that lay in the house that Jack built. ",  #verse 1
        "This is the dog that worried the cat "
        "That chased the rat that ate the cheese "
        "That lay in the house that Jack built. "
        "This is the cow with the crumpled horn "
        "That tossed the dog that worried the cat "
        "That chased the rat that ate the cheese "
        "That lay in the house that Jack built. ", # verse 2
        "This is the maiden all forlorn "
        "That milked the cow with the crumpled horn "
        "That tossed the dog that worried the cat "
        "That chased the rat that ate the cheese "
        "That lay in the house that Jack built. ", # verse 3
        "This is the man all tattered and torn "
        "That kissed the maiden all forlorn "
        "That milked the cow with the crumpled horn "
        "That tossed the dog that worried the cat "
        "That chased the rat that ate the cheese "
        "That lay in the house that Jack built. " # verse 4
        ]

Notice that each verse is a document separated by a comma ','.

Then, to make sure everything is loaded correctly we output docs with:

In [None]:
docs

['This is the house that Jack built. This is the cheese that lay in the house that Jack built. This is the rat that ate the cheese That lay in the house that Jack built. This is the cat that chased the rat That ate the cheese that lay in the house that Jack built. ',
 'This is the dog that worried the cat That chased the rat that ate the cheese That lay in the house that Jack built. This is the cow with the crumpled horn That tossed the dog that worried the cat That chased the rat that ate the cheese That lay in the house that Jack built. ',
 'This is the maiden all forlorn That milked the cow with the crumpled horn That tossed the dog that worried the cat That chased the rat that ate the cheese That lay in the house that Jack built. ',
 'This is the man all tattered and torn That kissed the maiden all forlorn That milked the cow with the crumpled horn That tossed the dog that worried the cat That chased the rat that ate the cheese That lay in the house that Jack built. ']

### Bag of Words
With the documents loaded our first step is to construct a bag of words. We can do this using CountVectorizer from SciKit Learn.

First we load the vectorizer and then instantiate it with:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer=CountVectorizer(stop_words='english')

The we can construct the bag of words with:

In [None]:
bag_of_words=count_vectorizer.fit_transform(docs)
bag_of_words.shape

(4, 22)

We display the words in the vocabulary with:

In [None]:
count_vectorizer.get_feature_names_out()

array(['ate', 'built', 'cat', 'chased', 'cheese', 'cow', 'crumpled',
       'dog', 'forlorn', 'horn', 'house', 'jack', 'kissed', 'lay',
       'maiden', 'man', 'milked', 'rat', 'tattered', 'torn', 'tossed',
       'worried'], dtype=object)

Then look at each document vector with

In [None]:
bag_of_words.toarray()

array([[2, 4, 1, 1, 3, 0, 0, 0, 0, 0, 4, 4, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0],
       [2, 2, 2, 2, 2, 1, 1, 2, 0, 1, 2, 2, 0, 2, 0, 0, 0, 2, 0, 0, 1, 2],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

### TF-IDF
Scoring documents by the frequency of words ignores uncommon words that may contain more relevant meaning.

A better method of scoring documents is called Term Frequency over Inverse Document Frequency or TF-IDF. This score takes the frequency of a word in a document and divides it by inverse frequency of that word as it appears in all the documents.

Fortunately, SciKitLearn has a transformer that can take the bag of words(count) vector and transform it to a TF-IDF vector.

First let us import the TfidfTransformer with:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

Next, we can calculate the TF-IDF scores using:

In [None]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(bag_of_words)

The TfidfTransformer converts the word counts to TF-IDF scores.

Finally, we can review the scores using pandas and the following:

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer

# Your previously calculated bag_of_words
# (Assuming you have already run the code to calculate bag_of_words)
# bag_of_words = count_vectorizer.fit_transform(docs)

# Create a TfidfTransformer
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)

# Fit the transformer to the bag_of_words and transform it to TF-IDF scores
tfidf_scores = tfidf_transformer.fit_transform(bag_of_words)

# Convert the TF-IDF scores to a pandas DataFrame
tfidf_df = pd.DataFrame(tfidf_scores.toarray(), columns=count_vectorizer.get_feature_names_out())

# Print or view the TF-IDF scores
print(tfidf_df)


        ate     built       cat    chased    cheese       cow  crumpled  \
0  0.229416  0.458831  0.114708  0.114708  0.344124  0.000000  0.000000   
1  0.272284  0.272284  0.272284  0.272284  0.272284  0.166521  0.166521   
2  0.200707  0.200707  0.200707  0.200707  0.200707  0.245493  0.245493   
3  0.159085  0.159085  0.159085  0.159085  0.159085  0.194584  0.194584   

        dog   forlorn      horn  ...    kissed       lay    maiden       man  \
0  0.000000  0.000000  0.000000  ...  0.000000  0.344124  0.000000  0.000000   
1  0.333043  0.000000  0.166521  ...  0.000000  0.272284  0.000000  0.000000   
2  0.245493  0.303233  0.245493  ...  0.000000  0.200707  0.303233  0.000000   
3  0.194584  0.240350  0.194584  ...  0.304854  0.159085  0.240350  0.304854   

     milked       rat  tattered      torn    tossed   worried  
0  0.000000  0.229416  0.000000  0.000000  0.000000  0.000000  
1  0.000000  0.272284  0.000000  0.000000  0.166521  0.333043  
2  0.303233  0.200707  0.000000

Look at the scores and notice the more frequent terms like 'ate' and 'jack' have a TF-IDF score of 1.0. While less frequent words like 'torn' and 'kissed' have a higher score.

### Cosine Similarity (Distance)
In order to measure the similarity between 2 documents or in this case verses, we can measure the distances in TF-IDF document vectors.

There are a number of methods we may use to measure distance between any n dimensional vectors. The most common method is called cosine distance. We calculate the cosine distance of any 2 vectors (v1 and v2) by:

distance = dot product(v1,v2) / (len(v1) * len(v2))

Distance calculated will range from 1.0, almost exact, to -1.0, the exact opposite, no matter what vectors we use.

Let's calculate the distance between the verses. First we will import TfidfVectorizer to create TF-IDF vectors from our document in one step, like so:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
tfidf = vectorizer.fit_transform(docs)

Then we can quickly measure the similarity with:

In [None]:
pairwise_similarity = tfidf * tfidf.T
pairwise_similarity.toarray()

array([[1.        , 0.7495952 , 0.55254323, 0.43796031],
       [0.7495952 , 1.        , 0.81888181, 0.64906728],
       [0.55254323, 0.81888181, 1.        , 0.79262632],
       [0.43796031, 0.64906728, 0.79262632, 1.        ]])

Finally, we can format that view better with:

In [None]:
import pandas as pd
terms = pd.DataFrame(pairwise_similarity.toarray())
terms

Unnamed: 0,0,1,2,3
0,1.0,0.749595,0.552543,0.43796
1,0.749595,1.0,0.818882,0.649067
2,0.552543,0.818882,1.0,0.792626
3,0.43796,0.649067,0.792626,1.0


The diagonal represents documents comparing to itself, hence the 1.0. Notice how different the 1st and 4th document are.

### Embeddings
We can also measure word similarity in a similar manner by using a concept called word embeddings. Word embeddings are vector representations of words learned from either a statistical or deep learning model.

The most common embedding model is Word2Vec from a library called gensim. Document lists of words are fed into the model and the model learns a vector representation of those words. We can then use that vector representation to measure similarity between words in our corpus.

First, we want to tokenize the documents into sentences and list of words with:

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
nltk.download('punkt')

doc = '.'.join(docs)
data = []
for sent in sent_tokenize(doc):
  temp = []
  for word in word_tokenize(sent):
    temp.append(word)
  data.append(temp)

data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[['This', 'is', 'the', 'house', 'that', 'Jack', 'built', '.'],
 ['This',
  'is',
  'the',
  'cheese',
  'that',
  'lay',
  'in',
  'the',
  'house',
  'that',
  'Jack',
  'built',
  '.'],
 ['This',
  'is',
  'the',
  'rat',
  'that',
  'ate',
  'the',
  'cheese',
  'That',
  'lay',
  'in',
  'the',
  'house',
  'that',
  'Jack',
  'built',
  '.'],
 ['This',
  'is',
  'the',
  'cat',
  'that',
  'chased',
  'the',
  'rat',
  'That',
  'ate',
  'the',
  'cheese',
  'that',
  'lay',
  'in',
  'the',
  'house',
  'that',
  'Jack',
  'built',
  '.'],
 ['.This',
  'is',
  'the',
  'dog',
  'that',
  'worried',
  'the',
  'cat',
  'That',
  'chased',
  'the',
  'rat',
  'that',
  'ate',
  'the',
  'cheese',
  'That',
  'lay',
  'in',
  'the',
  'house',
  'that',
  'Jack',
  'built',
  '.'],
 ['This',
  'is',
  'the',
  'cow',
  'with',
  'the',
  'crumpled',
  'horn',
  'That',
  'tossed',
  'the',
  'dog',
  'that',
  'worried',
  'the',
  'cat',
  'That',
  'chased',
  'the',
  'rat',
  't

Next we will import the Word2Vec model and create it with:

In [None]:
import gensim
from gensim.models import Word2Vec

model1 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 10, window = 5)

We can look at the model's vocabulary with:

In [None]:
model1.wv.key_to_index

{'the': 0,
 'that': 1,
 'That': 2,
 'is': 3,
 'house': 4,
 'Jack': 5,
 'built': 6,
 '.': 7,
 'cheese': 8,
 'lay': 9,
 'in': 10,
 'ate': 11,
 'rat': 12,
 'This': 13,
 'cat': 14,
 'chased': 15,
 'worried': 16,
 'dog': 17,
 '.This': 18,
 'cow': 19,
 'with': 20,
 'crumpled': 21,
 'horn': 22,
 'tossed': 23,
 'all': 24,
 'maiden': 25,
 'forlorn': 26,
 'milked': 27,
 'torn': 28,
 'man': 29,
 'tattered': 30,
 'and': 31,
 'kissed': 32}

Then measure the similarity between 2 words with:



In [None]:
model1.wv.similarity('Jack', 'rat')

-0.05713532

The model we created here is called a Continuous Bag of Words or CBOW. You should know there are other variations to embeddings models.