# ANLP - Lab 5:  Textual Similarity

Today's goals:
Learn how to calculate semantic similarity using
- methods such as bag-of-words and tf-idf
- various measures, such as cosine distance and word mover's distance
- resources such as WordNet

---

## Task 1: Create a bag-of-the-words (bow) representation


A bag-of-words representation for a document just lists the number of times each word occurs in the document. 

![](https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/img/bow.png)

*Example:* The resulting bag of words model of the following three example sentences,

    "I like to play football"
    "Did you go outside to play tennis"
    "John and I play tennis"

looks like this:

<table>
  <tr>
    <th></th>
    <th>Play</th>
    <th>Tennis</th>
    <th>To</th>
    <th>I</th>
    <th>Football</th>
    <th>Did </th>
    <th>You</th>
    <th>go</th>
  </tr>
  <tr>
    <td>Sentence 1</td>
    <td>1</td>
    <td>0</td>
    <td>1</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
  </tr>
  <tr>
    <td>Sentence 2</td>
    <td>1</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
    <td>1</td>
    <td>1</td>
    <td>1</td>
  </tr>
  <tr>
    <td>Sentence 3</td>
    <td>1</td>
    <td>1</td>
    <td>0</td>
    <td>1</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
    <td>0</td>
  </tr>
</table>

Following these steps, create a list of bags of words:

    Step 1: Tokenize the Sentences
    Step 2: Create a Dictionary of Word Frequency
    Step 3: Creating the Bag of Words Model
    
---

We scrape the Wikipedia article on [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing). The `Beautifulsoup4` libraryis used to parse the data from Wikipedia. Furthermore, Python's regex library, `re`, is used for some preprocessing tasks on the text.

In [None]:
import nltk
nltk.download("punkt")
nltk.download("wordnet")

import numpy as np  
import random  
import string
import bs4 as bs  
import urllib.request  
import re
import heapq
import pandas as pd

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')  
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')
article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:  
    article_text += para.text
    
corpus = nltk.sent_tokenize(article_text)

Now, we iterate through each sentence in the corpus, convert the sentence to lower case, and then remove the punctuation and empty spaces from the text.

In [None]:
for i in range(len(corpus )):
    corpus[i] = corpus[i].lower()
    corpus[i] = re.sub(r'\W', ' ', corpus [i])
    corpus[i] = re.sub(r'\s+', ' ', corpus [i])

The next step is to tokenize the sentences in the corpus and create a dictionary that contains words and their corresponding frequencies in the corpus. 

In [None]:
wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

# Filter down the vocabulary to the 200 most frequently occurring words.
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)

In [None]:
sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    # Your code goes here
    sent_vec = []
    for token in most_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

print(pd.DataFrame(data=sentence_vectors))

    0    1    2    3    4    5    6    ...  193  194  195  196  197  198  199
0     1    1    1    1    1    1    1  ...    0    0    0    0    0    0    0
1     1    1    1    0    0    0    1  ...    1    1    1    1    0    0    0
2     1    0    0    0    1    1    0  ...    0    0    0    0    1    1    1
3     0    0    1    0    1    1    0  ...    0    0    0    0    0    0    0
4     1    0    1    0    0    1    0  ...    0    0    0    0    0    0    0
5     1    1    1    0    1    1    1  ...    0    0    0    0    0    0    0
6     1    1    1    1    1    0    1  ...    0    0    0    0    0    0    0
7     1    1    1    1    0    0    0  ...    0    0    0    0    0    0    0
8     1    1    1    0    0    1    1  ...    0    0    0    0    0    0    0
9     1    1    0    1    1    1    0  ...    0    0    0    0    0    0    0
10    1    1    1    1    0    0    0  ...    0    0    0    0    0    0    0
11    1    1    1    1    1    1    1  ...    0    0    0    0  

---
## Task 2: Creating TF-IDF Model 

**Tf-idf** is a statistical measure used to evaluate how important a word is to a document in a collection or corpus; the importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

#### What does it mean?

1. The idea behind the TF-IDF approach is that the words that are more common in one text/sentence and less common in other texts/sentences should be given high weights.
2. If a word occurs in all texts in a corpus it’s not characteristic to any of them, so it will get a weight of 0. That’s the case of function words like “the”.
3. If a word occurs many times in (a) particular text(s) and doesn’t occur elsewhere in the corpus, it will get a very high tf-idf score for this/these text(s) and can be used as a distinctive feature.

* **Tf** = term frequency
* **Idf** = inverse document frequency
* **Term** ≈ word
* **Document** ≈ sentence/text

Implement TF-IDF following these steps:
* Once you have tokenized the sentences, the next step is to find the TF-IDF value for each word in the sentence.
* TF = (Frequency of the word in the sentence) / (Total number of words in the sentence)
* IDF = log((Total number of sentences)/(Number of sentences  containing the word))
* TF-IDF = TF * IDF
    
![TF-IF formula](https://radimrehurek.com/gensim/_images/math/5332752ed4984e682c6a54406ac01d5dae47d6d1.png)



In [None]:
# create the TF dictionary for each word
word_tf_values = {}
for token in most_freq:
    sent_tf_vector = []
    for document in corpus:
        doc_freq = 0
        for word in nltk.word_tokenize(document):
            if token == word:
                  doc_freq += 1
        word_tf = doc_freq/len(nltk.word_tokenize(document))
        sent_tf_vector.append(word_tf)
    word_tf_values[token] = sent_tf_vector
    
# find the IDF values for the most frequently occurring words in the corpus
word_idf_values = {}
for token in most_freq:
    doc_containing_word = 0
    for document in corpus:
        if token in nltk.word_tokenize(document):
            doc_containing_word += 1
    word_idf_values[token] = np.log(len(corpus)/(1 + doc_containing_word))


# create tf-idf
tfidf_values = []
# Your code goes here.
for token in word_tf_values.keys():
    tfidf_sentences = []
    for tf_sentence in word_tf_values[token]:
        tf_idf_score = tf_sentence * word_idf_values[token]
        tfidf_sentences.append(tf_idf_score)
    tfidf_values.append(tfidf_sentences)

tf_idf_model = np.asarray(tfidf_values)
tf_idf_model = np.transpose(tf_idf_model)

print(pd.DataFrame(data=tf_idf_model))

         0         1         2         3    ...       196     197     198     199
0   0.008302  0.015118  0.058393  0.034466  ...  0.000000  0.0000  0.0000  0.0000
1   0.061674  0.042114  0.036148  0.000000  ...  0.150333  0.0000  0.0000  0.0000
2   0.044153  0.000000  0.000000  0.000000  ...  0.000000  0.1435  0.1435  0.1435
3   0.000000  0.000000  0.142332  0.000000  ...  0.000000  0.0000  0.0000  0.0000
4   0.035976  0.000000  0.084345  0.000000  ...  0.000000  0.0000  0.0000  0.0000
5   0.018680  0.011338  0.014598  0.000000  ...  0.000000  0.0000  0.0000  0.0000
6   0.019046  0.011561  0.014884  0.013178  ...  0.000000  0.0000  0.0000  0.0000
7   0.017988  0.016378  0.042173  0.037339  ...  0.000000  0.0000  0.0000  0.0000
8   0.026982  0.012283  0.063259  0.000000  ...  0.000000  0.0000  0.0000  0.0000
9   0.033495  0.030497  0.000000  0.023176  ...  0.000000  0.0000  0.0000  0.0000
10  0.032379  0.014740  0.037955  0.033605  ...  0.000000  0.0000  0.0000  0.0000
11  0.011165  0.

Both bag of words and tf-idf models are available in `sklearn.feature_extraction.text` as [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) respectively. Here are some important parameters:

* analyzer = 'word', 'char' or 'char_wb' (characters within word boundaries)
* stop_words = ar list of stopwords or None
* ngram_range = (min_n, max_n) — a list of n-gram values to be extracted
* max_df = 1.0 — max frequency of a term
* min_df = 1 — min frequency of a term

---
## Task 3: Cosine similarity
**Question**: What are the most similar sentences in your `corpus` using the cosine similarity and TF-IDF?

In [None]:
def cos_sim(a,b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

In [None]:
similarity_matrix = []
# Your code goes here
for sent_num_1, sent_vec_1 in enumerate(tf_idf_model):
    max_similarity, max_index = 0, 0
    for sent_num_2, sent_vec_2 in enumerate(tf_idf_model):
        if not np.array_equal(sent_vec_1, sent_vec_2):
            if cos_sim(sent_vec_1, sent_vec_2) > max_similarity:
                max_similarity = cos_sim(sent_vec_1, sent_vec_2)
                max_index = sent_num_2
    similarity_matrix.append((sent_num_1, max_index, max_similarity))

for i in similarity_matrix:
    print(corpus[i[0]])
    print(corpus[i[1]])
    print(i[2], end="\n\n")

natural language processing nlp is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data 
already in 1950 alan turing published an article titled computing machinery and intelligence which proposed what is now called the turing test as a criterion of intelligence a task that involves the automated interpretation and generation of natural language but at the time not articulated as a problem separate from artificial intelligence 
0.2159668907867705

the result is a computer capable of understanding the contents of documents including the contextual nuances of the language within them 
the technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves 
0.14601408318066153

the technology can then accurately extract in

  """



"Cosine similarity on bag-of-words vectors is known to do well in practice, but it inherently cannot capture when documents say the same thing in completely different words".

Take, for example, two headlines:

    Obama speaks to the media in Illinois
    The President greets the press in Chicago

These have no content words in common, so according to most bag of words--based metrics, their distance would be maximal.

**Question**: How such a thing can be explained and how it can be addressed?



 ([source](https://vene.ro/blog/word-movers-distance-in-python.html))

---
## Task 4: Word mover's distance (WMD) for document classification

WMD adapts the earth mover's distance to the space of documents: the distance between two texts is given by the total amount of "mass" needed to move the words from one side into the other, multiplied by the distance the words need to move. Read the original paper [here](https://mkusner.github.io/publications/WMD.pdf).

![WMD](https://vene.ro/images/wmd-obama.png)

In the following, we are going to download the Glove word embeddings and get the embeddings of a few words.

In [None]:
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-50")



In [None]:
from sklearn.feature_extraction.text import CountVectorizer
d1 = "Obama speaks to the media in Illinois"
d2 = "The President addresses the press in Chicago"

# Convert a collection of text documents to a matrix of token counts, except stop words
vect = CountVectorizer(stop_words="english").fit([d1, d2])
print("Features:",  ", ".join(vect.get_feature_names()))

Features: addresses, chicago, illinois, media, obama, president, press, speaks



The two documents are completely orthogonal in terms of bag-of-words. What Cosine distance do you expect?


In [None]:
v_1, v_2 = vect.transform([d1, d2])
v_1 = v_1.toarray().ravel()
v_2 = v_2.toarray().ravel()
print(v_1, v_2)
print("Cosine similarity (doc_1, doc_2) = {:.2f}".format(cos_sim(v_1, v_2)))

[0 0 1 1 1 0 0 1] [1 1 0 0 0 1 1 0]
Cosine similarity (doc_1, doc_2) = 0.00


In [None]:
# wmdistance is an inbuilt function in gensim
similarity = word_vectors.wmdistance(d1.lower().split(), d2.lower().split())
print("{:.4f}".format(similarity))
print(word_vectors.distance("media", "media"))
print(word_vectors.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))

2.7885
0.0
0.74835527


Following Task 1, extract the sentenses of the [Computational Linguistics Wikipedia Page](https://en.wikipedia.org/wiki/Computational_linguistics) and sort the most similar sentences using word embeddings and WMD distance. 

In [None]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Computational_linguistics')  
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:  
    article_text += para.text
    
CL_corpus = nltk.sent_tokenize(article_text)

for i in range(len(CL_corpus)):
    CL_corpus[i] = CL_corpus[i].lower()
    CL_corpus[i] = re.sub(r'\W',' ',CL_corpus [i])
    CL_corpus[i] = re.sub(r'\s+',' ',CL_corpus [i])

# Your code goes here
similair_pairs = list()
max_sim, max_indx = 0, []
for cl_indx, cl_sent in enumerate(CL_corpus[0:50]):
    for nlp_indx, nlp_sent in enumerate(corpus):
        if 1 - word_vectors.wmdistance(nlp_sent, cl_sent) > max_sim:
            max_sim = 1 - word_vectors.wmdistance(nlp_sent, cl_sent)
            max_indx = [cl_indx, nlp_indx]
    similair_pairs.append((max_indx, max_sim))
    
for sim_pair in similair_pairs:
    print(CL_corpus[sim_pair[0][0]])
    print(corpus[sim_pair[0][1]])
    print(sim_pair[1], end="\n\n")

computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language as well as the study of appropriate computational approaches to linguistic questions 
the machine learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora the plural form of corpus is a set of documents possibly with human or computer annotations of typical real world examples 
0.42612991350233587

computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language as well as the study of appropriate computational approaches to linguistic questions 
the machine learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora the plural form of corpus is a set of documents possibly with human or computer annotations of typical real world examples 
0.42612991350233587

trad

## Extra Task: Calculating WordNet Synset similarity (with an introduction to WordNet)

[The WordNet](https://wordnet.princeton.edu/) is a part of Python's Natural Language Toolkit. It is a large word database of English Nouns, Adjectives, Adverbs and Verbs. These are grouped into some set of cognitive synonyms, which are called synsets.

You can get information about words as follows:

In [None]:
from nltk.corpus import wordnet   #Import wordnet from the NLTK
synset = wordnet.synsets("Travel")
print('Word and Type : ' + synset[0].name())
print('Synonym of Travel is: ' + synset[0].lemmas()[0].name())
print('The meaning of the word : ' + synset[0].definition())
print('Example of Travel : ' + str(synset[0].examples()))

Word and Type : travel.n.01
Synonym of Travel is: travel
The meaning of the word : the act of going from one place to another
Example of Travel : ['he enjoyed selling but he hated the travel']


Here we will see how wordnet returns the synonyms and antonyms of a given word:

In [None]:
syn = list()
ant = list()
for synset in wordnet.synsets("Worse"):
   for lemma in synset.lemmas():
      syn.append(lemma.name())    #add the synonyms
      if lemma.antonyms():    #When antonyms are available, add them into the list
          ant.append(lemma.antonyms()[0].name())
print('Synonyms: ' + str(syn))
print('Antonyms: ' + str(ant))

Synonyms: ['worse', 'worse', 'worse', 'worsened', 'bad', 'bad', 'big', 'bad', 'tough', 'bad', 'spoiled', 'spoilt', 'regretful', 'sorry', 'bad', 'bad', 'uncollectible', 'bad', 'bad', 'bad', 'risky', 'high-risk', 'speculative', 'bad', 'unfit', 'unsound', 'bad', 'bad', 'bad', 'forged', 'bad', 'defective', 'worse']
Antonyms: ['better', 'better', 'good', 'unregretful']


In [None]:
first_word = wordnet.synset("Travel.v.01")
second_word = wordnet.synset("Walk.v.01")
print('WordNet similarity: ' + str(first_word.wup_similarity(second_word)))
first_word = wordnet.synset("Good.n.01")
second_word = wordnet.synset("zebra.n.01")
print('WordNet similarity: ' + str(first_word.wup_similarity(second_word)))

WordNet similarity: 0.6666666666666666
WordNet similarity: 0.09090909090909091


---
This session was inspired by the followings:
- [https://stackabuse.com/python-for-nlp-creating-tf-idf-model-from-scratch/](https://stackabuse.com/python-for-nlp-creating-tf-idf-model-from-scratch/)
- [https://vene.ro/blog/word-movers-distance-in-python.html](https://vene.ro/blog/word-movers-distance-in-python.html)
- [https://radimrehurek.com/gensim/models/tfidfmodel.html](https://radimrehurek.com/gensim/models/tfidfmodel.html)
- [https://radimrehurek.com/gensim/models/keyedvectors.html](https://radimrehurek.com/gensim/models/keyedvectors.html)