# Natural Language Understanding

In this section we go through topics related to text understanding. We cover such topics like:
    
- Similarity measures
- Word Vectors
- Vector Space Model
- Type of vectorizers
- Build a vectorizer with Tensorflow.

## Similarity measures

Word does have different meanings. This makes the comparison and analysis a bit more complex.

In [1]:
from textblob import Word

w = Word("developer")

for synset, definition in zip(w.get_synsets(), w.define()):
    print(synset, definition)

ModuleNotFoundError: No module named 'textblob'

## Similarity measures

There are plenty of methods to measure the similarity of strings. Two most popular Python libraries examples for such measure are shown. We compare two strings: trains and training. The SequenceMatcher class allow us to use the Gestalt pattern matching algorithm:

In [3]:
from difflib import SequenceMatcher
a = "training"
b = "trains"
print(len(a))
print(len(b))
ratio = SequenceMatcher(None, a, b).ratio()
print(ratio)

8
6
0.7142857142857143


The distance is a normalized value between 0 and 1, where 1 means identical.

A different approach is shown below. We use the Jellyfish library. There are a few methods that we can use here. One of it is the Levenshtein distance. Below the distance and normalize distance values are calculated.

In [4]:
import jellyfish
distance = jellyfish.levenshtein_distance(a,b)
print(distance)

normalized_distance = distance/max(len(a),len(b))
print(1.0-normalized_distance)

3
0.625


Some words can be more similar to each other than other. We can build a similarity matrix to check it where 1 mean equal and 0 totally different.

In [6]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

tokens = nlp(u'king queen horse cat desk lamp')

for first_token in tokens:
    for second_token in tokens:
        print(first_token.text, second_token.text, first_token.similarity(second_token))

king king 1.0
king queen 0.6128482222557068
king horse 0.5876098871231079
king cat 0.6193754076957703
king desk 0.5271511077880859
king lamp 0.25223037600517273
queen king 0.6128482222557068
queen queen 1.0
queen horse 0.6164734363555908
queen cat 0.6143932342529297
queen desk 0.6345311403274536
queen lamp 0.3886454999446869
horse king 0.5876098871231079
horse queen 0.6164734363555908
horse horse 1.0
horse cat 0.6475561261177063
horse desk 0.6181738376617432
horse lamp 0.37150055170059204
cat king 0.6193754076957703
cat queen 0.6143932342529297
cat horse 0.6475561261177063
cat cat 1.0
cat desk 0.7207602262496948
cat lamp 0.4289367198944092
desk king 0.5271511077880859
desk queen 0.6345311403274536
desk horse 0.6181738376617432
desk cat 0.7207602262496948
desk desk 1.0
desk lamp 0.4279959797859192
lamp king 0.25223037600517273
lamp queen 0.3886454999446869
lamp horse 0.37150055170059204
lamp cat 0.4289367198944092
lamp desk 0.4279959797859192
lamp lamp 1.0


  print(first_token.text, second_token.text, first_token.similarity(second_token))


We can also compare sentences:

In [7]:
doc1 = nlp(u"Warsaw is the largest city in Poland.")
doc2 = nlp(u"Crossaint is baked in France.")
doc3 = nlp(u"An emu is a large bird.")

for doc in [doc1, doc2, doc3]:
    for other_doc in [doc1, doc2, doc3]:
        print(doc.similarity(other_doc))

1.0
0.6390468803250852
0.6041416025106453
0.6390468803250852
1.0
0.5464452551815584
0.6041416025106453
0.5464452551815584
1.0


  print(doc.similarity(other_doc))


The similarity matrix looks like following:

|       | doc1 | doc2 | doc3 |
|-------|------|------|------|
| **doc1** | 1.0  | 0.72 | 0.65 |
| **doc 2** | 0.72 | 1.0  | 0.40 |
| **doc 3** | 0.65 | 0.40 | 1.0  |

## Word Vectors

SpaCy does have already a set of words that are vectorized.

Let's take a look at the vectors that are available in spaCy using the previous example:

In [8]:
nlp = en_core_web_sm.load()

tokens = nlp(u'king queen horse cat desk lamp')

for token in tokens:
    print(str(token)+" "+str(token.vector))

king [-1.1133616  -1.0715381   0.17498347  0.2983135  -1.1198013   0.06686684
  0.83400404  0.8415991  -0.81147885 -0.04442363  2.0902855   0.61689997
 -1.6941863  -0.18409964 -0.57307726 -0.08472103 -1.5606233  -0.4961265
 -0.4470026  -0.53579384  0.23178513  1.6318842   1.3830286   0.55126584
  0.325065    0.3905236   0.66175556  1.4945935   0.01249412 -0.31671014
 -0.8269541  -0.94043803 -0.15464754  0.12148196 -0.736446   -0.11215541
  0.49824756 -0.5074818   0.1346393   0.01301625 -1.4823344   1.1512142
 -0.28982484  0.4669568  -0.2697815  -0.47252363 -0.10474992  0.5826209
  0.14604366  0.09867011 -0.5415132  -0.63924223  0.98515934 -0.7425058
 -0.8650961  -0.45056728  1.1102725  -0.31695974  0.66481125 -0.32102245
 -0.9101965  -0.43204626 -0.0340566  -1.2508934  -0.93322    -0.49311382
 -0.18716311  1.8433582   0.56230295 -0.705404    1.1738945  -0.29050007
  0.4054772  -0.8470016   0.29123953  0.13989909 -0.24977589  0.203008
  0.43926707 -0.86123663  0.06630252 -0.41542214 -0.

It looks that the vectors are quite long. It's easy to check the exact size of a vector:

In [9]:
len(tokens[1].vector)

96

You can play around and check the vector values for some other sentences. Let's take a look at sentence vectors of one of our previous examples:

In [10]:
len(doc1.vector)

96

A nice example of word vectorization done by some researchers at Warsaw University: [Word2Vec](https://lamyiowce.github.io/word2viz/).

## Negative sampling

It is a simpler implementation of word2vec. It is faster as it takes only a few terms in each iteration for training insted of the whole dataset as in previous example. This is why it's called negative sampling.

First of all, we define helper methods that are used later.

In [11]:
def zeros(*dims):
    return np.zeros(shape=tuple(dims), dtype=np.float32)

def ones(*dims):
    return np.ones(shape=tuple(dims), dtype=np.float32)

def rand(*dims):
    return np.random.rand(*dims).astype(np.float32)

def randn(*dims):
    return np.random.randn(*dims).astype(np.float32)

def sigmoid(batch, stochastic=False):
    return  1.0 / (1.0 + np.exp(-batch))

def as_matrix(vector):
    return np.reshape(vector, (-1, 1))

We need to load the data again.

In [12]:
import nltk
import numpy as np
import pandas as pd
from collections import namedtuple

nltk.download('all')

from nltk.book import *

texts()

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/piotrek/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to
[nltk_data]    |     /home/piotrek

[nltk_data]    |   Unzipping corpora/opinion_lexicon.zip.
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Package panlex_swadesh is already up-to-date!
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Unzipping corpora/paradigms.zip.
[nltk_data]    | Downloading package pe08 to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Unzipping corpora/pe08.zip.
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Unzipping misc/perluniprops.zip.
[nltk_data]    | Downloading package pil to /home/piotrek/nltk_data...
[nltk_data]    |   Unzipping corpora/pil.zip.
[nltk_data]    | Downloading package pl196x to
[nltk_data]    |     /home/piotrek/nltk_data...
[nltk_data]    |   Unzipping corpora/pl196x.zip.
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |  

[nltk_data]    |   Unzipping corpora/ycoe.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


Three variables are important for the training: ``train_dict``, ``train_tokens`` and ``train_set``. The first one contain all unique words used in the corpus. The second is a list of indices of words in the dictionary that correspond to each word used in the raw text. 

In [13]:
#raw_set = nltk.corpus.treebank_raw.raw()[0:50000].replace('.START',' ').replace("\n","").replace("."," ").replace(","," ")
#tokens = [token for token in nltk.word_tokenize(raw_set) if token.isalpha()]
tokens = text6.tokens
train_dict = pd.Series(tokens).unique().tolist()
train_tokens = np.array([train_dict.index(token) for token in tokens])

The last variable consist of a list of two numbers. The current word index and the word index that is before the word and after the word. Depending on the window size we use also other words that are in the neighbourhood. In this example the window size is set to 2. It means we take two words before and two words after the given word and build the relation in the training data set.

In [14]:
train_set = []
for i in range(2,len(tokens)-2):
    train_set.append([train_dict.index(tokens[i]), train_dict.index(tokens[i-1])])
    train_set.append([train_dict.index(tokens[i]), train_dict.index(tokens[i-2])])
    train_set.append([train_dict.index(tokens[i]), train_dict.index(tokens[i+1])])
    train_set.append([train_dict.index(tokens[i]), train_dict.index(tokens[i+2])])

train_set = np.random.permutation(np.array(train_set))

The next step is to set the training configuration. We set the the negative samples size to 10 and the vector size to 100. Learning rate and rate decay are set to 0.1 and 0.995. The training loops are set to 8000000. Logs are displayed each 10000 epoches.

In [15]:
Config = namedtuple("Config", ["dict_size", "vect_size", "neg_samples", "updates", "learning_rate",
                               "learning_rate_decay", "decay_period", "log_period"])
conf = Config(
    dict_size=len(train_dict),
    vect_size=100,
    neg_samples=10,
    updates=8000000,
    learning_rate=0.1,
    learning_rate_decay=0.995,
    decay_period=10000,
    log_period=10000)

We loop over ``updates`` and get the word and context from the train set. We calculate the negative context and calculate the word, context and negative sample vectors. The negative context is chosen randomly. In the next step we calcualte the cost and corresponding to it gradients.

In [16]:
def neg_sample(conf, train_set, train_tokens):
    Vp = randn(conf.dict_size, conf.vect_size)
    Vo = randn(conf.dict_size, conf.vect_size)

    J = 0.0
    learning_rate = conf.learning_rate
    for i in range(conf.updates):
        idx = i % len(train_set)

        word = train_set[idx, 0]
        context = train_set[idx, 1]

        neg_context = np.random.randint(0, len(train_tokens), conf.neg_samples)
        neg_context = train_tokens[neg_context]

        word_vect = Vp[word, :]  # word vector
        context_vect = Vo[context, :];  # context wector
        negative_vects = Vo[neg_context, :]  # sampled negative vectors

        # Cost and gradient calculation starts here
        score_pos = word_vect @ context_vect.T
        score_neg = word_vect @ negative_vects.T

        J -= np.log(sigmoid(score_pos)) + np.sum(np.log(sigmoid(-score_neg)))
        if (i + 1) % conf.log_period == 0:
            print('Update {0}\tcost: {1:>2.2f}'.format(i + 1, J / conf.log_period))
            final_cost = J / conf.log_period
            J = 0.0

        pos_g = 1.0 - sigmoid(score_pos)
        neg_g = sigmoid(score_neg)

        word_grad = -pos_g * context_vect + np.sum(as_matrix(neg_g) * negative_vects, axis=0)
        context_grad = -pos_g * word_vect
        neg_context_grad = as_matrix(neg_g) * as_matrix(word_vect).T

        Vp[word, :] -= learning_rate * word_grad
        Vo[context, :] -= learning_rate * context_grad
        Vo[neg_context, :] -= learning_rate * neg_context_grad

        if i % conf.decay_period == 0:
            learning_rate = learning_rate * conf.learning_rate_decay

    return Vp, Vo, final_cost

Next do the training:

In [17]:
Vp, Vo, J = neg_sample(conf, train_set, train_tokens)

Update 10000	cost: 18.59
Update 20000	cost: 10.84
Update 30000	cost: 8.49
Update 40000	cost: 7.39
Update 50000	cost: 6.57
Update 60000	cost: 6.09
Update 70000	cost: 5.59
Update 80000	cost: 4.86
Update 90000	cost: 4.63
Update 100000	cost: 4.45
Update 110000	cost: 4.30
Update 120000	cost: 4.22
Update 130000	cost: 4.17
Update 140000	cost: 4.01
Update 150000	cost: 3.79
Update 160000	cost: 3.74
Update 170000	cost: 3.69
Update 180000	cost: 3.63
Update 190000	cost: 3.61
Update 200000	cost: 3.58
Update 210000	cost: 3.50
Update 220000	cost: 3.39
Update 230000	cost: 3.40
Update 240000	cost: 3.33
Update 250000	cost: 3.31
Update 260000	cost: 3.31
Update 270000	cost: 3.27
Update 280000	cost: 3.18
Update 290000	cost: 3.17
Update 300000	cost: 3.13
Update 310000	cost: 3.09
Update 320000	cost: 3.09
Update 330000	cost: 3.12
Update 340000	cost: 3.05
Update 350000	cost: 3.00
Update 360000	cost: 2.99
Update 370000	cost: 2.96
Update 380000	cost: 2.96
Update 390000	cost: 2.94
Update 400000	cost: 2.94
Update 

Update 3210000	cost: 2.05
Update 3220000	cost: 2.05
Update 3230000	cost: 2.07
Update 3240000	cost: 2.06
Update 3250000	cost: 2.05
Update 3260000	cost: 2.05
Update 3270000	cost: 2.03
Update 3280000	cost: 2.04
Update 3290000	cost: 2.06
Update 3300000	cost: 2.03
Update 3310000	cost: 2.08
Update 3320000	cost: 2.04
Update 3330000	cost: 2.05
Update 3340000	cost: 2.01
Update 3350000	cost: 2.05
Update 3360000	cost: 2.05
Update 3370000	cost: 2.04
Update 3380000	cost: 2.07
Update 3390000	cost: 2.05
Update 3400000	cost: 2.02
Update 3410000	cost: 2.02
Update 3420000	cost: 2.04
Update 3430000	cost: 2.05
Update 3440000	cost: 2.05
Update 3450000	cost: 2.05
Update 3460000	cost: 2.03
Update 3470000	cost: 2.03
Update 3480000	cost: 2.01
Update 3490000	cost: 2.04
Update 3500000	cost: 2.03
Update 3510000	cost: 2.05
Update 3520000	cost: 2.04
Update 3530000	cost: 2.03
Update 3540000	cost: 2.02
Update 3550000	cost: 2.02
Update 3560000	cost: 2.04
Update 3570000	cost: 2.03
Update 3580000	cost: 2.03
Update 35900

Update 6370000	cost: 1.96
Update 6380000	cost: 1.95
Update 6390000	cost: 1.93
Update 6400000	cost: 1.92
Update 6410000	cost: 1.94
Update 6420000	cost: 1.95
Update 6430000	cost: 1.95
Update 6440000	cost: 1.95
Update 6450000	cost: 1.94
Update 6460000	cost: 1.92
Update 6470000	cost: 1.94
Update 6480000	cost: 1.95
Update 6490000	cost: 1.95
Update 6500000	cost: 1.94
Update 6510000	cost: 1.93
Update 6520000	cost: 1.92
Update 6530000	cost: 1.93
Update 6540000	cost: 1.94
Update 6550000	cost: 1.95
Update 6560000	cost: 1.94
Update 6570000	cost: 1.94
Update 6580000	cost: 1.95
Update 6590000	cost: 1.93
Update 6600000	cost: 1.93
Update 6610000	cost: 1.94
Update 6620000	cost: 1.94
Update 6630000	cost: 1.94
Update 6640000	cost: 1.95
Update 6650000	cost: 1.94
Update 6660000	cost: 1.93
Update 6670000	cost: 1.93
Update 6680000	cost: 1.94
Update 6690000	cost: 1.94
Update 6700000	cost: 1.94
Update 6710000	cost: 1.95
Update 6720000	cost: 1.93
Update 6730000	cost: 1.91
Update 6740000	cost: 1.92
Update 67500

The ``similar_words`` can be used to find related words of the ``word``.

In [18]:
def lookup_word_idx(word, word_dict):
    try:
        return np.argwhere(np.array(word_dict) == word)[0][0]
    except:
        raise Exception("No such word in dict: {}".format(word))

def similar_words(embeddings, word, word_dict, hits):
    word_idx = lookup_word_idx(word, word_dict)
    similarity_scores = embeddings @ embeddings[word_idx]
    similar_word_idxs = np.argsort(-similarity_scores)    
    return [word_dict[i] for i in similar_word_idxs[:hits]]

In [19]:
print('\n\nTraining cost: {0:>2.2f}\n\n'.format(J))

sample_words = ['knight', 'holy', 'grail']

Vp_norm = Vp / as_matrix(np.linalg.norm(Vp , axis=1))
for w in sample_words:
    similar = similar_words(Vp_norm, w, train_dict, 5)
    print('Words similar to {}: {}'.format(w, ", ".join(similar)))



Training cost: 1.92


Words similar to knight: knight, mumble, stone, VILLAGER, squeak
Words similar to holy: holy, pussy, j, clank, Bedwere
Words similar to grail: grail, Grenade, rabbit, dona, expensive


#### References

[1] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. 