# 16 Vector models of text

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

Recommended reading for this section:

1. A. Kedia and M. Rasu (2020) Hands-On Python Natural Language Processing. Packt Publishing
1. T. Srivastava. NLP: A quick guide to Stemming. https://medium.com/@tusharsri/nlp-a-quick-guide-to-stemming-60f1ca5db49e
1. J. Brownlee. How to Clean Text for Machine Learning with Python. https://machinelearningmastery.com/clean-text-machine-learning-python/
1. S. Prabhakaran. Gensim Tutorial - A Complete Beginners Guide. https://www.machinelearningplus.com/nlp/gensim-tutorial/#14howtotrainword2vecmodelusinggensim
1. K. Ganesan. Gensim Word2Vec Tutorial - Full Working Example. https://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/


The following Python modules will be required. Make sure that you have them installed.
- `spacy`
- `gensim`
- `pyemd` (requires `gensim` to compute word mover’s distance)
- `pot` (requires `gensim` to compute word mover’s distance)
- `multiprocessing`
- `re`
- `nltk`
- `sklearn`
- `numpy`
- `collections`

To install spacy and its module for English uncomment and execute the following command.

In [None]:
#pip install spacy && python -m spacy download en_core_web_sm

In the this lecture we will closely follow the book [1]

## Lesson 1

### Preliminary text processing: from natural language to mathematical object

Each natural language processing (NLP) problem requires formalizing of the text under consideration.

Each text consists of parts: chapters, paragraphs, sentences, words. 

We can easily reveal the parts manually just by observing the text. 

But for a computer this is not so simple problem. 

The text parts have to be found and isolated so that each one becomes an element of a data structure, e.g., of a lists.

Parts of each text have complicated semantic and grammar connections. Some of them have to be revealed and preserved others  can be omitted (depends of the final goal of the text processing).

Each text has ambiguities that have to be resolved. 

The result of this work is a vocabulary of the text. 

Then each vocabulary element have to be represented as a mathematical object, usually a vector.

Finally using vocabulary and the corresponding vectors the text as a whole is represented as a mathematical object, a set of vector.

### Sentence splitting

If we start working with text the first step can be extraction of its sentences and collecting them e.g. as list elements.

The splitting can be done using Python build-in method `.split()`. We only need to specify a symbol of splitting that is "." (dot)

In [67]:
text = """"Oh, God", he thought, "what a strenuous career it is 
that I've chosen! Travelling day in and day out.  Doing business 
like this takes much more effort than doing your own business at home.  
It can all go to Hell!"  He felt a slight itch up on his belly; 
pushed himself slowly up on his back towards the headboard so that 
he could lift his head better. """
sentences = text.split(".")
for s in sentences:
    # This line is requited to show all new line symbols as '\n'
    a = s.encode('unicode-escape').decode().replace('\\\\', '\\')    
    print(f"=*={a}=*=")

=*="Oh, God", he thought, "what a strenuous career it is \nthat I've chosen! Travelling day in and day out=*=
=*=  Doing business \nlike this takes much more effort than doing your own business at home=*=
=*=  \nIt can all go to Hell!"  He felt a slight itch up on his belly; \npushed himself slowly up on his back towards the headboard so that \nhe could lift his head better=*=
=*= =*=


Observe however that other sentence separators like exclamation point is ignored. We could write some code to take it into account.

But it is much more convenient to use a library `nltk`, Natural Language Toolkit. It provides a lot of powerful tools for NLP.

Function that splits a text into sentences is called `.sent_tokenize()`

In [68]:
from nltk import sent_tokenize
sentences = sent_tokenize(text)
for s in sentences:
    # This line is requited to show all new line symbols as '\n'
    a = s.encode('unicode-escape').decode().replace('\\\\', '\\')    
    print(f"=*={a}=*=")

=*="Oh, God", he thought, "what a strenuous career it is \nthat I've chosen!=*=
=*=Travelling day in and day out.=*=
=*=Doing business \nlike this takes much more effort than doing your own business at home.=*=
=*=It can all go to Hell!"=*=
=*=He felt a slight itch up on his belly; \npushed himself slowly up on his back towards the headboard so that \nhe could lift his head better.=*=


The NLTK splitter correctly takes into account all sentence separators and also automatically trims white spaces at sentence ends.

Notice that splitting a text into sentences is only required if the sentences are planned to be analyzed individually as special parts of text structure. Otherwise we can just skip this step.

### Text cleaning 

A text contain a lot of non-textual symbols.

First of all this are new line symbols "\n". There can be tabulations "\t" and some non textual characters like asterisk "*".

Punctuation marks are also typically not needed for further text processing.

The cleaning can be done using Python regular expressions.

The function `text_cleran` converts all symbols that are not alphanumerical or apostrophes to spaces.

In [69]:
import re

def text_clean(text):
    new_text = []
    for s in text:
        s1 = re.sub(pattern="[^\w']", repl=' ', string=s)
        new_text.append(s1)
    return new_text

print(text_clean(sentences))

[" Oh  God   he thought   what a strenuous career it is  that I've chosen ", 'Travelling day in and day out ', 'Doing business  like this takes much more effort than doing your own business at home ', 'It can all go to Hell  ', 'He felt a slight itch up on his belly   pushed himself slowly up on his back towards the headboard so that  he could lift his head better ']


Notice however, that the cleaning can be done also lately after splitting sentences into separated tokens. 

### Text tokenization

In order to build up a vocabulary, the first thing to do is to break the text into chunks called tokens. 

Each token carries a semantic meaning associated with it. 

Tokenization is one of the fundamental things to do in any text-processing activity.

Tokenization can be thought of as a segmentation technique wherein you are trying to
break down larger pieces of text chunks into smaller meaningful ones. 

Tokens generally comprise words and numbers, but they can be extended to include punctuation marks,
symbols, and, at times, understandable emoticons.

The simplest way to do tokenization provides Python built-in method `.split()`.

In [70]:
sentence = "The capital of China is Beijing"
print(sentence.split())

['The', 'capital', 'of', 'China', 'is', 'Beijing']


However this simple way can not deal correctly with multiple complicated cases:

In [71]:
sentence = "Beijing is where we'll go"
print(sentence.split())

['Beijing', 'is', 'where', "we'll", 'go']


The problematic token here is "we'll". 

Actually there are two tokens here: one is pronoun "we" and the other is reduced verb "will".

If later a grammar analysis is planned these token must be separated. 

Another complicated case:

In [72]:
sentence = "Let's travel to Hong Kong from Beijing"
print(sentence.split())

["Let's", 'travel', 'to', 'Hong', 'Kong', 'from', 'Beijing']


Obviously here "Kong"  must be attached to "Hong" and they both must be considered as a single token.

But in the sentence below "Kong" is a standing along token.

In [73]:
sentence = "The name of the King is Kong"
print(sentence.split())

['The', 'name', 'of', 'the', 'King', 'is', 'Kong']


Also there is a problem with boundary of sentences detection.

In the example below the period between M and S is actually indicative of an abbreviation, nut not the sentence boundary.

In [74]:
sentence = "A friend is pursuing his M.S from Beijing. He realy likes it."
print(sentence.split('.'))

['A friend is pursuing his M', 'S from Beijing', ' He realy likes it', '']


The problem of tokenization has no a single, the perfect for all cases solution. 

There are many tokenization methods that works better for different applications.

Basically the tokenizer use regular expressions may be with some additional more or less smart processing.

The basic tokenizer in NLTK is `.word_tokenize()`. It return a tokenized copy of passed text using NLTK's recommended word tokenizer.

In [75]:
import nltk
sentence = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."
tokens = nltk.word_tokenize(sentence)
print(tokens)

['A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$', '3000.0', '-', '$', '8000.0', 'in', 'USA', '.']


This tokenizer woks good in most cases, however here it split dollar sign `$` and the amount of many. 

In situations like this we need more control on the tokenization by specifying a regular expression explicitly.

This cane be done with the class `RegexpTokenizer`. It works like this:

In [76]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokens = tokenizer.tokenize(sentence)
print(tokens)

['A', 'Rolex', 'watch', 'costs', 'in', 'the', 'range', 'of', '$3000.0', '-', '$8000.0', 'in', 'USA', '.']


This regular expression wokrs as follows. 

The `\w+|\$[\d\.]+|\S+` regular expression allows three alternative patterns:

- First alternative: `\w` that matches any word character (equal to `[a-zA-Z0-9_]`). 
The + is a quantifier and matches between one and unlimited times as many
times as possible.
- Second alternative: `\$[\d\.]+`. Here, `\$` matches the character `$`, `\d` matches a
digit between 0 and 9, `\.` matches the character `.` (period), and `+` again acts as a
quantifier matching between one and unlimited times.
- Third alternative: `\S+`. Here, `\S` accepts any non-whitespace character and `+`
again acts the same way as in the preceding two alternatives.


Text in social media is much less formal and requires a specific tokenizers. 

People tag each other using their social media handles and use a lot of emoticons, hashtags,
and abbreviated text to express themselves.

For this purpose NLTK provides `TweetTokenizer`.

In [77]:
from nltk.tokenize import TweetTokenizer
sentence = """@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"""
tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(sentence)
print(tokens)

['@amankedia', "I'm", 'going', 'to', 'buy', 'a', 'Rolexxxxxxxx', 'watch', '!', '!', '!', ':-D', '#happiness', '#rolex', '<3']


Another common thing with social media writing is the use of expressions such
as `Rolexxxxxxxx`. 

Here, a lot of x's are present in addition to the normal one; it is a very
common trend and should be addressed to bring it to a form as close to normal as possible.

The `TweetTokenizer` provides two additional parameters. 

The first one `reduce_len` tries to reduce the excessive characters in a token. 

The word Rolexxxxxxxx is actually tokenized as Rolexxx in an attempt to reduce the number of x's present.

The parameter `strip_handles`, when set to True, removes the handles mentioned in a
post/tweet. 

As can be seen in the preceding output, `@amankedia` is stripped, since it is a handle.

In [78]:
from nltk.tokenize import TweetTokenizer
sentence = """@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"""
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokens = tokenizer.tokenize(sentence)
print(tokens)

["I'm", 'going', 'to', 'buy', 'a', 'Rolexxx', 'watch', '!', '!', '!', ':-D', '#happiness', '#rolex', '<3']


### Text cleaning after tokenization

Previously we considered text cleaning done before splitting a text into tokens.

NLTK tokenizers extracts non-alphanumeric symbols into separate tokens.

And all new line symbols are treated as toke separators and thus are removed.

We just need to run along token list and filter out non-alphanumeric tokens.

For this purpose the method `.isalpha()` can be used. It returns `True` if all characters in the string are alphabets, Otherwise, It returns `False`.

In addition we convert all tokens to lower case.

Pay attention that before checking `.isalpha()` we test if the first symbol is apostrophe. This is required to avoid removing
reduced verbs like "'ve"

In [79]:
text = """"Oh, God", he thought, "what a strenuous career it is 
that I've chosen! Travelling day in and day out.  Doing business 
like this takes much more effort than doing your own business at home.  
It can all go to Hell!"  He felt a slight itch up on his belly; 
pushed himself slowly up on his back towards the headboard so that 
he could lift his head better. """

import nltk
raw_tokens = nltk.word_tokenize(text)
tokens = []
for tok in raw_tokens:
    t1 = tok[1:] if tok[0] == "'" else tok  # we do not want to remove tokens like "'ve" 
    if t1.isalpha():
        tokens.append(tok.lower())
        
print(raw_tokens)
print()
print(tokens)

['``', 'Oh', ',', 'God', "''", ',', 'he', 'thought', ',', '``', 'what', 'a', 'strenuous', 'career', 'it', 'is', 'that', 'I', "'ve", 'chosen', '!', 'Travelling', 'day', 'in', 'and', 'day', 'out', '.', 'Doing', 'business', 'like', 'this', 'takes', 'much', 'more', 'effort', 'than', 'doing', 'your', 'own', 'business', 'at', 'home', '.', 'It', 'can', 'all', 'go', 'to', 'Hell', '!', "''", 'He', 'felt', 'a', 'slight', 'itch', 'up', 'on', 'his', 'belly', ';', 'pushed', 'himself', 'slowly', 'up', 'on', 'his', 'back', 'towards', 'the', 'headboard', 'so', 'that', 'he', 'could', 'lift', 'his', 'head', 'better', '.']

['oh', 'god', 'he', 'thought', 'what', 'a', 'strenuous', 'career', 'it', 'is', 'that', 'i', "'ve", 'chosen', 'travelling', 'day', 'in', 'and', 'day', 'out', 'doing', 'business', 'like', 'this', 'takes', 'much', 'more', 'effort', 'than', 'doing', 'your', 'own', 'business', 'at', 'home', 'it', 'can', 'all', 'go', 'to', 'hell', 'he', 'felt', 'a', 'slight', 'itch', 'up', 'on', 'his', 'bel

### Word normalization

Most of the time, we do not want to have every individual word fragment that we have ever encountered in our vocabulary. 

Probably we will want to bring words to their root form in the dictionary. 

For instance, am, are, and is can be identified by their root form, be. 

Also we can remove inflections from words to bring them down to the same form: Words car, cars, and car's can all be identified as car.

Also, common words that occur very frequently and do not convey much meaning, such as
the articles a, an, and the, can be removed. 

However, all these highly depend on the use cases. 

Wh- words, such as when, why, where, and who, do not carry much information in most contexts and are removed as part of a technique called stopword removal.

However, in situations such as question classification and question answering, these words become very important and should not be removed. 

Word normalization process includes the following procedures:
- Case folding: converting all letters in the text corpus into lowercase.
- Stopword removal: removing words such as a, an, the, in, at, and so on that occur frequently in text corpora
and do not carry a lot of information in most contexts.
- Stemming: removing all inflections to bring words to their basic form.
- Lemmatization is a process wherein the context is used to convert a word to its meaningful base form. 
- N-grams: grouping multiple tokens.

These steps should be performed as part of preprocessing the text corpora
before applying any algorithms to the data. 

However, which steps to apply and which to ignore depend on the use case.

### Case folding

Usually the fist step in word normalization is case folding. 

As part of case folding, all the letters in the text corpus are converted to lowercase. 

"The" and "the" will betreated the same in a scenario of case folding, whereas they would be treated differently in
a non-case folding scenario. 

This technique helps systems that deal with information retrieval, such as search engines.

Lamborghini, which is a proper noun, will be treated as lamborghini; whether the user typed
Lamborghini or lamborghini would not make a difference, and the same results would be
returned.

However, in situations where proper nouns are derived from common noun terms, case
folding will become a bottleneck as case-based distinction becomes an important feature
here. 

For instance, General Motors is composed of common noun terms but is itself a proper
noun. 

Performing case folding here might cause issues. 

Another problem is when acronyms are converted to lowercase. 

There is a high chance that they will map to common nouns. 

An example widely used here is CAT which stands for Common Admission Test in India getting converted to cat.

A potential solution to this is to build machine learning models that can use features from a
sentence to determine which words or tokens in the sentence should be lowercase and
which shouldn't be; however, this approach doesn't always help when users mostly type in
lowercase. 

As a result, lowercasing everything becomes a wise solution.

The language here is a major feature; in some languages, such as English, capitalization
from point to point in a text carries a lot of information, whereas in some other languages,
cases might not be as important.

The following code shows a very straightforward approach that would convert all
letters in a sentence to lowercase, making use of the lower() method available in Python:

In [80]:
s = """At the age of twenty, Susan Calvin had been part of the particular 
Psycho-Math seminar at which Dr. Alfred Lanning of U. S. Robots had demonstrated 
the first mobile robot to be equipped with a voice."""
s = s.lower()
print(s)

at the age of twenty, susan calvin had been part of the particular 
psycho-math seminar at which dr. alfred lanning of u. s. robots had demonstrated 
the first mobile robot to be equipped with a voice.


### Stopword removal

Stopwords are words such as a, an, the, in, at, and so on that occur frequently in text corpora
and do not carry a lot of information in most contexts. 

These words, in general, are required for the completion of sentences and making them grammatically sound. 

They are often the most common words in a language and can be filtered out in most NLP tasks, and
consequently help in reducing the vocabulary or search space. 

There is no single list of stopwords that is available universally, and they vary mostly based on use cases.

However, a certain list of words is maintained for languages that can be treated as stopwords specific
to that language, but they should be modified based on the problem that is being solved.

Let’s look at the stopwords available for English in the nltk:

In [81]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
", ".join(stop)

[nltk_data] Downloading package stopwords to /home/pavel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

If you look closely, you'll notice that Wh- words such as who, what, when, why, how, which,
where, and whom are part of this list of stopwords.

However, previously it was mentioned that these words are very significant in use cases such as question
answering and question classification. 

Measures should be taken to ensure that these words are not filtered out when the text corpus undergoes 
stopword removal. 

Let's learn how this can be achieved:

In [82]:
wh_words = set(['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom'])
stop = set(stopwords.words('english'))

sentence = "how are we putting in efforts to enhance our understanding of NLP"

# Brfore removing stop words we clean them from the wh-words
clean_stop = stop - wh_words
    
sentence_after_stopword_removal = [token for token in sentence.split() if token not in clean_stop]
" ".join(sentence_after_stopword_removal)

'how putting efforts enhance understanding NLP'

The code above shows that the sentence "how are we putting in efforts
to enhance our understanding of Lemmatization" gets modified to "how putting
efforts enhance understanding NLP". 

Observe that words "are", "we", "in", "to", "our, "of" were removed from the sentence. 

In some case removing of verbs may undesirable. In this case that must be protected in the same way as "wh-" words.

Stopword removal is generally the first step that is taken after tokenization while building a vocabulary or preprocessing text data.

### Stemming

Stemming is bringing all of the words like computer, computerization, and computerize into one word, compute. 

The stemming is removing the inflectional forms of a word and bringing them to a base form called the stem. 

The chopped-off pieces are referred to as affixes. 

In the preceding example, compute is the base form and the affixes are r, rization, and rize, respectively. 

One thing to keep in mind is that the stem need not be a valid word as we know it. 

For example, the word traditional would get stemmed to tradit, which is not a valid word in the English dictionary.

The two most common methods employed for stemming include the Porter stemmer and the Snowball stemmer. 

The Porter stemmer supports the English language, whereas the Snowball stemmer, which is an improvement on the Porter stemmer, supports multiple languages:

In [83]:
from nltk.stem.snowball import SnowballStemmer
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


Let's now first apply the Porter stemmer to words and see its effects in the following code block:

In [84]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating',
    'siezing', 'itemization', 'traditional', 'reference', 'colonizer', 'plotted', 'having', 'generously']
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener


Next, let's see how the Snowball stemmer would do on the same text:

In [85]:
stemmer2 = SnowballStemmer(language='english')
singles = [stemmer2.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous


As can be seen above, the Snowball stemmer requires the specification of a language parameter. 

In most of cases, its output is similar to that of the Porter stemmer, except for generously, 
where the Porter stemmer outputs gener and the Snowball stemmer outputs generous. 

The example shows how the Snowball stemmer makes minor changes to the Porter algorithm, achieving improvements in some cases.

Potential problems with stemming arise in the form of over-stemming and under-
stemming. 

A situation may arise when words that are stemmed to the same root should have been stemmed to different roots. 

This problem is referred to as __over-stemming__. This is also known as a false positive.

In [86]:
stemmer2 = SnowballStemmer(language='english')
words = ['universal', 'university', 'universe']
stemmed = [stemmer2.stem(s) for s in words]
print(' '.join(stemmed))

univers univers univers


All the above 3 words are stemmed to univers which is wrong behavior.

Though these three words are etymologically related, their modern meanings are in widely different domains, so treating them as synonyms in NLP will likely reduce the relevance of the results

In contrast, another problem occurs when words that should have been stemmed to the same
root aren't stemmed to it. This situation is referred to as __under-stemming__. This is also known as a false negative.

In [87]:
stemmer2 = SnowballStemmer(language='english')
words = ['alumnus', 'alumni', 'alumnae']
stemmed = [stemmer2.stem(s) for s in words]
print(' '.join(stemmed))

alumnus alumni alumna


In addition to Porter and Snowball there are more algorithms for stemming. 

Other stemmers include the Lancaster, Dawson, Krovetz, and Lovins stemmers, among others. 

Each stemmer can do under- and over-stemming. 

The better one should be chosen according to the goals of the study.

### Lemmatization

Unlike stemming, wherein a few characters are removed from words using crude methods,
lemmatization is a process wherein the context is used to convert a word to its meaningful
base form. 

It helps in grouping together words that have a common base form and so can
be identified as a single item. 

The base form is referred to as the lemma of the word and is also sometimes known as the dictionary form.

Lemmatization algorithms try to identify the lemma form of a word by taking into account
the neighborhood context of the word, part-of-speech (POS) tags, the meaning of a word,
and so on. 

The neighborhood can span across words in the vicinity, sentences, or even documents.

Also, the same words can have different lemmas depending on the context. 

A lemmatizer would try and identify the part-of-speech tags based on the context to identify the
appropriate lemma. 

Stemming and lemmatization obviously try to do the same job. But stemming does it a more simple way and thus faster. 

Sometimes stemming is enough and in more complicated cases the lemmatization resluts in the better performance.

The most commonly used lemmatizer is the WordNet lemmatizer.

Other lemmatizers include the Spacy lemmatizer, TextBlob lemmatizer, and Gensim lemmatizer, and others. 

Below we will explore the WordNet and Spacy lemmatizers.

### WordNet lemmatizer

WordNet is a lexical database of English that is freely and publicly available. 

As part of WordNet, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive
synonyms (synsets), each expressing distinct concepts. 

These synsets are interlinked using lexical and conceptual semantic relationships. 

It can be easily downloaded, and the nltk library offers an interface to it that enables you to perform lemmatization.

Let's try and lemmatize the following sentence using the WordNet lemmatizer:

`We are putting in efforts to enhance our understanding of Lemmatization`

In [88]:
import nltk
nltk.download('omw-1.4')  # download dependency
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
s = "We are putting in efforts to enhance our understanding of Lemmatization"
token_list = s.split()  # we do not need here more sophisticated tokinizer
sl = ' '.join([lemmatizer.lemmatize(token) for token in token_list])
print(s)
print(sl)

We are putting in efforts to enhance our understanding of Lemmatization
We are putting in effort to enhance our understanding of Lemmatization


[nltk_data] Downloading package omw-1.4 to /home/pavel/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /home/pavel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


As can be seen, the WordNet lemmatizer did not do much here. 

What are we lacking here?

The WordNet lemmatizer works well if the POS tags are also provided as inputs.

It is really impossible to manually annotate each word with its POS tag in a text corpus.

Now, how do we solve this problem and provide the part-of-speech tags for individual
words as input to the WordNet lemmatizer?

Fortunately, the nltk library provides a method for finding POS tags for a list of words
using an averaged perceptron tagger (i.e., a neural network).

The POS tags for the sentence "We are trying our best to understand Lemmatization" provided by 
the POS tagging method can be found in the following code:

In [89]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(token_list)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pavel/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('in', 'IN'),
 ('efforts', 'NNS'),
 ('to', 'TO'),
 ('enhance', 'VB'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

Here the abbreviations have the following meaning:
- PRP: personal pronoun
- VBP: verb, present tense
- VBG: verb gerund
- IN: preposition/subordinating conjunction 
- NNS: noun plural
- TO: infinite marker
- VB: verb
- PRP$: possessive pronoun
- NN: noun, singular

Now, the POS tags need to be converted to a form that can be understood by the
WordNet lemmatizer and sent in as input along with the tokens.

The code below does what is needed: extracts from the advanced nltk POS 
tags only the first letter (like VBG -> V) and
then maps it a wordnet built-in POS tags.

In [90]:
from nltk.corpus import wordnet

# This is a common method which is widely used across the NLP community of practitioners and readers
def get_part_of_speech_tags(token):
    """
    Get POS tag for the token and then extract the first letter of the tag:
      nltk.pos_tag(['Lemmatization']) -> [('Lemmatization', 'NN')]
      [('Lemmatization', 'NN')][0] -> ('Lemmatization', 'NN')
      ('Lemmatization', 'NN')[1] -> 'NN'
      'NN'[0] -> 'N'
    """
    tag = nltk.pos_tag([token])[0][1][0].upper()
    """Maps POS tags to first character lemmatize() accepts.
    We are focusing on Verbs, Nouns, Adjectives and Adverbs here.
    And if unknown letter appears in tag, wordnet.NOUN is assumed by default"""
    tag_dict = {
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

Now, let’s see how the WordNet lemmatizer performs when the POS tags are also provided as inputs:

In [91]:
lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list]
sl=' '.join(lemmatized_output_with_POS_information)
print(s) 
print(sl)

We are putting in efforts to enhance our understanding of Lemmatization
We be put in effort to enhance our understand of Lemmatization


The following conversions happened:
- are -> be
- putting -> put
- efforts -> effort
- understanding -> understand

Let’s compare this with the Snowball stemmer:

In [92]:
from nltk.stem.snowball import SnowballStemmer
stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(' '.join(stemmed_sentence))

we are put in effort to enhanc our understand of lemmat


As can be seen, the WordNet lemmatizer makes a sensible and context-aware conversion of
the token into its base form, unlike the stemmer, which tries to chop the affixes from the
token.

Let us consider one more example of lemmatization

In [93]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
s = "Susan Calvin had been born in the year 1982, they said, which made her seventy-five now."
token_list = s.split()  # we do not need here more sophisticated tokinizer
print(token_list)

['Susan', 'Calvin', 'had', 'been', 'born', 'in', 'the', 'year', '1982,', 'they', 'said,', 'which', 'made', 'her', 'seventy-five', 'now.']


[nltk_data] Downloading package wordnet to /home/pavel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [94]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(token_list)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/pavel/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Susan', 'NNP'),
 ('Calvin', 'NNP'),
 ('had', 'VBD'),
 ('been', 'VBN'),
 ('born', 'VBN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('year', 'NN'),
 ('1982,', 'CD'),
 ('they', 'PRP'),
 ('said,', 'VBD'),
 ('which', 'WDT'),
 ('made', 'VBD'),
 ('her', 'PRP$'),
 ('seventy-five', 'JJ'),
 ('now.', 'NN')]

Here the abbreviations have the following meaning:
- NNP: proper noun, singular
- VBD: verb past tense
- VBN: verb past participle
- IN: preposition/subordinating conjunction 
- DT: determiner
- NN: noun, singular
- CD: cardinal digit 
- PRP: personal pronoun
- WDT: wh-determiner
- PRP$: possessive pronoun
- JJ: adjective

In [95]:
lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token, get_part_of_speech_tags(token)) for token in token_list]
sl= ' '.join(lemmatized_output_with_POS_information)
print(s)  # the original sencence
print(sl)

Susan Calvin had been born in the year 1982, they said, which made her seventy-five now.
Susan Calvin have be born in the year 1982, they said, which make her seventy-five now.


Observe that "born" and "said" have been left unchanged.

### Spacy lemmatizer

The Spacy lemmatizer comes with pretrained models that can parse text and figure out the various properties of the text, such as POS tags, named-entity tags, and so on, with a simple function call. 

The prebuilt models identify the POS tags and assign a lemma to each token,
unlike the WordNet lemmatizer, where the POS tags need to be explicitly provided.

If spacy and the model for English languages are not installed yet it can be done as described in the very top of this document.

In [96]:
import spacy

s = "Susan Calvin had been born in the year 1982, they said, which made her seventy-five now."
nlp = spacy.load('en_core_web_sm')

doc = nlp(s)
sl = " ".join([token.lemma_ for token in doc])
print(s)
print(sl)

Susan Calvin had been born in the year 1982, they said, which made her seventy-five now.
Susan Calvin have be bear in the year 1982 , they say , which make she seventy - five now .


The spacy lemmatizer performed a decent job without the input information of the POS
tags. 

The advantage here is that there is no need to look out for external dependencies for
fetching POS tags as the information is built into the pretrained model.

Notice that a tokennization has been also done before the lemmatization: object `doc` is iterable and returns token objects. Token objects in turn have an attribute `.lemma_`:

In [97]:
for token in doc:
    print(f"{token} -> {token.lemma_}")

Susan -> Susan
Calvin -> Calvin
had -> have
been -> be
born -> bear
in -> in
the -> the
year -> year
1982 -> 1982
, -> ,
they -> they
said -> say
, -> ,
which -> which
made -> make
her -> she
seventy -> seventy
- -> -
five -> five
now -> now
. -> .


### N-grams

Until now, we have focused on tokens of size 1, which means only one word. 

Sentences generally contain names of people and places and other open compound terms, such as
_living room_ and _coffee mug_. 

These phrases convey a specific meaning when two or more words are used together. 

When used individually, they carry a different meaning altogether and the inherent meaning 
behind the compound terms is somewhat lost. 

The usage of multiple tokens to represent such inherent meaning can be highly beneficial for the
NLP tasks being performed. 

Even though such occurrences are rare, they still carry a lot of
information. 

Techniques should be employed to make sense of these as well.

In general, these are grouped under the umbrella term of n-grams. 

When n is equal to 1, these are termed as unigrams. 

Bigrams, or 2-grams, refer to pairs of words, such as _dinner table_. 

Phrases such as the _United Arab Emirates_ comprising three words are termed as
trigrams or 3-grams. 

This naming system can be extended to larger n-grams, but most NLP
tasks use only trigrams or lower.

Let’s understand how this works for the following sentence:

`Natural Language Processing is the way to go`

The phrase _Natural Language Processing_ carries an inherent meaning that would be
lost if each of the words in the phrase is processed individually.

However, when we use trigrams, these phrases can be extracted together and the meaning gets captured. 

In general, all NLP tasks make use of unigrams, bigrams, and trigrams together to capture all
the information.

The following code illustrates an example of capturing bigrams:


In [98]:
from nltk.util import ngrams
s = "Natural Language Processing is the way to go"
tokens = s.split()
bigrams = list(ngrams(tokens, 2))
print([" ".join(token) for token in bigrams])

['Natural Language', 'Language Processing', 'Processing is', 'is the', 'the way', 'way to', 'to go']


Let's try and capture trigrams from the same sentence using the following code:

In [99]:
s = "Natural Language Processing is the way to go"
tokens = s.split()
trigrams = list(ngrams(tokens, 3))
print([" ".join(token) for token in trigrams])

['Natural Language Processing', 'Language Processing is', 'Processing is the', 'is the way', 'the way to', 'way to go']


Usually stemming or lemmatization is done before composing n-gramms.

### Exercises

1\. Describe in writing why stopword removal is required when a vectorized model of a text is prepared.

2\. Describe in writing what are stemming and lemmatization. For what purpose they are leveraged?

3\. Describe in writing what are n-grams and why their using may improve a text model.

## Lesson 2

### Transforming text into data structures

Once tokenization and word nomalization is done we can convert a collection of prepared tokes into a vocabulary.

This is a lits of unique tokens that can be found in the text.

The next step is to transform it into a data structure.

There are multiple approaches for it. Some are very simple and others are very sophisticated even involving intermediate data processing with neural networks.

The final goal is to have vectors representing text in the most appropriate way.

### Bag of words

A very intuitive approach to representing a document is to use the frequency of the words
in that particular document. 

This is exactly what is done as part of the bag of words (BoW) approach.

The vocabulary-building step comes as a prerequisite to the BoW methodology. 

Once the vocabulary is available, each sentence can be represented as a vector. 

The length of this vector would be equal to the size of the vocabulary. 

Each entry in the vector would correspond to a term in the vocabulary, and the
number in that particular entry would be the frequency of the term in the sentence under
consideration. 

The lower limit for this number would be 0, indicating that the vocabulary
term does not occur in the sentence concerned.

The upper limit could possibly be the frequency of the occurrence of the word in the text corpora.

This would indicate that the most frequently occurring word occurs in only one sentence.

However, this is an extremely rare situation.

Let us see the simple example how BoW can looks like. We use `sklearn` module for it. The detailed explanation of its using will be done a bit later.

In [100]:
from sklearn.feature_extraction.text import CountVectorizer

X = ["Computers can analyze text and understand text", 
     "They do it using vectors and matrices", 
     "Computers can process massive amounts of text data"]

vectorizer = CountVectorizer(stop_words='english')
X_vec = vectorizer.fit_transform(X)
# All tokens are stored as a dictionary {'token': position}. We print it sorted by positions
print(sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1])   )
# Here each vector represents one sentence
print(X_vec.toarray())

[('amounts', 0), ('analyze', 1), ('computers', 2), ('data', 3), ('massive', 4), ('matrices', 5), ('process', 6), ('text', 7), ('understand', 8), ('using', 9), ('vectors', 10)]
[[0 1 1 0 0 0 0 2 1 0 0]
 [0 0 0 0 0 1 0 0 0 1 1]
 [1 0 1 1 1 0 1 1 0 0 0]]


The first vector corresponds to the sentence "Computers can analyze text and understand text". 

Numbers 1 appear on positions 1, 2 and 8. They correspond to tokens "analyze", "computers" and "understand". Token "text" appears in the text twice so that we 2 on position 7.

The words "can", "and" have been omitted as a stopword.

### One-hot representation of tokens

Building of BoW vectors can be considered as sums of one-hot vectors corresponding to tokens.

In course of text preprocessing a list of unique tokens is created. This is called a vocabulary. 

In `CountVectorizer` the vocabulary is represented as a Python dictionary with tokes as keys and their positions in the vocabulary as values:

In [101]:
vocab = vectorizer.vocabulary_
print(vocab)

{'computers': 2, 'analyze': 1, 'text': 7, 'understand': 8, 'using': 9, 'vectors': 10, 'matrices': 5, 'process': 6, 'massive': 4, 'amounts': 0, 'data': 3}


The list of tokens can be treated as a set of allowed values of a categorical variable. 

Let us remember: in machine learning categorical variables are preferably represented in one-hot form.

Number of all values is a length of a one-hot vector. This vector always has zeros everywhere except a single position with one. 

It corresponds to a particular value of the categorical variable. 

Let us create one-hot vectors for the tokens from the previous example.

In [102]:
import numpy as np

size = len(vocab)

one_hot = {}

for token in vocab:
    position = vocab[token]
    vector = np.zeros(size, dtype=int)
    vector[position] = 1
    one_hot[token] = vector
    print(f"{token:15} {position:2}, {vector}")

computers        2, [0 0 1 0 0 0 0 0 0 0 0]
analyze          1, [0 1 0 0 0 0 0 0 0 0 0]
text             7, [0 0 0 0 0 0 0 1 0 0 0]
understand       8, [0 0 0 0 0 0 0 0 1 0 0]
using            9, [0 0 0 0 0 0 0 0 0 1 0]
vectors         10, [0 0 0 0 0 0 0 0 0 0 1]
matrices         5, [0 0 0 0 0 1 0 0 0 0 0]
process          6, [0 0 0 0 0 0 1 0 0 0 0]
massive          4, [0 0 0 0 1 0 0 0 0 0 0]
amounts          0, [1 0 0 0 0 0 0 0 0 0 0]
data             3, [0 0 0 1 0 0 0 0 0 0 0]


As a result we have a dictionary that maps tokens to their one-hot vectors:

In [103]:
print(one_hot)

{'computers': array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]), 'analyze': array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'text': array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]), 'understand': array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), 'using': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]), 'vectors': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]), 'matrices': array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]), 'process': array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]), 'massive': array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]), 'amounts': array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'data': array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])}


Now to create a BoW vector for a sentence we have to add one-hot vectors of its tokens taking into account their numbers of repetitions.

For the sentence "computers can analyze text and understand text" we have:

$$
\text{computers} + \text{analyze} + 2\times \text{text} + \text{understand}
$$

Token "text" appears twice so we multiply its vector by 2:

In [104]:
bow_vector = one_hot["computers"] + one_hot["analyze"] + 2 * one_hot["text"] + one_hot["understand"]
print(bow_vector)

[0 1 1 0 0 0 0 2 1 0 0]


### BoW from scratch

Let us now consider a full process of BoW building. We start from a raw text, perform its tokenization and cleaning. Then we remove stopwords and find stems. Then we created the BoW.

In [105]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import numpy as np

def tokenize_and_clean(sentence):
    """Tokenize sentence and clean it.
    """
    raw_tokens = nltk.word_tokenize(sentence)
    tokens = []
    for tok in raw_tokens:
        t1 = tok[1:] if tok[0] == "'" else tok  # we do not want to remove tokens like "'ve" 
        if t1.isalpha():
            tokens.append(tok.lower())
    return tokens

def stopwords_removal(tokens):
    wh_words = set(['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom'])
    stop = set(stopwords.words('english')) - wh_words
    return [tok for tok in tokens if tok not in stop]

def stemming(tokens):
    stemmer = SnowballStemmer(language = 'english')
    return [stemmer.stem(tok) for tok in tokens]

[nltk_data] Downloading package stopwords to /home/pavel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


This is the text to analyze.

In [106]:
text = """
We are reading about Natural Language Processing Here. What an interesting topic of data science! 
Natural Language Processing makes computers to comprehend language data. The field 
of Natural Language Processing is evolving everyday."""

Splitting to sentences

In [107]:
sentences = nltk.sent_tokenize(text)
print(sentences)

['\nWe are reading about Natural Language Processing Here.', 'What an interesting topic of data science!', 'Natural Language Processing makes computers to comprehend language data.', 'The field \nof Natural Language Processing is evolving everyday.']


Tokenization and cleaning

In [108]:
raw_tokens_list = [tokenize_and_clean(sentence) for sentence in sentences]
print(raw_tokens_list)

[['we', 'are', 'reading', 'about', 'natural', 'language', 'processing', 'here'], ['what', 'an', 'interesting', 'topic', 'of', 'data', 'science'], ['natural', 'language', 'processing', 'makes', 'computers', 'to', 'comprehend', 'language', 'data'], ['the', 'field', 'of', 'natural', 'language', 'processing', 'is', 'evolving', 'everyday']]


Removing stopwords

In [109]:
tokens_list = [stopwords_removal(tokes) for tokes in raw_tokens_list]
print(tokens_list)

[['reading', 'natural', 'language', 'processing'], ['what', 'interesting', 'topic', 'data', 'science'], ['natural', 'language', 'processing', 'makes', 'computers', 'comprehend', 'language', 'data'], ['field', 'natural', 'language', 'processing', 'evolving', 'everyday']]


Stemming

In [110]:
stem_tokens_list = [stemming(tokens) for tokens in tokens_list]
print(stem_tokens_list)

[['read', 'natur', 'languag', 'process'], ['what', 'interest', 'topic', 'data', 'scienc'], ['natur', 'languag', 'process', 'make', 'comput', 'comprehend', 'languag', 'data'], ['field', 'natur', 'languag', 'process', 'evolv', 'everyday']]


Now we have fully prepared tokens and can create the vocabulary. We sort it to compare later with the vocabulary created by `CountVectorizer` from `sklearn`.

In [111]:
set_of_words = set()
for tokens in stem_tokens_list:
    for tok in tokens:
        set_of_words.add(tok)
vocab = sorted(list(set_of_words))
print(vocab)

['comprehend', 'comput', 'data', 'everyday', 'evolv', 'field', 'interest', 'languag', 'make', 'natur', 'process', 'read', 'scienc', 'topic', 'what']


Fetching the position of each word in the vocabulary. We use `OrderedDictt` to preserve ordering of the vocabulary.

In [112]:
from collections import OrderedDict

position = OrderedDict()
for i, token in enumerate(vocab):
    position[token] = i
print(position)

OrderedDict([('comprehend', 0), ('comput', 1), ('data', 2), ('everyday', 3), ('evolv', 4), ('field', 5), ('interest', 6), ('languag', 7), ('make', 8), ('natur', 9), ('process', 10), ('read', 11), ('scienc', 12), ('topic', 13), ('what', 14)])


Finally create the BoW matrix

In [114]:
bow_matrix = np.zeros((len(stem_tokens_list), len(vocab)), dtype=int)

for i, tokens in enumerate(stem_tokens_list):
    for token in tokens:
        bow_matrix[i][position[token]] += 1

Here is our text (we clean it a little to make it look better). 

It has 4 sentences.

In [115]:
for sentence in sentences:
    print(sentence.replace('\n', ' ').strip())

We are reading about Natural Language Processing Here.
What an interesting topic of data science!
Natural Language Processing makes computers to comprehend language data.
The field  of Natural Language Processing is evolving everyday.


And this is its BaW representation. Each sentence is mapped to a vector.

In [116]:
print(bow_matrix)

[[0 0 0 0 0 0 0 1 0 1 1 1 0 0 0]
 [0 0 1 0 0 0 1 0 0 0 0 0 1 1 1]
 [1 1 1 0 0 0 0 2 1 1 1 0 0 0 0]
 [0 0 0 1 1 1 0 1 0 1 1 0 0 0 0]]


Size of vectors equals to the length of the vocabulary. 

Each vector element correspond to a single token (word) in the text. A number shows how many times this token appears in the sentence.

### Using `CountVectorizer` to build BoW

`CountVectorizer` is a tool provided by the `sklearn` or scikit-learn library in Python
that saves all the effort performed above and provides application
programming interfaces (APIs) that would conveniently help in building a BoW model.

It converts a list of text documents into a matrix such that each entry in the matrix would
correspond to the count of a particular token in the respective sentences. 

Let us look at how to instantiate `CountVectorizer` and fit data to it.

First we try the simplest way: we feed the `CountVectorizer` with already stemmed tokens in `stem_tokens_list`.

For this purpose this list of lists must be converted into sentences again, i.e., we need join them by spaces. 

This is because `CountVectorizer` accepts raw sentences.

In [117]:
stem_sentences = [' '.join(tokens) for tokens in stem_tokens_list]
print(stem_sentences)

['read natur languag process', 'what interest topic data scienc', 'natur languag process make comput comprehend languag data', 'field natur languag process evolv everyday']


Now the vectorization is done:

In [118]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bow_matrix2 = vectorizer.fit_transform(stem_sentences)

print(vectorizer.get_feature_names_out())
print(bow_matrix2.toarray())

['comprehend' 'comput' 'data' 'everyday' 'evolv' 'field' 'interest'
 'languag' 'make' 'natur' 'process' 'read' 'scienc' 'topic' 'what']
[[0 0 0 0 0 0 0 1 0 1 1 1 0 0 0]
 [0 0 1 0 0 0 1 0 0 0 0 0 1 1 1]
 [1 1 1 0 0 0 0 2 1 1 1 0 0 0 0]
 [0 0 0 1 1 1 0 1 0 1 1 0 0 0 0]]


We can compare this matrix with the one obtained manually:

In [119]:
print(bow_matrix2.toarray() - bow_matrix)

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


Now we will wrap the whole pipeline of data preparation into `CountVectorizer`.

We are going to feed it with the list of raw sentences and obtain BoW as a result.

Let us remember that the sentences read:

In [120]:
print(sentences)

['\nWe are reading about Natural Language Processing Here.', 'What an interesting topic of data science!', 'Natural Language Processing makes computers to comprehend language data.', 'The field \nof Natural Language Processing is evolving everyday.']


The code below accepts this list of sentences and returns the BoW.

Observe that `CountVectorizer` has built-in method tokenizer that is responsible for tokenization. 

We fetch it with the method `.build_tokenizer()` and extend its functionality: stemming and stopwords removal is added.

Of course we could also use here another tokenizer, for example from NLTK.

We do not use built-in stopwords removal since it is applied after our custom tokenizer and stemmer.

In [121]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language = 'english')
tokenizer = CountVectorizer().build_tokenizer()

wh_words = set(['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom'])
stop = set(stopwords.words('english')) - wh_words

def stemmed_tokenizer(doc):
    tokens = [tok for tok in tokenizer(doc) if tok not in stop]
    stem_tokens = [stemmer.stem(tok) for tok in tokens]
    return stem_tokens

stem_vectorizer = CountVectorizer(tokenizer=stemmed_tokenizer)
bow_matrix3 = stem_vectorizer.fit_transform(sentences);

print(stem_vectorizer.get_feature_names_out())
print(bow_matrix3.toarray())

['comprehend' 'comput' 'data' 'everyday' 'evolv' 'field' 'interest'
 'languag' 'make' 'natur' 'process' 'read' 'scienc' 'topic' 'what']
[[0 0 0 0 0 0 0 1 0 1 1 1 0 0 0]
 [0 0 1 0 0 0 1 0 0 0 0 0 1 1 1]
 [1 1 1 0 0 0 0 2 1 1 1 0 0 0 0]
 [0 0 0 1 1 1 0 1 0 1 1 0 0 0 0]]


In [122]:
print(bow_matrix3.toarray() - bow_matrix)

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


### Getting N-gramms with `CountVectorizer`

Simple BoW totally ignores the order of words, but we know that this is important.

N-grams allows to take into account local word ordering. 

`CountVectorizer` has an option `ngram_range` for it. 

The tuple must be specified for this option that says which n-grams to create. 

In [123]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_ngram = CountVectorizer(ngram_range=(1, 3), stop_words='english')
bow_matrix_ngram = vectorizer_ngram.fit_transform(sentences);

print(vectorizer_ngram.get_feature_names_out())
print(bow_matrix_ngram.toarray())

['comprehend' 'comprehend language' 'comprehend language data' 'computers'
 'computers comprehend' 'computers comprehend language' 'data'
 'data science' 'everyday' 'evolving' 'evolving everyday' 'field'
 'field natural' 'field natural language' 'interesting'
 'interesting topic' 'interesting topic data' 'language' 'language data'
 'language processing' 'language processing evolving'
 'language processing makes' 'makes' 'makes computers'
 'makes computers comprehend' 'natural' 'natural language'
 'natural language processing' 'processing' 'processing evolving'
 'processing evolving everyday' 'processing makes'
 'processing makes computers' 'reading' 'reading natural'
 'reading natural language' 'science' 'topic' 'topic data'
 'topic data science']
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1
  0 0 0 0]
 [0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  1 1 1 1]
 [1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 2 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0 

### Limiting vocabulary size

The standard problem for BoW model is too large vocabulary. For a large text the vocabulary can be really huge. 

Size of the vocabulary equals to the size of vectors that are used to represent each sentence. 

And these vectors are usually very sparse: only a few entries are not zero.

Subsequent processing of such model will be inefficient and memory consuming. 

The `CountVectorizer` provides a parameter called `max_features` that will build a vocabulary such 
that the size of the vocabulary would be less than or equal to `max_features` ordered by the frequency 
of tokens occurring in a corpus:

In [124]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_max_features = CountVectorizer(ngram_range=(1, 3), stop_words='english', max_features=6)
bow_matrix_max_features = vectorizer_max_features.fit_transform(sentences);

print(vectorizer_max_features.get_feature_names_out())
print(bow_matrix_max_features.toarray())

['language' 'language processing' 'natural' 'natural language'
 'natural language processing' 'processing']
[[1 1 1 1 1 1]
 [0 0 0 0 0 0]
 [2 1 1 1 1 1]
 [1 1 1 1 1 1]]


This example illustrates that only six of the most frequently occurring n-grams among
unigrams, bigrams, or trigrams in the corpus were selected since the value of the
`max_features` attribute was set to 6.

Observe that 6 is inappropriate for our text since the second vector does not have features at all.

### Min_df and Max_df thresholds

Now that we are clear on how `max_features` helps by limiting the vocabulary size, we also
need to understand that at the top of this limited vocabulary would be terms or phrases
that have occurred very frequently in the text corpus under consideration. 

These phrases might occur very frequently in an individual document or may be present in almost all
documents in the corpus, and may not carry any pattern. 

One approach we have discussed so far to remove such terms is the removal of stopwords.
(Too frequent terms are removed because being everywhere in the text they do not carry much 
of information. Not so frequent terms are expected to be more informative.)

Another convenient technique that comes along with `CountVectorizer` is `max_df`, which
will ignore terms having a document frequency higher than a provided threshold
mentioned as part of the `max_df` parameter. 

Similarly, we can remove rarely occurring terms that occur fewer times in a document than a given 
threshold, using a `min_df` parameter. 

This can potentially have issues as these rarely occurring terms might be very
significant for certain documents in the text corpus. 

We will look into how to capture such information in the TF-IDF vectors section below.

The following example illustrates how `max_df` and `min_df` can be put into action and
consequently provide minimum and maximum thresholds toward the occurrence of a
phrase in a corpus:

In [125]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_minmax_df = CountVectorizer(ngram_range=(1, 3), stop_words='english', max_df = 3, min_df = 2)
bow_matrix_minmax_df = vectorizer_minmax_df.fit_transform(sentences);

print(vectorizer_minmax_df.get_feature_names_out())
print(bow_matrix_minmax_df.toarray())

['data' 'language' 'language processing' 'natural' 'natural language'
 'natural language processing' 'processing']
[[0 1 1 1 1 1 1]
 [1 0 0 0 0 0 0]
 [1 2 1 1 1 1 1]
 [0 1 1 1 1 1 1]]


### Limitations of the BoW representation

The BoW model provides a mechanism for representing text data using numbers. 

However, there are certain limitations to it. 

The model only relies on the count of terms in a document. 

This might work well for certain tasks or use cases with a limited vocabulary,
but it would not scale to large vocabularies efficiently.

The BoW model also intrinsically provides possibilities for eliminating or reducing the
significance of tokens or phrases that occur very rarely. 

These phrases might be present in a
very small number of documents, but they can be very important in the representation of
those documents. 

The BoW model does not support such possibilities.

The BoW model can also get extremely huge in terms of the vocabulary for a large text
corpus. 

This can lead to vectors of huge sizes representing every document, which might
cause a degradation in the model's performance.

### TF-IDF vectors

For BoW models the frequency of words across a document was the only criterion for building vectors for documents. 

The words that occur rarely are either removed or their weights are too low compared to words that occur
very frequently. 

While following this kind of approach, the pattern of information carried
across terms that are rarely present but carry a high amount of information for a document
or an evident pattern across similar documents is lost. 

The TF-IDF approach for weighing terms in a text corpus helps mitigate this issue.

The TF-IDF approach is by far the most commonly used approach for weighing terms. 

It is found in many applications. 

The definition below tells about a text corpus that consists of documents.

The text corpus corresponds to the variables `text` or `sentences` leveraged above and
one document is one our `sentence`.

TF-IDF is a composite of two terms, which are described as follows:

- TF (term frequency) is similar to the `CountVectorizer` tool. 
It takes into account how frequently
a term occurs in a document. Since most of the documents in a text corpus are of
different lengths, it is very likely that a term would appear more frequently in
longer documents rather than in smaller ones. This calls for normalizing the
frequency of the term by dividing it with the count of the terms in the document.
There are multiple variations to calculate TF, but the following is the most
common representation:

$$
\mathrm{TF}(w, d)=\frac{\text{Number of times the word $w$ occurs in a document $d$}}
{\text{Total number of words in the document $d$}}
$$

- IDF (inverse document frequency) takes into account those terms that occur 
not so frequently across documents but might be more meaningful in 
representing the document. It measures the importance of a term in a document. 
The usage of TF only would provide more weightage to terms that 
occur very frequently. As part of IDF, just the opposite is
done, whereby the weights of frequently occurring terms are suppressed and the
weights of possibly more meaningful but less frequently occurring terms are
scaled up. Similar to TF, there are multiple ways to measure IDF, but the
following is the most common representation:

$$
\mathrm{IDF}(w)=\log\frac{\text{Total number of documents}}
{\text{Number of documents containing word $w$}}
$$

Final weight of word $w$ in document $d$ is given by the following TF-
IDF weighting:

$$
\mathrm{Weight}(w,d)=\mathrm{TF}(w, d)\times \mathrm{IDF}(w)
$$

As can be seen, the weight of word $w$ in document $d$ is a product of the TF of word $w$ in
document $d$ and the IDF of word $w$ across the text corpus.

Let us understand how all this pans out in action. 

We will take the same corpus as the one taken for the `CountVectorizer` model for this example 
to see the differences. 

Also, the data underwent the same preprocessing pipeline here as well.

In [126]:
text = """
We are reading about Natural Language Processing Here. What an interesting topic of data science! 
Natural Language Processing makes computers to comprehend language data. The field 
of Natural Language Processing is evolving everyday."""

sentences = nltk.sent_tokenize(text)
print(sentences)

['\nWe are reading about Natural Language Processing Here.', 'What an interesting topic of data science!', 'Natural Language Processing makes computers to comprehend language data.', 'The field \nof Natural Language Processing is evolving everyday.']


In [128]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language = 'english')
tokenizer = TfidfVectorizer().build_tokenizer()

wh_words = set(['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom'])
stop = set(stopwords.words('english')) - wh_words

def stemmed_tokenizer(doc):
    tokens = [tok for tok in tokenizer(doc) if tok not in stop]
    stem_tokens = [stemmer.stem(tok) for tok in tokens]
    return stem_tokens

tfidf_vectorizer = TfidfVectorizer(tokenizer=stemmed_tokenizer)
tfidf_matrix = tfidf_vectorizer.fit_transform(sentences);

The results on the preprocessed corpus after TF-IDF vectorization are shown below. 

The vocabulary is the same as `CountVectorizer`; however, the
weights are completely different for the various terms across the documents:

In [129]:
print(tfidf_vectorizer.get_feature_names_out(), "\n")
print(tfidf_matrix.toarray())

['comprehend' 'comput' 'data' 'everyday' 'evolv' 'field' 'interest'
 'languag' 'make' 'natur' 'process' 'read' 'scienc' 'topic' 'what'] 

[[0.         0.         0.         0.         0.         0.
  0.         0.42817512 0.         0.42817512 0.42817512 0.67081906
  0.         0.         0.        ]
 [0.         0.         0.36673901 0.         0.         0.
  0.46516193 0.         0.         0.         0.         0.
  0.46516193 0.46516193 0.46516193]
 [0.40601945 0.40601945 0.3201104  0.         0.         0.
  0.         0.51831391 0.40601945 0.25915695 0.25915695 0.
  0.         0.         0.        ]
 [0.         0.         0.         0.48666375 0.48666375 0.48666375
  0.         0.31063117 0.         0.31063117 0.31063117 0.
  0.         0.         0.        ]]


Additionally each vector, i.e., each row of this matrix has been normalized: after computation each vector element 
has been divided divided by the vector Euclidean norm (also this is called L2 norm). 

The resulting vectors have unit Euclidean lengths.

This is needed for the following computation of their distances.

This normalization can be switched off by setting `norm=None` parameter of `TfidfVectorizer`.

### N-grams and maximum features in the TF-IDF vectorizer

Similar to `CountVectorizer`, the TF-IDF vectorizer offers the capability of using n-grams
and max_features to limit our vocabulary:

In [131]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 3), stop_words='english', max_features=6)
tfidf_matrix = tfidf_vectorizer.fit_transform(sentences);

print(tfidf_vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())

['language' 'language processing' 'natural' 'natural language'
 'natural language processing' 'processing']
[[0.40824829 0.40824829 0.40824829 0.40824829 0.40824829 0.40824829]
 [0.         0.         0.         0.         0.         0.        ]
 [0.66666667 0.33333333 0.33333333 0.33333333 0.33333333 0.33333333]
 [0.40824829 0.40824829 0.40824829 0.40824829 0.40824829 0.40824829]]


Here, we took the top six features among unigrams, bigrams, and trigrams, and used them
to represent the TF-IDF vectors. 

The TF-IDF vectorizer provides the `min_df` and `max_df` parameters as well, and the usage 
is exactly the same as `CountVectorizer`. 

### Limitations of the TF-IDF vectorizer's representation

The TF-IDF vectorizer offers an improvement over `CountVectorizer` by scaling the
weights of the less frequently occurring terms as well as by using the IDF component. 

It is also computationally fast. 

However, it still relies on lexical analysis and does not take into
account things such as semantics, the context associated with
terms, and the position of a term in a document. 

It is dependent on the vocabulary size, like `CountVectorizer`, and will get really slow with large vocabulary sizes.

### Cosine similarity

TF-IDF vectors can be compared using so called cosine similarity.

Let us first remember:

Dot product is an operation when we sum a componentwise products of two vectors $x$ and $y$:

$$
(x, y) = \sum_{i=0}^{N-1} x_i y_i
$$

Using the dot product we can compute angels between vectors due to the following property:

$$
(x, y) = |x|_2 |y|_2 \cos(\alpha)
$$

where $|x|_2$ and $|y|_2$ are Euclidean norms and $\alpha$ is the angle between $x$ and $y$.

Since $\cos 90^\circ=\cos \pi/2=0$ the dot product of two orthogonal vectors is always zero.

Cosine between two vectors can be used as a measure of their similarity. This is called a cosine similarity. 

The cosine similarity is the highest and equals 1 when the angle between two vectors is zero. For nonzero angles the similarity is less then 1.

Since by default `TfidfVectorizer` produces TF-IDF vectors already normalized, i.e., $|x|=|y|=1$ the dot product of two such vectors 
is automatically their cosine similarity.

Cosine similarity is used to reveal sentences with similar meanings.

Let us check how it works.

Consider the following sentences. Number 1 and 2 as well as number 3 and 4 are similar and these 
two pairs don't resemble each other.

The sentence number 5 has a little similarity with all others: it tells about rain and snowing and about Moon and Earth.

In [132]:
sentences2 = [
    "He likes snowing", 
    "Besides of snowing he likes rain", 
    "Earth has a satellite", 
    "Moon rotates around Earth as its satellite",
    "Unlike Earth on the Moon there is no rain or snowing"
]

This function takes a matrix whose rows are TF-IDF vectors and compute all pairwise cosine similarities.

In [133]:
def cosine_sim(mat):
    res = []
    N = len(mat)
    for i in range(N):
        for j in range(i+1, N):
            res.append((i+1, j+1, np.dot(mat[i], mat[j])))
    return res

Let us compute TF-IDF matrix for the above sentences and check their similarities.

In [134]:
from sklearn.feature_extraction.text import TfidfVectorizer

mat = TfidfVectorizer().fit_transform(sentences2).toarray()
cosine_sim(mat)

[(1, 2, 0.6306281134574364),
 (1, 3, 0.0),
 (1, 4, 0.0),
 (1, 5, 0.11177554172274358),
 (2, 3, 0.0),
 (2, 4, 0.0),
 (2, 5, 0.1727873301513294),
 (3, 4, 0.3164243062309872),
 (3, 5, 0.102060582469462),
 (4, 5, 0.1511657903269393)]

Analyzing the numbers we see indeed that pairs 1-2 and 3-4 are similar, there is not similarity between these two pairs and number 5 resembles all others a little.

### Word embedding

BoW and TF-IDF vector models of text have two common disadvantages:
- Sparsity of vectors. For a large text a vocabulary is large and the length of each vector representing a sentence equals to the size of the vocabulary. But most of sites in this vector are zeros because each sentence contains only a few words from the whole vocabulary.
- Ignoring word context. Information about the neighborhood of the word is not taken into account. The neighborhood of
a word carries important information in terms of what context the word is carrying in a sentence. 

The approach free from these disadvantages is called word embedding. 

Within this approach words in sentences are represented as sufficiently low (not so high) dimensional vector in such a way that words with similar meaning are represented with close vectors. 

This is done due to taking into account words co-occurrence in the sentences of a text corpus.

Word embedding is computed using neural networks.

### Word2vec model

Word2vec model is one of the widely used models of word embedding.

Let us see how it works. We will use `gensim` library for it.

As the first example we will download a text corpus `text8` that goes as a part of `gensim` library.

This is nothing but the "First 100,000,000 bytes of plain text from Wikipedia".

In [135]:
import gensim.downloader as api
api.info('text8')

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [136]:
corpus = api.load('text8')
print(type(corpus))

<class 'text8.Dataset'>


Let us see what we have. Variable `corpus` is an instance of a gensim class `Dataset`. 

It can be converted to a plain list like this:

In [137]:
data = [d for d in corpus]
print(len(data))

1701


We have a list or records. Each record is just a part of Wikipedia article already tokenized.

In [138]:
print(data[0][:25])
print(data[1][:25])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes']
['reciprocity', 'qualitative', 'impairments', 'in', 'communication', 'as', 'manifested', 'by', 'at', 'least', 'one', 'of', 'the', 'following', 'delay', 'in', 'or', 'total', 'lack', 'of', 'the', 'development', 'of', 'spoken', 'language']


Total number of unique tokens is about 254 thousands:

In [139]:
tokens = set()
for record in data:
    tokens.update(set(record))
print(len(tokens))

253854


We split of the dataset into two parts to show that the Word2vec model can be updated we newly arrived data.

In [140]:
data_part1 = data[:1000]
data_part2 = data[1000:]

Let us create Word2vec model for this corpus. 

This is done using class `Word2Vec`. It can be fed by an instance of `Dataset`, that is by our `corpus` itself or by a list of list of tokens.

Parameter `worker` specifies how many CPU cores will be used for training. To employ all available CPU cores we use function `cpu_count()` from `multiprocessing` library.

Parameter `vector_size` specifies the size of the resulting vectors representing words. Typical values between 100 and 300, the default value is 100. 

Notice that this vectors size 100 is actually small. Above we have found that the number of unique tokens in this data set is about 12 thousands. This the vocabulary size. If we were building BoW or TF-IDF models we would use vectors of this size.

Parameter `min_count` restricts the vocabulary so that word vectors are only built for words that occur at least min_count times in the corpus. Default value of `min_count` is 5. So that if training a model for a small corpus it is reasonable to override the default value to set `min_value=1`.

Parameter `window` control how many neighboring words take into account. `window=5` means that we take 2 neighbors at each sides and the word itself.

In [141]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count

from gensim.models.word2vec import Word2Vec
model = Word2Vec(data_part1, workers=cpu_count(), vector_size=100, window=5, min_count=1)

The model is ready. That is how we can get a vector representing a token:

In [142]:
print(model.wv['science'])

[ 1.0988156   1.1851287  -0.1812801   1.9452833  -0.03114034  0.3643762
 -1.3559775   0.5539245  -0.09251617 -0.05850023 -2.0460358  -0.14969215
 -0.29326993  0.40652508  1.1700534   1.6707319  -0.17668715 -1.2871346
 -0.7278299   1.3315673  -1.3031261  -0.6099126  -2.1808357   0.08573119
 -0.9529415  -1.6262963   1.0095385   3.145084    1.653135    0.3627537
 -1.6213607   1.38337    -1.032838   -2.8465905  -0.23281428  1.8973178
 -2.209518    0.66805017 -0.92841035  1.4418032   0.6985215  -1.119551
 -1.3932337  -0.11649393  1.1610394   0.381421    2.686128    1.202984
 -2.4556758   0.7286778  -0.48711845  1.9086686   3.1575143  -2.2418313
 -2.8425183   1.5637796   0.03529016 -0.34123522 -3.1144638  -0.46221352
  1.515962    0.75416195  3.0945842   0.24025121 -1.7028453   1.6288635
  4.261755   -1.6062495  -2.1672902   0.0628279   0.64470625  1.1294228
  0.6317704   0.6122948  -0.5962357   1.2249781   0.61357033 -0.10868856
 -2.139901    0.24257016  0.8102151  -1.692973    0.6539546   

Only vectors for words from a vocabulary can be obtained. The following code raises an exception due to a rare word that is not found in the vocabulary:

In [143]:
try:
    print(model.wv['agastopia'])
except KeyError as e:
    print(e)

"Key 'agastopia' not present"


Now we assume that some new data have appeared. We can use it to improve the model and extend its dictionary.

Given a trained model, one needs to call the `.build_vocab()` method on the new dataset and then call the `.train()` method. 

Method `.build_vocab()` is called first because the model has to be apprised of what new words to expect in the incoming corpus.

Method `.train()` updates the model using new corpus. 

Parameters `total_examples` specify number of new data samples. Its appropriate value is stored in `model.corpus_count` after calling `.build_vocab()`. 

Parameter `epochs` tells how many epochs to train. We will train it the same number of epochs as previously.

In [144]:
model.build_vocab(data_part2, update=True)
model.train(data_part2, total_examples=model.corpus_count, epochs=model.epochs);

Now let us have a look  at a few examples to understand what relationships and analogies can be captured by a Word2vec model. 

A very frequently used example deals with the embedding
of King, Man, Queen, and Woman. 

Once a Word2vec model is built properly and the embedding from it is obtained for these words, the following relationship is frequently obtained, provided that these words are actually a part of the vocabulary:

$$
\mathrm{vector}(\text{Man}) - \mathrm{vector}(\text{King}) + \mathrm{vector}(\text{Queen}) = \mathrm{vector}(\text{Woman})
$$

This equation boils down to the following relationship:

$$
\mathrm{vector}(\text{Man}) + \mathrm{vector}(\text{Queen}) = 
\mathrm{vector}(\text{Woman}) + \mathrm{vector}(\text{King})
$$

The thought process here is that the relationship of Man:King is the same as Woman:Queen.

The Word2vec algorithm is able to capture these semantic relationships when it devises an embedding for each of these words.

Let us check that it really works. We use `.most_simular()` method for it.
 
Its parameter `positive` specifies the words that have to be most similar to the hunted word and `neagative` specifies the least similar words.

In [145]:
model.wv.most_similar(positive=["man", "queen"], negative=["king"])

[('woman', 0.6735669374465942),
 ('girl', 0.6734843850135803),
 ('blonde', 0.5943896770477295),
 ('lady', 0.5814957618713379),
 ('bride', 0.5650728940963745),
 ('bird', 0.5568346977233887),
 ('creature', 0.5520504713058472),
 ('cow', 0.5500223636627197),
 ('dog', 0.5497346520423889),
 ('eriboea', 0.5417327284812927)]

One more example: find something like "vatican" in "italy" but for "england". 

The model find "westminster" probably keeping in mind Westminster Abbey.

In [146]:
model.wv.most_similar(positive=["england", "vatican"], negative=["italy"])

[('harmar', 0.5710040330886841),
 ('westminster', 0.5597139000892639),
 ('wales', 0.5442865490913391),
 ('canterbury', 0.5321040153503418),
 ('episcopal', 0.5305808782577515),
 ('diocese', 0.5183550119400024),
 ('hampshire', 0.5180380940437317),
 ('parish', 0.5175967812538147),
 ('privy', 0.516585648059845),
 ('lambeth', 0.5126064419746399)]

### Word mover’s distance

In the previous section we discussed that measuring word similarity is one of the major use cases of Word2vec. 

Think of a problem statement, such as one where we are
building an engine that can rank resumes based on their relevance to a job description.

Here, we ideally need to figure out the distance between the job description and the set of resumes. 

The smaller the distance between the resume and the job description, the higher the relevance of the resume to the job description.

One measure we discussed above was to use cosine similarity to find how close or far text documents are to one another or how far
removed they are from one another. 

Now we will discuss another measure, Word Mover's Distance (WMD), which is more relevant than cosine similarity, especially when
we base the distance measure for documents on word embeddings.

According to the WDM idea the dissimilarity between two text
documents can be measured as the minimum amount of distance that the embedded words of 
one document need to travel to reach the embedded words of another document.

Let us look at an standard example that is most often used for illustration of the idea. 

Sentence 1: "Obama speaks to the media in Illinois."

Sentence 2: "President greets the press in Chicago."

Based on the Word2vec model, the embedding for Obama would be very close to
President.  Similarly, speaks would be pretty close to greets, media would be pretty
close to press, and Illinois would map pretty closely to Chicago.

Let us take a look at a third sentence:

"Apple is my favorite company."

Now, this is likely to be more distant to sentence 1 than sentence 2 is. 

This is because there is not much of a semantic relationship between the words in the first and third sentences.

WMD computes the pairwise Euclidean distance between words across the sentences and it
defines the distance between two documents as the minimum cumulative cost in terms of
the Euclidean distance required to move all the words from the first sentence to the second
sentence.

Let's see how we implement this using gensim.

We have four sentences. The first two are similar and the last two are also similar. And these two pairs are very different form each other.

In [None]:
sentence_1 = "Obama speaks to the media in Illinois"
sentence_2 = "President greets the press in Chicago"
sentence_3 = "Apple is my favorite company"
sentence_4 = "I like smartphones and laptops produced by Apple"

It is very important to remove stopwords from the compared sentences. Otherwise they will strongly influence the result: all sentences will be similar due to them.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

def preprocess(sentence):
    return [w for w in sentence.lower().split() if w not in stop]

sentence_1s = preprocess(sentence_1)
sentence_2s = preprocess(sentence_2)
sentence_3s = preprocess(sentence_3)
sentence_4s = preprocess(sentence_4)

Let us now compute the distances:

In [None]:
wmd_12 = model.wv.wmdistance(sentence_1s, sentence_2s)
print(wmd_12)

In [None]:
wmd_34 = model.wv.wmdistance(sentence_3s, sentence_4s)
print(wmd_34)

Two pairs are indeed similar  to each other.

Let us compare pairs:

In [None]:
wmd_13 = model.wv.wmdistance(sentence_1s, sentence_3s)
wmd_14 = model.wv.wmdistance(sentence_1s, sentence_4s)
print(wmd_13, wmd_14)

In [None]:
wmd_23 = model.wv.wmdistance(sentence_2s, sentence_3s)
wmd_24 = model.wv.wmdistance(sentence_2s, sentence_4s)
print(wmd_23, wmd_24)

We observe that distance between pairs are higher then within the pairs.

### Using Word2vec models

Word2vec model can be easily built for a plain text like below:

In [None]:
text = """He was professor at the Johannaeum, and was delivering a series of 
lectures on mineralogy, in the course of every one of which he broke 
into a passion once or twice at least. Not at all that he was over-anxious 
about the improvement of his class, or about the degree of attention 
with which they listened to him, or the success which might eventually 
crown his labours. Such little matters of detail never troubled him much. 
His teaching was as the German philosophy calls it, 'subjective'; 
it was to benefit himself, not others. He was a learned egotist. He was 
a well of science, and the pulleys worked uneasily when you wanted to draw 
anything out of it. In a word, he was a learned miser."""

Here we split the text by sentences using `nltk` and then tokennize each sentence with the help of
`gensim` utility `simple_preprocess`. 

This utility converts a document into a list of lowercase tokens, ignoring tokens that are too short or too long.

Of course `nltk` tokenizer could also be used. 

In [None]:
import nltk
import gensim
corpus = []
for sentence in nltk.sent_tokenize(text):
    corpus.append(gensim.utils.simple_preprocess(sentence))
print(corpus)

Now the corpus is ready and we can train the model. Observe that since our corpus is small we use `min_count=1`. Otherwise most of tokens will be ignored.

In [None]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count

model = Word2Vec(corpus, workers=cpu_count(), min_count=1)

When we train a model from text the access to all Word2vec features occurs via attribute `.wv`.

For example this is list of all trained vectors (each one corresponds to a single token):

In [None]:
print(model.wv.vectors)
print(model.wv.vectors.shape)

We see that the modes has kept 78 tokes and created 100-dimensional vector for each.

This is an example of a vector:

In [None]:
print(model.wv['mineralogy'])

However a model trained for such small corpus will probably be useless. 

This are the most similar words for "mineralogy"

In [None]:
model.wv.most_similar('mineralogy')

Really huge corpus is required to have a good model.

Usually we do not need to training our own Word2vec model. Better solution is to use a pretrained model that is trained for a problem-specific corpora or at least for a common texts.

`gensim` provides a list of corpora and pretrained models.

Here are their lists:

In [None]:
import gensim.downloader as api
info = api.info()

print(info['corpora'].keys())

In [None]:
print(info['models'].keys())

Above we have considered how to use downloaded corpus to create a model. Now let us see how to download a pretrained model.

In [None]:
info['models']['glove-wiki-gigaword-50']

In [None]:
model = api.load("glove-wiki-gigaword-50")

Now we have a model. 

It can not be trained further and thus we have access to its Word2vec facilities without using the attribute `.wv`.

This model also knows a word "mineralogy" and it knows better then our simple model above:

In [None]:
model.most_similar('mineralogy')

The standard large Word2vec model suitable for many purposes is "word2vec-google-news-300". 

This model is trained on part of the Google News dataset, covering approximately 3 million words and phrases. 

Such a model can take hours to train, but since it is already available, downloading and loading it with `gensim` takes minutes.

The model is approximately 2GB, so it requires a decent network connection to be downloaded. Once downloaded it is cashed locally and no more downloadings are required.

### Limitations of Word2vec


Word2vec is a great tool for capturing semantic information from text, and we have seen how well it captures information. 

However, the Word2vec model has some limitations.

Let's take the following two sentences:

"I am eating an apple."

"I am using an apple desktop."

"apple" in the first sentence signifies the fruit and, in the second sentence, it signifies the company. 

However, the word vector generated for apple would be the same for both the
company and the fruit. 

In other words, since a static embedding is created for each word
after the training, generating an embedding on the fly based on the context for a word's specific usage is a limitation of the Word2vec model.

### Other embedding models

Previously we looked at how information related to the ordering of words, along with their semantics, can be taken into account when building embeddings to represent words. 

The idea of building embeddings can be extended. 

In the `Word2Vec` approach each word in the vocabulary had a vector representation. 

`Word2Vec` relies heavily on the vocabulary it has been trained to represent. It unable to handle properly words not found in the vocabulary. 

Ideas of `Word2Vec` have been extended in a model `fastText`. 

Each word is encapsulated a combination of character n-grams. 

Each of these n-grams has a vector representation. 

Word representations are actually a result of the summation of their character n-grams.

When certain words are missing from the training vocabulary we can still have a representation for them if their n-grams are present as part of other words.

Often instead of vectors for individual words we need vectors for whole sentences. 

The trivial solution is to average all word vectors including in a sentence.

But there are techniques that are able to build straightforward embeddings for documents and sentences.

An algorithm called `Doc2Vec` provides sentence- or document-level contextual embeddings. 

Another technique `Sent2Vec` is focused on obtaining embeddings for sentences based on word n-grams. 

Research has shown that `Sent2Vec` outperforms `Doc2Vec` in the majority of
the tasks it undertakes and that it is a better representation method for sentences or
documents. 

One more approach is called the Universal Sentence Encoder (`USE`). This is a model for fetching embeddings at the sentence level. 

Several models that have been built using USE model have outperformed state-of-the-art results. 

### Exercises

4\. Describe in writing the key differences between BoW and TF-IDF models of text.

5\. Describe in writing what is an idea of word embedding. What are its advantages in comparison with other vectorization techniques?

6\. Come up with two sentences with high cosine similarity and two whose similarity is exactly zero. Compute these similarities using the code that has been used above. 

7\. Compute word mover's distances for the sentences from the previous exercise. Use Word2vec model trained on `text8` corpus or download the pretrained model `glove-wiki-gigaword-50`. Compare the distances with cosine similarity. What method produces more reasonable results?

8\. Below you will find a piece of text. Split it to sentences and create BoW model. Above we have used stemming for the analogous model. Use lemmatization instead. Do not forget that lematization may require whole sentences to identify parts of speech. It means that the stopword removal must be done after lemmatization.

In [None]:
text = """The skull and the upper bones lay beside it in the thick dust, 
and in one place, where rain-water had dropped through a leak in the 
roof, the thing itself had been worn away. Further in the gallery was 
the huge skeleton barrel of a Brontosaurus. My museum hypothesis was 
confirmed. Going towards the side I found what appeared to be sloping 
shelves, and clearing away the thick dust, I found the old familiar 
glass cases of our own time. But they must have been air-tight to 
judge from the fair preservation of some of their contents."""

9\. Below you will find a list of tweets. Create TF-IDF model for them. For tokenization use TweetTokenizer provided by NLTK. Using cosine similarity find two most similar teats. 

In [None]:
tweets = [
"@Tatiana_K nope they didn't have it ",
"@twittera que me muera ? ",
"spring break in plain city... it's snowing ))) ",
"I just re-pierced my ears ",
"@caregiving I couldn't bear to watch it.  And I thought the UA losssssss was embarrassing . . . . .",
"@octolinz16 It it counts, idk why I did either. you never talk to me anymore ",
"@smarrison i would've been the first, but i didn't have a gun.    not really though, zac snyder's just a doucheclown.",
"@iamjazzyfizzle I wish I got to watch it with you!! I miss you and @iamlilnicki  how was the premiere?!",
"Hollis' death scene will hurt me severely to watch on film  wry is directors cut not out now?",
"about to file taxes ",
"@LettyA ahh ive always wanted to see rent  love the soundtrack!!",
"@FakerPattyPattz Oh dear. Were you drinking out of the forgotten table drinks? ",
"@alydesigns i was out most of the day so didn't get much done ;) ",
"one of my friend called me, and asked to meet with her at Mid Valley today...but i've no time *sigh* "]