In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from sklearn import model_selection
from sklearn import feature_extraction
from sklearn import linear_model
from sklearn import svm
from sklearn import metrics

In [3]:
np.random.seed(7)

# NLTK library

Various Python libraries are available for Natural Language Processing. Some of the most popular are NLTK, SpaCy and FastText.

[NLTK](https://www.nltk.org/) (Natural Language ToolKit) is the one with the longest history and which offers the most functionality for working with text. It is used by linguists, researchers in the digital humanities, as well as researchers in the field of natural language processing. Library provides support for working with a large number of languages. The following are examples of some of the most important and commonly used methods.

In [4]:
import nltk

The text we will use for the demonstration is taken from a Wikipedia article about Nikola Tesla in English.

In [5]:
text = "Born and raised in the Austrian Empire, Tesla studied engineering and physics in the 1870s without receiving a degree, gaining practical experience in the early 1880s working in telephony and at Continental Edison in the new electric power industry. In 1884 he emigrated to the United States, here he became a naturalized citizen. He worked for a short time at the Edison Machine Works in New York City before he struck out on his own. With the help of partners to finance and market his ideas, Tesla set up laboratories and companies in New York to develop a range of electrical and mechanical devices. His alternating current (AC) induction motor and related polyphase AC patents, licensed by Westinghouse Electric in 1888, earned him a considerable amount of money and became the cornerstone of the polyphase system which that company eventually marketed." 

In [6]:
text

'Born and raised in the Austrian Empire, Tesla studied engineering and physics in the 1870s without receiving a degree, gaining practical experience in the early 1880s working in telephony and at Continental Edison in the new electric power industry. In 1884 he emigrated to the United States, here he became a naturalized citizen. He worked for a short time at the Edison Machine Works in New York City before he struck out on his own. With the help of partners to finance and market his ideas, Tesla set up laboratories and companies in New York to develop a range of electrical and mechanical devices. His alternating current (AC) induction motor and related polyphase AC patents, licensed by Westinghouse Electric in 1888, earned him a considerable amount of money and became the cornerstone of the polyphase system which that company eventually marketed.'

## Splitting text into sentences

Support for advanced sentence-tokenization is provided through the <code>sent_tokenize()</code> function from <code>tokenize</code> module. Punctuation marks such as periods, question marks, or exclamation marks are used as separators. The function is able to distinguish the occurrences of these characters in other contexts and to associate them with the correct function. For example, a dot that is a part of the abbreviation U.S.A. or date 20.02.2020. will not affect on wrong splitting of a sentence.

<div><br><span style="font-size:25px">&#9888;</span> $\hspace{0.1cm}$<code>sent_tokenize()</code> function internally uses a pre-trained model <code>PunktSentenceTokenizer</code> to detect the boundaries of sentences which is stored in the NLTK data module <code>punkt</code> and needs to be downloaded in order to use the function</div>

In [7]:
from nltk import tokenize

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/nevena/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
sentences = tokenize.sent_tokenize(text)

In [10]:
sentences

['Born and raised in the Austrian Empire, Tesla studied engineering and physics in the 1870s without receiving a degree, gaining practical experience in the early 1880s working in telephony and at Continental Edison in the new electric power industry.',
 'In 1884 he emigrated to the United States, here he became a naturalized citizen.',
 'He worked for a short time at the Edison Machine Works in New York City before he struck out on his own.',
 'With the help of partners to finance and market his ideas, Tesla set up laboratories and companies in New York to develop a range of electrical and mechanical devices.',
 'His alternating current (AC) induction motor and related polyphase AC patents, licensed by Westinghouse Electric in 1888, earned him a considerable amount of money and became the cornerstone of the polyphase system which that company eventually marketed.']

## Splitting sentences into words

Sentences can be further divided into words (tokens). Word-tokenization is provided through the <code>word_tokenize()</code> function, also from the <code>tokenize</code> module.

In [11]:
tokens = nltk.tokenize.word_tokenize(sentences[0])

In [12]:
tokens

['Born',
 'and',
 'raised',
 'in',
 'the',
 'Austrian',
 'Empire',
 ',',
 'Tesla',
 'studied',
 'engineering',
 'and',
 'physics',
 'in',
 'the',
 '1870s',
 'without',
 'receiving',
 'a',
 'degree',
 ',',
 'gaining',
 'practical',
 'experience',
 'in',
 'the',
 'early',
 '1880s',
 'working',
 'in',
 'telephony',
 'and',
 'at',
 'Continental',
 'Edison',
 'in',
 'the',
 'new',
 'electric',
 'power',
 'industry',
 '.']

In [13]:
len(tokens)

42

## Token filtering

It is obvious that some of the extracted tokens should be filtered-out. For a start, punctuation marks can be excluded. The <code>string</code> library defines constant <code>punctuation</code> that contains all punctuation marks and can be used to identify tokens that represent punctuation marks.

In [14]:
import string

In [15]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
tokens_without_punctuation = [token for token in tokens if token not in string.punctuation]

In [17]:
tokens_without_punctuation

['Born',
 'and',
 'raised',
 'in',
 'the',
 'Austrian',
 'Empire',
 'Tesla',
 'studied',
 'engineering',
 'and',
 'physics',
 'in',
 'the',
 '1870s',
 'without',
 'receiving',
 'a',
 'degree',
 'gaining',
 'practical',
 'experience',
 'in',
 'the',
 'early',
 '1880s',
 'working',
 'in',
 'telephony',
 'and',
 'at',
 'Continental',
 'Edison',
 'in',
 'the',
 'new',
 'electric',
 'power',
 'industry']

In [18]:
len(tokens_without_punctuation)

39

Further, words that appear very often and therefore have little discriminative value can be excluded from the vocabulary. These words are called **stop words**. For example, in English these are articles 'a', 'an' and 'the', prepositions like 'in', 'at', 'on' and etc. Filtering criteria depends primarily on the specific problem to be solved and cannot be roughly generalized. The general strategy for determining a stop list is to sort words by frequency, and then to label the most frequent terms as a stop list.

Common stop words for different languages are available in the NLTK library through a submodule <code>stopwords</code> of <code>corpus</code> module. The list of stop words for a specific language can be obtained by function <code>words()</code>.

<div><br><span style="font-size:25px">&#9888;</span> $\hspace{0.1cm}$ in order to use <code>words()</code> function it is necessary to download stop word corpora from the NLTK data module <code>stopwords</code>!</div>

In [19]:
from nltk.corpus import stopwords

In [20]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/nevena/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
stopwords_list = stopwords.words('english')

In [22]:
stopwords_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [23]:
tokens_without_stopwords = [token for token in tokens_without_punctuation if token not in stopwords_list]

In [24]:
tokens_without_stopwords

['Born',
 'raised',
 'Austrian',
 'Empire',
 'Tesla',
 'studied',
 'engineering',
 'physics',
 '1870s',
 'without',
 'receiving',
 'degree',
 'gaining',
 'practical',
 'experience',
 'early',
 '1880s',
 'working',
 'telephony',
 'Continental',
 'Edison',
 'new',
 'electric',
 'power',
 'industry']

In [25]:
len(tokens_without_stopwords)

25

## Lemmatization and stemming

After filtering, the tokens can be further processed. For instance, tokens 'works', 'worked' and 'working' could be reduced to the same form 'work' (the infinitive of the verb), and tokens 'ideas' and 'devices' to 'idea' and 'device' (singular nouns). Depending on whether the reduction of similar words to the same form is done according to linguistic rules or some heuristic rules, we are talking about **lemmatization** or **stemming**.

### Lemmatization

Lematization assigns the so-called **lemma** (grammatical root) to tokens. For example, verbs are assigned with a form in the infinitive, and nouns by a form in the singular.

NLTK library offers support for lemmatization through class <code>WordNetLemmatizer</code> which relies on the manually created lexical database **WordNet** in which nouns, verbs and adjectives are grouped according to the same grammatical root.

In [26]:
lemmatizer = nltk.stem.WordNetLemmatizer()

Lematization is performed using the <code>lemmatize()</code> method. Along with the token method expects the so-called POS tag (Part-Of-Speach Tag) - information about the type of word as an approximation of a token context. Possible values for the <code>pos</code> parameter are:
- <code>'n'</code> for nouns
- <code>'v'</code> for verbs
- <code>'a'</code> for adjectives
- <code>'r'</code> for adverbs
- <code>'s'</code> for so-called satellite adjectives (denote an adjective that is in some sense satellite to its noun, relational adjectives, adjectives whose meaning depends on the context, ie the noun which they satellite to, e.g. 'atomic bomb' and 'atomic adjustments', where adjective 'atomic', according to the order, has meaning that the bomb is based on the release of energy due to the fission of atoms, ie that the change is extremely small, immeasurable).

<div><br><span style="font-size:25px">&#9888;</span> $\hspace{0.1cm}$ in order to use the method <code>lemmatize()</code> it is necessary to download the WordNet database from the NLTK data module <code>wordnet</code>!</div>

In [27]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/nevena/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [28]:
lemmatizer.lemmatize('working', pos='v')

'work'

In [29]:
lemmatizer.lemmatize('working', pos='a')       #e.g. 'working class'

'working'

Note that associated lemma differs whether (same) token appears as a verb or as an adjective.

In [30]:
lemmatizer.lemmatize('ideas', pos='n')

'idea'

In [31]:
lemmatizer.lemmatize('atomic', pos='s')

'atomic'

In [32]:
lemmatizer.lemmatize('arid', pos='s')          #I: lacking water, dry, II: lacking vitality or spirit, lifeless

'arid'

Satellite attributes are usually not shortened, as their meaning depends on the context.

Automatic assignment of POS tags to tokens can be performed used so-called **POS taggers**. These are usually sequence labeling classifiers trained on a large collection of manually annotated texts. 

Support for automatized assignment of POS tags is provided through the <code>pos_tag()</code> function</code>. POS taggers by default use the PennTreebank tagset. In it, for example, the POS label for proper nouns is ‘NNP’, for nouns ‘NN’, for adjectives 'JJ', for prepositions ‘IN’, etc. Complete PennTreebank scheme can be found [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

<div><br><span style="font-size:25px">&#9888;</span> $\hspace{0.1cm}$<code>pos_tag()</code> function by default 
uses <code>AveragedPerceptron</code> POS tager (NLTK version 3.6.1) which is stored in the NLTK data module <code>averaged_perceptron_tagger</code> and needs to be downloaded in order to use the function</div>

In [33]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/nevena/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [34]:
nltk.pos_tag(tokens_without_stopwords)

[('Born', 'NNP'),
 ('raised', 'VBD'),
 ('Austrian', 'JJ'),
 ('Empire', 'NNP'),
 ('Tesla', 'NNP'),
 ('studied', 'VBD'),
 ('engineering', 'NN'),
 ('physics', 'NNS'),
 ('1870s', 'CD'),
 ('without', 'IN'),
 ('receiving', 'VBG'),
 ('degree', 'NN'),
 ('gaining', 'VBG'),
 ('practical', 'JJ'),
 ('experience', 'NN'),
 ('early', 'RB'),
 ('1880s', 'CD'),
 ('working', 'VBG'),
 ('telephony', 'NN'),
 ('Continental', 'NNP'),
 ('Edison', 'NNP'),
 ('new', 'JJ'),
 ('electric', 'JJ'),
 ('power', 'NN'),
 ('industry', 'NN')]

In order to use POS tagger in lemmatization, it is necessary to match the tags associated with the tokens. WordNet lematizer uses its more modest POS tagset - <code>'n'</code>, <code>'v'</code>, <code>'a'</code>, <code>'r'</code> correspond in order to list of PennTreebank POS tags \['NN', 'NNS', 'NNP', 'NNPS'\], \['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'\], \['JJ', 'JJR', 'JJS'\], \['RB', 'RBR', 'RBS'\], while satellite adjectives are treated the same as other adjectives, ie. PennTreebank does not have a special POS tag for them. 

In [35]:
def get_wordnet_pos_tag(token):
    #pairing first character of the PennTreebank POS tag with the corresponding WordNet POS tag
    pos_tag_dict = {
        'N' : 'n',
        'V' : 'v',
        'J' : 'a',
        'R' : 'r'
    }
    
    penn_pos_tag = nltk.pos_tag([token])[0][1][0]
    
    #mapping to the corresponding WordNet POS tag or set as default to the noun POS tag
    if penn_pos_tag in pos_tag_dict:
        return pos_tag_dict[penn_pos_tag]
    else:
        return 'n'

In [36]:
lemmatized_tokens = []
for token in tokens_without_stopwords:
    pos_tag = get_wordnet_pos_tag(token)
    lemmatized_tokens.append(lemmatizer.lemmatize(token, pos_tag))

In [37]:
lemmatized_tokens

['Born',
 'raise',
 'Austrian',
 'Empire',
 'Tesla',
 'study',
 'engineering',
 'physic',
 '1870s',
 'without',
 'receive',
 'degree',
 'gain',
 'practical',
 'experience',
 'early',
 '1880s',
 'work',
 'telephony',
 'Continental',
 'Edison',
 'new',
 'electric',
 'power',
 'industry']

### Stemming

Stemming assigns the so-called **stem** (atrificial root, part of a word responsible for its lexical meaning) to tokens. For example, some common rules for stemming are truncating the 'ed' suffix (e.g. 'played' $\rightarrow$ 'play'), replacing the suffix 'ational' with the suffix 'ate' (e.g. 'relational' $\rightarrow$ 'relate') or replace the suffix 'ization' with the suffix 'ize' (e.g. 'organization' $\rightarrow$ 'organize'). The stem doesn't need to be identical to grammatical root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. 

NLTK library offers support for steaming through class <code>PorterStemmer</code> which uses Porter's steaming algorithm, one of the most well-known English-language steaming algorithms. Some of other available stemmers are <code>SnowballStemmer</code>, <code>LancasterStemmer</code> i <code>RegexpStemmer</code>.

In [38]:
stemmer = nltk.stem.PorterStemmer()

In [39]:
stemmer.stem('working')

'work'

In [40]:
stemmer.stem('ideas')

'idea'

In [41]:
stemmer.stem('atomic')

'atom'

In [42]:
stemmer.stem('arid')

'arid'

Whether we will use lemmatization or stemming depends on the specific task we are solving, as well as the available resources. Lematization requires the existence of a lexical database similar to WordNet and it can significantly slow down the program, while stemming gives faster but less accurate results.

Lemmatization and stemming are text normalization (or sometimes called word normalization) techniques that are used to prepare text for further processing. Depending on the type of textual content, other types of normalization techniques may be needed as well. For instance, language on social networks is often very specific, rich in abbreviations and words that deviate from the standard spelling  ('u2' is short for 'you too', 'tmrw' for 'tomorrow', and 'cooool' for 'cool').

Previously described techniques perform text preprocessing and thereby implicitly build the vocabulary. Based on the constructed vocabulary we further define text representation that will be used for machine learning algorithms.