# Natural Language Processing

*Natural Language Processing* is an interdisciplinary area of research bringning together insights from fields such as artifical intelligence, computational linguistics, statistics and computer science. The ultimate aim of NLP is to enable computers to understand and to process the natural langauges that are spoken and written by human beings. Researchers in the field of NLP have developd sophisticated tools and algorithms for the recognition of grammatical and syntactic categories, or for the conversion of inflected word forms into their dictionary forms, among many other purposes. Current more advanced research focuses the development of software for tasks such as named entity recognition, machine translation, or unsupervised summarisation. Tasks such as these all demand an understanding not only of the grammar and the syntax, but also of the logical structure and the semantic contents of the text.


## Nltk

One of the ways in which you can analyse natural languages in Python is by making use of nltk, the *Natural Language Toolkit* (nltk). This library can be imported as follows:


In [None]:
import nltk

Having imported nltk, you can also import specific methods from this library. 

The methods thar are discussed in this tutorial make use of a number of additional resources which are not installed by default. If you have never used `nltk` before, you need to run the code below to install the relevant components.

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('wordnet')

## Tokenisation

As was explained earlier, tokenisation is a process in which a full linear text is broken down into smaller linguistic units, such as words, sentences of paragraphs. In `nltk`, you can use the method `word_tokenize()` to tokenise a text into words. The method `sent_tokenize()` can divide a text into sentences. The cell below contains an illustration of how these methods can be used.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

quote = '''
In the late summer of that year we lived in a house in a village that looked across the river and the plain to the mountains. In the bed of the river there were pebbles and boulders, dry and white in the sun, and the water was clear and swiftly moving and blue in the channels. Troops went by the house and down the road and the dust they raised powdered the leaves of the trees. The trunks of the trees too were dusty and the leaves fell early that year and we saw the troops marching along the road and the dust rising and leaves, stirred by the breeze, falling and the soldiers marching and afterward the road bare and white except for the leaves.
'''

words = word_tokenize(quote)
sentences = sent_tokenize(quote)

print( f'The fragmenst contains { len(words) } words, and { len(sentences) } sentences.\n' )


for w in words:
    print(w)


for s in sentences:
    print(s)
  


The method `word_tokenise()` divides the string into words on the basis of spaces. When you simply count the number of tokens found using this method, the results may sometimes be misleading. Interpunction marks are viewed as separate tokens as well, and, if you don't take any measure to exlude these punctuation marks, these are included in the counts as well. In most of the following examples will actually use the method `word_tokenise()` from the `tdm` module instead. 


## Part of speech tagging

Part of speech (POS) taggers are applications which can produce data about the syntactic categories of words. Once you have imported the nltk library, you can generate such POS tags by making use of the `pos_tag()` method. This method demands a list of words as a parameter. 

`pos_tag()` is typically used in combination with a word tokenisation method. Teh output of this latter function can then be used as input to the `pos_tag()` method.

In [None]:
import nltk
import tdm

quote = '''
The studio was filled with the rich odour of roses, and when the light summer wind stirred amidst the trees of the garden, there came through the open door the heavy scent of the lilac, or the more delicate perfume of the pink-flowering thorn.
'''

words = tdm.word_tokenise(quote)
pos = nltk.pos_tag(words)

for p in pos:
    print(p[0] + ' => ' + p[1] )
  


The `pos_tag()` methods returns a composite variable with two values. More specifically, it is a data structure that is called a *tuple*. The first value is the word that was tagged and the second value is the POS tag that was assigned to this word. You can access these values individually using square brackets. 

The meaning of all of the POS tags can be displayed by printing the output of the `nltk.help.upenn_tagset()` method.


In [None]:
print( nltk.help.upenn_tagset() )

The meaning of these POS codes can also be [found online](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

## Lemmatisation 

English verbs can be used, in the past tense, in the present tense, in the continuous form or in the perfect form, and these different forms can evidently make it more difficult to search systematically for occurrences of a specific verb. In this context, lemmatisation can offer a solution.

Lemmatisation is a process in which the conjugated forms of the words that are found in a text are converted into their base dictionary form. This base form is referred to as the lemma. 
In many cases, the manner in which words are to be lemmatised depends on their contexts. Certain homonyms may either be verbs or nouns, for instance, and, depending on their usage, they should be lemmatised to different forms. 

The `lemmatize()` method, inside the `WordNetLemmatizer` module of the `nltk` library, can be used in combination with POS tags. The first parameter of this needs to be the word to be lemmatised. To improve the results of this method, you can optionally supply a the POS tag as a second parameter. Unlike `pos_tag()`, however, the `lemmatize()` method does not make use of the Penn Treebank codes but of the POS codes that have been defined for `wordnet`. 

The code below firstly tokenises the words in a given sentence using `word_tokenise`. Next, the code generates the POS codes using `pos_tag`. The Penn Treebank codes produced by this method are converted to `wordnet` codes used a new function named `ptb_to_wordnet()`.


In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from tdm import word_tokenise
import re


def ptb_to_wordnet(PTT):

    if PTT.startswith('J'):
        ## Adjective
        return 'a'
    elif PTT.startswith('V'):
        ## Verb
        return 'v'
    elif PTT.startswith('N'):
        ## Noune
        return 'n'
    elif PTT.startswith('R'):
        ## Adverb
        return 'r'
    else:
        return ''


lemmatiser = WordNetLemmatizer()

quote = "It was the best of times, it was the worst of times"
words = word_tokenise(quote)
pos = nltk.pos_tag(words)

for i in range( 0 , len(words) ):
    word_found = words[i]
    posTag = ptb_to_wordnet( pos[i][1] )
    
    if re.search( r'\w+' , posTag , re.IGNORECASE ):
        lemma = lemmatiser.lemmatize( words[i] , posTag )
        print( f'{word_found} => {lemma}' )
    else:
        lemma = lemmatiser.lemmatize( words[i] )
        print( f'{word_found} => {lemma}' )
        


# Textual analysis using nltk

As explained, the `sent_tokenize()` method from the `nltk` library can be used to divide a text into its separate sentences. If we divide the total number of words by the total number of sentences, we get a number which indicates the average length of the sentences, or the average number of words per sentence. 

In [None]:
from os.path import join
from tdm import word_tokenise

dir = 'Corpus'
file = 'HeartofDarkness.txt'
path = join( dir , file )


file_handler = open( path , encoding = 'utf-8' )
full_text = file_handler.read()
s = sent_tokenize(full_text)
w = word_tokenise(full_text)
words_per_sentence = len(w) / len(s)

print( f'{file} contain { len(s) } sentences and { len(w) } words.' )
print( f'The sentences contain {words_per_sentence} words on average.' )


Using the POS tagger in `nltk`, you can also develop a specific type of frequency list: one that only includes the words in a specific grammatical category. 

The code below tries to find all the adjectives that are used in given a text. In the Penn Treebank tag set, adjectives be tagged as 'JJ' (regular adjecives), 'JJR' (adjectives in the comparative form) or 'JJS' (superatives). 

The code creates a list named `tags` which specifies the categories of words that need to be counted. In the lines that follow, the text is processed sentence by sentence, using the `sent_tokenize()` method. All the words which, according to the POS tagger, belong to one of the categories that have been listed in `tags`, are appended to the `tokens` list. By the end of this code, the list `tokens` contains all the adjectives that were found in this way. 

Using a similar approach, you could make a list of all the adverbs (labelled 'RB', 'RBR' or 'RBS') in the text, or all the nouns in the text (labelled 'NN', 'NNS', 'NNP' or 'NNPS'). To do this, you obviously need to change the tags listed in the list `tags`.

In [None]:
import re
from os.path import join
import nltk
from nltk.tokenize import sent_tokenize
from tdm import word_tokenise

dir = 'Corpus'
file = 'HeartofDarkness.txt'

## tags used for adjectives
tags = [ 'JJS' ]

freq = dict()

dir = 'Corpus'
file = 'HeartofDarkness.txt'
path = join( dir , file )


file_handler = open( path , encoding = 'utf-8' )
full_text = file_handler.read()

print('Analysing POS tags ... \n')

sentences = sent_tokenize(full_text)
for s in sentences:
    words = word_tokenise(s.lower() )
    pos = nltk.pos_tag(words)

    for w in pos:
        token = w[0]
        tag = w[1]
        if tag in tags:
            if len(token) > 1:
                freq[token] = freq.get( token , 0 ) + 1 
            
def sortedByValue( dict ):      
    return sorted( dict , key=lambda x: dict[x])             
            

for word in reversed( sortedByValue( freq ) ):
    print( f'{word} => {freq[word]} ' )


## Readability 

Pythons's `nltk` library does not include a method that can count the number of syllables in a word, surprisingly. To address this lacuna, you can make use of the method `countSyllables()` that is given below. As its own only parameter, the function demands a single English word. The code aims to count the number of syllables in a given word, on the basis of a regular expression.

A demonstration is given below.

In [None]:
def countSyllables( word ):
    pattern = "e?[aiou]+e*|e(?!d$|ly).|[td]ed|le$|ble$|a$|y$"
    syllables = re.findall( pattern , word )
    return len(syllables)


print( countSyllables("beauty") )
print( countSyllables("believe") )
print( countSyllables("university") )

When we divide the total number of syllables in a text by the total number of tokens, the number that results gives an indication of the average length of the words (if measured in the number of syllables).

In [None]:
import re
from os.path import join
import nltk
from tdm import word_tokenise , countSyllables

nr_tokens = 0 
nr_syllables = 0 

dir = 'Corpus'
file = 'HeartOfDarkness.txt'
path = join( dir , file )

with open(path , encoding = 'utf-8') as file_handler:
    full_text = file_handler.read()

word = word_tokenise(full_text)
for w in word:
    if re.search( r'\w' , w):
        nr_tokens += 1
        nr_syllables += countSyllables( w )
    
print( f'{file}:\n\nTotal number of words: {nr_tokens}' )
print( f'Total number of syllables: {nr_syllables}' )
print( f'Average number of syllables: {nr_syllables / nr_tokens}' )

Data on the average sentence lengths and on the average word lengths are commonly used in formulae developed to assess the overall complexity of a text. The Flesch–Kincaid Grade Level Formula, for instance, uses these two numbers in a formula which can ultimately be used to assess the number of years of education that is required to comprehend the text. 

The Flesch-Kincaid equation is as follows:

fk = (0.39 * asl) + (11.8 x asw) - 15.59

In this equation, 'asl' stands for the average senetnce length, which can in turn be calculated by dividing the total number of words by the number of sentences. 

'asw' stands for the average number of syllables per word. We can calculate this number by dividing the total number of syllables by the total number of words.  

For more information, see, for instance: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests. 

In [None]:
from os.path import join
from tdm import word_tokenise , countSyllables
from nltk import sent_tokenize
import re

dir = 'Corpus'
file = 'ARoomWithAView.txt'
path = join( dir , file )

file_handler = open( path , encoding = 'utf-8' )
full_text = file_handler.read()

s = sent_tokenize(full_text)
w = word_tokenise(full_text)

asl = len(w) / len(s)

nr_syllables = 0
for word in w:
    if re.search( r'\w' , word):
        nr_syllables += countSyllables( word )

asw = nr_syllables / len(w)

print( f'Average number of words per sentence: {asl}')
print( f'Average number of syllables per words: {asw}')


fk = 0.39 * ( asl )
fk = fk + 11.8 * ( asw )
fk = fk - 15.59
fk = round(fk,1)

print( f'To fully understand {file}, you need to have followed {fk} years of formal education. ')


The examples that have been given so far have all collected data about a single text. It is evidently possible to calculate the numbers for all the texts in a corpus, and to compare the texts on the basis of these numbers. 

The code below brings together all the methods that have been discussed to this point. For all the texts in the folder named 'Corpus', it creates data about the total number of tokens, the average sentence length, the average number of adjectives, adverbs and pronouns, the average number of syllables per word and result of the Flesch–Kincaid Grade Level Formula. 

The results are saved in a new data file named `nlp.csv`. 

If your texts are long or if there are many texts in your corpus, running the code may also take quite a while. The various print statements have been added to give you updates on the progress during the data creation process.

In [None]:
import os
from os.path import join
import re
import tdm
from tdm import word_tokenise, flesch_kincaid, countSyllables
from nltk.tokenize import sent_tokenize 
from nltk import pos_tag

dir = 'Corpus'
texts = []
pos = []

out = open( 'nlp.csv' , 'w' ,  encoding= 'utf-8' )

out.write( 'title,tokens,sentences,adjectives,adverbs,persPronouns,syllables,fk\n' )

for file in os.listdir(dir):
    if re.search( 'txt$' , file ):
        print( '\n\nCollecting data for ' + file + ' ... ')
        out.write( f'{ tdm.removeExtension(file) },' )
        
        path = join( dir , file )
        with open( path , encoding = 'utf-8' ) as file_handler:
            full_text = file_handler.read()
            sentences = sent_tokenize(full_text)
            words = word_tokenise(full_text)
        
        print("Counting number of tokens ...  ")
        out.write( f'{ len(words) },' )
        print("Calculating number of sentences ...  ") 
        out.write( f'{ len(sentences) },' )
        
        print("Adding POS tags ...  ")
        pos = pos_tag(words)

        pos_tags = dict()
        for p in pos:
            pos_tags[p[1]] = pos_tags.get( p[1] , 0 ) + 1
          
        print("Counting number of adjectives ...  ")
        
        adjectives = [ 'JJ' , 'JJR' , 'JJS' ]
        
        count = 0 
        for a in adjectives:
            count += pos_tags[a]
        
        out.write( f'{count},' )
        
        print("Counting number of adverbs ...  ")
        adverbs = [ 'RB' , 'RBR' , 'RBS' ]
        count = 0 
        for a in adverbs:
            count += pos_tags[a]
        
        out.write( f'{count},' )
        
        print("Counting number of personal pronouns ...  ")
        personalPronouns = [ 'PRP' ]
        count = 0 
        for a in personalPronouns:
            count += pos_tags[a]
        
        out.write( f'{count},' )
        
        print("Calculating Flesch-Kincaid formula ...  ")
        
        asl = len(words) / len(sentences)
        
        nr_syllables = 0 
        for word in words:
            if re.search( r'\w' , word):
                nr_syllables += countSyllables( word )
        asw = nr_syllables / len(words)
        out.write( f'{nr_syllables},' )

        fk = flesch_kincaid( asl , asw )
        
        out.write( f'{fk}' )


        out.write( '\n' )
    
out.close()    

print("\n\nDone!")


Using the CSV that was just created, you can create a number of visualisations illuminating the differences and the similarities between the texts in your corpus. 

The CSV contains the following variables:

* title
* tokens
* sentences
* adjectives
* adverbs
* persPronouns
* syllables
* fk

These variables can all be used to make such visualisations. The code in the cell below creates a bar chart, for instance. The X-axis lists the various titles in your corpus, and the Y-axis displays the average number of words per sentence. 

This data about average number of words per sentence is not in the CSV originally. The CSV file contains the raw counts. To calculate the average sentence length, you need to divide the values in the column `tokens` by the values in the columnb `sentences`. The code below adds a new column named `sentenceLength`. 

The variables `x_axis` and `y_axis` specify the values to be used in this visualisation.  

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv( 'nlp.csv' )



data['sentenceLength'] = data['tokens'] / data['sentences'] 

y_axis = 'sentenceLength'
x_axis = 'title'

fig = plt.figure( figsize=( 12 ,6 ) )
ax = plt.axes()


bar_width = 0.6
opacity = 0.8

ax.bar( list(data[x_axis]) , list(data[y_axis]) , width = bar_width, alpha = opacity , color = '#781926')

plt.xticks(rotation= 90)

ax.set_xlabel('Texts' , fontsize= 12)
ax.set_ylabel('Average sentence length' , fontsize = 12 )


plt.show()
#plt.savefig('barchart.jpg')

The code below visualises two sets of values  simultaneously in a scatter plot. 

As was mentioned, the CSV file contains raw counts. To ensure that the texts can be compared fairly, we need to normise these raw counts. We can do this by dividing the absolute number by the number of tokens. The numbers that result from such a division represent proportions. If we divide the total number of adverbs by the total number of words, this is similar to calculating the percentage of adverbs in the full text. 

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.patches as mpatches

data = pd.read_csv( 'nlp.csv' )

data['adverbs_normalised'] = data['adverbs'] / data['tokens'] 
data['adjectives_normalised'] = data['adjectives'] / data['tokens'] 

x_axis = 'adverbs_normalised' 
y_axis = 'adjectives_normalised' 


plt.style.use('seaborn-whitegrid')


fig = plt.figure( figsize = ( 8,8 ))
ax = plt.axes()


ax.scatter( list(data[x_axis])  , list(data[y_axis]) , alpha=0.8,  s=90 )


for index, row in data.iterrows():
    plt.text( row[x_axis] , row[y_axis] , row['title'] , fontsize=14)
    

ax.set_xlabel( 'Average number of adverbs' , fontsize = 16 )
ax.set_ylabel( 'Average number of adjectives'  , fontsize = 16 )

plt.show()

## NLP for other languages

Many of the default methods in the `NLTK` library, such as `pos_tag`, were trained on texts in modern English. If you want to work with other languages, you need to change the model underlying these methods. 

For texts in the Dutch langauge, for instance, you can make use of the model trained on the [Alpino](https://www.let.rug.nl/~vannoord/trees/) corpus. You can do this as follows:



In [None]:
import nltk
nltk.download('alpino')

from nltk.corpus import alpino as alp
from nltk.tag import UnigramTagger, BigramTagger
training_corpus = alp.tagged_sents()
unitagger = UnigramTagger(training_corpus)
bitagger = BigramTagger(training_corpus, backoff=unitagger)
pos_tag = bitagger.tag

In [None]:
from tdm import word_tokenise

sentence = 'Het was nog donker, toen in de vroege morgen van de tweeëntwintigste december 1946 in onze stad, op de eerste verdieping van het huis Schilderskade 66, de held van deze geschiedenis, Frits van Egters, ontwaakte.'

words = word_tokenise(sentence)
pos_tag(words)

Another option is t make use of another NLP library named [spaCy](https://spacy.io/). This NLP library offers support for [a wide range of lanaguges](https://spacy.io/usage/models). These langauge models can all be downloaded from the spaCy website. 

library is not part of the Anaconda distribution of Python, so if you have never worked with spaCy before, the library needs to be installed first.

In [None]:
import sys
!pip install spacy


There are a number of [models for the Dutch language](https://spacy.io/models/nl), for instance. You can use the code below to download the model named `nl_core_news_lg`. 


In [None]:
import sys
!python3 -m spacy download nl_core_news_lg

After the model has been downloaded, it needs to be loaded into your code, so that you can start to work with it. The `load()` method in `spaCy` creates a new object which can be used to add linguistic and semantic annotations. in the cell below, this is object is given the name nlp.

In [None]:
import spacy
nlp = spacy.load("nl_core_news_lg")

This newly created `nlp` object can given a string as input. Its output will be a tagged text giving information about a number of grammatical and morphological aspects of this string, including the parts of speech, the sentence boundaries and the lemmatised form. 

In the code below, the output of the `nlp()` method is assigned to a variable named `tagged_text`. The annotations can be accessed by naviagting through the string word by word.

In [None]:
lemmatizer = nlp.get_pipe("lemmatizer")
tagged_text = nlp("'Het is gezien', mompelde hij, 'het is niet onopgemerkt gebleven.''")

for w in tagged_text:
    print( f'{w.text} (pos: {w.pos_} ; lemma: {w.lemma_})' )
    

The code below aims to use `spaCy` to produce data about the number of words, sentences, adverbs, pronouns, adjectives and conjunctions for all the texts in a folder named 'Corpus'. The process of adding linguistic annotations may demand some time, unfortunately. The code below used the `timeit` library to track how long this process actually takes. With longer texts, this process may take more than a minute. 

In [None]:
import timeit
from tdm import removeExtension
import spacy
import os
import re

dir = 'Corpus'

out = open( 'nlp.csv' , 'w' ,  encoding = 'utf-8')

# CSV header
out.write( 'title,tokens,sentences,' )
out.write(  'adverbs,verbs,pronouns,nouns,adjectives,conjunctions,aux-verbs\n')


for file in os.listdir(dir):
    if re.search( r'.txt$' , file ):
        print( f'Adding annotations for {file} ... ')
        out.write( removeExtension(file) )
        path = os.path.join(dir,text)
        with open(path) as file_handler:
            full_text = file_handler.read()
        start_time = timeit.default_timer()
        annotated_text = nlp(full_text)
        end_time = timeit.default_timer()
        print( f'Done! The annotation process took {end_time-start_time} seconds.')
        nr_words = len(annotated_text) 
        nr_sentences = len(list( annotated_text.sents ))
        out.write( f',{nr_words},{nr_sentences}')

        for w in annotated_text:
            pos[ w.pos_ ] = pos.get( w.pos_ , 0 ) + 1
            
        out.write( f",{pos.get('ADV',0)}" )
        out.write( f",{pos.get('VERB',0)}" )
        out.write( f",{pos.get('PRON',0)}" )
        out.write( f",{pos.get('NOUN',0)}" )
        out.write( f",{pos.get('ADJ',0)}" )
        out.write( f",{pos.get('SCONJ',0)+pos.get('CCONJ',0)}" )
        out.write( f",{pos.get('AUX',0)}" )
        out.write( '\n')
            
out.close()