# 6. NLP

## Introduction

Natural Language Processing is an interdisciplinary area of research bringing together insights from fields such as artifical intelligence, computational linguistics, statistics and computer science. The aim of NLP is to enable computers to understand and to process the natural langauges that are spoken and written by human beings. Researchers in the field of NLP have developed sophisticated tools and algorithms for machine translation, document summarisation and sentiment analysis. This notebook explains two specific preprocssing task in NLP: Part of speech tagging and Lemmatisation.  

 

## Part of speech tagging

Part of speech (POS) taggers are applications which can produce data about the syntactic categories of words. Their aim is to determine the lexical categories of words. Using such POS taggers, we extract words in specific lexical categories, such as nouns, verbs, adjectives, and adverbs. 

Once you have imported the `nltk` library, you can generate such POS tags by making use of the `pos_tag()` method. This method demands a list of words as a parameter. 

`pos_tag()` is typically used in combination with a word tokenisation method such as `word_tokenize`. The output of this latter function can then be used as input to the `pos_tag()` method.

In [None]:
import nltk
from nltk import word_tokenize,pos_tag
from tdmh import *

quote = '''All the world is a stage, 
and all the men and women merely players'''

words = word_tokenize(quote)
words = remove_punctuation(words)
pos = pos_tag(words)

for p in pos:
    print(p[0] + ' => ' + p[1] )
  

The `pos_tag()` methods returns a composite variable with two values. More specifically, it is a data structure that is called a *tuple*. The first value is the word that was tagged and the second value is the POS tag that was assigned to this word. You can access these values individually using square brackets. 

The meaning of all of the POS tags can be displayed by printing the output of the `nltk.help.upenn_tagset()` method.


In [None]:
print( nltk.help.upenn_tagset() )

The list of codes and their meanings can [also be found online](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

## Lemmatisation 

English verbs can be used, in the past tense, in the present tense, in the continuous form or in the perfect form, and these different forms can evidently make it more difficult to search systematically for occurrences of a specific verb. The same can be the case for nouns. They can be used in the singular and in the plural form. In some situations, we simply want to find all occurrences of a word, regardless of declensions and inflections. In this context, lemmatisation can offer a solution.

Lemmatisation is a process in which the conjugated forms of the words that are found in a text are converted into their base dictionary form. This base form is referred to as the lemma. 

You can lemmatise texts using the  `lemmatize()` method, which is part of the `WordNetLemmatizer` module of the `nltk` library. This method needs to be applied to indivual words.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatiser = WordNetLemmatizer()

print( lemmatiser.lemmatize( 'books' ) )
## prints 'book'

print( lemmatiser.lemmatize( 'reads' ) )
## prints 'read'

It some cases, it can be unclear precisely how words ought to be lemmatised. Certain homonyms may either be verbs or nouns, for instance, and, depending on their usage, they should be lemmatised to different forms. To help the lemmatiser to make such distinctions, we can add a second parameter to indicate the lexical category of the word to be lemmatised. The first statement in the code lemmatises the word 'recording' as a verb, and the second statement as a noun. 

In [None]:
print( lemmatiser.lemmatize( 'recording' , 'v') )
## 'record'

print( lemmatiser.lemmatize( 'recording' , 'n' ) )
## 'recording'

As you can see, the `lemmatize()` method does not use the Penn Treebank codes but the POS codes that have been defined for `wordnet`. It uses 'a' for adjectives, 'v' for verbs, 'n for nouns' and  'r' for adverbs. 

The code below shows you how you can lemmatise a whole sentence. The code firstly tokenises the words in the sentence that is given using `word_tokenize`. Next, the code generates the POS codes (from Penn Treebank) using `pos_tag`. These codes are then converted into `wordnet` codes used a new function named `ptb_to_wordnet()`.

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
import re

quote = '''We are such stuff as dreams are made on, 
and our little life is rounded with a sleep'''


def ptb_to_wordnet(PTT):

    if PTT.startswith('J'):
        ## Adjective
        return 'a'
    elif PTT.startswith('V'):
        ## Verb
        return 'v'
    elif PTT.startswith('N'):
        ## Noune
        return 'n'
    elif PTT.startswith('R'):
        ## Adverb
        return 'r'
    else:
        return ''

    
lemmatiser = WordNetLemmatizer()

words = word_tokenize(quote)
words = remove_punctuation(words)

pos = nltk.pos_tag(words)

for i,word in enumerate(words):
    word = word.lower()
    posTag = ptb_to_wordnet( pos[i][1] )
    
    if re.search( r'\w+' , posTag , re.IGNORECASE ):
        lemma = lemmatiser.lemmatize( words[i] , posTag )
        print( f'{word} => {lemma}' )
    else:
        print( f'{word} => {word}' )
        

# Exercises

## Exercise 6.1

Create a list containing the unique adjectives that are occur in *Pride and Prejudice*. 

## Exercise 6.2

Stephen King is [reputed to have said](https://www.goodreads.com/quotes/430289-i-believe-the-road-to-hell-is-paved-with-adverbs) that “the road to hell is paved with adverbs", and many style guides similarly give writers the advice to avoid adverbs, especially those ending in '-ly'. 

Can you calculate, for each text in the corpus, the number of adverb ending in '-ly', measured as a percentage of the total number of words?

# Execise 6.3

Which text in the corpus has the highest number of modal verbs? The Penn Treebank code for 'modal auxialiaries' is MD. 

# Exercise 6.4

Extract all the sentences from *HeartOfDarkness.txt* that contain an adjective in the superlative form.  Write these sentences into a file named 'sentences.txt'. The code for the words in these category is 'JJS'.

## Exercise 6.5

Extract all the sentences from *ARoomWithaView.txt* containing a form of the verb 'to see', in all tenses and conjugations and excepting the infitive form. In other words, extract sentences containing forms such as 'seen', 'saw' or 'seeing', but not 'see'. 


## Exercise 6.6

From *HeartOfDarkness.txt* , extract all sentnces containing the following combinations of categories: 

* Article - adverb - adjective - noun 

These categorties can be asigned the following codes:

* Article: DT
* Adverb: RB, RBR or RBS
* Adjective: JJ, JJR or JJS
* Noun: NN, NNP, NNPS or NNS

