<div class="pull-right"><img src=KEY-logo.png></div/>

# Natural Language Processing
### Exploring the NLP pipeline in NLTK

CSI4106 Artificial Intelligence  
Fall 2018  
Caroline Barrière

***

This notebook is split in two parts.  In **Part A**, you will explore the different steps of the NLP pipeline, and in **Part B**, you will perform a few statistical analysis of a corpus.  We will work with the package *nltk* which is very useful for NLP analysis.  You will need to install it before you start.  Information about NLTK are here: http://www.nltk.org/.

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time. Look for (**TO DO**) for the tasks that you need to perform.  
Make sure you *sign* (type your name) the notebook at the end. Once you're done, submit your notebook.

***

In [1]:
# first step, import the package
import nltk

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nick\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

**PART A - NATURAL LANGUAGE PROCESSING PIPELINE**  
  
In this part, we will use the modules from *nltk* to perform the different steps of the pipeline.  
We first define a small sample text below.

In [2]:
sampleText = "On September 5th, we started our A.I. course. The course number is CSI4106. "\
             "In that course, we have studied many artificial intelligence concepts.  "\
             "We have looked at intricate algorithms, such as the ARC-3 algorithm and neural networks learning algorithms."

#### Step 1 - Tokenization

In [3]:
from nltk import word_tokenize

tokens = word_tokenize(sampleText)
# number of tokens
len(tokens)

47

In [4]:
# Showing the tokens
print(tokens)

['On', 'September', '5th', ',', 'we', 'started', 'our', 'A.I', '.', 'course', '.', 'The', 'course', 'number', 'is', 'CSI4106', '.', 'In', 'that', 'course', ',', 'we', 'have', 'studied', 'many', 'artificial', 'intelligence', 'concepts', '.', 'We', 'have', 'looked', 'at', 'intricate', 'algorithms', ',', 'such', 'as', 'the', 'ARC-3', 'algorithm', 'and', 'neural', 'networks', 'learning', 'algorithms', '.']


#### Step 2a - Stemming (Porter Stemmer)
For a reference to the [algorithm](http://snowballstem.org/algorithms/).

In [5]:
# nltk contains different stemmers, and we try the Porter Stemmer here
from nltk.stem import *
from nltk.stem.porter import *

stemmer = PorterStemmer()
singles = [stemmer.stem(t) for t in tokens]
print(singles)

['On', 'septemb', '5th', ',', 'we', 'start', 'our', 'a.i', '.', 'cours', '.', 'the', 'cours', 'number', 'is', 'csi4106', '.', 'In', 'that', 'cours', ',', 'we', 'have', 'studi', 'mani', 'artifici', 'intellig', 'concept', '.', 'We', 'have', 'look', 'at', 'intric', 'algorithm', ',', 'such', 'as', 'the', 'arc-3', 'algorithm', 'and', 'neural', 'network', 'learn', 'algorithm', '.']


#### Step 2b - Lemmatization
The lemmatization relies on a resource called Wordnet (https://wordnet.princeton.edu/), in which lemmas are defined.

In [6]:
# Download the wordnet resource
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nick\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


In [7]:
wnl = nltk.WordNetLemmatizer()
lemmas = [wnl.lemmatize(t) for t in tokens]
print(lemmas)

['On', 'September', '5th', ',', 'we', 'started', 'our', 'A.I', '.', 'course', '.', 'The', 'course', 'number', 'is', 'CSI4106', '.', 'In', 'that', 'course', ',', 'we', 'have', 'studied', 'many', 'artificial', 'intelligence', 'concept', '.', 'We', 'have', 'looked', 'at', 'intricate', 'algorithm', ',', 'such', 'a', 'the', 'ARC-3', 'algorithm', 'and', 'neural', 'network', 'learning', 'algorithm', '.']


**(TO-DO : Q1)** Describe in your own words the difference between lemmatisation and stemming.  Use examples from above to show the difference.

*Q1 ANSWER:* Stemming is the creations of a set of non-chaning portion of words, while lemmatisation creates a set of the base form of words, or lemmas. In the above code the words 'September' and 'course' were changed to 'septemb' and 'cours' in the stemming, but in the lemmatisation they became 'september' and 'course'

#### Step 3 - Part-Of-Speech tagging  (POS tagging)
As we've seen in class, sentence splitting can be learned through a supervised model.  POS tagging can also be learned through a supervised model.  Here, we will use a perceptron model pre-trained in NLTK.  Look here http://www.nltk.org/_modules/nltk/tag/perceptron.html to understand the model.  
  
The full sest of tags is available [here](https://www.clips.uantwerpen.be/pages/mbsp-tags)

In [8]:
# nltk contains a method to obtain the part-of-speech of each token

nltk.download('averaged_perceptron_tagger')
posTokens = nltk.pos_tag(tokens)
print(posTokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\nick\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[('On', 'IN'), ('September', 'NNP'), ('5th', 'CD'), (',', ','), ('we', 'PRP'), ('started', 'VBD'), ('our', 'PRP$'), ('A.I', 'NNP'), ('.', '.'), ('course', 'NN'), ('.', '.'), ('The', 'DT'), ('course', 'NN'), ('number', 'NN'), ('is', 'VBZ'), ('CSI4106', 'NNP'), ('.', '.'), ('In', 'IN'), ('that', 'DT'), ('course', 'NN'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('studied', 'VBN'), ('many', 'JJ'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('concepts', 'NNS'), ('.', '.'), ('We', 'PRP'), ('have', 'VBP'), ('looked', 'VBN'), ('at', 'IN'), ('intricate', 'JJ'), ('algorithms', 'NNS'), (',', ','), ('such', 'JJ'), ('as', 'IN'), ('the', 'DT'), ('ARC-3', 'NNP'), ('algorithm', 'NN'), ('and', 'CC'), ('neural', 'JJ'), ('networks', 'NNS'), ('learning', 'VBG'), ('algorithms', 'NN'), ('.', '.')]


In [9]:
# If we just want to see one tag in particular

print(posTokens[1])  # it's a tuple
print(posTokens[1][1])  # second part of the tuple is the tag

('September', 'NNP')
NNP


#### Back to Step 2 

The lemmatizer we use is based on WordNet (a lexical resource commonly used in NLP) to provide a set of lemmas.  As many words are ambiguous and can be found in sentences as verbs or nouns (remember examples such as *Will's will will be achieved*), the lemmatizer can benefit from knowledge of POS.  Small problem... POS tags in Wordnet are not the same as in Treebank.  Wordnet defines only 4 POS: N (noun), V (verb), J (adjective) and R (adverb). The small method below is to obtain a partial equivalence between the tagsets.

In [10]:
from nltk.corpus import wordnet

# try to lemmatize, this time knowing the POS
# tagsets are often different... here we map the treebank tagset (default in pos_tag) 
# to the wordnet tagset -- 
# We will learn more about wordnet when we discuss resources in the Knowledge Representation module of this course

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.ADV  # just use as default, for ADV the lemmatizer doesn't change anything 


In [11]:
# Transform the tags into wordnet tags
wordnet_tags = [get_wordnet_pos(p[1]) for p in posTokens]
print(wordnet_tags)

['r', 'n', 'r', 'r', 'r', 'v', 'r', 'n', 'r', 'n', 'r', 'r', 'n', 'n', 'v', 'n', 'r', 'r', 'r', 'n', 'r', 'r', 'v', 'v', 'a', 'a', 'n', 'n', 'r', 'r', 'v', 'v', 'r', 'a', 'n', 'r', 'a', 'r', 'r', 'n', 'n', 'r', 'a', 'n', 'v', 'n', 'r']


In [12]:
# Now, let's try to lemmatize again, but we tell the lemmatizer what the POS is.

posLemmas = [wnl.lemmatize(t,w) for t,w in zip(tokens,wordnet_tags)]
print(posLemmas)

['On', 'September', '5th', ',', 'we', 'start', 'our', 'A.I', '.', 'course', '.', 'The', 'course', 'number', 'be', 'CSI4106', '.', 'In', 'that', 'course', ',', 'we', 'have', 'study', 'many', 'artificial', 'intelligence', 'concept', '.', 'We', 'have', 'look', 'at', 'intricate', 'algorithm', ',', 'such', 'as', 'the', 'ARC-3', 'algorithm', 'and', 'neural', 'network', 'learn', 'algorithm', '.']


**(TO-DO : Q2)** Which words are lemmatized differently when provided with the additional knowledge of POS? Look at the variables *lemmas* and *posLemmas*.

In [15]:
for i in range(len(posLemmas)):
    if lemmas[i] != posLemmas[i]:
        print(i, ' ', lemmas[i], ' ', posLemmas[i])

5   started   start
14   is   be
23   studied   study
31   looked   look
37   a   as
44   learning   learn


*Q2-ANSWER:* From the code above, the changed words are: started, is, studied, looked, a, learning


#### Step 4 - Sentence splitting

In [17]:
# Sentence splitting can be done before tokenizing and POS tagging if we wish

sentences = nltk.sent_tokenize(sampleText)
print(sentences)
print(len(sentences))

['On September 5th, we started our A.I.', 'course.', 'The course number is CSI4106.', 'In that course, we have studied many artificial intelligence concepts.', 'We have looked at intricate algorithms, such as the ARC-3 algorithm and neural networks learning algorithms.']
5


**(TO-DO - Q3)** How many sentences are generated?  If you try to "reverse engineer"... what could be the cause of the additional split?

*Q3-ANSWER:* There are 5 sentences generated. The additional split

**(TO-DO - Q4)** Create another sample of text containing a few sentences.  You can copy sentences from a book or webpage, or just make them up.  Then, copy the pipeline steps we performed above to analyze your sentences.  Your pipeline should include: tokenization, POS tagging, POS-based lemmatization, and sentence splitting.

In [None]:
# Q4 ANSWER

# Rerun the pipeline on YOUR text
# mySampleText = "......

#### Step 5 - Parsing
The parser in NLTK actually is a wrapper around a java parser (from Stanford NLP).  It's a bit complex to install, so instead, we will use the online Stanford parse to test a few sentences: http://nlp.stanford.edu:8080/parser/.

**(TO-DO : Q5)** Run YOUR sentences in the parser.  Show the results of the dependency and constituency parsing (do copy/paste of portion of screen here).

*Q5-ANSWER:*

**PART B - STATISTICAL ANALYSIS OF TEXTS**

We now look at a few statistics that help understand what a text is about and how it is stuctured. A quick "summary" of a text is obtained by looking at its important words.  Importance can be approximated by frequency.  

Below we work with a small text copied from the page on Vitamin C in Wikipedia.  

NLTK has many built-in methods to explore text. It's good to find some cheat sheet to see the set of methods.  [Here](https://github.com/michellejm/GCDRB_Text_Analysis/blob/master/Text-Analysis-with-NLTK-Cheatsheet.pdf) is one example.

In [18]:
vitaminText =   "Vitamin C, also known as ascorbic acid and L-ascorbic acid, is a vitamin found in food"\
                "and used as a dietary supplement.[1] The disease scurvy is prevented and treated with "\
                "vitamin C-containing foods or dietary supplements.[1] Evidence does not support use in "\
                "the general population for the prevention of the common cold.[2][3] There is, however, "\
                "some evidence that regular use may shorten the length of colds.[4] It is unclear if "\
                "supplementation affects the risk of cancer, cardiovascular disease, or dementia.[5][6] "\
                "It may be taken by mouth or by injection. Vitamin C is generally well tolerated.[1] Large doses "\
                "may cause gastrointestinal discomfort, headache, trouble sleeping, and flushing of the skin.[1][3] "\
                "Normal doses are safe during pregnancy.[7] The United States Institute of Medicine recommends "\
                "against taking large doses.[8] Vitamin C is an essential nutrient involved in the repair of "\
                "tissue and the enzymatic production of certain neurotransmitters.[1][8] It is required for the "\
                "functioning of several enzymes and is important for immune system function.[8][9] It also functions "\
                "as an antioxidant.[2] Foods containing vitamin C include citrus fruits, broccoli, Brussels sprouts, "\
                "raw bell peppers, and strawberries.[2] Prolonged storage or cooking may reduce vitamin C content "\
                " in foods.[2] Vitamin C was discovered in 1912, isolated in 1928, and in 1933 was the first vitamin "\
                "to be chemically produced.[10] It is on the World Health Organization Model List of Essential "\
                "Medicines, the most effective and safe medicines needed in a health system.[11] Vitamin C is available "\
                "as a generic medication and over-the-counter drug.[1] In 2015, the wholesale cost in the developing "\
                "world was less than US$0.01 per tablet.[12] Partly for its discovery, Albert Szent-Györgyi and "\
                "Walter Norman Haworth were awarded 1937 Nobel Prizes in Physiology and Medicine and Chemistry, "\
                "respectively.[13][14]"

**1. Gather the tokens**

In [19]:
# Number of tokens
vTokens = word_tokenize(vitaminText.lower())
print(vTokens)

['vitamin', 'c', ',', 'also', 'known', 'as', 'ascorbic', 'acid', 'and', 'l-ascorbic', 'acid', ',', 'is', 'a', 'vitamin', 'found', 'in', 'foodand', 'used', 'as', 'a', 'dietary', 'supplement', '.', '[', '1', ']', 'the', 'disease', 'scurvy', 'is', 'prevented', 'and', 'treated', 'with', 'vitamin', 'c-containing', 'foods', 'or', 'dietary', 'supplements', '.', '[', '1', ']', 'evidence', 'does', 'not', 'support', 'use', 'in', 'the', 'general', 'population', 'for', 'the', 'prevention', 'of', 'the', 'common', 'cold', '.', '[', '2', ']', '[', '3', ']', 'there', 'is', ',', 'however', ',', 'some', 'evidence', 'that', 'regular', 'use', 'may', 'shorten', 'the', 'length', 'of', 'colds', '.', '[', '4', ']', 'it', 'is', 'unclear', 'if', 'supplementation', 'affects', 'the', 'risk', 'of', 'cancer', ',', 'cardiovascular', 'disease', ',', 'or', 'dementia', '.', '[', '5', ']', '[', '6', ']', 'it', 'may', 'be', 'taken', 'by', 'mouth', 'or', 'by', 'injection', '.', 'vitamin', 'c', 'is', 'generally', 'well', '

NLTK provides a class Text, which is instantiated using tokens. The class Text provides methods to easily build a frequency distribution of tokens, which shows the content of the text.

In [20]:
# Wrap the tokens within a "Text" object on which we can apply methods.
vText = nltk.Text(vTokens)
# Number or tokens
len(vText)
# Build frequency distribution 
fdist = nltk.FreqDist(vText)
fdist.most_common(50)

[('[', 25),
 (']', 25),
 ('.', 20),
 (',', 19),
 ('the', 16),
 ('and', 12),
 ('in', 11),
 ('vitamin', 10),
 ('is', 10),
 ('of', 9),
 ('c', 7),
 ('1', 6),
 ('it', 5),
 ('as', 4),
 ('a', 4),
 ('or', 4),
 ('for', 4),
 ('2', 4),
 ('may', 4),
 ('foods', 3),
 ('doses', 3),
 ('8', 3),
 ('was', 3),
 ('also', 2),
 ('acid', 2),
 ('dietary', 2),
 ('disease', 2),
 ('evidence', 2),
 ('use', 2),
 ('3', 2),
 ('be', 2),
 ('by', 2),
 ('large', 2),
 ('safe', 2),
 ('medicine', 2),
 ('an', 2),
 ('essential', 2),
 ('system', 2),
 ('world', 2),
 ('health', 2),
 ('medicines', 2),
 ('known', 1),
 ('ascorbic', 1),
 ('l-ascorbic', 1),
 ('found', 1),
 ('foodand', 1),
 ('used', 1),
 ('supplement', 1),
 ('scurvy', 1),
 ('prevented', 1)]

Many tokens shown above are not helpful in determining what the text is about.  We can use regular expressions to limit the desired tokens.

In [21]:
# import the module for regular expressions
import re

In [22]:
# this regular expression says that a word must start and end with any number of letters.
alphaTokens = [t for t in vTokens if re.match("^[a-zA-Z]+$",t)]

# generate another text, only with the alpha tokens
alphaText = nltk.Text(alphaTokens)
# build the frequency distribution again
fdistAlpha = nltk.FreqDist(alphaText)
fdistAlpha.most_common(50)

[('the', 16),
 ('and', 12),
 ('in', 11),
 ('vitamin', 10),
 ('is', 10),
 ('of', 9),
 ('c', 7),
 ('it', 5),
 ('as', 4),
 ('a', 4),
 ('or', 4),
 ('for', 4),
 ('may', 4),
 ('foods', 3),
 ('doses', 3),
 ('was', 3),
 ('also', 2),
 ('acid', 2),
 ('dietary', 2),
 ('disease', 2),
 ('evidence', 2),
 ('use', 2),
 ('be', 2),
 ('by', 2),
 ('large', 2),
 ('safe', 2),
 ('medicine', 2),
 ('an', 2),
 ('essential', 2),
 ('system', 2),
 ('world', 2),
 ('health', 2),
 ('medicines', 2),
 ('known', 1),
 ('ascorbic', 1),
 ('found', 1),
 ('foodand', 1),
 ('used', 1),
 ('supplement', 1),
 ('scurvy', 1),
 ('prevented', 1),
 ('treated', 1),
 ('with', 1),
 ('supplements', 1),
 ('does', 1),
 ('not', 1),
 ('support', 1),
 ('general', 1),
 ('population', 1),
 ('prevention', 1)]

**(TO-DO : Q6)** We've removed the non alphanumeric characters.  Now continue the code below to remove both alphanumeric and *stopwords* from the set of tokens.  I let you first explore what *stopwords* are by decommenting the code below. And then, continue the code to show the top 50 tokens, once the filtering is done.

In [25]:
# Q6 - ANSWER

nltk.download('stopwords')
# import the stopwords
from nltk.corpus import stopwords

setStopWords = set(stopwords.words('english'))
print(setStopWords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nick\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
{'be', 'its', "weren't", 'nor', 'her', 're', "it's", "wasn't", 'itself', "hasn't", 'mustn', 'the', 'can', 'as', 'weren', 'wouldn', 'll', 'about', 'does', 'while', 'so', 'but', 'we', 'against', 'i', 'to', 'any', 'been', "couldn't", "hadn't", 'should', "doesn't", 'haven', 'myself', 'hers', 'herself', 'them', 'than', 'he', 'my', 'a', 'such', "that'll", "you've", 'other', "mightn't", 'when', 'shan', "you're", 'through', 'don', 'in', 'shouldn', 'hasn', 'of', 'having', "mustn't", 'himself', 'yours', 'mightn', 'how', 'doing', 'until', 'below', 'by', 'these', 'from', 'some', "didn't", "you'll", 'up', 'is', 'were', "wouldn't", 'who', 'they', 'ours', 'isn', 'this', 'whom', 'his', 'further', 'didn', 'themselves', 'did', 'which', "haven't", 'why', 'because', 'hadn', "aren't", 'over', 'no', 'd', 'it', 'once', 'yourself', 'there', "shouldn't",

In [31]:
# Q6 - ANSWER (continue)

# build a filtered list of tokens
filtered_tokens = [i not in stopwords.words('english') for i in vTokens]
# we can reconstruct a Text object using the filtered list
filteredText = nltk.Text(filtered_tokens)

# build the frequency distribution again
fdistFiltered = nltk.FreqDist(filteredText)
# show top 50
fdistFiltered.most_common(50)

[(True, 295), (False, 104)]

**(TO-DO : Q7)** Download a page and gather statistics about that page content.  Write the code to show the top 50 tokens, after **lemmatizing, removing non-alphnumeric characters and removing stop words**.  You have all the elements of coding in the previous questions, but you need to put them together here. You can use the *Beethoven* page I used, or change it to a page of your choice.  I'm using a package called *BeautifulSoup* to help pre-process html pages and get to the raw text.  

In [None]:
# This code will download a page from Wikipedia 

from urllib import request
url = "https://en.wikipedia.org/wiki/Ludwig_van_Beethoven"
response = request.urlopen(url)
raw = response.read().decode('utf8')
# print(raw[1000:2000])

In [None]:
# import from http://www.crummy.com/software/BeautifulSoup/

from bs4 import BeautifulSoup
cleanRaw = BeautifulSoup(raw).get_text()
print(cleanRaw[:-1000])

In [None]:
# Q7 - ANSWER
# Continue here... 



Signature
I, -------YOUR NAME--------------, declare that the answers provided in this notebook are my own.