# Natural Language Processing with `nltk`

`nltk` is the most popular Python package for Natural Language processing, it provides algorithms for importing, cleaning, pre-processing text data in human language and then apply computational linguistics algorithms like sentiment analysis.

## Inspect the Movie Reviews Dataset

It also includes many easy-to-use datasets in the `nltk.corpus` package, we can download for example the `movie_reviews` package using the `nltk.download` function:

In [1]:
# Uncomment the below line and run this cell if you need to install nltk
#pip install nltk

In [2]:
#Run this cell for all the imports
import pandas as pd
import numpy as np
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

In [3]:
#Run this cell to download the dataset
nltk.download("movie_reviews")

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\mary0\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

You can also list and download other datasets interactively just typing:

`nltk.download()`
    
in the Jupyter Notebook.

In [4]:
#Run this cell to import the dataset
from nltk.corpus import movie_reviews

In [5]:
#Run this cell for later use in tokenization
nltk.download('vader_lexicon')  # for sentiment analysis
nltk.download('punkt')  # for tokenizing

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\mary0\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mary0\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 1. Tokenize Text in Words

In [6]:
#Run this cell
romeo_text = """Why then, O brawling love! O loving hate!
O any thing, of nothing first create!
O heavy lightness, serious vanity,
Misshapen chaos of well-seeming forms,
Feather of lead, bright smoke, cold fire, sick health,
Still-waking sleep, that is not what it is!
This love feel I, that feel no love in this."""

The first step in Natural Language processing is generally to split the text into words, this process might appear simple but it is very tedious to handle all corner cases, see for example all the issues with punctuation we have to solve if we just start with a split on whitespace.

1.1 **Split `romeo_text` by spaces and print the resultant list of words** [0.5 pt]

In [7]:
romeo_text.split()

['Why',
 'then,',
 'O',
 'brawling',
 'love!',
 'O',
 'loving',
 'hate!',
 'O',
 'any',
 'thing,',
 'of',
 'nothing',
 'first',
 'create!',
 'O',
 'heavy',
 'lightness,',
 'serious',
 'vanity,',
 'Misshapen',
 'chaos',
 'of',
 'well-seeming',
 'forms,',
 'Feather',
 'of',
 'lead,',
 'bright',
 'smoke,',
 'cold',
 'fire,',
 'sick',
 'health,',
 'Still-waking',
 'sleep,',
 'that',
 'is',
 'not',
 'what',
 'it',
 'is!',
 'This',
 'love',
 'feel',
 'I,',
 'that',
 'feel',
 'no',
 'love',
 'in',
 'this.']

`nltk` has a sophisticated word tokenizer trained on English named `punkt` which we imported earlier in the notebook.

1.2  **Use the `nltk.word_tokenize(text)` function to properly tokenize `romeo_text` and stores the result as `romeo_words`. Print the resultant list.** Compare it to the whitespace splitting we used above and mention the difference. [0.5 pt]

In [8]:
romeo_words = nltk.word_tokenize(romeo_text)
print(romeo_words)

['Why', 'then', ',', 'O', 'brawling', 'love', '!', 'O', 'loving', 'hate', '!', 'O', 'any', 'thing', ',', 'of', 'nothing', 'first', 'create', '!', 'O', 'heavy', 'lightness', ',', 'serious', 'vanity', ',', 'Misshapen', 'chaos', 'of', 'well-seeming', 'forms', ',', 'Feather', 'of', 'lead', ',', 'bright', 'smoke', ',', 'cold', 'fire', ',', 'sick', 'health', ',', 'Still-waking', 'sleep', ',', 'that', 'is', 'not', 'what', 'it', 'is', '!', 'This', 'love', 'feel', 'I', ',', 'that', 'feel', 'no', 'love', 'in', 'this', '.']


The difference is that the string from the nltk is separated by punctuations such as exclamation marks and commas, while the split method is only separated by whitespaces. 

## 2. Build a bag-of-words model

The simplest model for analyzing text is just to think about text as an unordered collection of words (bag-of-words). This can generally allow to infer from the text the category, the topic or the sentiment.

From the bag-of-words model we can build features to be used by a classifier, here we assume that each word is a feature that can either be `True` or `False`.
We implement this in Python as a dictionary where for each word in a sentence we associate `True`.

2.1 **Write a function `build_bag_of_words(words)` that returns such a dictionary with {word : True} format given a set of words. Call the function with `romeo_words` and print the resultant dictionary.** [1 pt]

In [9]:
def build_bag_of_words(words):
    x = {words : True for words in romeo_words}
    return x

In [10]:
build_bag_of_words(romeo_text)

{'Why': True,
 'then': True,
 ',': True,
 'O': True,
 'brawling': True,
 'love': True,
 '!': True,
 'loving': True,
 'hate': True,
 'any': True,
 'thing': True,
 'of': True,
 'nothing': True,
 'first': True,
 'create': True,
 'heavy': True,
 'lightness': True,
 'serious': True,
 'vanity': True,
 'Misshapen': True,
 'chaos': True,
 'well-seeming': True,
 'forms': True,
 'Feather': True,
 'lead': True,
 'bright': True,
 'smoke': True,
 'cold': True,
 'fire': True,
 'sick': True,
 'health': True,
 'Still-waking': True,
 'sleep': True,
 'that': True,
 'is': True,
 'not': True,
 'what': True,
 'it': True,
 'This': True,
 'feel': True,
 'I': True,
 'no': True,
 'in': True,
 'this': True,
 '.': True}

This is what we wanted, but we notice that also punctuation like "!" and words useless for classification purposes like "of" or "that" are also included.
Those words are named "stopwords" and `nltk` has a convenient corpus we can download:

In [11]:
#Run this cell
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mary0\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
#Run this cell
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Using the Python `string.punctuation` list and the English stopwords we can build better features by filtering out those words that would not help in the classification. 

2.2 **Create a list `useless_words` that is a collection of stopwords in english and the punctuation characters.** [0.5 pt]

In [19]:
useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)
#print(useless_words)

2.3 **Write a function `build_bag_of_words_features_filtered(words)` that returns a filtered bag of words - a dictionary with only useful words as key and 1 as the value. Call this function with `romeo_words` and print the resultant list.** [1 pt]

In [14]:
def build_bag_of_words_features_filtered(words): 
    x = { words : 1 for words in romeo_words \
         if not words in useless_words} # if not, they're ignored
    return x

In [21]:
build_bag_of_words_features_filtered(romeo_text) # most of the words like is and punctuation are gone

{'Why': 1,
 'O': 1,
 'brawling': 1,
 'love': 1,
 'loving': 1,
 'hate': 1,
 'thing': 1,
 'nothing': 1,
 'first': 1,
 'create': 1,
 'heavy': 1,
 'lightness': 1,
 'serious': 1,
 'vanity': 1,
 'Misshapen': 1,
 'chaos': 1,
 'well-seeming': 1,
 'forms': 1,
 'Feather': 1,
 'lead': 1,
 'bright': 1,
 'smoke': 1,
 'cold': 1,
 'fire': 1,
 'sick': 1,
 'health': 1,
 'Still-waking': 1,
 'sleep': 1,
 'This': 1,
 'feel': 1,
 'I': 1}

## 3. Frequencies of Words

It is common to explore a dataset before starting the analysis, in this section we will find the most common words and plot their frequency.

3.1. Using the `movie_reviews.words()` (the nltk corpus we imported previously) with no argument we can extract the words from the entire dataset as `all_words` and check that it is about 1.6 millions. [0.5 pt]

In [24]:
all_words = movie_reviews.words()
print(len(all_words))

1583820


3.2. Filter out `useless_words` as defined in the previous section, and create a new list `filtered_words` this will reduce the length of the dataset by more than a factor of 2. (Hint - python list comprehension) [0.5 pt]

In [37]:
filtered_words = [w for w in all_words if not w in useless_words ]
print(len(filtered_words))

710579


The `collection` package of the standard library contains a `Counter` class that is handy for counting frequencies of words in our list:

In [38]:
#Run this cell
from collections import Counter
word_counter = Counter(filtered_words)

Use the [most_common() ](https://pythontic.com/containers/counter/most_common) method of the word_counter and print the top 10 used words from the corpus as a list called `most_common_words`. [0.5 pt]

In [40]:
most_common_words = word_counter.most_common(10)
print(most_common_words)

[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('good', 2411), ('time', 2411), ('story', 2169), ('would', 2109), ('much', 2049)]


## 4. Sentiment Analysis [2 pt]

Using the sentiment intensity analyzer, loop over the `list_sentences` and print the polarity scores of each of the sentence. (Hint - refer to lecture notebook)

In [41]:
#Run this cell
list_sentences = ["Hello, how are you?", "Today is a nice day", "I don't like the food at the cafe", "This is the worst pizza I have ever had.", "The orange juice is delicious!", "I am late to class." ]

In [51]:
sia = SentimentIntensityAnalyzer()            

In [64]:
for i in range(len(list_sentences)):
    scores = sia.polarity_scores(list_sentences[i])    
    print(scores)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.517, 'pos': 0.483, 'compound': 0.4215}
{'neg': 0.26, 'neu': 0.74, 'pos': 0.0, 'compound': -0.2755}
{'neg': 0.369, 'neu': 0.631, 'pos': 0.0, 'compound': -0.6249}
{'neg': 0.0, 'neu': 0.501, 'pos': 0.499, 'compound': 0.6114}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


# 5. Neural Networks

### 5.1 Experiment [2.5 pt]

For this section, we will use the same notebook from the lecture videos and perform tests on it. We learned that neural networks consist of layers of interconnected neurons/nodes that make up the hidden layers with weights that work on incoming information using the activation function. 

Make a copy and run (run all) this [notebook](https://drive.google.com/file/d/1JtkYzdEHl1ijLvralyla_2bNgaPODCM0/view?usp=sharing). USE COLAB ONLY.

Perform the following experiments and report the test accuracy and time taken (sum of time taken across 10 epochs while fitting). 

1. For example - Provided case<br>
1st hidden layer - 128 nodes.
Accuracy : 0.9781
Time: 43s

For the subsequent experiments, you need to make a change in the cell [15] like this - <br>
  ```model = tf.keras.Sequential([
      tf.keras.layers.Flatten(input_shape=(28, 28)),
      tf.keras.layers.Dense(128, activation='relu'), #given hidden layer 1
      tf.keras.layers.Dense(128, activation='relu'), #added this layer for Experiment 2
      tf.keras.layers.Dense(10)])
  ```

2. Experiment 2 <br>
1st hidden layer - 128 nodes.<br>
2nd hidden layer - 128 nodes.<br>

3. Experiment 3 <br>
1st hidden layer - 256 nodes.<br>


3. Experiment 4 <br>
1st hidden layer - 256 nodes.<br>
2nd hidden layer - 256 nodes.<br>


4. Experiment 5 <br>
1st hidden layer - 128 nodes.<br>
2nd hidden layer - 128 nodes.<br>
3rd hidden layer - 128 nodes.<br>

  

In [None]:
import tensorflow as tf
# Experiment 1:
# Accuracy:      Time:    

### 5.2 Inference [0.5 pt]
What can you infer from the above experiments regarding accuracy and computation resources(time) by changing the number of layers/number of nodes each layer. For any of the experiment did the accuracy decrease unexpectedly - what could be the reason? [0.5 pt]

In [None]:
#ToDo