### Natural Language Toolkit (NLTK)

This notebook introduces basic text mining capabilities in the NLTK library and brings these capabilities together to process text data from movie reviews to build a sentiment analysis model

* [Text Mining Capabilities in NLTK](#first-bullet)
* [Sentiment Analysis with NLTK](#second-bullet)

In [None]:
!pip install nltk

In [None]:
import nltk
import random, re, os
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import ClassifierI
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from statistics import mode

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

### Text Mining Capabilities in NLTK <a class="anchor" id="first-bullet"></a>

#### Sentence Tokenization
Sentence tokenizer breaks text paragraph into sentences.

In [None]:
from nltk.tokenize import sent_tokenize

#text = 'An excellent documentry. I personally remember this growing up in NYC in the early 80\'s. This movie is for anyone that wasn\'t around during that time period.'
#text = 'A solid, if unremarkable film. Matthau, as Einstein, was wonderful. My favorite part, and the only thing that would make me go out of my way to see this again, was the wonderful scene with the physicists playing badmitton, I loved the sweaters and the conversation while they waited for Robbins to retrieve the birdie.'
text='FORBIDDEN PLANET is the best SF film from the golden age of SF cinema and what makes it a great film is its sense of wonder . As soon as the spaceship lands the audience - via the ships human crew - travels through an intelligent and sometimes terrifying adventure'
tokenized_text=sent_tokenize(text)

print(tokenized_text)

#### Word Tokenization
Word tokenizer breaks text paragraph into words.

In [None]:
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

#### Stopwords
Stopwords are considered as noise in the text. Text may contain stop words such as is, am, are, this, a, an, the, etc.  We want to remove these stopwords from our analysis

To add additional stopwords to the NLTK corpus, `stopwords.append('newWord')`

In [None]:
#List of default stopwords in the NLTK corpus
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

In [None]:
# Remove stop words from tokenized text
tokenized_filtered_sent=[]
for w in tokenized_word:
    if w not in stop_words:
        tokenized_filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_word)
print("-----------------------------------------------------------------------------------------------------------------")
print("Filterd Sentence:",tokenized_filtered_sent)

#### Frequency Distribution

Plot the distribution of the most frequently occurring words in the text

In [None]:
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_filtered_sent)

In [None]:
# Frequency Distribution Plot
import matplotlib.pyplot as plt
%matplotlib inline
fdist.plot(30,cumulative=False)
plt.show()

#### Lexicon Normalization
Lexicon normalization reduces related words to a common root word.  For example, the words "*connection*", "*connected*", "*connecting*" are reduced to a common root word "connect".

The two techniques for lexicon normalization are **Stemming** and **Lemmatization**

#### Stemming
Stemming is a process of linguistic normalization, which reduces words to their word root word or chops off the derivational affixes. 

Search engines use this technique when indexing pages, so many people write different versions for the same word and all of them are stemmed to the root word.

In [None]:
# Stemming
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

stemmed_words=[]
for w in tokenized_filtered_sent:
    stemmed_words.append(ps.stem(w))

print("Filtered Sentence:",tokenized_filtered_sent)
print("-----------------------------------------------------------------------------------------------------------------")
print("Stemmed Sentence:",stemmed_words)

#### Lemmatization
Lemmatization reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis. Lemmatization is usually more sophisticated than stemming. Stemmer works on an individual word without knowledge of the context. For example, The word "better" has "good" as its lemma. This thing will miss by stemming because it requires a dictionary look-up.

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word = "flying"

# the default part of speech extracted is nouns, the result could be a verb, noun, adjective, or adverb:

print("Lemmatized Word(verb): ",lem.lemmatize(word, pos="v"))

print("Lemmatized Word(noun): ", lem.lemmatize(word, pos="n"))

print("Lemmatized Word(adjective): ",lem.lemmatize(word, pos="a"))

print("Lemmatized Word(adverb): ", lem.lemmatize(word, pos="r"))

print("Stemmed Word:",stem.stem(word))

#### Synonyms
WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus.  You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more.

In [None]:
#list the synonyms of a specific word

from nltk.corpus import wordnet
syns = wordnet.synsets("delighted")
print(syns)

#### POS Tagging
The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

Reference to interpret [POS tags](https://www.guru99.com/pos-tagging-chunking-nltk.html)

In [None]:
sent = "wonderful scene with the physicists playing badmitton, I loved the sweaters and the conversation"

In [None]:
tokens=nltk.word_tokenize(sent)
print(tokens)

In [None]:
nltk.pos_tag(tokens)

### Sentiment Analysis with NLTK <a class="anchor" id="second-bullet"></a>

In this section we will apply the NLTK text mining capabilities to extract a bag of words(BOW) from a set of labeled movie reviews, positive or negative reviews.  These BOW will serve as input features to build a model to predict the sentiments for future reviews.

The labeled movie reviews are in the pos_reviews.txt and neg_reviews.txt files. The reviews are from 25000 IMBD reviews found [here](http://ai.stanford.edu/~amaas/data/sentiment/)

<font color='red'>Action Required:</font> Add the files,  *pos_reviews.txt* and *neg_reviews.txt* files, into this project before you proceed to execute the code cells below.


In [None]:
import numpy as np
import pandas as pd
from project_lib import Project
project = Project()

In [None]:
# run this code to read data from the project in WSD
pos_reviews=pd.read_csv(project.get_file('pos_reviews.txt'), sep='\t', names=['text'], header=None)
neg_reviews=pd.read_csv(project.get_file('neg_reviews.txt'), sep='\t', names=['text'], header=None)

In [None]:
pos_reviews.shape

In [None]:
neg_reviews.shape

In [None]:
# subset the reviews to speed up the processing.  Adjust the numbers lower for speedier processing and lower accuracy, it is a tradeoff
pos_reviews=pos_reviews[0:8000]
neg_reviews=neg_reviews[0:8000]

In [None]:
pos_reviews.shape

In [None]:
neg_reviews.shape

In [None]:
neg_reviews=neg_reviews.values.tolist()
pos_reviews=pos_reviews.values.tolist()

In [None]:
pos_reviews[0:3]

In [None]:
# Apply NLTK text mining capabilities to extact BOW from positive reviews

all_words = []
documents = []

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

stop_words = list(set(stopwords.words('english')))
ps = PorterStemmer()

# for the scope of our analysis, we will only extract adjectives
allowed_word_types = ["J"]

for p in  pos_reviews:
    # create a list of tuples where the first element is a review and the second element is the label, "pos"
    documents.append( (p[0], "pos") )
    
    # tokenize 
    tokenized = word_tokenize(p[0])
    
    # remove stopwords 
    stopped=[]
    for w in tokenized:
        if w not in stop_words:
            stopped.append(w)
    
    # normalize words
    stemmed_words=[]
    for k in stopped:
        stemmed_words.append(ps.stem(k))
    
    # parts of speech tagging for each word 
    pos = nltk.pos_tag(stemmed_words)
    
    # make a list of  all adjectives identified by the allowed word types list above
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())


In [None]:
all_words[0:5]

In [None]:
documents[0:2]

In [None]:
# Apply NLTK text mining capabilities to extact BOW from negative reviews

for n in neg_reviews:
    # create a list of tuples where the first element is a review and the second element is the label, "neg"
    documents.append( (n[0], "neg") )
    
    # tokenize 
    tokenized = word_tokenize(n[0])
    
    # remove stopwords 
    stopped=[]
    for w in tokenized:
        if w not in stop_words:
            stopped.append(w)
            
    # normalize words
    stemmed_words=[]
    for k in stopped:
        stemmed_words.append(ps.stem(k))
    
    # parts of speech tagging for each word 
    neg = nltk.pos_tag(stemmed_words)
    
    # make a list of  all adjectives identified by the allowed word types list above
    for w in neg:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())

In [None]:
len(all_words)

In [None]:
# creating a frequency distribution of each adjectives.
all_words = nltk.FreqDist(all_words)

In [None]:
# Frequency Distribution Plot
import matplotlib.pyplot as plt
%matplotlib inline
all_words.plot(30,cumulative=False)
plt.show()

In [None]:
# listing the 500 most frequent words.  Adjust this number lower for speedier processing and lower accuracy of the model, it is a tradeoff
word_features = list(all_words.keys())[:500]

In [None]:
# function to create a dictionary of features for each review in the list document.
# The keys are the words in word_features 
# The values of each key are either true or false for wether that feature appears in the review or not

def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features


For each review, create a tuple. The first element of the tuple is a dictionary where the keys are each of the 5000 words from BOW and values for each key is either True if the word appears in the review or False if the word does not. The second element is the label, tagged ‘pos’ for positive reviews and ‘neg’ for negative reviews.

An example of a tuple feature set for a given review
`({'great': True, 'bad': False, 'horrible': False}, 'pos')`

In [None]:
# Creating features for each review
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [None]:
# Shuffling the documents 
random.shuffle(featuresets)

In [None]:
len(documents)

In [None]:
len(featuresets)

In [None]:
# split the feature_set into training set and testing set.  Adjust the index based on the size of the featuresets
training_set = featuresets[:12000]
testing_set = featuresets[4000:]

In [None]:
# train the model
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [None]:

print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)


**List the most informative features**
And the ratios associated with them shows how much more often each corresponding word appear in one class of text over others. These ratios are known as likelihood ratios. For example, the word ‘flawless’ is 16 times more likely to occur in a positive review than in a negative review.  The word 'horrid' is 12 times more likely to occur in a negative review than in a positive review.

In [None]:
# List the top 15 most informative features
classifier.show_most_informative_features(15)

**Additional References**:
1. [NLTK Reference](https://www.nltk.org/book/)<br/>
2. [NLP Tutorial using NLTK](https://likegeeks.com/nlp-tutorial-using-python-nltk/amp/#)
3. [Basic Sentiment Analysis using NLTK](https://towardsdatascience.com/basic-binary-sentiment-analysis-using-nltk-c94ba17ae386)

**Author**: Sidney Phoon <br/>
**Date**: Jan 22, 2020