<a href="https://colab.research.google.com/github/mkane968/Text-Mining-Experiments/blob/main/NLTK/Tutorial%205-1%3A%20Intro%20to%20Sentiment%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 5-1: Intro to Sentiment Analysis

Sentiment analysis is the practice of using algorithms to classify various samples of related text into overall positive and negative categories. With NLTK, you can employ these algorithms through powerful built-in machine learning operations to obtain insights from linguistic data.

Based on [Exercise B: Sentiment Analysis in Natural Language Processing with Python/NLTK by Luciano M. Guasco](https://github.com/luchux/ipython-notebook-nltk/blob/master/NLP%20-%20MelbDjango.ipynb)

### ***Step 1: Explore the movie_reviews corpus*** 

Import movie_reviews from nltk and clean spacing

In [2]:
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews # These are movie reviews already separated as positive and negative.
movie_reviews.readme().replace('\n', ' ').replace('\t', '').replace('``', '"').replace("''", '"').replace('`', "'")

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.




If you want, you can print the file ids from movie_reviews; it generates a very long list. But you can see the structure of the ids and how the label includes "pos" or "neg"

In [None]:
#movie_reviews.fileids()

To determine how many movie reviews are in the corpus, print the length of the list of file ids

In [7]:
len(movie_reviews.fileids())

2000

Here's an additional cleaning trick to get rid of \' in text - but only if there were no " used. See how it works with just one file.

In [None]:
movie_reviews.raw("neg/cv000_29416.txt").replace("\n", "").replace("'", '"').replace('"', "'") 

### ***Step 2: Building and testing the classifier*** 

Before building the classifier, you'll want to generate a list of stopwords which will NOT be considered when making lists of positive and negative words. We'll import English stopwords from NLTK and put them in "stops," then add additional features we don't want to include in classification using stops.extend. To see check full list of stopwords, print stops.

In [10]:
import nltk
nltk.download('stopwords')  
from nltk.corpus import stopwords

stops = stopwords.words('english')
stops.extend('.,[,],(,),;,/,-,\',?,",:,<,>,n\'t,|,#,\'s,\",\'re,\'ve,\'ll,\'d,\'re'.split(','))
stops.extend(',')
#stops

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Import the NaiveBayes Classifier. Learn more about Naive Bayes [here](https://www.analyticsvidhya.com/blog/2021/01/a-guide-to-the-naive-bayes-algorithm/). 

In [14]:
from nltk.classify import NaiveBayesClassifier
import nltk.classify.util # Utility functions and classes for classifiers. Contains functions such as accuracy(classifier, gold)

Define a function which, given a word, returns a dict `{word: True}.` This will be our feature in the classifier. 

In [15]:
def word_feats(words):
    return dict([(word, True) for word in words if word not in stops and word.isalpha()])

Create new variables for all positive and all negative movie reviews and get combined length (should be same as  length of original file ids list).

In [16]:
pos_ids = movie_reviews.fileids('pos')
neg_ids = movie_reviews.fileids('neg')

len(pos_ids) + len(neg_ids) 

2000

We take the positive/negative words, create the feature for such words, and store it in a positive/negative features list. You can print pos_feats to check list of words has loaded correctly; it will print VERY long list, since it will include words from every positive review.


In [17]:
pos_feats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in pos_ids]
neg_feats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in neg_ids]

#pos_feats

Store 3/4 of features for training the classifier and check length of positive training features. 

In [18]:
pos_len_train = int(len(pos_feats) * 3 / 4)
neg_len_train = int(len(neg_feats) * 3 / 4)

pos_len_train

750

Combine positive and negative training features into one set and put the rest in "test features" 

In [20]:
train_feats = neg_feats[:neg_len_train] + pos_feats[:pos_len_train]
test_feats = neg_feats[neg_len_train:] + pos_feats[pos_len_train:]

Train a NaiveBayesClassifier with our training feature words.

In [21]:
classifier = NaiveBayesClassifier.train(train_feats)

Get accuracy of the classifier we have just trained.

In [22]:
print('Accuracy: ', nltk.classify.util.accuracy(classifier, test_feats))

Accuracy:  0.712


We can see which words fit best in each class by getting the classifier's most informative features. 

In [23]:
classifier.show_most_informative_features()

Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


###***Step 3: Classifying new data***

Add a new sentence to test our classifier and tokenize it, adding features to tokens that are NOT in "stops" we defined above.

In [25]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize, pos_tag

sentence = "I feel so miserable, it makes me amazing"
tokens = [word for word in word_tokenize(sentence) if word not in stops]
tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['I', 'feel', 'miserable', 'makes', 'amazing']

Make tokens into features using word_feats function defined above.

In [27]:
feats = word_feats(word for word in tokens)
feats

{'I': True, 'amazing': True, 'feel': True, 'makes': True, 'miserable': True}

Use classifier to classify new sentence as either positive or negative. The result may not be what you expect!

In [28]:
classifier.classify(feats)

'pos'

Try classifying another sentence - go through the same tokenizing process.

In [30]:
sentence2 = "You are a pathetic fool, a terrible excuse for a human being."
tokens2 = [word for word in word_tokenize(sentence2) if word not in stops]
tokens2

['You', 'pathetic', 'fool', 'terrible', 'excuse', 'human']

Load tokens into new variable - instead of retaining all tokens, just capture the adjectives using `if pos[] == JJ`

In [32]:
import nltk
nltk.download('averaged_perceptron_tagger')
pos_tags2 = [pos for pos in pos_tag(tokens2) if pos[1] == 'JJ']
pos_tags2

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('pathetic', 'JJ'), ('terrible', 'JJ')]

Put reduced list of tokens into variable for classificaiton

In [33]:
feats2 = word_feats([word for (word,_) in pos_tags2])
feats2

{'pathetic': True, 'terrible': True}

Use classifier to classify new sentence as either positive or negative.

In [34]:
classifier.classify(feats2)

'neg'

### ***Step 4: Incorporating bigram features***
In order to improve the classifier, bigram features can be examined using `nltk.util.ngrams`. This is because, for instance, 'not funny' is very different from 'funny'.