# Nltk Sentiment Analysis

https://www.nltk.org/api/nltk.sentiment.html


Sentiment analysis is used to determine positive or negative sentiment in text.

In this notebook I will decompose the sample code in the NTLK documentation so we can understand what it is doing:

https://www.nltk.org/howto/sentiment.html


In [2]:
# Import the sentiment analyser class
from nltk.sentiment import SentimentAnalyzer

We can always get help on any Python object like this:

In [3]:
help(SentimentAnalyzer)

Help on class SentimentAnalyzer in module nltk.sentiment.sentiment_analyzer:

class SentimentAnalyzer(builtins.object)
 |  SentimentAnalyzer(classifier=None)
 |  
 |  A Sentiment Analysis tool based on machine learning approaches.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, classifier=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  add_feat_extractor(self, function, **kwargs)
 |      Add a new function to extract features from a document. This function will
 |      be used in extract_features().
 |      Important: in this step our kwargs are only representing additional parameters,
 |      and NOT the document we have to parse. The document will always be the first
 |      parameter in the parameter list, and it will be added in the extract_features()
 |      function.
 |      
 |      :param function: the extractor function to add to the list of feature extractors.
 |      :param kwargs: additional parameters required by the `function` 

In [4]:
# Import the Naive Bayes Classification algorithm
from nltk.classify import NaiveBayesClassifier

# Import the subjectivity test corpus
from nltk.corpus import subjectivity

# Import the sentiment analysis libraries
from nltk.sentiment import SentimentAnalyzer

# Import the utilities library
from nltk.sentiment.util import *

## Grab our data set
The NTLK subjectivity corpus contains sample texts pre-classified as subjective or objective.

In [5]:
# Download subjectivity corpus
import nltk
nltk.download('subjectivity')

[nltk_data] Downloading package subjectivity to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\subjectivity.zip.


True

In [6]:
help(subjectivity)

Help on LazyCorpusLoader in module nltk.corpus.util object:

subjectivity = class CategorizedSentencesCorpusReader(nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.api.CorpusReader)
 |  subjectivity(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=None, encoding='utf8', **kwargs)
 |  
 |  A reader for corpora in which each row represents a single instance, mainly
 |  a sentence. Istances are divided into categories based on their file identifiers
 |  (see CategorizedCorpusReader).
 |  Since many corpora allow rows that contain more than one sentence, it is
 |  possible to specify a sentence tokenizer to retrieve all sentences instead
 |  than all rows.
 |  
 |  Examples using the Subjectivity Dataset:
 |  
 |  >>> from nltk.corpus import subjectivity
 |  >>> subjectivity.sents()[23]
 |  ['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits',
 

Have a look at the sentences in the test corpus

In [7]:
subjectivity.sents()

[['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ...]

We can split into subjective and objective sentences.  Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. 

In [8]:
subjectivity.sents(categories='subj')

[['smart', 'and', 'alert', ',', 'thirteen', 'conversations', 'about', 'one', 'thing', 'is', 'a', 'small', 'gem', '.'], ['color', ',', 'musical', 'bounce', 'and', 'warm', 'seas', 'lapping', 'on', 'island', 'shores', '.', 'and', 'just', 'enough', 'science', 'to', 'send', 'you', 'home', 'thinking', '.'], ...]

In [9]:
subjectivity.sents(categories='obj')

[['the', 'movie', 'begins', 'in', 'the', 'past', 'where', 'a', 'young', 'boy', 'named', 'sam', 'attempts', 'to', 'save', 'celebi', 'from', 'a', 'hunter', '.'], ['emerging', 'from', 'the', 'human', 'psyche', 'and', 'showing', 'characteristics', 'of', 'abstract', 'expressionism', ',', 'minimalism', 'and', 'russian', 'constructivism', ',', 'graffiti', 'removal', 'has', 'secured', 'its', 'place', 'in', 'the', 'history', 'of', 'modern', 'art', 'while', 'being', 'created', 'by', 'artists', 'who', 'are', 'unconscious', 'of', 'their', 'artistic', 'achievements', '.'], ...]

In [10]:
# Get 100 of each (subjective and objective)
n_instances = 100
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[:n_instances]]
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[:n_instances]]
len(subj_docs), len(obj_docs)


(100, 100)

## Create training and test sets

In order to be scientific about the process, we split the documents into a training set, used to build the model, and a test set, used only for testing the model:

In [11]:
# Create a training and test set for subjective docs.  80% go in training, save 20% for testing
train_subj_docs = subj_docs[:80]
test_subj_docs = subj_docs[80:100]

# Create a training and test set for objective docs.  80% go in training, save 20% for testing
train_obj_docs = obj_docs[:80]
test_obj_docs = obj_docs[80:100]

# Merge the training sets and the test sets
training_docs = train_subj_docs+train_obj_docs
testing_docs = test_subj_docs+test_obj_docs



In [12]:
len(training_docs),len(testing_docs)

(160, 40)

## Prepare the sentiment analyser

In [13]:
# Create a sentiment analyser
sentim_analyzer = SentimentAnalyzer()

# Mark up the negative sections of the doc and return just the words
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

When you see a line like the above, it's worth breaking it apart to see what it does:

In [14]:
# CHECK: to see what mark_negation does by isolating one document
mark_negation(training_docs[2])

(['it',
  'is',
  'not',
  'a_NEG',
  'mass-market_NEG',
  'entertainment_NEG',
  'but_NEG',
  'an_NEG',
  'uncompromising_NEG',
  'attempt_NEG',
  'by_NEG',
  'one_NEG',
  'artist_NEG',
  'to_NEG',
  'think_NEG',
  'about_NEG',
  'another_NEG',
  '.'],
 'subj')

In [15]:
# Get help on all_words
help(sentim_analyzer.all_words)

Help on method all_words in module nltk.sentiment.sentiment_analyzer:

all_words(documents, labeled=None) method of nltk.sentiment.sentiment_analyzer.SentimentAnalyzer instance
    Return all words/tokens from the documents (with duplicates).
    
    :param documents: a list of (words, label) tuples.
    :param labeled: if `True`, assume that each document is represented by a
        (words, label) tuple: (list(str), str). If `False`, each document is
        considered as being a simple list of strings: list(str).
    :rtype: list(str)
    :return: A list of all words/tokens in `documents`.



In [16]:
# CHECK: to see what all_words does by isolating one document
sentim_analyzer.all_words(mark_negation(training_docs[2]))

['it',
 'is',
 'not',
 'a_NEG',
 'mass-market_NEG',
 'entertainment_NEG',
 'but_NEG',
 'an_NEG',
 'uncompromising_NEG',
 'attempt_NEG',
 'by_NEG',
 'one_NEG',
 'artist_NEG',
 'to_NEG',
 'think_NEG',
 'about_NEG',
 'another_NEG',
 '.',
 's',
 'u',
 'b',
 'j']

In [17]:
# CHECK: What is in all_words_neg now?
all_words_neg[:50]

['smart',
 'and',
 'alert',
 ',',
 'thirteen',
 'conversations',
 'about',
 'one',
 'thing',
 'is',
 'a',
 'small',
 'gem',
 '.',
 'color',
 ',',
 'musical',
 'bounce',
 'and',
 'warm',
 'seas',
 'lapping',
 'on',
 'island',
 'shores',
 '.',
 'and',
 'just',
 'enough',
 'science',
 'to',
 'send',
 'you',
 'home',
 'thinking',
 '.',
 'it',
 'is',
 'not',
 'a_NEG',
 'mass-market_NEG',
 'entertainment_NEG',
 'but_NEG',
 'an_NEG',
 'uncompromising_NEG',
 'attempt_NEG',
 'by_NEG',
 'one_NEG',
 'artist_NEG',
 'to_NEG']

In [18]:
len(all_words_neg)

3799

Remove infrequent words

In [19]:
# Return all words that occur at least 4 times
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
len(unigram_feats)


83

In [20]:
# Have a look at these words
unigram_feats[:50]

['.',
 'the',
 ',',
 'a',
 'and',
 'of',
 'to',
 'is',
 'in',
 'with',
 'it',
 'that',
 'his',
 'on',
 'for',
 'an',
 'who',
 'by',
 'he',
 'from',
 'her',
 '"',
 'film',
 'as',
 'this',
 'movie',
 'their',
 'but',
 'one',
 'at',
 'about',
 'the_NEG',
 'a_NEG',
 'to_NEG',
 'are',
 "there's",
 '(',
 'story',
 'when',
 'so',
 'be',
 ',_NEG',
 ')',
 'they',
 'you',
 'not',
 'have',
 'like',
 'will',
 'all']

In [21]:
# Use the extract_unigram_feats feature extractor when processing the doc
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

## Apply the feature extractor to the training and test docs

In [22]:
# Apply the feature extractor to the docs
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)

In [23]:
# We can see that it has marked each word in each doc, saying which of the top words are in the doc and which are not
training_set

[({'contains(.)': True, 'contains(the)': False, 'contains(,)': True, 'contains(a)': True, 'contains(and)': True, 'contains(of)': False, 'contains(to)': False, 'contains(is)': True, 'contains(in)': False, 'contains(with)': False, 'contains(it)': False, 'contains(that)': False, 'contains(his)': False, 'contains(on)': False, 'contains(for)': False, 'contains(an)': False, 'contains(who)': False, 'contains(by)': False, 'contains(he)': False, 'contains(from)': False, 'contains(her)': False, 'contains(")': False, 'contains(film)': False, 'contains(as)': False, 'contains(this)': False, 'contains(movie)': False, 'contains(their)': False, 'contains(but)': False, 'contains(one)': True, 'contains(at)': False, 'contains(about)': True, 'contains(the_NEG)': False, 'contains(a_NEG)': False, 'contains(to_NEG)': False, 'contains(are)': False, "contains(there's)": False, 'contains(()': False, 'contains(story)': False, 'contains(when)': False, 'contains(so)': False, 'contains(be)': False, 'contains(,_NEG)

## Build the model

In [24]:
# Train the classifier using the training data
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)

Training classifier


## Test the model using the test set

In [25]:
# Print out the evaluation metrics
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
     print('{0}: {1}'.format(key, value))

Evaluating NaiveBayesClassifier results...
Accuracy: 0.8
F-measure [obj]: 0.8
F-measure [subj]: 0.8
Precision [obj]: 0.8
Precision [subj]: 0.8
Recall [obj]: 0.8
Recall [subj]: 0.8


## Examine the model
We can take a look at the features that the model has homed in on.  These features are the most important ones for distinguishing between subjective and objective documents.

In [26]:
classifier.show_most_informative_features(15)

Most Informative Features
           contains(her) = True              obj : subj   =      5.4 : 1.0
           contains(its) = True             subj : obj    =      5.0 : 1.0
          contains(more) = True             subj : obj    =      4.3 : 1.0
            contains(if) = True             subj : obj    =      3.7 : 1.0
            contains(it) = True             subj : obj    =      3.1 : 1.0
        contains(begins) = True              obj : subj   =      3.0 : 1.0
          contains(both) = True             subj : obj    =      3.0 : 1.0
           contains(him) = True              obj : subj   =      3.0 : 1.0
           contains(his) = True              obj : subj   =      3.0 : 1.0
          contains(life) = True              obj : subj   =      3.0 : 1.0
          contains(make) = True             subj : obj    =      3.0 : 1.0
            contains(on) = True              obj : subj   =      3.0 : 1.0
           contains(she) = True              obj : subj   =      3.0 : 1.0

In [27]:
# We can check the classification of each test doc
for doc in test_set:
    print(classifier.classify(doc[0]))

subj
subj
subj
obj
obj
subj
subj
obj
subj
subj
subj
subj
subj
subj
subj
subj
subj
obj
subj
subj
obj
obj
subj
obj
obj
obj
obj
subj
obj
obj
obj
subj
obj
obj
obj
obj
obj
obj
obj
subj


We can also classify some "unseen" documents, from the subjectivity corpus documents we did not use in the model building:

In [28]:
# Grab some unseen documents
subj_docs_unseen = [(sent, 'subj') for sent in subjectivity.sents(categories='subj')[n_instances:n_instances+5]]
obj_docs_unseen = [(sent, 'obj') for sent in subjectivity.sents(categories='obj')[n_instances:n_instances+5]]

In [29]:
subj_docs_unseen[0]

(["it's",
  'a',
  'frightful',
  'vanity',
  'film',
  'that',
  ',',
  'no',
  'doubt',
  ',',
  'pays',
  'off',
  'what',
  'debt',
  'miramax',
  'felt',
  'they',
  'owed',
  'to',
  'benigni',
  '.'],
 'subj')

In [30]:
# Apply the same feature extractor to the unseen documents
subj_docs_unseen = sentim_analyzer.apply_features(subj_docs_unseen)
obj_docs_unseen = sentim_analyzer.apply_features(obj_docs_unseen)

In [31]:
# Classify them
print("Should be Subj:")
for doc in subj_docs_unseen:
    print(classifier.classify(doc[0]))
    
print("Should be Obj:")
for doc in obj_docs_unseen:
    print(classifier.classify(doc[0]))

Should be Subj:
subj
obj
subj
subj
subj
Should be Obj:
obj
obj
obj
subj
obj


## Self-Contained example
https://data-and-design.readthedocs.io/en/latest/09-Machine-Learning-Intro.html

In [32]:

from nltk.tokenize import word_tokenize
train = [("Great place to be when you are in Bangalore.", "pos"),
  ("The place was being renovated when I visited so the seating was limited.", "neg"),
  ("Loved the ambience, loved the food", "pos"),
  ("The food is delicious but not over the top.", "neg"),
  ("Service - Little slow, probably because too many people.", "neg"),
  ("The place is not easy to locate", "neg"),
  ("Mushroom fried rice was spicy", "pos"),
]
dictionary = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
dictionary = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in dictionary}, x[1]) for x in train]
classifier = nltk.NaiveBayesClassifier.train(t)
test_data = "Manchurian was hot and spicy"
test_data_features = {word.lower(): (word in word_tokenize(test_data.lower())) for word in dictionary}
print (classifier.classify(test_data_features))

pos
