# Dictionary Based Sentiment Analyzer

* Word tokenization
* Sentence tokenization
* Scoring of the reviews
* Comparison of the scores with the reviews in plots
* Measuring the distribution
* Handling negation
* Adjusting your dictionary-based sentiment analyzer
* Checking your results

In [1]:
# all imports and related

%matplotlib inline

import pandas as pd
import numpy as np
import altair as alt

from nltk import download as nltk_download
from nltk.tokenize import word_tokenize, sent_tokenize

nltk_download('punkt')  # required by word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### load the small_corpus CSV

run process from
[create_dataset.ipynb](https://github.com/oonid/growth-hacking-with-nlp-sentiment-analysis/blob/master/create_dataset.ipynb)

copy file **small_corpus.csv** to this Google Colab Files (via file upload or mount drive).


In [2]:
df = pd.read_csv('small_corpus.csv')
df

Unnamed: 0,ratings,reviews
0,1,Recently UBISOFT had to settle a huge class-ac...
1,1,"code didn't work, got me a refund."
2,1,"these do not work at all, all i get is static ..."
3,1,well let me start by saying that when i first ...
4,1,"Dont waste your money, you will just end up us..."
...,...,...
4495,5,"Nice long micro USB cable, battery lasts a lon..."
4496,5,I've been having a great time with this game. ...
4497,5,d
4498,5,"Really pretty, funny, interesting game. Works ..."


In [3]:
# check if any columns has null, and yes the reviews column has
df.isnull().any()

ratings    False
reviews     True
dtype: bool

In [4]:
# repair null in column reviews with empty string ''
df.reviews = df.reviews.fillna('')

# test again
df.isnull().any()

ratings    False
reviews    False
dtype: bool

In [5]:
rating_list = list(df['ratings'])
review_list = list(df['reviews'])

print(rating_list[:5])
for r in review_list[:5]:
    print('--\n{}'.format(r))

[1, 1, 1, 1, 1]
--
Recently UBISOFT had to settle a huge class-action suit brought against the company for bundling (the notoriously harmful) StarFORCE DRM with its released games. So what the geniuses at the helm do next? They decide to make the same mistake yet again - by choosing the same DRM scheme that made BIOSHOCK, MASS EFFECT and SPORE infamous: SecuROM 7.xx with LIMITED ACTIVATIONS!

MASS EFFECT can be found in clearance bins only months after its release; SPORE not only undersold miserably but also made history as the boiling point of gamers lashing back, fed up with idiotic DRM schemes. And the clueless MBAs that run an art-form as any other commodity business decided that, "hey, why not jump into THAT mud-pond ourselves?"

The original FAR CRY was such a GREAT game that any sequel of it would have to fight an uphill battle to begin with (especially without its original developing team). Now imagine shooting this sequel on the foot with a well known, much hated and totally u

### tokenize the sentences and words of the reviews

In [6]:
word_tokenized = df['reviews'].apply(word_tokenize)
word_tokenized

0       [Recently, UBISOFT, had, to, settle, a, huge, ...
1        [code, did, n't, work, ,, got, me, a, refund, .]
2       [these, do, not, work, at, all, ,, all, i, get...
3       [well, let, me, start, by, saying, that, when,...
4       [Dont, waste, your, money, ,, you, will, just,...
                              ...                        
4495    [Nice, long, micro, USB, cable, ,, battery, la...
4496    [I, 've, been, having, a, great, time, with, t...
4497                                                  [d]
4498    [Really, pretty, ,, funny, ,, interesting, gam...
4499    [i, had, a, lot, of, fun, playing, this, game,...
Name: reviews, Length: 4500, dtype: object

In [7]:
sent_tokenized = df['reviews'].apply(sent_tokenize)
sent_tokenized

0       [Recently UBISOFT had to settle a huge class-a...
1                    [code didn't work, got me a refund.]
2       [these do not work at all, all i get is static...
3       [well let me start by saying that when i first...
4       [Dont waste your money, you will just end up u...
                              ...                        
4495    [Nice long micro USB cable, battery lasts a lo...
4496    [I've been having a great time with this game....
4497                                                  [d]
4498    [Really pretty, funny, interesting game., Work...
4499    [i had a lot of fun playing this game, if your...
Name: reviews, Length: 4500, dtype: object

### download the opinion lexicon of NLTK

use it with reference to it source:

https://www.nltk.org/_modules/nltk/corpus/reader/opinion_lexicon.html



In [8]:
# imports and related

nltk_download('opinion_lexicon')

from nltk.corpus import opinion_lexicon

[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Unzipping corpora/opinion_lexicon.zip.


In [9]:
print('total lexicon words: {}'.format(len(opinion_lexicon.words())))
print('total lexicon negatives: {}'.format(len(opinion_lexicon.negative())))
print('total lexicon positives: {}'.format(len(opinion_lexicon.positive())))
print('sample of lexicon words (first 10, by id):')
print(opinion_lexicon.words()[:10])  # print first 10 sorted by file id
print('sample of lexicon words (first 10, by alphabet):')
print(sorted(opinion_lexicon.words())[:10])  # print first 10 sorted alphabet

positive_set = set(opinion_lexicon.positive())
negative_set = set(opinion_lexicon.negative())
print(len(positive_set))
print(len(negative_set))


total lexicon words: 6789
total lexicon negatives: 4783
total lexicon positives: 2006
sample of lexicon words (first 10, by id):
['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted']
sample of lexicon words (first 10, by alphabet):
['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort']
2006
4783


In [10]:
def simple_opinion_test(words):
    if words not in opinion_lexicon.words():
        print('{} not covered on opinion_lexicon'.format(words))
    else:
        if words in opinion_lexicon.negative():
            print('{} is negative'.format(words))
        if words in opinion_lexicon.positive():
            print('{} is positive'.format(words))

simple_opinion_test('awful')
simple_opinion_test('beautiful')
simple_opinion_test('useless')
simple_opinion_test('Great')  # must be lower case
simple_opinion_test('warming')


awful is negative
beautiful is positive
useless is negative
Great not covered on opinion_lexicon
warming not covered on opinion_lexicon


### classify each review in a scale of -1 to +1

In [0]:
# the process to score review:
# * tokenize review (from multiple sentence) become sentences
# * so sentence score will be build from it words

def score_sentence(sentence):
    """sentence (input) are words that tokenize from sentence.
    return score between -1 and 1
    if the total positive greater than total negative then return 0 to 1
    if the total negative greater than total positive then return -1 to 0
    """
    # opinion lexicon not contains any symbol character, and must be set lower
    selective_words = [w.lower() for w in sentence if w.isalnum()]
    total_selective_words = len(selective_words)
    # count total words that categorized as positive from opinion lexicon
    total_positive = len([w for w in selective_words if w in positive_set])
    # count total words that categorized as negative from opinion lexicon
    total_negative = len([w for w in selective_words if w in negative_set])

    if total_selective_words > 0:  # has at least 1 word to categorize
        return (total_positive - total_negative) / total_selective_words
    else:  # no selective words
        return 0

def score_review(review):
    """review (input) is single review, could be multiple sentences.
    tokenize review become sentences.
    tokenize sentence become words.
    collect sentence scores as list, called sentiment scores.
    score of review = sum of all sentence scores / total of all sentence scores
    return score of review
    """
    sentiment_scores = []
    sentences = sent_tokenize(review)
    # process per sentence
    for sentence in sentences:
        # tokenize sentence become words
        words = word_tokenize(sentence)
        # calculate score per sentence, passing tokenized words as input
        sentence_score = score_sentence(words)
        # add to list of sentiment scores
        sentiment_scores.append(sentence_score)
    # mean value = sum of all sentiment scores / total of sentiment scores
    if sentiment_scores:  # has at least 1 sentence score
        return sum(sentiment_scores) / len(sentiment_scores)
    else:  # return 0 if no sentiment_scores, avoid division by zero
        return 0


In [12]:
review_sentiments = [score_review(r) for r in review_list]
print(review_sentiments[:5])

[-0.013158071747989037, 0.2857142857142857, 0.0, -0.02052123414370017, 0.0]


In [13]:
print(rating_list[:5])
print(review_sentiments[:5])
for r in review_list[:5]:
    print('--\n{}'.format(r))

[1, 1, 1, 1, 1]
[-0.013158071747989037, 0.2857142857142857, 0.0, -0.02052123414370017, 0.0]
--
Recently UBISOFT had to settle a huge class-action suit brought against the company for bundling (the notoriously harmful) StarFORCE DRM with its released games. So what the geniuses at the helm do next? They decide to make the same mistake yet again - by choosing the same DRM scheme that made BIOSHOCK, MASS EFFECT and SPORE infamous: SecuROM 7.xx with LIMITED ACTIVATIONS!

MASS EFFECT can be found in clearance bins only months after its release; SPORE not only undersold miserably but also made history as the boiling point of gamers lashing back, fed up with idiotic DRM schemes. And the clueless MBAs that run an art-form as any other commodity business decided that, "hey, why not jump into THAT mud-pond ourselves?"

The original FAR CRY was such a GREAT game that any sequel of it would have to fight an uphill battle to begin with (especially without its original developing team). Now imagine 

In [14]:
df = pd.DataFrame({
    "rating": rating_list,
    "review": review_list,
    "review dictionary based sentiment": review_sentiments,
})
df

Unnamed: 0,rating,review,review dictionary based sentiment
0,1,Recently UBISOFT had to settle a huge class-ac...,-0.013158
1,1,"code didn't work, got me a refund.",0.285714
2,1,"these do not work at all, all i get is static ...",0.000000
3,1,well let me start by saying that when i first ...,-0.020521
4,1,"Dont waste your money, you will just end up us...",0.000000
...,...,...,...
4495,5,"Nice long micro USB cable, battery lasts a lon...",0.058824
4496,5,I've been having a great time with this game. ...,0.222222
4497,5,d,0.000000
4498,5,"Really pretty, funny, interesting game. Works ...",0.455556


In [0]:
df.to_csv('dictionary_based_sentiment.csv', index=False)