# Dictionary Based Sentiment Analyzer

* Word tokenization
* Sentence tokenization
* Scoring of the reviews
* Comparison of the scores with the reviews in plots
* Measuring the distribution
* Handling negation
* Adjusting your dictionary-based sentiment analyzer
* Checking your results

In [1]:
# all imports and related

%matplotlib inline

import pandas as pd
import numpy as np
import altair as alt

from nltk import download as nltk_download
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.sentiment.util import mark_negation

nltk_download('punkt')  # required by word_tokenize

from collections import Counter




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### load the small_corpus CSV

run process from
[create_dataset.ipynb](https://github.com/oonid/growth-hacking-with-nlp-sentiment-analysis/blob/master/create_dataset.ipynb)

copy file **small_corpus.csv** to this Google Colab Files (via file upload or mount drive).


In [2]:
df = pd.read_csv('small_corpus.csv')
df

Unnamed: 0,ratings,reviews
0,1,Recently UBISOFT had to settle a huge class-ac...
1,1,"code didn't work, got me a refund."
2,1,"these do not work at all, all i get is static ..."
3,1,well let me start by saying that when i first ...
4,1,"Dont waste your money, you will just end up us..."
...,...,...
4495,5,"Nice long micro USB cable, battery lasts a lon..."
4496,5,I've been having a great time with this game. ...
4497,5,d
4498,5,"Really pretty, funny, interesting game. Works ..."


In [3]:
# check if any columns has null, and yes the reviews column has
df.isnull().any()

ratings    False
reviews     True
dtype: bool

In [4]:
# repair null in column reviews with empty string ''
df.reviews = df.reviews.fillna('')

# test again
df.isnull().any()

ratings    False
reviews    False
dtype: bool

In [5]:
rating_list = list(df['ratings'])
review_list = list(df['reviews'])

print(rating_list[:5])
for r in review_list[:5]:
    print('--\n{}'.format(r))

[1, 1, 1, 1, 1]
--
Recently UBISOFT had to settle a huge class-action suit brought against the company for bundling (the notoriously harmful) StarFORCE DRM with its released games. So what the geniuses at the helm do next? They decide to make the same mistake yet again - by choosing the same DRM scheme that made BIOSHOCK, MASS EFFECT and SPORE infamous: SecuROM 7.xx with LIMITED ACTIVATIONS!

MASS EFFECT can be found in clearance bins only months after its release; SPORE not only undersold miserably but also made history as the boiling point of gamers lashing back, fed up with idiotic DRM schemes. And the clueless MBAs that run an art-form as any other commodity business decided that, "hey, why not jump into THAT mud-pond ourselves?"

The original FAR CRY was such a GREAT game that any sequel of it would have to fight an uphill battle to begin with (especially without its original developing team). Now imagine shooting this sequel on the foot with a well known, much hated and totally u

### tokenize the sentences and words of the reviews

In [6]:
word_tokenized = df['reviews'].apply(word_tokenize)
word_tokenized

0       [Recently, UBISOFT, had, to, settle, a, huge, ...
1        [code, did, n't, work, ,, got, me, a, refund, .]
2       [these, do, not, work, at, all, ,, all, i, get...
3       [well, let, me, start, by, saying, that, when,...
4       [Dont, waste, your, money, ,, you, will, just,...
                              ...                        
4495    [Nice, long, micro, USB, cable, ,, battery, la...
4496    [I, 've, been, having, a, great, time, with, t...
4497                                                  [d]
4498    [Really, pretty, ,, funny, ,, interesting, gam...
4499    [i, had, a, lot, of, fun, playing, this, game,...
Name: reviews, Length: 4500, dtype: object

In [7]:
sent_tokenized = df['reviews'].apply(sent_tokenize)
sent_tokenized

0       [Recently UBISOFT had to settle a huge class-a...
1                    [code didn't work, got me a refund.]
2       [these do not work at all, all i get is static...
3       [well let me start by saying that when i first...
4       [Dont waste your money, you will just end up u...
                              ...                        
4495    [Nice long micro USB cable, battery lasts a lo...
4496    [I've been having a great time with this game....
4497                                                  [d]
4498    [Really pretty, funny, interesting game., Work...
4499    [i had a lot of fun playing this game, if your...
Name: reviews, Length: 4500, dtype: object

### download the opinion lexicon of NLTK

use it with reference to it source:

https://www.nltk.org/_modules/nltk/corpus/reader/opinion_lexicon.html



In [8]:
# imports and related

nltk_download('opinion_lexicon')

from nltk.corpus import opinion_lexicon

[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Unzipping corpora/opinion_lexicon.zip.


In [9]:
print('total lexicon words: {}'.format(len(opinion_lexicon.words())))
print('total lexicon negatives: {}'.format(len(opinion_lexicon.negative())))
print('total lexicon positives: {}'.format(len(opinion_lexicon.positive())))
print('sample of lexicon words (first 10, by id):')
print(opinion_lexicon.words()[:10])  # print first 10 sorted by file id
print('sample of lexicon words (first 10, by alphabet):')
print(sorted(opinion_lexicon.words())[:10])  # print first 10 sorted alphabet

positive_set = set(opinion_lexicon.positive())
negative_set = set(opinion_lexicon.negative())
print(len(positive_set))
print(len(negative_set))


total lexicon words: 6789
total lexicon negatives: 4783
total lexicon positives: 2006
sample of lexicon words (first 10, by id):
['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted']
sample of lexicon words (first 10, by alphabet):
['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort']
2006
4783


In [10]:
def simple_opinion_test(words):
    if words not in opinion_lexicon.words():
        print('{} not covered on opinion_lexicon'.format(words))
    else:
        if words in opinion_lexicon.negative():
            print('{} is negative'.format(words))
        if words in opinion_lexicon.positive():
            print('{} is positive'.format(words))

simple_opinion_test('awful')
simple_opinion_test('beautiful')
simple_opinion_test('useless')
simple_opinion_test('Great')  # must be lower case
simple_opinion_test('warming')


awful is negative
beautiful is positive
useless is negative
Great not covered on opinion_lexicon
warming not covered on opinion_lexicon


### classify each review in a scale of -1 to +1

In [0]:
# the process to score review:
# * tokenize review (from multiple sentence) become sentences
# * so sentence score will be build from it words

def score_sentence(sentence):
    """sentence (input) are words that tokenize from sentence.
    return score between -1 and 1
    if the total positive greater than total negative then return 0 to 1
    if the total negative greater than total positive then return -1 to 0
    """
    # opinion lexicon not contains any symbol character, and must be set lower
    selective_words = [w.lower() for w in sentence if w.isalnum()]
    total_selective_words = len(selective_words)
    # count total words that categorized as positive from opinion lexicon
    total_positive = len([w for w in selective_words if w in positive_set])
    # count total words that categorized as negative from opinion lexicon
    total_negative = len([w for w in selective_words if w in negative_set])

    if total_selective_words > 0:  # has at least 1 word to categorize
        return (total_positive - total_negative) / total_selective_words
    else:  # no selective words
        return 0

def score_review(review):
    """review (input) is single review, could be multiple sentences.
    tokenize review become sentences.
    tokenize sentence become words.
    collect sentence scores as list, called sentiment scores.
    score of review = sum of all sentence scores / total of all sentence scores
    return score of review
    """
    sentiment_scores = []
    sentences = sent_tokenize(review)
    # process per sentence
    for sentence in sentences:
        # tokenize sentence become words
        words = word_tokenize(sentence)
        # calculate score per sentence, passing tokenized words as input
        sentence_score = score_sentence(words)
        # add to list of sentiment scores
        sentiment_scores.append(sentence_score)
    # mean value = sum of all sentiment scores / total of sentiment scores
    if sentiment_scores:  # has at least 1 sentence score
        return sum(sentiment_scores) / len(sentiment_scores)
    else:  # return 0 if no sentiment_scores, avoid division by zero
        return 0


In [12]:
review_sentiments = [score_review(r) for r in review_list]
print(review_sentiments[:5])

[-0.013158071747989037, 0.2857142857142857, 0.0, -0.02052123414370017, 0.0]


In [13]:
print(rating_list[:5])
print(review_sentiments[:5])
for r in review_list[:5]:
    print('--\n{}'.format(r))

[1, 1, 1, 1, 1]
[-0.013158071747989037, 0.2857142857142857, 0.0, -0.02052123414370017, 0.0]
--
Recently UBISOFT had to settle a huge class-action suit brought against the company for bundling (the notoriously harmful) StarFORCE DRM with its released games. So what the geniuses at the helm do next? They decide to make the same mistake yet again - by choosing the same DRM scheme that made BIOSHOCK, MASS EFFECT and SPORE infamous: SecuROM 7.xx with LIMITED ACTIVATIONS!

MASS EFFECT can be found in clearance bins only months after its release; SPORE not only undersold miserably but also made history as the boiling point of gamers lashing back, fed up with idiotic DRM schemes. And the clueless MBAs that run an art-form as any other commodity business decided that, "hey, why not jump into THAT mud-pond ourselves?"

The original FAR CRY was such a GREAT game that any sequel of it would have to fight an uphill battle to begin with (especially without its original developing team). Now imagine 

In [14]:
df = pd.DataFrame({
    "rating": rating_list,
    "review": review_list,
    "review dictionary based sentiment": review_sentiments,
})
df

Unnamed: 0,rating,review,review dictionary based sentiment
0,1,Recently UBISOFT had to settle a huge class-ac...,-0.013158
1,1,"code didn't work, got me a refund.",0.285714
2,1,"these do not work at all, all i get is static ...",0.000000
3,1,well let me start by saying that when i first ...,-0.020521
4,1,"Dont waste your money, you will just end up us...",0.000000
...,...,...,...
4495,5,"Nice long micro USB cable, battery lasts a lon...",0.058824
4496,5,I've been having a great time with this game. ...,0.222222
4497,5,d,0.000000
4498,5,"Really pretty, funny, interesting game. Works ...",0.455556


In [0]:
df.to_csv('dictionary_based_sentiment.csv', index=False)

# Compare the scores of the product reviews with the product ratings using a plot

In [16]:
rating_counts = Counter(rating_list)
print('distribution of rating as dictionary: {}'.format(rating_counts))

distribution of rating as dictionary: Counter({1: 1500, 5: 1500, 2: 500, 3: 500, 4: 500})


### a plot of the distribution of the ratings

In [17]:
# ratings as str will be different with ratings as int from keys()
dfrc = pd.DataFrame({
    "ratings": [str(k) for k in rating_counts.keys()],
    "counts": list(rating_counts.values())
})
dfrc

Unnamed: 0,ratings,counts
0,1,1500
1,2,500
2,3,500
3,4,500
4,5,1500


In [18]:
rating_counts_chart = alt.Chart(dfrc).mark_bar().encode(x="ratings", y="counts")
rating_counts_chart

### a plot of the distribution of the sentiment scores

In [19]:
# get histogram value 
# with the value of the probability density function at the bin,
# normalized such that the integral over the range is 1
hist, bin_edges = np.histogram(review_sentiments, density=True)
print('histogram value: {}'.format(hist))
print('bin_edges value: {}'.format(bin_edges))  # from -1 to 1
print()
labels = [(str(l[0]), str(l[1])) for l in zip(bin_edges, bin_edges[1:])]
print('labels: {}'.format(labels))
labels = [" ".join(label) for label in labels]
print('labels: {}'.format(labels))


histogram value: [1.44444444e-02 1.11111111e-03 2.11111111e-02 6.55555556e-02
 1.38555556e+00 2.96777778e+00 2.66666667e-01 1.57777778e-01
 1.11111111e-02 1.08888889e-01]
bin_edges value: [-1.  -0.8 -0.6 -0.4 -0.2  0.   0.2  0.4  0.6  0.8  1. ]

labels: [('-1.0', '-0.8'), ('-0.8', '-0.6'), ('-0.6', '-0.3999999999999999'), ('-0.3999999999999999', '-0.19999999999999996'), ('-0.19999999999999996', '0.0'), ('0.0', '0.20000000000000018'), ('0.20000000000000018', '0.40000000000000013'), ('0.40000000000000013', '0.6000000000000001'), ('0.6000000000000001', '0.8'), ('0.8', '1.0')]
labels: ['-1.0 -0.8', '-0.8 -0.6', '-0.6 -0.3999999999999999', '-0.3999999999999999 -0.19999999999999996', '-0.19999999999999996 0.0', '0.0 0.20000000000000018', '0.20000000000000018 0.40000000000000013', '0.40000000000000013 0.6000000000000001', '0.6000000000000001 0.8', '0.8 1.0']


In [20]:
dfsc = pd.DataFrame({
    "sentiment scores": labels,
    "counts": hist,
})
dfsc

Unnamed: 0,sentiment scores,counts
0,-1.0 -0.8,0.014444
1,-0.8 -0.6,0.001111
2,-0.6 -0.3999999999999999,0.021111
3,-0.3999999999999999 -0.19999999999999996,0.065556
4,-0.19999999999999996 0.0,1.385556
5,0.0 0.20000000000000018,2.967778
6,0.20000000000000018 0.40000000000000013,0.266667
7,0.40000000000000013 0.6000000000000001,0.157778
8,0.6000000000000001 0.8,0.011111
9,0.8 1.0,0.108889


In [21]:
# sentiment_counts_chart = alt.Chart(dfsc).mark_bar() \
#     .encode(x="sentiment scores", y="counts")
sentiment_counts_chart = alt.Chart(dfsc).mark_bar() \
    .encode(x=alt.X("sentiment scores", sort=labels), y="counts")
sentiment_counts_chart


### a plot about the relation of the sentiment scores and product ratings

In [22]:
# explore if there's relationship between ratings and sentiments
dfrs = pd.DataFrame({
    "ratings": [str(r) for r in rating_list],
    "sentiments": review_sentiments,
})
dfrs

Unnamed: 0,ratings,sentiments
0,1,-0.013158
1,1,0.285714
2,1,0.000000
3,1,-0.020521
4,1,0.000000
...,...,...
4495,5,0.058824
4496,5,0.222222
4497,5,0.000000
4498,5,0.455556


In [23]:
rating_sentiments_chart = alt.Chart(dfrs).mark_bar()\
    .encode(x="ratings", y="sentiments", color="ratings", \
            tooltip=["ratings", "sentiments"])\
    .interactive()
rating_sentiments_chart

# Measure the correlation of the sentiment scores and product ratings

article from [machinelearningmastery](https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/) about how to use correlation to understand the relationship between variable.

* Covariance. Variables can be related by a linear relationship.
* Pearson's Correlation. Pearson correlation coefficient can be used to summarize the strength of the linear relationship between two data samples.
* Spearman's Correlation. Two variables may be related by a non-linear relationship, such that the relationship is stronger or weaker across the distribution of the variables.

import pearsonr and spearmanr from package scipy.stats


In [24]:
from scipy.stats import pearsonr, spearmanr

pearson_correlation, _ = pearsonr(rating_list, review_sentiments)
print('pearson correlation: {}'.format(pearson_correlation))

spearman_correlation, _ = spearmanr(rating_list, review_sentiments)
print('spearman correlation: {}'.format(spearman_correlation))

# Spearman rank correlation value said that there's weak correlation
# between rating and review score (sentiments)


pearson correlation: 0.4339913860749739
spearman correlation: 0.5860296165999471


# Improve your sentiment analyzer in order to reduce contradictory cases

### need to handle negation, since mostly those cases are contradictory when there is negation in the sentence (e.g., no problem)

In [25]:
for idx, review in enumerate(review_list):
    r = rating_list[idx]
    s = review_sentiments[idx]
    if r == 5 and s < -0.2:
        # rating 5 but sentiment negative below -0.2
        print('({}, {}): {}'.format(r, s, review))
    if r == 1 and s > 0.3:
        # rating 1 but got sentiment positive more than 0.3
        print('({}, {}): {}'.format(r, s, review))


(1, 0.6666666666666666): Never worked right
(1, 0.5): Not Good
(1, 0.3333333333333333): not kid appropriate
(1, 0.5): doesn't work
(1, 0.5): Never worked.
(1, 0.3333333333333333): Not well made
(1, 0.5): Doesn't work
(1, 0.3333333333333333): Did not work.
(1, 0.3333333333333333): returning don't work
(1, 0.3333333333333333): Does not work correctly with xbox
(1, 0.5): Doesn't work
(1, 0.3333333333333333): Does not work
(1, 0.5): doesn't work
(5, -0.2857142857142857): Killing zombies, how can you go wrong!
(5, -0.25): Would buy again. No problems.


### use the mark_negation function to handle negation

In [26]:
test_sentence = 'Does not work correctly with xbox'
print(mark_negation(test_sentence.split()))

# not detected on "No problems."
test_sentence = 'Would buy again. No problems.'
print(mark_negation(test_sentence.split()))

# sentence from sample solution works to detect "no problems."
test_sentence = "I received these on time and no problems. No damages battlfield never fails"
print(mark_negation(test_sentence.split()))

['Does', 'not', 'work_NEG', 'correctly_NEG', 'with_NEG', 'xbox_NEG']
['Would', 'buy', 'again.', 'No', 'problems.']
['I', 'received', 'these', 'on', 'time', 'and', 'no', 'problems._NEG', 'No_NEG', 'damages_NEG', 'battlfield_NEG', 'never_NEG', 'fails_NEG']
