## Workflow

Workflow

1. Load in the small_corpus .csv file you created in the previous milestone.

2. Tokenize the sentences and words of the reviews with the tokenize module of NLTK.
   - Keep in mind that word_tokenize and sent_tokenize functions of the nltk.tokenize module should be used.
3. Download the opinion lexicon of NLTK by using the following command: nltk.download('opinion_lexicon'). Before you classify each word of the reviews, experiment with words and find out whether they are labeled as positive or negative.

   - Note that the dictionary contains various word-forms, not only stems.
4. Classify each review in a scale of –1 to +1. The higher the score is, the more positive the review is.

   - It is recommended to score the reviews in two steps. First score the sentences of the reviews from –1 to 1 based on the sum of the positive and negative words they include. Then count the sentiment score of the reviews, which you preliminary sliced into sentences.
   - Don’t forget that NLTK opinion lexicon contains neither uppercase words nor punctuation marks.
5. Compare the scores of the product reviews with the product ratings using a plot. In this step, you need to accomplish three sub-tasks:

   - Create a plot of the distribution of the ratings. Explore which is the most common rating.

      - You can use Altair to create the plot.
   - Create a plot of the distribution of the sentiment scores. Explore which is the most common.

      - Note that the scores are not discrete numbers.
      - It is recommended to use a NumPy histogram to put the sentiment scores into bins.
   - Create a plot about the relation of the sentiment scores and product ratings. What is your impression? Do they correlate?

6. Measure the correlation of the sentiment scores and product ratings. Try out more methods. Study the contradictions, namely those cases where the rating is high but the score is low, or the other way around.
   - Choose the most effective correlation measure.
7. Improve your sentiment analyzer in order to reduce contradictory cases. Handle negation, as most of those cases are contradictory when there is negation in the sentence (“no problem,” for example).

   - It is recommended to use the mark_negation function of the NLTK sentiment.utils module.
   - Don’t forget to complete your vocabulary with negated words.
8. Finalize your own dictionary-based sentiment analyzer. Score the reviews again on the basis of your new vocabulary.

9. Check whether your results improved. Compare the scores of the product reviews with the product ratings again.

10. Export your results to your data file (small_corpus) and add a column to the table that contains the sentiment scores.


## Deliverable

The deliverable for this milestone is a Jupyter Notebook that documents your workflow with the following items:

- word tokenization
- sentence tokenization
- scoring of the reviews
- comparison of the scores with the reviews in plots
- measuring the distribution
- handling negation
- adjusting your dictionary-based sentiment analyzer
- checking your results

In [1]:
import pandas as pd

In [9]:
df = pd.read_csv("..\\data\\raw\\review_corpus.tsv", sep="\t")
df

Unnamed: 0,rating,review
0,1.0,Yet another garbage CoD game. Zombies is unpla...
1,1.0,$80? .... No way. This is NOT worth $80. $80?....
2,1.0,One of the worst games ever. I bought and down...
3,1.0,I did a lot of homework before I decided to by...
4,1.0,"I am really into RPG games, I loved Skyrim, Bo..."
...,...,...
7495,5.0,"Worked good, girlfriend loves this game. Five ..."
7496,5.0,This is my 3rd Mystery PI game and I've enjoye...
7497,5.0,work like brand new wont brake any time soon
7498,5.0,This remote works fantastic. I love it. Five S...


In [12]:
ratings = list(df["rating"])
reviews = list(df["review"])

In [13]:
###############################################################################
#####                   Dictionary based sentiment analysis               #####
###############################################################################
from nltk.corpus import opinion_lexicon
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

positive_wds = set(opinion_lexicon.positive())
negative_wds = set(opinion_lexicon.negative())
# lists are NOT lemmatized so we only have to tokenize the text and count
# positive and negative words

In [22]:
# positive_wds

In [80]:
def score_sent(sent):    
    """Returns a score btw -1 and 1"""
    # print("sent = {}".format(sent))
    sent_t = word_tokenize(sent)
    # print(sent_t)
    sent = [e.lower() for e in sent_t if e.isalnum()]
    # print("sent = {}".format(sent))
    total = len(sent)
    pos = len([e for e in sent if e in positive_wds])
    neg = len([e for e in sent if e in negative_wds])
    # print("pos = {}".format(pos))
    # print("neg = {}".format(neg))
    if total > 0:
        return (pos - neg) / total
    else:
        return 0

In [81]:
def score_review(review):
    sentiment_scores = []
    sents = sent_tokenize(review)
    # print(sents[1])
    for sent in sents:
        sentiment_scores.append(score_sent(sent))
    #print(sentiment_scores)
    return sum(sentiment_scores) / len(sentiment_scores)

In [82]:
# reviews

In [83]:
review_sentiments = [score_review(e) for e in reviews]
# review_sentiments

In [85]:


df = pd.DataFrame(
    {
        "rating": ratings,
        "review": reviews,
        "review dictionary based sentiment": review_sentiments,
    }
)

In [86]:
df

Unnamed: 0,rating,review,review dictionary based sentiment
0,1.0,Yet another garbage CoD game. Zombies is unpla...,-0.008081
1,1.0,$80? .... No way. This is NOT worth $80. $80?....,0.033333
2,1.0,One of the worst games ever. I bought and down...,-0.154938
3,1.0,I did a lot of homework before I decided to by...,0.039000
4,1.0,"I am really into RPG games, I loved Skyrim, Bo...",-0.161413
...,...,...,...
7495,5.0,"Worked good, girlfriend loves this game. Five ...",0.250000
7496,5.0,This is my 3rd Mystery PI game and I've enjoye...,0.077381
7497,5.0,work like brand new wont brake any time soon,0.222222
7498,5.0,This remote works fantastic. I love it. Five S...,0.277778


In [87]:
with open("..\\data\\processed\\dictionary_based_sentiment.tsv", "w") as outfile:
    outfile.write(df.to_csv(index=False, sep="\t"))

In [88]:
###############################################################################
#####                       Exploratory Data Analysis                     #####
###############################################################################
# plot score vs dict_sents
from collections import Counter

import altair as alt
import numpy as np
import pandas as pd

In [91]:
# let's see the distributions

# the distribution of review scores
rating_counts = Counter(ratings)
data1 = pd.DataFrame(
    {
        "ratings": [str(e) for e in list(rating_counts.keys())],
        "counts": list(rating_counts.values()),
    }
)
data1

Unnamed: 0,ratings,counts
0,1.0,1500
1,2.0,1500
2,3.0,1500
3,4.0,1500
4,5.0,1500


In [94]:
chart1 = alt.Chart(data1).mark_bar().encode(x="ratings", y="counts")
chart1.save(".\\plots\\01\\rating_counts.html")
chart1


In [100]:
# we have a majority class !
# the distribution of sentiment scores
hist, bin_edges = np.histogram(review_sentiments, density=True)
# density : bool, optional
# If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability *density* function at the bin, 
# normalized such that the *integral* over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; 
# it is not a probability *mass* function.

labels = list(zip(bin_edges, bin_edges[1:]))
labels = [(str(e[0]), str(e[1])) for e in labels]
labels = [" ".join(e) for e in labels]


data2 = pd.DataFrame({"sentiment scores": labels, "counts": hist})
data2

Unnamed: 0,sentiment scores,counts
0,-1.0 -0.8,0.001333
1,-0.8 -0.6,0.003333
2,-0.6 -0.3999999999999999,0.016
3,-0.3999999999999999 -0.19999999999999996,0.095333
4,-0.19999999999999996 0.0,1.446
5,0.0 0.20000000000000018,2.951333
6,0.20000000000000018 0.40000000000000013,0.408
7,0.40000000000000013 0.6000000000000001,0.065333
8,0.6000000000000001 0.8,0.008667
9,0.8 1.0,0.004667


In [101]:
chart2 = (
    alt.Chart(data2)
    .mark_bar()
    .encode(x=alt.X("sentiment scores", sort=labels), y="counts")
)
chart2.save(".\\plots\\01\\review_sentiments.html")
chart2
# (0.0, 0.20000000000000018) -> neutral is the majority

In [102]:
# is there any relationship btw review scores and sentiments?
source = pd.DataFrame(
    {"ratings": [str(e) for e in ratings], "sentiments": review_sentiments}
)
source

Unnamed: 0,ratings,sentiments
0,1.0,-0.008081
1,1.0,0.033333
2,1.0,-0.154938
3,1.0,0.039000
4,1.0,-0.161413
...,...,...
7495,5.0,0.250000
7496,5.0,0.077381
7497,5.0,0.222222
7498,5.0,0.277778


In [104]:
# https://altair-viz.github.io/user_guide/large_datasets.html
# MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000)
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [105]:
# 5
chart4 = (
    alt.Chart(source)
    .mark_circle(size=60)
    .encode(
        x="ratings", y="sentiments", color="ratings", tooltip=["ratings", "sentiments"]
    )
    .interactive()
)
chart4.save(".\\plots\\01\\reviews_ratings_vs_sentiment.html")
chart4 

In [114]:
# 6: Correlation

# test correlation
from scipy.stats import pearsonr, spearmanr

corr1, _ = pearsonr(ratings, review_sentiments)
print("Pearson Correlation = {}".format(corr1))
# 0.4958261819356306

# Spearman rank correlation says there's weak correlation btw review score
# and sentiment
scor1, _ = spearmanr(ratings, review_sentiments)

print("Spearman Rank Correlation = {}".format(scor1))
# 0.5565197428328601

# We plotted to see the distribution, but it's not normal, so
# pearson can be omitted because it assumes a normal distribution

# https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php
# you could even look for a Manninges stat book

Pearson Correlation = 0.5027590735150967
Spearman Rank Correlation = 0.5674703525244491


In [112]:
###############################################################################
######                          Let's see the data                       ######
###############################################################################
for i in range(len(reviews)):
    sc = ratings[i]
    rs = review_sentiments[i]
    # ss = summary_sentiments[i]
    t = reviews[i]
    if sc == 5 and rs < -0.2:
        print(t)
    if sc == 1 and rs > 0.3:
        print(t)

### Problem with reviews like
# no issues
# no complains
# Doesn't work.
# Didn't like it.

didn't work didn't work
Nice Love it
Excellent.  What I was expecting. The right item
good One Star
The Villainous! Five Stars
Goon Five Stars
Addicting. Now I'm also addicted to Pokemon Shuffle. Halps.
No issues. No issues.
Does what is advertised. Nothing lost as far as performance. no problems


In [113]:
from nltk.sentiment.util import mark_negation


t = "I received these on time and no problems. No damages battlfield never fails"
print(mark_negation(t.split()))

['I', 'received', 'these', 'on', 'time', 'and', 'no', 'problems._NEG', 'No_NEG', 'damages_NEG', 'battlfield_NEG', 'never_NEG', 'fails_NEG']


In [115]:
def score_sent_w_negation(sent):    
    """Returns a score btw -1 and 1"""
    # print("sent = {}".format(sent))
    sent_t = word_tokenize(sent)
    # print(sent_t)
    sent = [e.lower() for e in sent_t if e.isalnum()]
    # print("sent = {}".format(sent))
    total = len(sent)
    pos = len([e for e in sent if e in positive_wds])
    neg = len([e for e in sent if e in negative_wds])
    # print("pos = {}".format(pos))
    # print("neg = {}".format(neg))
    if total > 0:
        return (pos - neg) / total
    else:
        return 0

In [116]:
def score_review_w_negation(review):
    sentiment_scores = []
    sents = sent_tokenize(review)
    # print(sents[1])
    for sent in sents:
        sentiment_scores.append(score_sent_w_negation(sent))
    #print(sentiment_scores)
    return sum(sentiment_scores) / len(sentiment_scores)

In [117]:
review_sentiments = [score_review_w_negation(e) for e in reviews]
# review_sentiments

In [118]:
df = pd.DataFrame(
    {
        "rating": ratings,
        "review": reviews,
        "review dictionary based sentiment": review_sentiments,
    }
)

In [119]:
df

Unnamed: 0,rating,review,review dictionary based sentiment
0,1.0,Yet another garbage CoD game. Zombies is unpla...,-0.008081
1,1.0,$80? .... No way. This is NOT worth $80. $80?....,0.033333
2,1.0,One of the worst games ever. I bought and down...,-0.154938
3,1.0,I did a lot of homework before I decided to by...,0.039000
4,1.0,"I am really into RPG games, I loved Skyrim, Bo...",-0.161413
...,...,...,...
7495,5.0,"Worked good, girlfriend loves this game. Five ...",0.250000
7496,5.0,This is my 3rd Mystery PI game and I've enjoye...,0.077381
7497,5.0,work like brand new wont brake any time soon,0.222222
7498,5.0,This remote works fantastic. I love it. Five S...,0.277778


In [120]:
with open("..\\data\\processed\\dictionary_based_sentiment_with_negation.tsv", "w") as outfile:
    outfile.write(df.to_csv(index=False, sep="\t"))

In [121]:
# we have a majority class !
# the distribution of sentiment scores
hist, bin_edges = np.histogram(review_sentiments, density=True)
# density : bool, optional
# If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability *density* function at the bin, 
# normalized such that the *integral* over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; 
# it is not a probability *mass* function.

labels = list(zip(bin_edges, bin_edges[1:]))
labels = [(str(e[0]), str(e[1])) for e in labels]
labels = [" ".join(e) for e in labels]


data3 = pd.DataFrame({"sentiment scores": labels, "counts": hist})
data3

Unnamed: 0,sentiment scores,counts
0,-1.0 -0.8,0.001333
1,-0.8 -0.6,0.003333
2,-0.6 -0.3999999999999999,0.016
3,-0.3999999999999999 -0.19999999999999996,0.095333
4,-0.19999999999999996 0.0,1.446
5,0.0 0.20000000000000018,2.951333
6,0.20000000000000018 0.40000000000000013,0.408
7,0.40000000000000013 0.6000000000000001,0.065333
8,0.6000000000000001 0.8,0.008667
9,0.8 1.0,0.004667


In [123]:
chart5 = (
    alt.Chart(data3)
    .mark_bar()
    .encode(x=alt.X("sentiment scores", sort=labels), y="counts")
)
chart5.save(".\\plots\\01\\review_sentiments_with_negation.html")
chart5
# (0.0, 0.20000000000000018) -> neutral is the majority

In [125]:
# is there any relationship btw review scores and sentiments?
source = pd.DataFrame(
    {"ratings": [str(e) for e in ratings], "sentiments": review_sentiments})

In [126]:
# 5
chart6 = (
    alt.Chart(source)
    .mark_circle(size=60)
    .encode(
        x="ratings", y="sentiments", color="ratings", tooltip=["ratings", "sentiments"]
    )
    .interactive()
)
chart6.save(".\\plots\\01\\reviews_ratings_vs_sentiment_w_negation.html")
chart6 

In [127]:
# 6: Correlation

# test correlation
from scipy.stats import pearsonr, spearmanr

corr1, _ = pearsonr(ratings, review_sentiments)
print("Pearson Correlation = {}".format(corr1))
# 0.4958261819356306

# Spearman rank correlation says there's weak correlation btw review score
# and sentiment
scor1, _ = spearmanr(ratings, review_sentiments)

print("Spearman Rank Correlation = {}".format(scor1))
# 0.5565197428328601

# We plotted to see the distribution, but it's not normal, so
# pearson can be omitted because it assumes a normal distribution

# https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php
# you could even look for a Manninges stat book

Pearson Correlation = 0.5027590735150967
Spearman Rank Correlation = 0.5674703525244491
