# Creating a Dictionary-based Sentiment Analyzer

In [1]:
import pandas as pd
import nltk
from IPython.display import display
pd.set_option('display.max_columns', None)

### Step 1: Loading in the small_corpus .csv file created in the "creating_dataset" milestone.

In [2]:
reviews = pd.read_csv("../data/small_corpus.csv")

In [3]:
reviews.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,1.0,True,"11 30, 2015",A3AC92K59QLYR8,B00503E8S2,ben,Game freezes over and over its unplayable,it just doesn't work,1448841600,,{'Format:': ' Video Game'},
1,1.0,False,"05 19, 2012",A334LHR8DWARY8,B00178630A,Xenocide,I have no problem with needing to be online to...,The only real way to show Blizzard our feeling...,1337385600,23.0,{'Format:': ' Computer Game'},
2,1.0,True,"10 19, 2014",A28982ODE7ZGVP,B001AWIP7M,Eric Frykberg,NOT GOOD,One Star,1413676800,,{'Format:': ' Video Game'},
3,1.0,True,"09 6, 2015",A19E85RLQCAMI1,B00NASF4MS,Joe,Really not worth the money to buy this game on...,Really not worth the money to buy this game on...,1441497600,2.0,{'Format:': ' Video Game'},
4,1.0,False,"05 28, 2008",AEMQKS13WC4D2,B00140P9BA,Craig,They need to eliminate the Securom. I purchase...,Securom can ruin a great game,1211932800,55.0,{'Format:': ' DVD-ROM'},


### Step 2: Tokenizing the sentences and words of the reviews
Here, We're going to test different versions of word tokenizer on reviews. We'll then decide which tokenizer might be better to use.

### Treebank Word Tokenizer

In [4]:
from nltk.tokenize import TreebankWordTokenizer
from string import punctuation
import string

In [5]:
tb_tokenizer = TreebankWordTokenizer()

In [6]:
reviews["rev_text_lower"] = reviews['reviewText'].apply(lambda rev: str(rev)\
                                                        .translate(str.maketrans('', '', punctuation))\
                                                        .replace("<br />", " ")\
                                                        .lower())

In [7]:
reviews[['reviewText','rev_text_lower']].sample(2)

Unnamed: 0,reviewText,rev_text_lower
2434,Super Ghouls 'N Ghosts is impossible. Nah it'...,super ghouls n ghosts is impossible nah its n...
3007,I was originally going to buy this game on ste...,i was originally going to buy this game on ste...


In [8]:
reviews["tb_tokens"] = reviews['rev_text_lower'].apply(lambda rev: tb_tokenizer.tokenize(str(rev)))

In [9]:
pd.set_option('display.max_colwidth', None)

In [10]:
reviews[['reviewText','tb_tokens']].sample(3)

Unnamed: 0,reviewText,tb_tokens
2395,"I bought this game expecting DLC But then, all i got is cheap crap dlc, but i got mass effect so overall im happy with this product! MASS EFFECT 3 SUCKS! ITS SO TERRIBLE!","[i, bought, this, game, expecting, dlc, but, then, all, i, got, is, cheap, crap, dlc, but, i, got, mass, effect, so, overall, im, happy, with, this, product, mass, effect, 3, sucks, its, so, terrible]"
3368,"Yeah, Doom 3 is great. There's tension, darkness, a forboding feeling, and some awesome looking demons coming to tear you apart and drag you to hell.\n\nI'm sure someone's already said that the fact that you can't equip the flashlight at the same time as a gun really makes things a little scarier.\n\nThe game is good. It starts to feel like it's getting a bit long, but then you get attacked by a new enemy or you get rocketted into Hell or something like that and you're suddenly loving it again.\n\nThe game can get to be pretty hard, but it's nothing you can't get through after a couple tries.\n\nAs for the Collector's Edition, I'd say only get it if you're a big fan of the Doom series. Of course. You've got some interviews and such, some concept art, and the original Dooms. There's really no reason to get it for Doom and Doom 2 unless you HAVE to have them on xbox for some reason. It is fun to get some friends together and play some deathmatches on the XBOX, but beyond that, there's really not much use to them when you could easily just get them for your computer and have the real experience.\n\nI really only bought the XBOX version of Doom 3 because my computer would never run it. So yeah. If you have a computer capable of running Doom 3 you shouldn't even be thinking about the XBOX version.","[yeah, doom, 3, is, great, theres, tension, darkness, a, forboding, feeling, and, some, awesome, looking, demons, coming, to, tear, you, apart, and, drag, you, to, hell, im, sure, someones, already, said, that, the, fact, that, you, cant, equip, the, flashlight, at, the, same, time, as, a, gun, really, makes, things, a, little, scarier, the, game, is, good, it, starts, to, feel, like, its, getting, a, bit, long, but, then, you, get, attacked, by, a, new, enemy, or, you, get, rocketted, into, hell, or, something, like, that, and, youre, suddenly, loving, it, again, the, game, can, get, to, be, pretty, hard, ...]"
3326,"PDP has impressed me lately with their line of Xbox One & PC accessories, and this wired controller is no exception. PDP's controllers feel almost as good as the Microsoft variety, and they're just as responsive in games. On top of that, the PDP controller has an extra button that, when used in conjunction with the D-Pad, gives you audio controls for a headset via the built-in headphone jack.\n\nButtons are sufficiently clicky and triggers have nice resistance. The analog sticks aren't too stiff or too loose, and they have a nice divot in them to make your thumbs sit solidly in place when gaming. The dark camo design looks good, and since it's a wired controller, this feels a bit lighter than an wireless Xbox One controller with batteries. PDP also includes a really nice, long USB cord so you aren't tethered too close to your PC or console.\n\nIf you're looking for a nice alternative to the first-party Microsoft controller, you'll be pleased with PDP's offering. The price is decent, the build is nice, and you won't miss a step in your gaming. Five stars, all the way.","[pdp, has, impressed, me, lately, with, their, line, of, xbox, one, pc, accessories, and, this, wired, controller, is, no, exception, pdps, controllers, feel, almost, as, good, as, the, microsoft, variety, and, theyre, just, as, responsive, in, games, on, top, of, that, the, pdp, controller, has, an, extra, button, that, when, used, in, conjunction, with, the, dpad, gives, you, audio, controls, for, a, headset, via, the, builtin, headphone, jack, buttons, are, sufficiently, clicky, and, triggers, have, nice, resistance, the, analog, sticks, arent, too, stiff, or, too, loose, and, they, have, a, nice, divot, in, them, to, make, your, thumbs, sit, solidly, ...]"


### Casual Tokenizer

In [11]:
from nltk.tokenize.casual import casual_tokenize

In [12]:
reviews['casual_tokens'] = reviews['rev_text_lower'].apply(lambda rev: casual_tokenize(str(rev)))

In [13]:
reviews[['reviewText','casual_tokens','tb_tokens']].sample(3)

Unnamed: 0,reviewText,casual_tokens,tb_tokens
3880,"A story of love, hatred, betrayal, mystery, and action. The next WB movie? No, even better: Sonic adventure 2: Battle. In SA2:BSB, you'll play as Sonic, Tails & Knuckles (Hero side), or Dr. Eggman and newcomers Shadow the Hedgehog & Rouge the Bat. I'm going to give you sneaky story previews right now:\n*HERO SIDE STORY*\nSonic has been framed for stealing none other than the Chaos Emerald! The military team G.U.N. won't believe him though, and Sonic is captured and taken to Prison Island. Tails and Amy come to rescue him, though, but before he leaves, he battles a strange black hedgehog. Who is he, and what does he want with -- the Chaos Emerald?\n*DARK SIDE STORY*\nEggman has received traces of a project called Project Shadow, an attempt to create the World's Ultimate Lifeform, a.k.a. Shadow. Eggman finds Shadow, and sets him free of his 50-year prison in an attempt to take over the world. THey team up with the sneaky Rouge the Bat. But what is Shadow's past, and true purpose?\nYou didn't really think I'd give you the rest, did you? Nope! You'll have to play it yourself to find out what happens. I'll give you a hint though: Both sides will join forces in the final battle against evil! Nothing more.\nP.S. I heard the original SA for Dreamcast had a much better storyline. How Sonic Team could cook up a plot better than this one is beyond me.....","[a, story, of, love, hatred, betrayal, mystery, and, action, the, next, wb, movie, no, even, better, sonic, adventure, 2, battle, in, sa2bsb, youll, play, as, sonic, tails, knuckles, hero, side, or, dr, eggman, and, newcomers, shadow, the, hedgehog, rouge, the, bat, im, going, to, give, you, sneaky, story, previews, right, now, hero, side, story, sonic, has, been, framed, for, stealing, none, other, than, the, chaos, emerald, the, military, team, gun, wont, believe, him, though, and, sonic, is, captured, and, taken, to, prison, island, tails, and, amy, come, to, rescue, him, though, but, before, he, leaves, he, battles, a, strange, black, ...]","[a, story, of, love, hatred, betrayal, mystery, and, action, the, next, wb, movie, no, even, better, sonic, adventure, 2, battle, in, sa2bsb, youll, play, as, sonic, tails, knuckles, hero, side, or, dr, eggman, and, newcomers, shadow, the, hedgehog, rouge, the, bat, im, going, to, give, you, sneaky, story, previews, right, now, hero, side, story, sonic, has, been, framed, for, stealing, none, other, than, the, chaos, emerald, the, military, team, gun, wont, believe, him, though, and, sonic, is, captured, and, taken, to, prison, island, tails, and, amy, come, to, rescue, him, though, but, before, he, leaves, he, battles, a, strange, black, ...]"
843,"poor quality, breaks after two weeks","[poor, quality, breaks, after, two, weeks]","[poor, quality, breaks, after, two, weeks]"
3042,love it,"[love, it]","[love, it]"


### Removing Punctuations and StopWords

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/koosha.tahmasebipour/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [46]:
stop_words = nltk.corpus.stopwords.words('english')

In [47]:
stop_words.remove("no")

In [48]:
stop_words.remove("not")

In [49]:
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [51]:
"not" in stop_words

False

In [17]:
len(stop_words)

179

In [18]:
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [19]:
reviews['tokens_nosw'] = reviews['tb_tokens'].\
    apply(lambda words: [w for w in words if w not in stop_words and w not in punctuation and w != ""])

In [20]:
reviews[['tb_tokens','tokens_nosw']].sample(3)

Unnamed: 0,tb_tokens,tokens_nosw
2929,"[i, have, to, admit, the, thing, looked, really, suspect, when, it, showed, up, but, it, works, as, well, as, a, branded, cable, no, complaints]","[admit, thing, looked, really, suspect, showed, works, well, branded, cable, complaints]"
493,"[total, junk, waste, of, money, should, have, brought, a, gears, 4, xbox, one, or, ps4, pro, instead, u, cant, wear, this, thing, for, hours, or, even, a, hour, for, that, matter, without, feeling, sick, and, the, games, dont, even, look, clear]","[total, junk, waste, money, brought, gears, 4, xbox, one, ps4, pro, instead, u, cant, wear, thing, hours, even, hour, matter, without, feeling, sick, games, dont, even, look, clear]"
319,"[good, game, but, i, must, stress, my, disappointment, in, the, lack, of, information, given, apparently, the, online, servers, for, this, game, are, no, longer, active, amazon, shouldve, noted, this, in, the, games, description, me, and, a, couple, friends, bought, this, game, knowing, that, it, did, originally, have, online, play, capability, when, it, was, released]","[good, game, must, stress, disappointment, lack, information, given, apparently, online, servers, game, longer, active, amazon, shouldve, noted, games, description, couple, friends, bought, game, knowing, originally, online, play, capability, released]"


### Stemming

In [21]:
from nltk.stem.porter import PorterStemmer

In [22]:
stemmer = PorterStemmer()

In [23]:
reviews['tokens_stemmed'] = reviews['tokens_nosw'].apply(lambda words: [stemmer.stem(w) for w in words])

In [24]:
reviews[['tokens_nosw','tokens_stemmed']].sample(3)

Unnamed: 0,tokens_nosw,tokens_stemmed
962,"[one, regretted, game, ever, bought, life, game, simply, horrible, graphic, look, like, made, 10, years, ago, problems, supporting, graphic, cardsthe, audio, sounds, like, separate, game, main, character, talking, sounds, like, hes, recording, room, game, targeting, somehow, broken, put, whole, clip, bullets, someone, perfectly, standing, ai, super, stupid, nothing, good, game, id, say, id, rather, give, money, sister, let, buy, clothes]","[one, regret, game, ever, bought, life, game, simpli, horribl, graphic, look, like, made, 10, year, ago, problem, support, graphic, cardsth, audio, sound, like, separ, game, main, charact, talk, sound, like, he, record, room, game, target, somehow, broken, put, whole, clip, bullet, someon, perfectli, stand, ai, super, stupid, noth, good, game, id, say, id, rather, give, money, sister, let, buy, cloth]"
2312,"[game, probably, good, played, internet, linking, systems, would, probably, lot, fun, playing, death, matches, developers, didnt, think, include, little, feature, ive, played, single, player, game, thats, fairly, rate, id, probably, give, 2, half, dont, xbox, live, cant, rate, aspect, game, three, friends, three, xboxes, cant, accurately, review, linking, feature, review, single, player, mode, ok]","[game, probabl, good, play, internet, link, system, would, probabl, lot, fun, play, death, match, develop, didnt, think, includ, littl, featur, ive, play, singl, player, game, that, fairli, rate, id, probabl, give, 2, half, dont, xbox, live, cant, rate, aspect, game, three, friend, three, xbox, cant, accur, review, link, featur, review, singl, player, mode, ok]"
4353,"[fun, game, beat, ita, good, sequel, reboot]","[fun, game, beat, ita, good, sequel, reboot]"


### Lemmatisation

In [25]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

In [26]:
def penn_to_wn(tag):
    """
        Convert between the PennTreebank tags to simple Wordnet tags
    """
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None

In [27]:
lemmatizer = WordNetLemmatizer()
def get_lemas(tokens):
    lemmas = []
    for token in tokens:
        pos = penn_to_wn(pos_tag([token])[0][1])
        if pos:
            lemma = lemmatizer.lemmatize(token, pos)
            if lemma:
                lemmas.append(lemma)
    return lemmas

In [28]:
reviews['lemmas'] = reviews['tokens_nosw'].apply(lambda tokens: get_lemas(tokens))

In [36]:
reviews[['reviewText','tokens_stemmed','lemmas']].sample(3)

Unnamed: 0,reviewText,tokens_stemmed,lemmas
3825,It was great,[great],[great]
3326,"PDP has impressed me lately with their line of Xbox One & PC accessories, and this wired controller is no exception. PDP's controllers feel almost as good as the Microsoft variety, and they're just as responsive in games. On top of that, the PDP controller has an extra button that, when used in conjunction with the D-Pad, gives you audio controls for a headset via the built-in headphone jack.\n\nButtons are sufficiently clicky and triggers have nice resistance. The analog sticks aren't too stiff or too loose, and they have a nice divot in them to make your thumbs sit solidly in place when gaming. The dark camo design looks good, and since it's a wired controller, this feels a bit lighter than an wireless Xbox One controller with batteries. PDP also includes a really nice, long USB cord so you aren't tethered too close to your PC or console.\n\nIf you're looking for a nice alternative to the first-party Microsoft controller, you'll be pleased with PDP's offering. The price is decent, the build is nice, and you won't miss a step in your gaming. Five stars, all the way.","[pdp, impress, late, line, xbox, one, pc, accessori, wire, control, except, pdp, control, feel, almost, good, microsoft, varieti, theyr, respons, game, top, pdp, control, extra, button, use, conjunct, dpad, give, audio, control, headset, via, builtin, headphon, jack, button, suffici, clicki, trigger, nice, resist, analog, stick, arent, stiff, loos, nice, divot, make, thumb, sit, solidli, place, game, dark, camo, design, look, good, sinc, wire, control, feel, bit, lighter, wireless, xbox, one, control, batteri, pdp, also, includ, realli, nice, long, usb, cord, arent, tether, close, pc, consol, your, look, nice, altern, firstparti, microsoft, control, youll, pleas, pdp, offer, price, decent, build, nice, ...]","[pdp, impressed, lately, line, xbox, pc, accessory, wire, controller, exception, pdps, controller, feel, almost, good, microsoft, variety, theyre, responsive, game, top, pdp, controller, extra, button, use, conjunction, dpad, give, audio, control, headset, builtin, headphone, jack, button, sufficiently, clicky, trigger, nice, resistance, analog, stick, arent, stiff, loose, nice, divot, make, thumb, sit, solidly, place, game, dark, camo, design, look, good, wire, controller, feel, bit, lighter, wireless, xbox, controller, battery, pdp, also, include, really, nice, long, usb, cord, arent, tether, close, pc, console, youre, look, nice, alternative, firstparty, microsoft, controller, youll, pleased, pdps, offering, price, decent, build, nice, wont, miss, step, game, ...]"
2812,"Yes there are some things missing. No training camp for instance. There are some other things missing. However. There is no better football game out on the XBOX 360. IN fact there is no other football game. So this is it. Will it do ok? Yes. It is still a pretty darn fine little football game. Was everyone expecting more? You bet. But let's not completely discount that this is still Madden and still holds most of the charm of the former entries in the series.\n\nBottom Line: If you are a football nut will you like this game? Yes. Will you friends like it? They may be miffed at some exclusions and changes, but they will be in awe of it's beauty. Good game. Get it if you need a football game.","[ye, thing, miss, train, camp, instanc, thing, miss, howev, better, footbal, game, xbox, 360, fact, footbal, game, ok, ye, still, pretti, darn, fine, littl, footbal, game, everyon, expect, bet, let, complet, discount, still, madden, still, hold, charm, former, entri, seri, bottom, line, footbal, nut, like, game, ye, friend, like, may, mif, exclus, chang, awe, beauti, good, game, get, need, footbal, game]","[yes, thing, miss, training, camp, instance, thing, miss, however, well, football, game, xbox, fact, football, game, ok, yes, still, pretty, darn, fine, little, football, game, everyone, expect, bet, let, completely, discount, still, madden, still, hold, charm, former, entry, series, bottom, line, football, nut, game, yes, friend, miffed, exclusion, change, awe, beauty, good, game, get, need, football, game]"


### Sentiment Predictor Baseline Model

In [30]:
from nltk import pos_tag

In [31]:
pos_tag("Hi how are you exactly my dear".split())

[('Hi', 'NNP'),
 ('how', 'WRB'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('exactly', 'RB'),
 ('my', 'PRP$'),
 ('dear', 'NN')]

In [32]:
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn

[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/koosha.tahmasebipour/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


In [33]:
list(swn.senti_synsets("happy"))

[SentiSynset('happy.a.01'),
 SentiSynset('felicitous.s.02'),
 SentiSynset('glad.s.02'),
 SentiSynset('happy.s.04')]

In [34]:
list(swn.senti_synsets("sad"))[0]

SentiSynset('sad.a.01')

In [35]:
joy1 = swn.senti_synset('joy.n.01')
joy2 = swn.senti_synset('joy.n.02')
 
trouble1 = swn.senti_synset('trouble.n.03')
trouble2 = swn.senti_synset('trouble.n.04')
 
 
categories = ["Joy1", "Joy2", "Trouble1", "Trouble2"]
rows = []
rows.append(["List", "Positive score", "Negative Score"])
accs = {}
accs["Joy1"] = [joy1.pos_score(), joy1.neg_score()]
accs["Joy2"] = [joy2.pos_score(), joy2.neg_score()]
accs["Trouble1"] = [trouble1.pos_score(), trouble1.neg_score()]
accs["Trouble2"] = [trouble2.pos_score(), trouble2.neg_score()]
for cat in categories:
    rows.append([cat, f"{accs.get(cat)[0]:.3f}",
                f"{accs.get(cat)[1]:.3f}"])
 
columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i])
                  for i in range(0, len(row))))

 List      Positive score  Negative Score 
 Joy1      0.500           0.250          
 Joy2      0.375           0.000          
 Trouble1  0.000           0.625          
 Trouble2  0.000           0.500          
