# Sentiment analysis
<img src="./screencast.gif"/>

In this sample, we will build a sentiment annotator for the [Movie Review](http://www.cs.cornell.edu/people/pabo/movie-review-data/) dataset from Cornell.

In [1]:
import json
import tarfile

with tarfile.open('data.json.tgz') as tar:
    file = tar.extractfile('data.json')
    data = file.read().decode('utf8')

In [2]:
data[:2000]

'{"0": {"text": "in my review of \\" the spy who shagged me , \\" i postulated an unbreakable law of film physics : every time a sequel is as good as or better than the previous film in the series , it is followed by a third movie that is a bore . \\nthe cause is probably complacency ; a studio sighs with relief when part 2 lives up to expectations and figures part 3 is a sure thing . \\n \\" scream 3 \\" provides the latest proof of this rule . \\nin los angeles production has begun on \\" stab 3 : return to woodsboro , \\" the most recent installment in the series of movies inspired by the murders surrounding sidney prescott ( neve campbell ) . \\nhowever , life soon starts imitating art , and \\" stab \\" cast members turn up stabbed . \\nsmelling yet another book deal , gale weathers ( courteney cox arquette ) comes to the set to investigate and finds her ex-boyfriend dewey riley ( david arquette ) acting as a technical consultant and getting chummy with jennifer ( parker posey ) ,

In [3]:
from textwrap import wrap


class Record(object):
    def __init__(self, id, text, cornell, vader=None, my=None):
        self.id = id
        self.text = text
        self.cornell = cornell
        self.vader = vader
        self.my = my
        
    def __repr__(self):
        return 'Record(id={self.id!r}, text={self.text!r}), cornell={self.cornell!r}, vader={self.vader!r}, my={self.my!r})'.format(self=self)


def parse(data):
    data = json.loads(data)
    for id in data:
        item = data[id]
        yield Record(
            id=id,
            text=item['text'],
            cornell=item['sent'],
        )
        
        
records = list(parse(data))

We will write a simple display formatter to make our output look nice

In [4]:
from IPython.display import display, HTML


RED = 'red'
GREEN = 'green'


def format_color(value, color):
    return '<span style="color:{color};">{value}</span>'.format(
        color=color,
        value=value
    )


def display_record(record):
    value = record.cornell
    if value == 'neg':
        color = RED
    elif value == 'pos':
        color = GREEN
    else:
        raise ValueError(value)
    display(HTML('cornell: ' + format_color(value, color)))

    value = record.vader
    if value is not None:
        color = RED if value < 0 else GREEN
        display(HTML('vader: ' + format_color(value, color)))

    value = record.my
    if value is not None:
        color = RED if value < 0 else GREEN
        display(HTML('my: ' + format_color(value, color)))
    
    print(record.text)

    
display_record(records[0])

the yet-to-be-released krippendorf's tribe is being marketed as a family comedy , but buyer beware . 
this movie can't make up its mind . 
is it a family comedy with vulgar references to both the male and female bodies , menstruation , circumcision , and sex that would make any parents squirm at the thought of having their child next to them ? 
or is it an adult comedy approached with such immaturity that only adolescents will appreciate the effort ? 
either way , " unbalanced " is the word to stamp on this hit and miss and miss and miss effort . 
the premise is catchy - widowed anthropology professor james krippendorf ( richard dreyfuss ) has spent the past two years " getting over " the death of his wife , neglecting key research and squandering grant money on personal living expenses . 
now it's time to show what he's achieved in those two years , and he has absolutely nothing to show for it . 
with a fabricated tale of studying a previously undiscovered tribe in new guinea , krippe

In [5]:
len(records)

2000

In [6]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/alexkuk/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [7]:
from tqdm import tqdm_notebook as log_progress

from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()


for record in log_progress(records):
    score = vader.polarity_scores(record.text)
    # {'compound': 0.6156, 'neg': 0.074, 'pos': 0.085, 'neu': 0.842}
    record.vader = score['compound']






In [8]:
display_record(records[0])

the yet-to-be-released krippendorf's tribe is being marketed as a family comedy , but buyer beware . 
this movie can't make up its mind . 
is it a family comedy with vulgar references to both the male and female bodies , menstruation , circumcision , and sex that would make any parents squirm at the thought of having their child next to them ? 
or is it an adult comedy approached with such immaturity that only adolescents will appreciate the effort ? 
either way , " unbalanced " is the word to stamp on this hit and miss and miss and miss effort . 
the premise is catchy - widowed anthropology professor james krippendorf ( richard dreyfuss ) has spent the past two years " getting over " the death of his wife , neglecting key research and squandering grant money on personal living expenses . 
now it's time to show what he's achieved in those two years , and he has absolutely nothing to show for it . 
with a fabricated tale of studying a previously undiscovered tribe in new guinea , krippe

## Assemble our annotator
Now we can assemble our checker using `ipyannotate`. For this task, we will show the user the model-evaluated sentiment, and let them override it with `+1`, `0` and `-1` buttons, which will modify the annotation tasks.

In [9]:
from ipyannotate.buttons import ValueButton as Button, NextButton, BackButton
from ipyannotate.toolbar import Toolbar
from ipyannotate.tasks import Task, Tasks
from ipyannotate.canvas import OutputCanvas
from ipyannotate.annotation import Annotation


def callback(button):
    annotation.tasks.current.output.my = button.value


tasks = Tasks(Task(_) for _ in records[:100])

pos = Button(1, shortcut='1', color='green')
neu = Button(0, shortcut='2', color='gray')
neg = Button(-1, shortcut='3', color='red')

for button in [pos, neu, neg]:
    button.on_click(callback)

buttons = [pos, neu, neg, BackButton(shortcut='j'), NextButton(shortcut='k')]
toolbar = Toolbar(buttons)

canvas = OutputCanvas(display=display_record)

annotation = Annotation(toolbar, tasks, canvas=canvas)
annotation

# annotation.tasks

In [11]:
annotation.tasks[:10]

[Task(output=Record(id='961', text='the yet-to-be-released krippendorf\'s tribe is being mark..., value=1),
 Task(output=Record(id='581', text='mpaa : not rated ( though i feel it would likely be pg , ..., value=0),
 Task(output=Record(id='417', text="would you believe -- in real life , i mean -- that if you..., value=-1),
 Task(output=Record(id='1790', text='it seemed like the perfect concept . \nwhat better for t..., value=1),
 Task(output=Record(id='395', text='phaedra cinema , the distributor of such never-heard-of c..., value=0),
 Task(output=Record(id='725', text="synopsis : a man whose lover , paris , was murdered agree..., value=-1),
 Task(output=Record(id='718', text='luckily , some people got starship troopers . \nsome peop..., value=1),
 Task(output=Record(id='498', text='vampire lore and legend has always been a popular fantasy..., value=0),
 Task(output=Record(id='1910', text='though it is a fine piece of filmmaking , there\'s somet..., value=-1),
 Task(output=Record(id='1