# Sentiment Analisys

After considering many alternatives, for the sentiment analysis I decided to test of the very few algorithms that go a long way to address compositionality, which the algorithms based of bag-of-words models cannot address: simply attributing a polarity to the single words does not always lead the right result.

I tested Stanford's reimplementation of the [Recursive Neural Tensor Networks by Socher et al.](http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf).

They applied their system to a corpus of ~11k movie reviews from Rotten Tomatoes, and made the trained model available through the Stanford CoreNLP system.

Although I considered reimplementing their system, I started out by verifying the possibility of domain transfer. Kaggle has made available a labled [dataset of airline tweets](https://www.kaggle.com/crowdflower/twitter-airline-sentiment).

The problem is complicate by the different domain, but also by the fact that Twitter texts are short, are difficult to parse, and difficult to interpret even for humans.

The results have surprised me. The following describes the way I tested the accuracy of the resulting system.

## Input and test files

Since my purpose is not train and test a model, instead of divdidind the labeled input in three parts as usual, I used all the ~14.5 items in the Kaggle dataset to evaluate the already trained model provided by the Stanford software. 

* An input JSON file (`tweets.json`) is prepared from the dataset;
* such file is processed by the Scala code (`AccuracyTest.process`);
* the output, `analyzed.json`, is evaluated.

In [3]:
import sqlite3, codecs
from json import dumps

# This is the database containing the labeled Airline data
conn = sqlite3.connect('../data/airline-twitter-sentiment/database.sqlite')

In [4]:
# Turn positive -> +, negative -> -, neuteral -> =.
def to_symbol(s):
    if s.lower().startswith('pos'): return '+'
    elif s.lower().startswith('neg'): return '-'
    else: return '='

sql = 'select airline_sentiment, text from tweets'

js_docs = (dumps({'polarity': to_symbol(polarity), 
                  'text': text}) 
           for (polarity, text) in conn.execute(sql))

with codecs.open('../data/airline-twitter-sentiment/tweets.json', mode='w+', encoding='utf-8') as os:
    os.write(u'\n'.join(js_docs))

* Input file:

In [5]:
!head -n 5 ../data/airline-twitter-sentiment/tweets.json

{"polarity": "=", "text": "@JetBlue's new CEO seeks the right balance to please passengers and Wall ... - Greenfield Daily Reporter http://t.co/LM3opxkxch"}
{"polarity": "-", "text": "@JetBlue is REALLY getting on my nerves !! \ud83d\ude21\ud83d\ude21 #nothappy"}
{"polarity": "-", "text": "@united yes. We waited in line for almost an hour to do so. Some passengers just left not wanting to wait past 1am."}
{"polarity": "-", "text": "@united the we got into the gate at IAH on time and have given our seats and closed the flight. If you know people is arriving, have to wait"}
{"polarity": "-", "text": "@SouthwestAir its cool that my bags take a bit longer, dont give me baggage blue balls-turn the carousel on, tell me it's coming, then not."}


* The following file is the output of `AccracyTest.process` (in `src/main/scala/BasicTask.scala`) with the above file as input.

In [6]:
!wc -l ../data/airline-twitter-sentiment/analyzed.json

   14485 ../data/airline-twitter-sentiment/analyzed.json


In [7]:
!head -n 5 ../data/airline-twitter-sentiment/analyzed.json

{"text":"@JetBlue's new CEO seeks the right balance to please passengers and Wall ... - Greenfield Daily Reporter http://t.co/LM3opxkxch","polarity":"=","sentiment":3.0}
{"text":"@JetBlue is REALLY getting on my nerves !! \ud83d\ude21\ud83d\ude21 #nothappy","polarity":"-","sentiment":1.5}
{"text":"@united yes. We waited in line for almost an hour to do so. Some passengers just left not wanting to wait past 1am.","polarity":"-","sentiment":1.6666666666666667}
{"text":"@united the we got into the gate at IAH on time and have given our seats and closed the flight. If you know people is arriving, have to wait","polarity":"-","sentiment":2.5}
{"text":"@SouthwestAir its cool that my bags take a bit longer, dont give me baggage blue balls-turn the carousel on, tell me it's coming, then not.","polarity":"-","sentiment":1.0}


* The file is read in to be evauated.

_Small note_: both Python 2 and 3 split lines on all Unicode characters with the Line_Break propery, which is unfortunate, since some of those can show up in the tweets. That's why I need to read in the entire file in memory and split it over the ASCII newline. 

In [8]:
from json import loads
with open('../data/airline-twitter-sentiment/analyzed.json', encoding='utf-8') as lines:
    s = lines.read()

In [9]:
jss = [loads(l, encoding='utf-8') for l in s.split('\n') if l]
len(jss)

14485

In [10]:
import pandas as pd

In [11]:
jss_df = pd.DataFrame(jss)
N = len(jss_df); N

14485

* A small sample of the test Tweets:

In [12]:
jss_df.sample(15)

Unnamed: 0,polarity,sentiment,text
1619,-,1.0,@USAirways has an SPF record error that is cau...
1252,-,1.5,@united once again my bag is lost when I trave...
11169,=,1.0,@jetblue it's time for a direct flight from #J...
6551,+,2.25,@JetBlue ok!!! That's super helpful. Thank you...
5779,-,1.5,@SouthwestAir been on hold to rebook a flight ...
1797,=,2.0,“@SouthwestAir: @saysorrychris Can you follow ...
7729,-,2.0,@USAirways using all of my monthly minute on h...
6373,+,2.5,@VirginAmerica Thanks for a great flight from ...
8708,+,3.0,@united Thank y'all for being an amazing airli...
5882,+,2.5,@VirginAmerica thanks guys! Sweet route over t...


* Sanity check: we want to get an idea of how the system has labeled the data.

In [13]:
def sentiment(pol):
    return jss_df[jss_df['polarity'] == pol]['sentiment']
    
def mean_std(df):
    m, s = df.mean(), df.std()
    return m - s, m, m + s

desc = [mean_std(s) for s in map(sentiment, ('-', '=', '+'))]

The Stanford sentiment analzer has five lables (from 0 to 4). These are the $(\bar{x_i} - \sigma, \bar{x_i}, \bar{x_i} + \sigma)$ tuples for the negative, neutral and positive polarity items. There is quite a bit of overlap:

In [14]:
desc

[(0.95509820978792059, 1.4426183135663428, 1.9301384173447651),
 (1.0485336944830861, 1.6355839501000793, 2.2226342057170725),
 (1.5285735260682785, 2.2019647447668018, 2.8753559634653252)]

## Accuracy

We compute here the score according to the guidelines for [Semeval 2015, task 10](http://alt.qcri.org/semeval2015/task10/), subtask B:

> **Subtask B**: Message Polarity Classification: Given a message, classify whether the message is of positive, negative, or neutral sentiment. For messages conveying both a positive and negative sentiment, whichever is the stronger sentiment should be chosen.

The score is $$ \frac{F_1^{pos} + F_1^{neg}}{2}. $$

Since the Stanford sentment analyzer returns five labels from 0 to 4 (strongly negative, negative, neutral, positive, strongly positive) but seems more than a bit biased towards the lower end of the spectrum, I'm using a rather arbitary 2.0 as the cutoff between the "positive" and "negative" classes. 

In [15]:
def prec_rec(true_p, predict_p):
    """Takes two predicates that evaluate a true and predicted
    retrult, and returns a tuple with precision and recall.
    """
    true_count, predict_count, correct_count = 0, 0, 0
    for js in jss:
        if true_p(js):
            true_count += 1
        if predict_p(js):
            predict_count += 1
        if true_p(js) and predict_p(js):
            correct_count += 1
#     print(true_count, predict_count, correct_count)
    return correct_count / predict_count, correct_count / true_count

cutoff = 2.0
true_pos = lambda js: js['polarity'] == '+'
pred_pos = lambda js: js['sentiment'] > cutoff

true_neg = lambda js: js['polarity'] == '-'
pred_neg = lambda js: js['sentiment'] <= cutoff

def F1(true_p, pred_p):
    p, r = prec_rec(true_p, pred_p) 
    return 2 * p * r / (p + r)

F1_pos = F1(true_pos, pred_pos)
F1_neg = F1(true_neg, pred_neg)

score = (F1_pos + F1_neg) / 2
print('Score Semeval 2015: %.2f%%' % (score * 100))

Score Semeval 2015: 66.31%


## To note

Such score is, to me at least, rather suprising: it's actually slightly **better** than the ones reported as the [official results](https://docs.google.com/document/d/1WV-XTvQDpuH_IfKrjzeZ361s1ykcskDNNuOV3oI39_c/edit) for the task (they range between 60.77 and 64.84). In fairness, it certainly would need to be tested on the Semeval dataset, but the evidence is that it performs well on the present one.

The result is suprising for two reasons:

* the model was trained on movie reviews, but it generalizes over certain language patterns that might be common to the two domains;
* despite there being no provision in the code for dealing with the quirkiness of Twitter texts, the score is still comparable to the output of systems specifcally tailored to the task.

With respect to the latter, the `CoreNLP` code uses rule-based tokenizer and lexer, which are tailored to deal with well-behaved texts the model was trained on.

Trying the sentiment analyzer on the command line, it's easy to grt a sense of its sensitivity to spelling mistakes, the presence of hashtgs and emoticons, and the very specific language and abbreviations that people use on Twitter. By integrating some Twitter-specific tools for parsing (like [Noah Smith's group's](http://www.cs.cmu.edu/~ark/TweetNLP/)) should help further improve the score. 