## Extract Twitter Handles (@-strings) from training data
We want to use them to select tweets.

Prepare test file for measuring accuracy if the Sentiment Analyzer (in the Scala code). Output JSON objects, one per line.

In [2]:
import sqlite3, codecs
from json import dumps

conn = sqlite3.connect('../data/airline-twitter-sentiment/database.sqlite')

In [59]:
# Turn positive -> +, negative -> -, neuteral -> =.
def to_symbol(s):
    if s.lower().startswith('pos'): return '+'
    elif s.lower().startswith('neg'): return '-'
    else: return '='

sql = 'select airline_sentiment, text from tweets'

js_docs = (dumps({'polarity': to_symbol(polarity), 
                  'text': text}) 
           for (polarity, text) in conn.execute(sql))

with codecs.open('../data/airline-twitter-sentiment/tweets.json', mode='w+', encoding='utf-8') as os:
    os.write(u'\n'.join(js_docs))

In [135]:
!head -n 5 ../data/airline-twitter-sentiment/tweets.json

{"polarity": "=", "text": "@JetBlue's new CEO seeks the right balance to please passengers and Wall ... - Greenfield Daily Reporter http://t.co/LM3opxkxch"}
{"polarity": "-", "text": "@JetBlue is REALLY getting on my nerves !! \ud83d\ude21\ud83d\ude21 #nothappy"}
{"polarity": "-", "text": "@united yes. We waited in line for almost an hour to do so. Some passengers just left not wanting to wait past 1am."}
{"polarity": "-", "text": "@united the we got into the gate at IAH on time and have given our seats and closed the flight. If you know people is arriving, have to wait"}
{"polarity": "-", "text": "@SouthwestAir its cool that my bags take a bit longer, dont give me baggage blue balls-turn the carousel on, tell me it's coming, then not."}


The following file is the output of `ErrorTest.process` (in `src/main/scala/BasicTask.scala`) with the above file as input.

In [16]:
!wc -l ../data/airline-twitter-sentiment/analyzed.json

   14485 ../data/airline-twitter-sentiment/analyzed.json


In [136]:
from json import loads
with open('../data/airline-twitter-sentiment/analyzed.json', encoding='utf-8') as lines:
    s = lines.read()

In [137]:
jss = [loads(l, encoding='utf-8') for l in s.split('\n') if l]
len(jss)

14485

In [126]:
import pandas as pd

In [153]:
jss_df = pd.DataFrame(jss)
N = len(jss_df); N

14485

A small sample of the test Tweets:

In [134]:
jss_df.sample(15)

Unnamed: 0,polarity,sentiment,text
1016,=,2.0,@USAirways and if the flight is full?
10823,-,1.8,@united. epic fail. @reagan. no jetway. been h...
9262,-,1.5,@USAirways thanks for three Cancelled Flightle...
5597,=,1.5,@JetBlue If you'd love to see more girls be in...
1587,-,1.0,@USAirways No reFlight Booking Problems necess...
11279,-,1.0,"@united Awesome flight crew on UA1589, re the ..."
7903,-,2.0,@united direct messages not going through plea...
13349,-,2.0,"@united according to your DMs, I'm not owed a ..."
13935,+,3.0,@AmericanAir if I could fly an md80/dc10 I wou...
6055,-,1.0,@united U BUMS HOW DO U LOSE THE BIGGEST BAG O...


A small sanity check: we want to get an idea of how the system has labeled the data.

In [132]:
def sentiment(pol):
    return jss_df[jss_df['polarity'] == pol]['sentiment']
    
def mean_std(df):
    m, s = df.mean(), df.std()
    return m - s, m, m + s

desc = [mean_std(s) for s in map(sentiment, ('-', '=', '+'))]

The Stanford sentiment analzer has four lables (from 1 to 4). These are the $(\bar{x_i} - \sigma, \bar{x_i}, \bar{x_i} + \sigma)$ tuples for the negative, neutral and positive polarity items:

In [133]:
desc

[(0.95509820978792059, 1.4426183135663428, 1.9301384173447651),
 (1.0485336944830861, 1.6355839501000793, 2.2226342057170725),
 (1.5285735260682785, 2.2019647447668018, 2.8753559634653252)]

## Accuracy

We compute here the score according to the guidelines for Semeval 2015, task 10, subtask B:

> **Subtask B**: Message Polarity Classification: Given a message, classify whether the message is of positive, negative, or neutral sentiment. For messages conveying both a positive and negative sentiment, whichever is the stronger sentiment should be chosen.

The score is $$ \frac{F_1^{pos} + F_1^{neg}}{2}. $$

Since the Stanford sentment analyzer returns five labels from 0 to 4 (strongly negative, negative, neutral, positive, strongly positive), but seems more than a bit biased towards the lower end of the spectrum, I'm using a rather arbitary 2.0 as the cutoff between the two classes. 

In [152]:
def prec_rec(true_p, predict_p):
    """Takes two predicates that evaluate a true and predicted
    retrult, and returns a tuple with precision and recall.
    """
    true_count, predict_count, correct_count = 0, 0, 0
    for js in jss:
        if true_p(js):
            true_count += 1
        if predict_p(js):
            predict_count += 1
        if true_p(js) and predict_p(js):
            correct_count += 1
#     print(true_count, predict_count, correct_count)
    return correct_count / predict_count, correct_count / true_count

cutoff = 2.0
true_pos = lambda js: js['polarity'] == '+'
pred_pos = lambda js: js['sentiment'] > cutoff

true_neg = lambda js: js['polarity'] == '-'
pred_neg = lambda js: js['sentiment'] <= cutoff

def F1(true_p, pred_p):
    p, r = prec_rec(true_p, pred_p) 
    return 2 * p * r / (p + r)

F1_pos = F1(true_pos, pred_pos)
F1_neg = F1(true_neg, pred_neg)

score = (F1_pos + F1_neg) / 2
print('Score Semeval 2015: %.2f%%' % (score * 100))

Score Semeval 2015: 66.31%


Such score is actually *better* than the ones reported as the [official results](https://docs.google.com/document/d/1WV-XTvQDpuH_IfKrjzeZ361s1ykcskDNNuOV3oI39_c/edit) for the task (they range between 60.77 and 64.84).

In [None]:
import re
from itertools import chain
from operator import concat
from string import split

def flatmap(f, sequence):
    "Apply a function that returns a sequence concatenating the results"
    return reduce(concat, map(f, sequence))

handle_pat = re.compile(r'@\w+')
handles = set(flatmap(handle_pat.findall, messages))

In [None]:
handles

In [34]:
import re
from itertools import ifilter

def lines(it):
    return ifilter(None, (item.rstrip() for item in it))

conf = {k: v for k, v in map(lambda l: re.split(r'\s+=\s+', l), 
                             filter(lambda l: l, lines(open('../etc/twitter.conf'))))}

In [35]:
import tweepy

auth = tweepy.OAuthHandler(conf['twitter4j.oauth.consumerKey'], conf['twitter4j.oauth.consumerSecret'])
auth.set_access_token(conf['twitter4j.oauth.accessToken'], conf['twitter4j.oauth.accessTokenSecret'])

# Construct the API instance
api = tweepy.API(auth)

In [39]:
import string 
handles = [v.lower() for _, v in map(lambda l: l.split(','), lines(open('../data/airline-twitter-sentiment/airline-handles')))]

In [43]:
class MyStreamListener(tweepy.StreamListener):
    """Override tweepy.StreamListener to add logic.
    """    
    def on_status(self, status):
        print (status.text, status.entities)
    
    def on_error(self, status_code):
        if status_code == 420:
            # returning False in on_data disconnects the stream
            return False

In [45]:
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)

In [46]:
myStream.filter(track=handles, languages=['en'], async=True)

In [55]:
myStream.disconnect()

In [56]:
myStream.running

False

In [None]:
myStream.listener.on_status(None)

In [48]:
open('../var/tweets')

(u'RT @big_yummy: Hey .@AmericanAir thanks to the gross incompetence of you ODR staff, I just missed my flight!  Thanks a lot! #nexttimeSWA .@\u2026', {u'user_mentions': [{u'id': 382126710, u'indices': [3, 13], u'id_str': u'382126710', u'screen_name': u'big_yummy', u'name': u'Lincoln Lobley'}, {u'id': 22536055, u'indices': [20, 32], u'id_str': u'22536055', u'screen_name': u'AmericanAir', u'name': u'American Airlines'}, {u'id': 7212562, u'indices': [139, 140], u'id_str': u'7212562', u'screen_name': u'SouthwestAir', u'name': u'Southwest Airlines'}], u'symbols': [], u'hashtags': [{u'indices': [124, 136], u'text': u'nexttimeSWA'}], u'urls': []})
(u'@AmericanAir Any reason why customer relations would just...stop responding?  Because this whole process has been a disaster.', {u'user_mentions': [{u'id': 22536055, u'indices': [0, 12], u'id_str': u'22536055', u'screen_name': u'AmericanAir', u'name': u'American Airlines'}], u'symbols': [], u'hashtags': [], u'urls': []})
