Bing Liu Lexicon Features
------
**What it does**: Counts number of negative and positive words used in a tweet based on the Bing Liu lexicon.  
Source: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon

**Strengths**: Conveys sentiment information, provides both positive and negative features.

**Weaknesses**: Longer tweets can bias scores

**Hyperparameters**:  None

In [50]:
from collections import OrderedDict, defaultdict, Counter
import pandas as pd
import csv
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer

In [51]:
sts_gold = pd.read_csv('../data/sts_gold_v03/sts_gold_tweet.csv', index_col='id', sep=';')

In [52]:
sts_gold.head()

Unnamed: 0_level_0,polarity,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1467933112,0,the angel is going to miss the athlete this we...
2323395086,0,It looks as though Shaq is getting traded to C...
1467968979,0,@clarianne APRIL 9TH ISN'T COMING SOON ENOUGH
1990283756,0,drinking a McDonalds coffee and not understand...
1988884918,0,So dissapointed Taylor Swift doesnt have a Twi...


In [53]:
tweets = sts_gold['tweet']

In [54]:
negList = []
posList = []
wordDict = defaultdict(list)

with open('../lexicons/bing_liu_lexicon/positive-words.txt', 'r') as f:
    reader = csv.reader(f)
    headerRows = [i for i in range(0, 35)]
    for row in headerRows:
        next(reader)
    for word in reader:
        posList.extend(word[0])
        wordDict[word[0]].append('positive')

# FYI, I had to edit the word 'inimically' in the original file as there was a weird non utf-8 character
with open('../lexicons/bing_liu_lexicon/negative-words.txt', 'r') as f:
    reader = csv.reader(f)
    headerRows = [i for i in range(0, 35)]
    for row in headerRows:
        next(reader)
    for word in reader:
        negList.extend(word[0])
        wordDict[word[0]].append('negative')

In [55]:
tt = TweetTokenizer()

In [56]:
def generate_emotion_count(string, tokenizer):
    emoCount = Counter()
    for token in tt.tokenize(string):
        token = token.lower()
        emoCount += Counter(wordDict[token])
    return emoCount

In [57]:
emotionCounts = [generate_emotion_count(tweet, tt) for tweet in tweets]

In [58]:
emotion_df = pd.DataFrame(emotionCounts, index=tweets.index)
emotion_df = emotion_df.fillna(0)

In [59]:
emotion_df.describe()

Unnamed: 0,negative,positive
count,2034.0,2034.0
mean,0.728614,0.632743
std,0.924125,0.841221
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,1.0,1.0
max,6.0,5.0


### Feature Evaluation

In [60]:
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.preprocessing import Binarizer, StandardScaler
from sklearn.dummy import DummyClassifier

In [61]:
models = [('DUMMY', DummyClassifier(strategy='most_frequent')),
          ('mNB' , MultinomialNB()),
          ('bNB' , BernoulliNB()),
          ('svc' , SVC())
         ]

In [62]:
print('{0}\t{1:<1}\t{2:<4}\t{3:<4}'.format("MODEL", "MEAN CV", "MIN CV", "MAX CV"))

for name, model in models:    
    X, Y = emotion_df, (sts_gold['polarity'] == 4).ravel()
    
    if name == 'bNB':
        binarize = Binarizer()
        X = binarize.fit_transform(X)
    elif name == 'svc':
        ss = StandardScaler()
        X = X.as_matrix()
        X = ss.fit_transform(X)
        
    cv = cross_val_score(model, X, Y, cv=5, scoring='accuracy')
    
    print('{0}\t{1:<3}\t{2:<4}\t{3:<4}'.format(name, round(cv.mean(), 4), round(cv.min(), 4), round(cv.max(), 4)))

MODEL	MEAN CV	MIN CV	MAX CV
DUMMY	0.6893	0.6887	0.6897
mNB	0.8053	0.7586	0.848
bNB	0.8033	0.7611	0.8456
svc	0.8023	0.7586	0.848
