Word n-grams (Bag of Words - BOW)
------
**What it does**: Runs two bag of words models -- one on words from the first half of the tweet, one from the second half. From the paper linked below.

_we also hypothesize that the words located towards the end of a tweet are more important than other words, because people usually summarize or highlight their points in the end. For example, “I hate it when stuff like that happens,.. ;/ thank god it worked out. #thankful.”. Although “hate” appears in the first half of the tweet, the overall emotion is dominated by “thank” in the latter half. We encoded the position information into a feature by attaching a number (i.e, 1 or 2) to each n-gram to indicate whether it is in the first half or the second half of the tweet_

source: http://knoesis.wright.edu/library/download/wenbo_socialcom_2012.pdf

Technically it loads the entire twitter vocabulary from a `CountVectorizer` first, so you change any settings in the model run on the entire tweet corpus. Then it applies that vocabulary to the two separate corpuses generated by the first and second half of the string.

**Strengths**: May capture additional positional information.

**Weaknesses**: Worse performance on STS-Gold compared to a traditional BOW.

**Hyperparameters**:
- `CountVectorizer`:
  - `ngram_range`: the window length of words to look at -- `(min, max)`. In this notebook, we look at unigrams and bigrams
  - `min_df`, `max_df`: The minimum and maximum document freqency for an n-gram, respectively. Can be a count (`3`) or a percent (`0.95`)
  - `stop_words`: Whether to remove stopwords based on the `english` word list. Can input another stopword list.
  - `binary`: Whether to convert to a binary (yes/no) occurence. Can also just apply later in pipeline using `Binarizer`

In [1]:
from collections import OrderedDict
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
sts_gold = pd.read_csv('../data/sts_gold_v03/sts_gold_tweet.csv', index_col='id', sep=';')

In [3]:
sts_gold.head()

Unnamed: 0_level_0,polarity,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1467933112,0,the angel is going to miss the athlete this we...
2323395086,0,It looks as though Shaq is getting traded to C...
1467968979,0,@clarianne APRIL 9TH ISN'T COMING SOON ENOUGH
1990283756,0,drinking a McDonalds coffee and not understand...
1988884918,0,So dissapointed Taylor Swift doesnt have a Twi...


In [4]:
tweets = sts_gold['tweet']

In [5]:
test_str = 'this is a sentence I am about to split in half'

In [6]:
def split_tweet(string, half='first'):
    split_str = string.split()
    halfway = int(len(split_str) / 2)
    if half == 'first':
        return ' '.join(split_str[:halfway])
    if half == 'second':
        return ' '.join(split_str[halfway:])


In [7]:
tweets_firsthalf = tweets.apply(lambda x: split_tweet(x))
tweets_secondhalf = tweets.apply(lambda x: split_tweet(x, half='second'))

In [8]:
cv = CountVectorizer(ngram_range=(1,2), min_df=3, max_df=.95, stop_words='english')
cv.fit_transform(tweets)
all_vocabulary = cv.vocabulary_

In [9]:
cv_first = CountVectorizer(vocabulary=all_vocabulary)
bow_first = cv_first.fit_transform(tweets_firsthalf)
colnames_1 = [i + "_1" for i in cv_first.get_feature_names()]
bow_first_df = pd.DataFrame(bow_first.toarray(), index=tweets_firsthalf.index, columns=colnames_1)

cv_second = CountVectorizer(vocabulary=all_vocabulary)
bow_second = cv_second.fit_transform(tweets_secondhalf)
colnames_2 = [i + "_2" for i in cv_second.get_feature_names()]

bow_second_df = pd.DataFrame(bow_second.toarray(), index=tweets_secondhalf.index, columns=colnames_2)

bow_combined = pd.concat([bow_first_df, bow_second_df], axis=1)

In [10]:
bow_combined.head()

Unnamed: 0_level_0,10_1,100_1,101_1,12_1,13_1,14_1,15_1,1st_1,20_1,24_1,...,yay_2,yea_2,yeah_2,year_2,years_2,yep_2,yes_2,yesterday_2,youtube_2,youtube channel_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1467933112,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2323395086,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1467968979,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1990283756,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1988884918,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Feature Evaluation

In [11]:
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import Binarizer, StandardScaler
from sklearn.ensemble import VotingClassifier

from sklearn.cross_validation import cross_val_score

In [15]:
models = [('DUMMY', DummyClassifier(strategy='most_frequent')),
          ('mNB' , MultinomialNB()),
          ('bNB' , BernoulliNB()),
          ('svc' , SVC(probability=True)),
          ('rf' , RandomForestClassifier()),
          ('lr' , LogisticRegressionCV())
         ]
models.append(('eclf', VotingClassifier(estimators=[models[i] for i in [1, 3, 4, 5]], voting='soft')))

In [17]:
print('{0}\t{1:<1}\t{2:<4}\t{3:<4}'.format("MODEL", "MEAN CV", "MIN CV", "MAX CV"))

for name, model in models:    
    X, Y = bow_combined, (sts_gold['polarity'] == 4).ravel()
    
    if name == 'bNB':
        binarize = Binarizer()
        X = binarize.fit_transform(X)
    elif name == 'svc':
        ss = StandardScaler()
        X = X.values
        X = ss.fit_transform(X)
        
    cv = cross_val_score(model, X, Y, cv=5, scoring='accuracy')
    
    print('{0}\t{1:<3}\t{2:<4}\t{3:<4}'.format(name, round(cv.mean(), 4), round(cv.min(), 4), round(cv.max(), 4)))

MODEL	MEAN CV	MIN CV	MAX CV
DUMMY	0.6893	0.6887	0.6897
mNB	0.8137	0.8079	0.8162
bNB	0.7886	0.7696	0.803
svc	0.7183	0.7069	0.7291
rf	0.7567	0.7328	0.7956




lr	0.7916	0.777	0.8054
eclf	0.8107	0.7931	0.8276


