Hashtag Counts
------
**What it does**: Finds all hashtags in a tweet and counts the occurences per tweet.

**Strengths**: 

**Weaknesses**: Have seen some models typically just use hashtag indicator rather than specific hashtags.

**Hyperparameters**:  None

In [18]:
from collections import OrderedDict
import pandas as pd
import nltk
import re

In [19]:
sts_gold = pd.read_csv('../data/sts_gold_v03/sts_gold_tweet.csv', index_col='id', sep=';')

In [20]:
sts_gold.head()

Unnamed: 0_level_0,polarity,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1467933112,0,the angel is going to miss the athlete this we...
2323395086,0,It looks as though Shaq is getting traded to C...
1467968979,0,@clarianne APRIL 9TH ISN'T COMING SOON ENOUGH
1990283756,0,drinking a McDonalds coffee and not understand...
1988884918,0,So dissapointed Taylor Swift doesnt have a Twi...


In [21]:
#We're going to combine all text together and form a Named Entity Dictionary for Count Vectorizer
alltext = ' '.join([i for i in sts_gold['tweet']])

#remove hashtags
alltext_nohash = re.sub(r'\#\w+','', alltext)

#remove mentions
alltext_nohash_nomentions = re.sub(r'\@\w+','', alltext_nohash)

In [22]:
hashtag_vocabulary = list(set(re.findall(r'\#\w+', alltext)))

hash_df = pd.DataFrame()
for hashtag in hashtag_vocabulary:
    hash_df[hashtag] = sts_gold['tweet'].str.count(hashtag)
    
print(hash_df.shape, hash_df.values.mean(), hash_df.values.max())

(2034, 65) 0.00064291657212 1


### Feature Evaluation

In [26]:
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.preprocessing import Binarizer, StandardScaler
from sklearn.dummy import DummyClassifier

In [31]:
models = [('DUMMY', DummyClassifier(strategy='most_frequent')),
          ('mNB' , MultinomialNB()),
          ('bNB' , BernoulliNB()),
          ('svc' , SVC())]

In [32]:
print('{0}\t{1:<1}\t{2:<4}\t{3:<4}'.format("MODEL", "MEAN CV", "MIN CV", "MAX CV"))

for name, model in models:    
    X, Y = hash_df, (sts_gold['polarity'] == 4).ravel()
    
    if name == 'bNB':
        binarize = Binarizer()
        X = binarize.fit_transform(X)
    elif name == 'svc':
        ss = StandardScaler()
        #X = X.toarray()
        X = ss.fit_transform(X)
        
    cv = cross_val_score(model, X, Y, cv=5, scoring='accuracy')
    
    print('{0}\t{1:<3}\t{2:<4}\t{3:<4}'.format(name, round(cv.mean(), 4), round(cv.min(), 4), round(cv.max(), 4)))

MODEL	MEAN CV	MIN CV	MAX CV
DUMMY	0.6893	0.6887	0.6897
mNB	0.6893	0.6872	0.6912
bNB	0.6878	0.6823	0.6936
svc	0.6893	0.6887	0.6897
