Character *n*-grams
------
**What it does**:

**Strengths**:  
- Character n-grams are able to capture information on various levels: lexical (|the\_|, |free|), word-class (|ed\_|, |ing\_|), punctuation mark usage (|!!!|, |f.r.|), etc. 
- they are robust to grammatical errors (e.g., the word-tokens ‘assignment’ and ‘asignment’ share the majority of character n-grams) and strange usage of abbreviations, punctuation marks etc.
- The bag of character n-grams representation is language-independent and does not require any text preprocessing (tokenizer, lemmatizer, or other ‘deep’ NLP tools). 

*source: http://www.icsd.aegean.gr/lecturers/stamatatos/papers/IJAIT-spam.pdf*


**Weaknesses**:
- Sparse

**Hyperparameters**:  
- `CountVectorizer`:
  - `ngram_range`: the window length of characters to look at. In this notebook, we look at 3- to 5-grams.
  - `analyzer`: `char` looks at all n-grams, including spaces. `char_wb` looks only at characters within word boundaries.
  - `min_df`, `max_df`: The minimum and maximum document freqency for an n-gram, respectively. Can be a count (`3`) or a percent (`0.95`)
  - `lowercase`: Whether to lowercase all characters, default `True`. Potential for better results with case sensitivity.

In [1]:
from collections import OrderedDict
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
sts_gold = pd.read_csv('../data/sts_gold_v03/sts_gold_tweet.csv', index_col='id', sep=';')

In [3]:
sts_gold.head()

Unnamed: 0_level_0,polarity,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1467933112,0,the angel is going to miss the athlete this we...
2323395086,0,It looks as though Shaq is getting traded to C...
1467968979,0,@clarianne APRIL 9TH ISN'T COMING SOON ENOUGH
1990283756,0,drinking a McDonalds coffee and not understand...
1988884918,0,So dissapointed Taylor Swift doesnt have a Twi...


In [4]:
tweets = sts_gold['tweet']

In [5]:
cv = CountVectorizer(analyzer='char', ngram_range=(3,5), min_df=3, max_df=.95, lowercase=True)
boc = cv.fit_transform(tweets)

# use below for data frame
boc_df = pd.DataFrame(boc.toarray(), index=tweets.index, columns=cv.get_feature_names())

In [6]:
boc_df.head()

Unnamed: 0_level_0,!,! l,! lo,!!,!!,#b,#c,#e,#f,#l,...,zil..,zili,zilia,zill,zin,zine,zing,zing,zing!,zy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1467933112,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2323395086,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1467968979,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1990283756,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1988884918,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Feature Evaluation

In [7]:
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import Binarizer, StandardScaler
from sklearn.ensemble import VotingClassifier

from sklearn.cross_validation import cross_val_score

In [8]:
models = [('DUMMY', DummyClassifier(strategy='most_frequent')),
          ('mNB' , MultinomialNB()),
          ('bNB' , BernoulliNB()),
          ('svc' , SVC(probability=False)),
          ('rf' , RandomForestClassifier()),
          ('lr' , LogisticRegressionCV())
         ]
models.append(('eclf', VotingClassifier(estimators=[models[i] for i in [1, 3, 4, 5]], voting='hard')))

In [9]:
print('{0}\t{1:<1}\t{2:<4}\t{3:<4}'.format("MODEL", "MEAN CV", "MIN CV", "MAX CV"))

for name, model in models:    
    X, Y = boc, (sts_gold['polarity'] == 4).ravel()
    
    if name == 'bNB':
        binarize = Binarizer()
        X = binarize.fit_transform(X)
    elif name == 'svc':
        ss = StandardScaler()
        X = X.toarray()
        X = ss.fit_transform(X)
        
    cv = cross_val_score(model, X, Y, cv=5, scoring='accuracy')
    
    print('{0}\t{1:<3}\t{2:<4}\t{3:<4}'.format(name, round(cv.mean(), 4), round(cv.min(), 4), round(cv.max(), 4)))

MODEL	MEAN CV	MIN CV	MAX CV
DUMMY	0.6893	0.6887	0.6897
mNB	0.852	0.835	0.8725
bNB	0.85	0.8325	0.875




svc	0.706	0.6961	0.7143
rf	0.7955	0.7635	0.8251
lr	0.8525	0.8407	0.867
eclf	0.8107	0.8039	0.8227
