# Naive Bayes Exercise

## Concat and frequency count
Load in Twitter dataset (both files) and count the frequency of each 
class

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn import metrics
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

colnames = ['label', 'id', 'date', 'query', 'user', 'text']
df1 = pd.read_csv('data/training.1600000.processed.noemoticon.csv',
                      header=None, names=colnames, encoding='windows-1252')
df2 = pd.read_csv('data/testdata.manual.2009.06.14.csv',
                      header=None, names=colnames, encoding='windows-1252')

df = pd.concat([df1, df2]) # concat the two sets
df['label'].value_counts() # should have only two classes

4    800182
0    800177
2       139
Name: label, dtype: int64

## Simple pipeline

Implement a simple pipeline with Naïve Bayes classifier with words 
(unigrams) as features
- Using “training.1600000.processed.noemoticon.csv”, 
  evaluate your 
  model with 80% training and 20% test
- hint: look at the `train_test_split()` function

In [4]:
split = train_test_split(df1, test_size=0.2, train_size=0.8)
train, test = split[0], split[1]
print(train.shape, test.shape)

bow_vector = CountVectorizer()
mnb_classifier = MultinomialNB()
pipeline = Pipeline([('bow', bow_vector),('mnb', mnb_classifier)])
pipeline.fit(train['text'], train['label'])
prediction = pipeline.predict(test['text'])

print(metrics.classification_report(test['label'], prediction))
print(accuracy_score(test['label'], prediction))

((1280000, 6), (320000, 6))
             precision    recall  f1-score   support

          0       0.76      0.82      0.79    159819
          4       0.80      0.74      0.77    160181

avg / total       0.78      0.78      0.78    320000

0.780265625


## Enhancement and experiment with bigrams and trigrams
Enhance the Naïve Bayes model by removing stop words, and experimenting with bigrams and trigrams
- hint: look at the various parameters for `CountVectorizer()`

```
CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
```

In [7]:
stop_bigram = CountVectorizer(stop_words='english', ngram_range=(2,2))
pipeline = Pipeline([('stop bigram', stop_bigram),('mnb', mnb_classifier)])
pipeline.fit(train['text'], train['label'])
prediction = pipeline.predict(test['text'])

print(metrics.classification_report(test['label'], prediction))
print(accuracy_score(test['label'], prediction))

             precision    recall  f1-score   support

          0       0.69      0.79      0.73    159819
          4       0.75      0.64      0.69    160181

avg / total       0.72      0.71      0.71    320000

0.714575
