# Project 3 - Web APIs & Classification

## Part 3a - Modeling: Naive-Bayes

In this part, the corpus is evaluated using the **Naive-Bayes** model. I used two types of vectorizations, **CountVectorizer** and **TFIDFVectorizer**. Both vectorizers and their hyperparameters were evaluated through **Pipeline** and **GridSearchCV**. 

An extra section, **3.5 - Understand-Naive-Bayes** is included in this part for educational purpose. This section reproduces the model results manually for the selected decocument in order to understand concept of the **Naive-Bayes** model.  

### Result Summary

>**Accuracy & Misclassification**

|         Metric         | Baseline | CountVectorizer | TFIDFVectorizer |
|:----------------------:|:--------:|:---------------:|:---------------:|
| Accuracy Train         |   0.52   |      0.996      |      0.999      |
| Accuracy Test          |     -    |      0.996      |      0.996      |
| MisClassification Test |     -    |        2        |        2        |
>**Note:** all misclassified documents can be correctly classified by human by readging through the documents.

>**Best Model Parameters**

|    Metric    | CountVectorizer | TFIDFVectorizer |
|:------------:|:---------------:|:---------------:|
| Tokenizer    |     default     |     default     |
| Processer    |  Lemmatization  |  Lemmatization  |
| min_df       |        2        |        2        |
| max_df       |       0.9       |       0.9       |
| max_features |       1000      |       1000      |
| ngram_range  |      (1, 1)     |      (1, 1)     |
| stop_words   |     english     |     english     |

### Table of Content

- [3.0-Import Libraries](#3.0---Import-Libraries)
- [3.1-Load Data](#3.1---Load-Data)
- [3.2-Model Preparation](#3.2---Model-Preparation)
- [3.3-Fit & Run Model](#3.3---Fit-&-Run-Model)
- [3.4-Results](#3.4---Results)
- [3.5-Understand-Naive-Bayes](#3.4---Understand-Naive-Bayes)

### 3.0 - Import Libraries

In [72]:
import numpy as np
import pandas as pd
from nltk import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix

### 3.1 - Load Data

In [73]:
%store -r df_to_preprocess
df = df_to_preprocess
df.head()

Unnamed: 0,post_title,post_content,title_and_content,class
0,Just found out invasive breast cancer level II,I'm alone and bawling. My 10 year old is retur...,Just found out invasive breast cancer level II...,0
1,New Kickstarter project for a colourful and we...,,New Kickstarter project for a colourful and we...,0
2,I hate cancer.,I don’t have anything interesting to say. Just...,I hate cancer. I don’t have anything interesti...,0
3,Having a stereotactic biopsy on Friday,This is my first post ever on Reddit. Hope I'm...,Having a stereotactic biopsy on Friday This is...,0
4,Advice on beast cancer UK,You can write something like my sister is suff...,Advice on beast cancer UK You can write someth...,0


In [381]:
# Base model Accuracy
df['class'].value_counts(normalize=True)

0    0.520933
1    0.479067
Name: class, dtype: float64

### 3.2 - Model Preparation

**3.2.1 - Set X and y**

In [74]:
# Set X and y
X = df['post_title']
y = df['class']

**3.2.1 - Train/Test Split**

In [75]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42, 
                                                    stratify=y)

**3.2.3 - LemmaTokenizer**

In [167]:
# Build a class for customized tokenizer incorporating lemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        tokenizer = RegexpTokenizer('(?u)\\b\\w\\w+\\b')
        return [self.wnl.lemmatize(t) for t in tokenizer.tokenize(doc)]

### 3.3 - Fit & Run Model

**3.3.1 - CountVectorizer**

In [77]:
# Instantiate Pipeline
pipe_cv = Pipeline([('cvec', CountVectorizer(tokenizer=LemmaTokenizer())),
                    ('mnb', MultinomialNB())
                   ])

# Pipeline_parameter CountVectorizer
pipe_params_cv = {
    'cvec__max_features': [100, 500, 1000],
    'cvec__stop_words': [None, 'english'],
    'cvec__ngram_range':[(1,1),(1,2)],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [.9, .95, .98]
}

In [24]:
# GridSearch
gs_cv = GridSearchCV(pipe_cv, 
                     param_grid=pipe_params_cv, 
                     verbose=1,
                     cv=3,
                     n_jobs=4
                    )
gs_cv.fit(X_train, y_train)

Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.3s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   14.1s
[Parallel(n_jobs=4)]: Done 324 out of 324 | elapsed:   24.9s finished
  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...d0>,
        vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'cvec__max_features': [100, 500, 1000], 'cvec__stop_words': [None, 'english'], 'cvec__ngram_range': [(1, 1), (1, 2)], 'cvec__min_df': [2, 3, 4], 'cvec__max_df': [0.9, 0.95, 0.98]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

**3.3.2 - TFIDFVectorizer**

In [25]:
# Instantiate
pipe_tv = Pipeline([('tvec', TfidfVectorizer(tokenizer=LemmaTokenizer())),
                    ('mnb', MultinomialNB())
                   ])

# Pipeline_parameter TFIDFVectorizer
pipe_params_tv = {
    'tvec__max_features': [100, 500, 1000],
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range':[(1,1),(1,2)],
    'tvec__min_df': [2, 3, 4],
    'tvec__max_df': [.9, .95, .98]
}

In [26]:
gs_tv = GridSearchCV(pipe_tv, 
                     param_grid=pipe_params_tv, 
                     verbose=1,
                     cv=3,
                     n_jobs=4
                    )
gs_tv.fit(X_train, y_train)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.9s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   15.0s
[Parallel(n_jobs=4)]: Done 324 out of 324 | elapsed:   25.8s finished
  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...se_idf=True, vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'tvec__max_features': [100, 500, 1000], 'tvec__stop_words': [None, 'english'], 'tvec__ngram_range': [(1, 1), (1, 2)], 'tvec__min_df': [2, 3, 4], 'tvec__max_df': [0.9, 0.95, 0.98]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

### 3.4 - Results

**3.4.1 - Accuracy**

In [363]:
# Test Scores
nb_cv_train = gs_cv.score(X_train, y_train)
nb_cv_test = gs_cv.score(X_test, y_test)
nb_tv_train = gs_tv.score(X_train, y_train)
nb_tv_test = gs_tv.score(X_test, y_test)

pd.DataFrame({'CV_MNB': [nb_cv_train, nb_cv_test], 'TV_MNB': [nb_tv_train, nb_tv_test]}, index=['train','test'])

Unnamed: 0,CV_MNB,TV_MNB
train,0.996466,0.998587
test,0.995763,0.995763


**3.4.2 - Hyperparameters**

In [28]:
print(gs_cv.best_params_)
print()
print(gs_tv.best_params_)

{'cvec__max_df': 0.9, 'cvec__max_features': 1000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': 'english'}

{'tvec__max_df': 0.9, 'tvec__max_features': 1000, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': 'english'}


**3.4.3 - Confusion Matrix**

In [29]:
y_pred_cv = gs_cv.predict(X_test)
y_pred_tv = gs_tv.predict(X_test)

In [361]:
pd.DataFrame(confusion_matrix(y_test, y_pred_cv), 
             columns=['pred_bc', 'pred_aq'], 
             index=['actual_bc', 'actual_aq'])

Unnamed: 0,pred_bc,pred_aq
actual_bc,244,2
actual_aq,0,226


In [362]:
pd.DataFrame(confusion_matrix(y_test, y_pred_tv),
             columns=['pred_bc', 'pred_aq'], 
             index=['actual_bc', 'actual_aq'])

Unnamed: 0,pred_bc,pred_aq
actual_bc,244,2
actual_aq,0,226


### 3.5 - Understand Naive Bayes

> For educational purpose, this section preproduces the MNB model results manually in order to understand what is behind the code.

**3.5.1 - Vectorize**

In [372]:
# Instantiate the best model
cvec = CountVectorizer(max_df=0.9, 
                       max_features=1000, 
                       min_df=2,
                       ngram_range=(1, 1), 
                       stop_words='english',
                       tokenizer=LemmaTokenizer()
                      )

In [373]:
# Fit CountVectorizer
X_train_cv = cvec.fit_transform(X_train)
X_test_cv = cvec.transform(X_test)

  'stop_words.' % sorted(inconsistent))


In [374]:
# Check for top 5 rows
X_train_cv_df = pd.DataFrame(X_cv.toarray(), columns = cvec.get_feature_names())
X_train_cv_df.head()

Unnamed: 0,000,10,100,12,13th,15,17,1700ppb,173,20,...,xinhua,xpost,xw,year,yes,yesterday,york,young,youtube,yuan
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**3.5.2 - Multinomial Naive Bayes Model**

In [375]:
# Fit Estimator
mnb = MultinomialNB()
mnb.fit(X_train_cv, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [376]:
# Model Test Scores
print(f'The accuracy for training data is {mnb.score(X_train_cv, y_train)}')
print(f'The accuracy for testing data is {mnb.score(X_test_cv, y_test)}')

The accuracy for training data is 0.9964664310954063
The accuracy for testing data is 0.9957627118644068


In [377]:
# Predict y using X_test
y_pred = mnb.predict(X_test_cv)

In [378]:
# Summarize Output Results
df_y_pred = pd.DataFrame(y_pred, columns=['class_pred'])
df_y_test = pd.DataFrame(y_test).reset_index(drop=True)
df_X_test = pd.DataFrame(X_test).reset_index(drop=False)
df_y_prob = pd.DataFrame(mnb.predict_proba(X_test_cv), columns=['bc_prob','aq_prob'])
df_y = pd.concat([df_X_test, df_y_prob, df_y_pred, df_y_test], axis=1)
df_y.head(13)

Unnamed: 0,index,post_title,bc_prob,aq_prob,class_pred,class
0,1514,With a peak over 300 AQI EPA this weekend in B...,0.007485148,0.992515,1,1
1,1642,A Beijing artist wore a face mask wedding dres...,2.415901e-06,0.999998,1,1
2,476,Have you been diagnosed with cancer? We need y...,0.9999949,5e-06,0,0
3,1007,Is my home air making me sick?,1.342636e-05,0.999987,1,1
4,639,29/M lump in breast,0.9999955,5e-06,0,0
5,1088,Question about Air Quality Index,2.192041e-05,0.999978,1,1
6,1622,Fewer children visited ER for asthma problems ...,1.327595e-06,0.999999,1,1
7,1100,China Renewable Energy Growth Soars &amp; Coal...,5.570501e-06,0.999994,1,1
8,1200,A Chinese company is offering free training fo...,2.480461e-07,1.0,1,1
9,397,Rare phyllodes tumour,0.8768979,0.123102,0,0


In [379]:
# Misclassfied Documents
misclassify = df_y['class_pred'] != df_y['class'] # when predicted class is not the same as the actuall class
df_y = df_y[misclassify]
df_y

Unnamed: 0,index,post_title,bc_prob,aq_prob,class_pred,class
306,691,Board Members You Night Events New Orleans Lou...,0.382495,0.617505,1,0
354,725,Axillary web syndrome anyone??,0.202809,0.797191,1,0


All misclassfied items can be correctly classfied by human by reading through through the document.

**3.5.3 - Reproduce Model Results**


In [365]:
# Prior Probability of BC and AQ (from Training Data)
prio_p_class = pd.DataFrame(mnb.class_count_, columns=['count'], index=['bc_prob','aq_prob'])
prio_p_class['prio_prob'] = prio_p_class['count']/prio_p_class['count'].sum()
prio_p_class

Unnamed: 0,count,prio_prob
bc_prob,737.0,0.520848
aq_prob,678.0,0.479152


In [350]:
# Pick index 12 for feature analysis
select_title = df_y.iloc[12, 1].lower().split(' ')
select_title

['high', 'co2', 'levels']

In [351]:
# Extract feature probability
feature_prob = pd.DataFrame(np.exp(mnb.feature_log_prob_), columns=cvec.get_feature_names()).T

In [352]:
# Extract feature probability for selected features
feat_prob_select = pd.DataFrame([feature_prob.loc[select_title, 0], feature_prob.loc[select_title, 1]],
                                index=['bc_prob','aq_prob'])
feat_prob_select

Unnamed: 0,high,co2,levels
bc_prob,0.001292,0.000258,0.000258
aq_prob,0.005821,0.006269,0.002537


In [353]:
# add class prio probability to the df
mnb_prob = pd.concat([feat_prob_select, prio_p_class], axis=1)
mnb_prob.drop(columns='count', inplace=True) # drop column 'count'
mnb_prob

Unnamed: 0,high,co2,levels,prio_prob
bc_prob,0.001292,0.000258,0.000258,0.520848
aq_prob,0.005821,0.006269,0.002537,0.479152


In [354]:
# Calculate P('high') * P('co2') * P(levels) * P(Class)
mnb_prob['p_high_co2_levels'] = mnb_prob.product(axis=1)
mnb_prob

Unnamed: 0,high,co2,levels,prio_prob,p_high_co2_levels
bc_prob,0.001292,0.000258,0.000258,0.520848,4.493121e-11
aq_prob,0.005821,0.006269,0.002537,0.479152,4.436206e-08


In [355]:
# normalize p_high_co2_levels
mnb_prob['p_high_co2_levels_norm'] = mnb_prob['p_high_co2_levels']/mnb_prob['p_high_co2_levels'].sum()
mnb_prob

Unnamed: 0,high,co2,levels,prio_prob,p_high_co2_levels,p_high_co2_levels_norm
bc_prob,0.001292,0.000258,0.000258,0.520848,4.493121e-11,0.001012
aq_prob,0.005821,0.006269,0.002537,0.479152,4.436206e-08,0.998988


#### 3.5.4 Result Comparison

> Results Match!!!

In [356]:
mnb_prob

Unnamed: 0,high,co2,levels,prio_prob,p_high_co2_levels,p_high_co2_levels_norm
bc_prob,0.001292,0.000258,0.000258,0.520848,4.493121e-11,0.001012
aq_prob,0.005821,0.006269,0.002537,0.479152,4.436206e-08,0.998988


In [357]:
df_y.iloc[12]

index                    1150
post_title    High CO2 Levels
bc_prob            0.00101181
aq_prob              0.998988
class_pred                  1
class                       1
Name: 12, dtype: object