# Project 3 - Web APIs & Classification

## Part 3b - Modeling: Logistic Regression


In this part, the corpus is evaluated using the **Logistic-Regression** model. I used two types of vectorizations, **CountVectorizer** and **TFIDFVectorizer**. Both vectorizers and their hyperparameters were evaluated through **Pipeline** and **GridSearchCV**. 

### Result Summary

>**Accuracy & Misclassification**

|         Metric         | Baseline | CountVectorizer | TFIDFVectorizer |
|:----------------------:|:--------:|:---------------:|:---------------:|
| Accuracy Train         |   0.52   |      1.0      |      0.993      |
| Accuracy Test          |     -    |      1.0      |      0.994      |
| MisClassification Test |     -    |        0        |        3        |

>**Best Model Parameters**

|    Metric    | CountVectorizer | TFIDFVectorizer |
|:------------:|:---------------:|:---------------:|
| Tokenizer    |     default     |     default     |
| Processer    |  Lemmatization  |  Lemmatization  |
| Regulization |     Lasso       |     Lasso       |
| min_df       |        2        |        3        |
| max_df       |       0.98       |       0.9       |
| max_features |       1000      |       500      |
| ngram_range  |      (1, 1)     |      (1, 1)     |
| stop_words   |     english     |     english     |

### Table of Content

- [3.0-Import Libraries](#3.0---Import-Libraries)
- [3.1-Load Data](#3.1---Load-Data)
- [3.2-Model Preparation](#3.2---Model-Preparation)
- [3.3-Fit & Run Model](#3.3---Fit-&-Run-Model)
- [3.4-Results](#3.4---Results)

### 3.0 - Import Libraries

In [115]:
import numpy as np
import pandas as pd
from nltk import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix

### 3.1 - Load Data

In [116]:
%store -r df_to_preprocess
df = df_to_preprocess
df.head()

Unnamed: 0,post_title,post_content,title_and_content,class
0,What could cause such high HCHO readings in my...,,What could cause such high HCHO readings in my...,0
1,Is my air full of cancer or is this normal for...,,Is my air full of cancer or is this normal for...,0
2,"We are air quality experts, we are www.particu...",,"We are air quality experts, we are www.particu...",0
3,I bought several air purifiers,I bought several air purifiers and the results...,I bought several air purifiers I bought severa...,0
4,Air Quality in Cars: Pollutants and Challenges,Vehicle interior air quality has been a topic...,Air Quality in Cars: Pollutants and Challenges...,0


### 3.2 - Model Preparation

**3.2.1 - Set X and y**

In [117]:
# Set X and y
X = df['post_title']
y = df['class']

**3.2.1 - Train/Test Split**

In [118]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42, 
                                                    stratify=y)

**3.2.3 - LemmaTokenizer**

In [119]:
# Build a class for customized tokenizer incorporating lemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        tokenizer = RegexpTokenizer('(?u)\\b\\w\\w+\\b')
        return [self.wnl.lemmatize(t) for t in tokenizer.tokenize(doc)]

### 3.3 - Fit & Run Model

**3.3.1 - CountVectorizer**

In [120]:
# Instantiate Pipeline
pipe_cv = Pipeline([('cvec', CountVectorizer(tokenizer=LemmaTokenizer())),
                    ('lr', LogisticRegression('l1'))
                   ])

# Pipeline_parameter CountVectorizer
pipe_params_cv = {
    'cvec__max_features': [100, 500, 1000],
    'cvec__stop_words': [None, 'english'],
    'cvec__ngram_range':[(1,1),(1,2)],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [.9, .95, .98]
}

In [121]:
# GridSearch
gs_cv = GridSearchCV(pipe_cv, 
                     param_grid=pipe_params_cv, 
                     verbose=1,
                     cv=3,
                     n_jobs=4
                    )
gs_cv.fit(X_train, y_train)

Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    7.9s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   20.4s
[Parallel(n_jobs=4)]: Done 324 out of 324 | elapsed:   30.5s finished
  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'cvec__max_features': [100, 500, 1000], 'cvec__stop_words': [None, 'english'], 'cvec__ngram_range': [(1, 1), (1, 2)], 'cvec__min_df': [2, 3, 4], 'cvec__max_df': [0.9, 0.95, 0.98]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

**3.3.2 - TFIDFVectorizer**

In [122]:
# Instantiate
pipe_tv = Pipeline([('tvec', TfidfVectorizer(tokenizer=LemmaTokenizer())),
                    ('lr', LogisticRegression('l1'))
                   ])

# Pipeline_parameter TFIDFVectorizer
pipe_params_tv = {
    'tvec__max_features': [100, 500, 1000],
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range':[(1,1),(1,2)],
    'tvec__min_df': [2, 3, 4],
    'tvec__max_df': [.9, .95, .98]
}

In [123]:
gs_tv = GridSearchCV(pipe_tv, 
                     param_grid=pipe_params_tv, 
                     verbose=1,
                     cv=3,
                     n_jobs=4
                    )
gs_tv.fit(X_train, y_train)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    3.3s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   15.9s
[Parallel(n_jobs=4)]: Done 324 out of 324 | elapsed:   26.2s finished
  'stop_words.' % sorted(inconsistent))


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'tvec__max_features': [100, 500, 1000], 'tvec__stop_words': [None, 'english'], 'tvec__ngram_range': [(1, 1), (1, 2)], 'tvec__min_df': [2, 3, 4], 'tvec__max_df': [0.9, 0.95, 0.98]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

### 3.4 - Results

**3.4.1 - Accuracy**

In [124]:
# Test Scores
lr_cv_train = gs_cv.score(X_train, y_train)
lr_cv_test = gs_cv.score(X_test, y_test)
lr_tv_train = gs_tv.score(X_train, y_train)
lr_tv_test = gs_tv.score(X_test, y_test)

pd.DataFrame({'CV': [lr_cv_train, lr_cv_test], 'TV': [lr_tv_train, lr_tv_test]}, index=['train','test'])

Unnamed: 0,CV,TV
train,1.0,0.992933
test,1.0,0.993644


**3.4.2 - Hyperparameters**

In [125]:
print(gs_cv.best_params_)
print()
print(gs_tv.best_params_)

{'cvec__max_df': 0.98, 'cvec__max_features': 1000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': 'english'}

{'tvec__max_df': 0.9, 'tvec__max_features': 500, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': 'english'}


**3.4.3 - Confusion Matrix**

In [126]:
y_pred_cv = gs_cv.predict(X_test)
y_pred_tv = gs_tv.predict(X_test)

In [127]:
pd.DataFrame(confusion_matrix(y_test, y_pred_cv))

Unnamed: 0,0,1
0,226,0
1,0,246


In [128]:
pd.DataFrame(confusion_matrix(y_test, y_pred_tv))

Unnamed: 0,0,1
0,223,3
1,0,246
