# 4.0 Modeling 

Contents

4.1 [Introduction](#4.1)

  * [4.1.1 Problem Recap](#4.1.1)
  * [4.1.2 Notebook Goals](#4.1.2)
 
4.2 [Load the data](#4.2)

  * [4.2.1 Imports](#4.2.1)
  * [4.2.2 Load the data](#4.2.2)

4.3 [Examine Class Split](#4.3)

4.4 [Pre-processing](#4.4)

  * [4.4.1 Set Random Seed for Reproducability](#4.4.1)
  * [4.4.2 Train/test Split](#4.4.2)
  * [4.4.4 Examine Class Split for Train/Test Data](#4.4.4)
  

4.5 [Setting Up Pipelines](#4.5)
  * 4.5.1 [Previous Best Model: Logistic Regression with Count Vectorization](#4.5.1)
<br/><br/>
    * [4.5.1.1 Training and Fitting the Model](#4.5.1.1)
    * [4.5.1.2 Evaluating the Model](#4.5.1.2)
<br/><br/>
 

## 4.1 Introduction <a name="4.1"></a>

### 4.1.1 Problem Recap <a name="4.1.1"><a/>

Using customer text data about amazon products, we will build, evaluate and compare models to estimate the probability that a given text review can be classified as “positive” or “negative”.

Our goal is to build a text classifier using Amazon product review data which can be used to analyze customer sentiment which does not have accompanying numeric data. The metric we will be primarily interested in will be Recall on the positive class. This is the proportion of the positive class (negative reviews coded as "1" in the data) we correctly predict.

### 4.1.2 Notebook Goals <a name="4.1.2"></a>

1. In our previous notebook our best results came from Term-Frequency Inverse-Document Frequency vectorization and a Logistic Regression Model.

2. We had slightly worse results from a Naive Bayes and Random Forest model. The Naive Bayes model incorrectly predicted a higher proportion of the negative class and the Random Forest model appeared to strongly overfit the training data with a very poor Recall on the test set.

3. Try over-sampling the minority class that we are trying to predict (encoded as "1"s) and/or under-sampling the majority class.

4. Test some other models such as gradient boosted trees (LightGBM/XGBoost) 

5. Examine how well our models will generalize with K-fold cross validation.

6. Tune hyper-parameters with grid-search or bayesian search optimization.

## 4.2 Load the data <a name="4.2"><a/>

### 4.2.1 Imports <a name="4.2.1"><a/>

In [1]:
from random import seed

#reading/processing data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyarrow.parquet as pq


#splitting the dataset
from sklearn.model_selection import train_test_split

#scaling/vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# models
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from imblearn.pipeline import Pipeline
import lightgbm as lgb

#metrics
from sklearn.metrics import recall_score

#dealing with class imbalance
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler

#hyper-parameter tuning
import optuna


  from pandas import MultiIndex, Int64Index
  from .autonotebook import tqdm as notebook_tqdm


### 4.2.2 Load the data <a name="4.2.2"><a/>

In [2]:
data = pq.read_table("../data/edited/fashion.parquet")
fashion = data.to_pandas()

In [3]:
def objective(trial, sub_sample_prop):

    # select vectorization parameters
    


    #sampler
    sampler_type = trial.suggest_categorical('sampler', [None, 'ros', 'rus', 'smote', 'ada'])

    if sampler_type == 'ros':
        sampler = RandomOverSampler(random_state=0)
    
    elif sampler_type == 'smote':
        k_neighbors = trial.suggest_int('k_neighbors', 2,10)
        sampler = SMOTE(random_state=0, k_neighbors=k_neighbors)
    
    elif sampler_type == 'rus':
        sampler = RandomUnderSampler(random_state=0)
    
    elif sampler_type == 'ada':
        n_neighbors = trial.suggest_int('n_neighbors', 2,10)
        sampler = ADASYN(n_neighbors=n_neighbors)
    
    else:
        sampler = None


    model_type = trial.suggest_categorical('classifier', ['XGBClassifier'])

    if model_type == 'LogisticRegression':
        #optimize params
        C = trial.suggest_categorical('C', [1.0, 0.1, 0.01]) #note: models with larger values for C failed to converge
        
        #model
        model = LogisticRegression(solver = "lbfgs", n_jobs=-1, max_iter=1000, C=C)

    elif model_type == 'XGBClassifier':
        #optimize params
        learning_rate = trial.suggest_categorical('learning_rate', [0.2, 0.1, 0.01, .001, .0001])
        max_depth = trial.suggest_int('max_depth', 3, 20)
        n_estimators = trial.suggest_categorical('n_estimators', [200,500,1000, 2000, 4000])

        #model
        model = xgb.XGBClassifier(n_estimators=n_estimators, max_depth=max_depth, learning_rate=learning_rate, n_jobs=-1, verbosity=0, use_label_encoder=False)

    else:
        #optimize params
        learning_rate = trial.suggest_categorical('learning_rate', [0.2, 0.1, 0.01, .001, .0001])
        max_depth = trial.suggest_int('max_depth', 3, 20)
        n_estimators = trial.suggest_categorical('n_estimators', [200,500,1000, 2000,4000])

        #model
        model = lgb.LGBMClassifier(max_depth = max_depth, n_estimators=n_estimators)
    
    pipeline = Pipeline([('sampler', sampler), ('model',model)])
    
    X_train_sample, _, y_train_sample, _ = train_test_split(X_train, y_train, train_size=sub_sample_prop)

    pipeline.fit(X_train_sample, y_train_sample)
        
    y_preds = pipeline.predict(X_test)

    return recall_score(y_preds, y_test)

In [4]:
tfidf = TfidfVectorizer(ngram_range=(1,2), min_df = 5, max_df=0.95)

X_train, X_test, y_train, y_test = train_test_split(fashion["review"].values, fashion["neg_sentiment"], test_size = .1, random_state=1)

y_train, y_test = np.ravel(y_train), np.ravel(y_test)

y_train, y_test = y_train.astype(int), y_test.astype(int)

tfidf.fit(X_train)
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)

study = optuna.create_study(direction='maximize')

study.optimize(objective, n_trials=10)

[32m[I 2022-07-12 10:49:41,395][0m A new study created in memory with name: no-name-ceef3b97-1ccf-458e-b75b-abdba51e7b5f[0m
[32m[I 2022-07-12 10:58:15,653][0m Trial 0 finished with value: 0.7586171381488602 and parameters: {'sampler': 'smote', 'k_neighbors': 4, 'classifier': 'XGBClassifier', 'learning_rate': 0.01, 'max_depth': 11, 'n_estimators': 4000}. Best is trial 0 with value: 0.7586171381488602.[0m
[32m[I 2022-07-12 10:59:38,000][0m Trial 1 finished with value: 0.752669944058315 and parameters: {'sampler': 'ada', 'n_neighbors': 10, 'classifier': 'XGBClassifier', 'learning_rate': 0.1, 'max_depth': 9, 'n_estimators': 500}. Best is trial 0 with value: 0.7586171381488602.[0m
[32m[I 2022-07-12 11:00:12,195][0m Trial 2 finished with value: 0.7298604825369082 and parameters: {'sampler': 'rus', 'classifier': 'XGBClassifier', 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 2000}. Best is trial 0 with value: 0.7586171381488602.[0m
[32m[I 2022-07-12 11:01:39,835][0m Trial

## Timing comparison of various under/oversampling techniques

In data sets with an imbalanced split between the classes we are trying to predict, there are few possible approaches to try to improve the target metric our model (classifier) is optimizing for.

1. Over-sampling - if we train with a higher proportion of the class we are trying to predict using resampling, we may be able to improve the result for our classifier.

2. Under-sampling - by the same logic, we can under-sample the majority classes we are NOT trying to predict.

Both of these can introduce worse outcome metrics for our alternate classes, which may be an issue depending on the specific problem.


3. Synthesize data: we can generate artificial data using the minority class we are trying to predict. ADASYN and SMOTE both use Nearest Neigbhors algorithms to generate artificial points that are located "close" in the n-dimensional feature space of the target class to the actual data points. Conceptually, we can think of it as if we gathered MORE data, and are assuming it looks similar to the current data we have. It will be unlikely to have strong outliers due to the nature of the algorithm and will be more "clumped" together than if we gathered more "real" data.


## Graphical output of the various types of sampling/explanations

## Timing comparison of various models

## Accuracy comparison of various models