# 4.0 Modeling 

Contents

4.1 [Introduction](#4.1)

  * [4.1.1 Problem Recap](#4.1.1)
  * [4.1.2 Notebook Goals](#4.1.2)
 
4.2 [Load the data](#4.2)

  * [4.2.1 Imports](#4.2.1)
  * [4.2.2 Load the data](#4.2.2)

4.3 [Examine Class Split](#4.3)

4.4 [Pre-processing](#4.4)

  * [4.4.1 Set Random Seed for Reproducability](#4.4.1)
  * [4.4.2 Train/test Split](#4.4.2)
  * [4.4.4 Examine Class Split for Train/Test Data](#4.4.4)
  

4.5 [Setting Up Pipelines](#4.5)
  * 4.5.1 [Previous Best Model: Logistic Regression with Count Vectorization](#4.5.1)
<br/><br/>
    * [4.5.1.1 Training and Fitting the Model](#4.5.1.1)
    * [4.5.1.2 Evaluating the Model](#4.5.1.2)
<br/><br/>
 

## 4.1 Introduction <a name="4.1"></a>

### 4.1.1 Problem Recap <a name="4.1.1"><a/>

Using customer text data about amazon products, we will build, evaluate and compare models to estimate the probability that a given text review can be classified as “positive” or “negative”.

Our goal is to build a text classifier using Amazon product review data which can be used to analyze customer sentiment which does not have accompanying numeric data. The metric we will be primarily interested in will be Recall on the positive class. This is the proportion of the positive class (negative reviews coded as "1" in the data) we correctly predict.

### 4.1.2 Notebook Goals <a name="4.1.2"></a>

1. In our previous notebook our best results came from Term-Frequency Inverse-Document Frequency vectorization and a Logistic Regression Model.

2. We had slightly worse results from a Naive Bayes and Random Forest model. The Naive Bayes model incorrectly predicted a higher proportion of the negative class and the Random Forest model appeared to strongly overfit the training data with a very poor Recall on the test set.

3. Try over-sampling the minority class that we are trying to predict (encoded as "1"s) and/or under-sampling the majority class.

4. Test some other models such as gradient boosted trees (LightGBM/XGBoost) 

5. Examine how well our models will generalize with K-fold cross validation.

6. Tune hyper-parameters with grid-search or bayesian search optimization.

## 4.2 Load the data <a name="4.2"><a/>

### 4.2.1 Imports <a name="4.2.1"><a/>

In [1]:
from random import seed

#reading/processing data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyarrow.parquet as pq
import plotly.express

#splitting the dataset
from sklearn.model_selection import train_test_split

#scaling/vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

# models
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from imblearn.pipeline import Pipeline
import lightgbm as lgb

#metrics
from sklearn.metrics import recall_score

#dealing with class imbalance
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler

#hyper-parameter tuning
import optuna


  from pandas import MultiIndex, Int64Index


### 4.2.2 Load the data <a name="4.2.2"><a/>

In [2]:
data = pq.read_table("../data/edited/fashion.parquet")
fashion = data.to_pandas()

## Vectorize and split the data into train/test sets

In [3]:
#Vectorizing and splitting the data into train and test sets

tfidf = TfidfVectorizer(ngram_range=(1,2), min_df = 5, max_df=0.95)

X_train, X_test, y_train, y_test = train_test_split(fashion["review"].values, fashion["neg_sentiment"], test_size = .1, random_state=1)

#convert to 1d arrays
y_train, y_test = np.ravel(y_train), np.ravel(y_test)

#ensure our 1/0 values are integers for the pipeline model we will use
y_train, y_test = y_train.astype(int), y_test.astype(int)

#fit on ONLY the training data
tfidf.fit(X_train)

#transform both train and test data
X_train = tfidf.transform(X_train)
X_test = tfidf.transform(X_test)

## Timing comparison of various under/oversampling techniques

In data sets with an imbalanced split between the classes we are trying to predict, there are few possible approaches to try to improve the target metric our model (classifier) is optimizing for.

1. Over-sampling - if we train with a higher proportion of the class we are trying to predict using resampling, we may be able to improve the result for our classifier.

2. Under-sampling - by the same logic, we can under-sample the majority classes we are NOT trying to predict.

Both of these can introduce worse outcome metrics for our alternate classes, which may be an issue depending on the specific problem.


3. Synthesize data: we can generate artificial data using the minority class we are trying to predict. ADASYN and SMOTE both use Nearest Neigbhors algorithms to generate artificial points that are located "close" in the n-dimensional feature space of the target class to the actual data points. Conceptually, we can think of it as if we gathered MORE data, and are assuming it looks similar to the current data we have. It will be unlikely to have strong outliers due to the nature of the algorithm and will be more "clumped" together than if we gathered more "real" data.


## Graphical output of the various types of sampling/explanations

## Timing comparison of various models


In [4]:
smote = SMOTE()

ada = ADASYN()

ros = RandomOverSampler()

rus = RandomUnderSampler()

In [5]:
samplers = {"smote":smote, "ada":ada, "ros":ros, "rus":rus}

timing_dict = {sampler:{k:None for k in range(1,25)} for sampler in samplers.keys()}

for sampler in samplers.keys():
    for i in range(1,25):
    
        n_rows = i*5000
    
        time_var = %timeit -n1 -o samplers[sampler].fit_resample(X_train[0:n_rows,:], y_train[0:n_rows])

        timing_dict[sampler][i] = np.mean(time_var.all_runs)

35.1 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
143 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
317 ms ± 831 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
571 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
895 ms ± 8.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.3 s ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.76 s ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.18 s ± 53.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.7 s ± 6.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.53 s ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.12 s ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.88 s ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.87 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.84 s ± 30.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.78 s ± 51.4 ms per loop (mean ± std. dev. of 7 run

KeyboardInterrupt: 

In [None]:
dfs = []

for sampler_type in samplers.keys():

    dfs.append(pd.DataFrame([(k,v) for k,v in timing_dict[sampler_type].items()], columns=["iter", "time_in_seconds"]))

In [None]:
for sampler_type, df in zip(samplers.keys(), dfs):
    df["sampler_type"] = sampler_type

In [None]:
combined_df = pd.concat(dfs, axis=0)

In [None]:
plotly.express.line(data_frame=combined_df, x="iter", y="time_in_seconds", color='sampler_type')

### As shown above, the synthetic data generation algorithms (SMOTE and ADASYN) which rely on K-Nearest Neighbors algorithms to generate the new data, run slower and slower on larger proportions of our data.

The KNN worst case time complexity is O(N * kD) (where N = number of data points/row, K is the number of neighbors used, and D is the number of features/columns). Because we have vectorized the reviews with TFIDF and uni AND bigrams, the width of the data is roughly 2 million columns.

We will use Optuna to do bayesian optimization over the various sampling methods, a few possible classification models (Logistic Regression, XGBoost, and LGBM (another boosted tree model)), and a number of hyperparameters.

## Accuracy comparison of various models

In [34]:
def objective(trial, sub_sample_prop, model_choices):
    """
    
    
    
    """

    #sampler
    sampler_type = trial.suggest_categorical('sampler', [None, 'ros', 'rus', 'smote', 'ada'])

    if sampler_type == 'ros':
        sampler = RandomOverSampler(random_state=0)
    
    elif sampler_type == 'smote':
        k_neighbors = trial.suggest_int('k_neighbors', 2,10)
        sampler = SMOTE(random_state=0, k_neighbors=k_neighbors)
    
    elif sampler_type == 'rus':
        sampler = RandomUnderSampler(random_state=0)
    
    elif sampler_type == 'ada':
        n_neighbors = trial.suggest_int('n_neighbors', 2,10)
        sampler = ADASYN(n_neighbors=n_neighbors)
    
    else:
        sampler = None


    model_type = trial.suggest_categorical('classifier', model_choices)

    if model_type == 'logreg':
        #optimize params
        C = trial.suggest_categorical('C', [1.0, 0.1, 0.01]) #note: models with larger values for C failed to converge
        
        #model
        model = LogisticRegression(solver = "lbfgs", n_jobs=-1, max_iter=1000, C=C)

    elif model_type == 'xgboost':
        #optimize params
        learning_rate = trial.suggest_categorical('learning_rate', [0.2, 0.1, 0.01, .001, .0001])
        max_depth = trial.suggest_int('max_depth', 3, 20)
        n_estimators = trial.suggest_categorical('n_estimators', [200,500,1000, 2000, 4000])

        #model
        model = xgb.XGBClassifier(n_estimators=n_estimators, max_depth=max_depth, learning_rate=learning_rate, n_jobs=-1, verbosity=0, use_label_encoder=False)

    elif model_type == "lgbm":
        #optimize params
        learning_rate = trial.suggest_categorical('learning_rate', [0.2, 0.1, 0.01, .001, .0001])
        max_depth = trial.suggest_int('max_depth', 3, 20)
        n_estimators = trial.suggest_categorical('n_estimators', [200,500,1000, 2000,4000])

        #model
        model = lgb.LGBMClassifier(max_depth = max_depth, n_estimators=n_estimators)
    
    pipeline = Pipeline([('sampler', sampler), ('model',model)])
    
    X_train_sample, _, y_train_sample, _ = train_test_split(X_train, y_train, train_size=sub_sample_prop)

    pipeline.fit(X_train_sample, y_train_sample)
        
    y_preds = pipeline.predict(X_test)

    return recall_score(y_preds, y_test)

In [None]:
func = lambda trial: objective(trial, .01, ["xgboost"])

study = optuna.create_study(direction='maximize')

study.optimize(func, n_trials=10)

[32m[I 2022-07-12 12:14:51,893][0m A new study created in memory with name: no-name-98004783-b204-40b2-970b-7230d61f9899[0m
[32m[I 2022-07-12 12:18:01,816][0m Trial 0 finished with value: 0.5415803203549747 and parameters: {'sampler': 'ada', 'n_neighbors': 2, 'classifier': 'xgboost', 'learning_rate': 0.0001, 'max_depth': 17, 'n_estimators': 2000}. Best is trial 0 with value: 0.5415803203549747.[0m
[32m[I 2022-07-12 12:18:54,835][0m Trial 1 finished with value: 0.5441885107349194 and parameters: {'sampler': 'smote', 'k_neighbors': 9, 'classifier': 'xgboost', 'learning_rate': 0.0001, 'max_depth': 20, 'n_estimators': 500}. Best is trial 1 with value: 0.5441885107349194.[0m
[32m[I 2022-07-12 12:19:26,027][0m Trial 2 finished with value: 0.7542236373852781 and parameters: {'sampler': 'ros', 'classifier': 'xgboost', 'learning_rate': 0.1, 'max_depth': 12, 'n_estimators': 1000}. Best is trial 2 with value: 0.7542236373852781.[0m
[32m[I 2022-07-12 12:19:37,298][0m Trial 3 finished