# 4.0 Modeling 

Contents

4.1 [Introduction](#4.1)

  * [4.1.1 Problem Recap](#4.1.1)
  * [4.1.2 Notebook Goals](#4.1.2)
 
4.2 [Load the data](#4.2)

  * [4.2.1 Imports](#4.2.1)
  * [4.2.2 Load the data](#4.2.2)

4.3 [Examine Class Split](#4.3)

4.4 [Pre-processing](#4.4)

  * [4.4.1 Set Random Seed for Reproducability](#4.4.1)
  * [4.4.2 Train/test Split](#4.4.2)
  * [4.4.4 Examine Class Split for Train/Test Data](#4.4.4)
  

4.5 [Setting Up Pipelines](#4.5)
  * 4.5.1 [Previous Best Model: Logistic Regression with Count Vectorization](#4.5.1)
<br/><br/>
    * [4.5.1.1 Training and Fitting the Model](#4.5.1.1)
    * [4.5.1.2 Evaluating the Model](#4.5.1.2)
<br/><br/>
 

## 4.1 Introduction <a name="4.1"></a>

### 4.1.1 Problem Recap <a name="4.1.1"><a/>

Using customer text data about amazon products, we will build, evaluate and compare models to estimate the probability that a given text review can be classified as “positive” or “negative”.

Our goal is to build a text classifier using Amazon product review data which can be used to analyze customer sentiment which does not have accompanying numeric data. The metric we will be primarily interested in will be Recall on the positive class. This is the proportion of the positive class (negative reviews coded as "1" in the data) we correctly predict.

### 4.1.2 Notebook Goals <a name="4.1.2"></a>

1. In our previous notebook our best results came from Term-Frequency Inverse-Document Frequency vectorization and a Logistic Regression Model.

2. We had slightly worse results from a Naive Bayes and Random Forest model. The Naive Bayes model incorrectly predicted a higher proportion of the negative class and the Random Forest model appeared to strongly overfit the training data with a very poor Recall on the test set.

3. Try over-sampling the minority class that we are trying to predict (encoded as "1"s) and/or under-sampling the majority class.

4. Test some other models such as gradient boosted trees (LightGBM/XGBoost) 

5. Examine how well our models will generalize with K-fold cross validation.

6. Tune hyper-parameters with grid-search or bayesian search optimization.

## 4.2 Load the data <a name="4.2"><a/>

### 4.2.1 Imports <a name="4.2.1"><a/>

In [3]:
from random import seed

#reading/processing data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pyarrow.parquet as pq


#splitting the dataset
from sklearn.model_selection import train_test_split
import imblearn as im

#scaling/vectorization
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import word2vec, FastText

# models
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from imblearn.pipeline import Pipeline


#metrics
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix, RocCurveDisplay, recall_score
from sklearn.model_selection import cross_validate

#dealing with class imbalance
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

#hyperparameter tuning
import optuna


  from pandas import MultiIndex, Int64Index


### 4.2.2 Load the data <a name="4.2.2"><a/>

In [4]:
data = pq.read_table("../data/edited/fashion.parquet")
fashion = data.to_pandas()

In [14]:
def objective(trial):

    # select vectorization parameters
    

    tfidf = TfidfVectorizer(ngram_range=(1,2), min_df = 5, max_df=0.95)

    X_train, X_test, y_train, y_test = train_test_split(fashion["review"].values, fashion["neg_sentiment"], test_size = .1)

    y_train, y_test = np.ravel(y_train), np.ravel(y_test)
    y_train, y_test = y_train.astype(int), y_test.astype(int)

    #sampler
    sampler_type = trial.suggest_categorical('sampler', ['ros', 'rus', None, 'smote'])

    if sampler_type == 'ros':
        sampler = RandomOverSampler(random_state=0)
    
    elif sampler_type == 'smote':
        k_neighbors = trial.suggest_int('k_neighbors', 2,5)
        sampler = SMOTE(random_state=0, k_neighbors=k_neighbors)
    
    elif sampler_type == 'rus':
        sampler = RandomUnderSampler(random_state=0)
    else:
        sampler = None

    model_type = trial.suggest_categorical('classifier', ['XGBClassifier']) #'LGBMClassifier', 'LogisticRegression'

    if model_type == 'LogisticRegression':

        C = trial.suggest_categorical('C', [3, 1.0, 0.1, 0.01]) #note: models with larger values for C failed to converge
        model = LogisticRegression(solver = "lbfgs", n_jobs=-1, max_iter=1000, C=C)

    elif model_type == 'XGBClassifier':
        
        learning_rate = trial.suggest_float('learning_rate', 0.0001, 0.1)
        max_depth = trial.suggest_int('max_depth', 3, 10)
        n_estimators = trial.suggest_int('n_estimators', 2,10)

        model = xgb.XGBClassifier(n_estimators=n_estimators, max_depth=max_depth, learning_rate=learning_rate, n_jobs=-1, random_state=0, verbosity=0, use_label_encoder=False)

    else:
              
        model = LightGBM
    
    pipeline = Pipeline([('tfidf', tfidf), ('sampler', sampler), ('model',model)])
    
    pipeline.fit(X_train, y_train)
    
    print("Fit worked")

    y_preds = pipeline.predict(X_test)

    return recall_score(y_preds, y_test)

In [15]:
study = optuna.create_study(direction='maximize')

study.optimize(objective, n_trials=10)

[32m[I 2022-07-07 13:59:26,764][0m A new study created in memory with name: no-name-d8130abc-6609-4a85-b0a3-c3793ea4018e[0m


Fit worked


[32m[I 2022-07-07 13:59:44,070][0m Trial 0 finished with value: 0.6794405948960622 and parameters: {'sampler': 'ros', 'classifier': 'XGBClassifier', 'learning_rate': 0.0968750508029098, 'max_depth': 6, 'n_estimators': 8}. Best is trial 0 with value: 0.6794405948960622.[0m


Fit worked


[32m[I 2022-07-07 14:16:26,312][0m Trial 1 finished with value: 0.6441725373872204 and parameters: {'sampler': 'smote', 'k_neighbors': 4, 'classifier': 'XGBClassifier', 'learning_rate': 0.055186723334697305, 'max_depth': 4, 'n_estimators': 7}. Best is trial 0 with value: 0.6794405948960622.[0m


Fit worked


[32m[I 2022-07-07 14:16:44,221][0m Trial 2 finished with value: 0.6814635800898838 and parameters: {'sampler': 'ros', 'classifier': 'XGBClassifier', 'learning_rate': 0.04403205019923962, 'max_depth': 7, 'n_estimators': 8}. Best is trial 2 with value: 0.6814635800898838.[0m


Fit worked


[32m[I 2022-07-07 14:16:59,692][0m Trial 3 finished with value: 0.7415875754961173 and parameters: {'sampler': None, 'classifier': 'XGBClassifier', 'learning_rate': 0.06040277859460945, 'max_depth': 9, 'n_estimators': 4}. Best is trial 3 with value: 0.7415875754961173.[0m


Fit worked


[32m[I 2022-07-07 14:17:19,040][0m Trial 4 finished with value: 0.7608858297171929 and parameters: {'sampler': None, 'classifier': 'XGBClassifier', 'learning_rate': 0.08180149542123777, 'max_depth': 10, 'n_estimators': 9}. Best is trial 4 with value: 0.7608858297171929.[0m


Fit worked


[32m[I 2022-07-07 14:17:34,812][0m Trial 5 finished with value: 0.7103896711418815 and parameters: {'sampler': None, 'classifier': 'XGBClassifier', 'learning_rate': 0.06390275060274116, 'max_depth': 6, 'n_estimators': 7}. Best is trial 4 with value: 0.7608858297171929.[0m


Fit worked


[32m[I 2022-07-07 14:17:49,434][0m Trial 6 finished with value: 0.654938610958402 and parameters: {'sampler': 'ros', 'classifier': 'XGBClassifier', 'learning_rate': 0.051699603709427805, 'max_depth': 4, 'n_estimators': 4}. Best is trial 4 with value: 0.7608858297171929.[0m


Fit worked


[32m[I 2022-07-07 14:18:06,227][0m Trial 7 finished with value: 0.7006222441999423 and parameters: {'sampler': 'rus', 'classifier': 'XGBClassifier', 'learning_rate': 0.04545761035075771, 'max_depth': 10, 'n_estimators': 8}. Best is trial 4 with value: 0.7608858297171929.[0m


Fit worked


[32m[I 2022-07-07 14:18:22,910][0m Trial 8 finished with value: 0.6650706998389118 and parameters: {'sampler': 'ros', 'classifier': 'XGBClassifier', 'learning_rate': 0.03094175344296864, 'max_depth': 4, 'n_estimators': 9}. Best is trial 4 with value: 0.7608858297171929.[0m
