#### Tasks
1. Create the most accurate classifier (evaluation metric: accuracy rate) for the data, as measured by the
test data.
2. Write a 8-12 page slides summarizing your approach to:  
    (a) cleaning and preparing the data for modeling - Assumption: Missing dates implying no delivery  
    (b) formulating the model design matrix - Definition of features  
    (c) building the model and tuning parameters - Different models tested and describe the tuning process  
    (d) validating the model by training & validation sets, or other approaches - 5-fold Cross-Validation   
    (e) comparing results from all attempts  
    (f) findings from the data and challenges from this contest.  

### **Imports**
---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, LabelEncoder, MinMaxScaler, Normalizer
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import plot_tree

from src.datasets import get_training
from src.prep import DataPrep
from src.utils import model_evaluation_cv, make_predictions

### **Get and Prepare Training Data**
---

*Read training X and y frames*

In [2]:
X, y = get_training()
X_prepped = DataPrep().run(X)

*Filter training data, removing customers with age > 100*

In [3]:
# Get relevent Indexes to keep
keep_idx = X_prepped[(X_prepped['customer_age_at_order'] < 100) | (X_prepped['customer_age_at_order'].isna())].index

# Filter our data in X and y
X_prepped = X_prepped.loc[keep_idx]
y = X_prepped.join(y)['return']

# Reset index from 0-n
X_prepped.reset_index(drop=True, inplace=True)
y.reset_index(drop=True, inplace=True)

### **Preprocessing**
---

*Preprocessing Steps*
1. Encode the categorical variables
    - OneHotEncoder for non-tree models
    - OrdinalEncoder for tree-based models
2. Impute missing values on numeric fields
    - SimpleImputer [mean, median, most_frequent]
3. Scale numerical values
4. [optional] normalize features

*Preprocessing Notes*
- For tree-based models, do not one-hot encode, instead use ordinal encoding. Tree-based models can basically learn the same information from an ordinal encoded feature as from a one-hot encoded feature, even if the features themselves are unordered.
- Cross-Validation on the entire pipeline
    - Data is split and then applies the pipeline steps (good) instead of preprocessing the data and then do cross-validation on just the model (bad - Data Leakage)
    - Preprocessing before splitting the data does not properly simulate reality
    - Splitting and then preprocessing does simulate reality, which is the entire purpose of cross-validation
    

*Create Preprocessors*

In [7]:
# Create lists of numerical and categorical columns in X data
numeric_cols = X_prepped.select_dtypes(include=np.number).columns
categorical_cols = X_prepped.select_dtypes(exclude=np.number).columns

# Create a preprocessor for tree-based models
treebased_preprocessor = ColumnTransformer([
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
        ]),categorical_cols),
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', MinMaxScaler())
        ]), numeric_cols)
    ])

# Create a generic preprocessor
generic_preprocessor = ColumnTransformer([
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]),categorical_cols),
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', MinMaxScaler())
        ]), numeric_cols)
    ])

### **Model Development and Testing**
---

*Models to Test*
- Naive Bayes
- Logistic Regression
- K-Nearest Neighbors
- SVC
- Decision Tree
- Bagging Decision Tree
- Boosted Decision Tree
- Random Forest Classifier
- Voting Classifier

#### **Model 1: Decision Tree Classifier**

In [5]:
from sklearn.tree import DecisionTreeClassifier

*Performance on Base Classifier*

In [5]:
# Initialize Base Decision Tree Classifier
clf = DecisionTreeClassifier()

# Initialize a Base ML Pipeline
pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', clf)])

# Evaluate performance on base model
model_evaluation_cv(pipeline, X_prepped, y, cv=5, scoring='accuracy', return_train_score=True)

fit_time [1.48026037 1.54868555 1.3789115  1.31208324 1.30005002]
score_time [0.11382413 0.09935117 0.0985651  0.09071136 0.10803843]
test_score [0.65700209 0.66681046 0.65784216 0.65638907 0.67066276]
train_score [0.97544487 0.97531999 0.97638144 0.97743153 0.97623399]
-----
Average cross-validation test score: 0.6617413071599373
Average cross-validation train score: 0.9761623655637937


*Hyperparameter Tuning*

In [7]:
# Create base ML Pipeline using Decision Tree
clf = DecisionTreeClassifier()
pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', clf)])

# Create the parameter grid for hyper-parameter tuning
param_grid = {
    'model__criterion': ['gini', 'entropy'],
    'model__max_depth': [2, 4, 6, 8, 10],
    'model__min_samples_split': [2, 4, 6, 8, 10],
    'model__splitter': ['best', 'random']
}

# Test 10 random hyperparameter combinations with 5-fold CV
search = RandomizedSearchCV(
    pipeline, param_grid, n_iter=20, cv=5, scoring='accuracy',
    verbose=1, random_state=42, n_jobs=-1)

search.fit(X_prepped, y)

# Evaluate results of the search
search_df = pd.DataFrame(search.cv_results_)
print(f'Best Accuracy: {search.best_score_}')
search.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Accuracy: 0.7669682858975142


{'model__splitter': 'random',
 'model__min_samples_split': 4,
 'model__max_depth': 8,
 'model__criterion': 'gini'}

*Evaluate the Tuned Decision Tree Model*

In [9]:
# Initilize Classifier with the best parameters
clf = DecisionTreeClassifier(
    min_samples_split=4,
    max_depth=8,
    splitter='random',
    criterion='gini'
)

# Create a ML Pipeline Instance with the Tuned Classifier
pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', clf)])

model_evaluation_cv(pipeline, X_prepped, y, cv=5, scoring='accuracy', return_train_score=True)

fit_time [0.59907317 0.61742568 0.58986616 0.58256817 0.56883979]
score_time [0.06925344 0.10107183 0.08158183 0.08467603 0.08393431]
test_score [0.76139769 0.77940242 0.76507583 0.75685678 0.75689667]
train_score [0.769553   0.76614162 0.7709834  0.77363417 0.77343111]
-----
Average cross-validation test score: 0.7639258781975714
Average cross-validation train score: 0.7707486607848851


#### **Model 2: SGDClassifier**

In [12]:
from sklearn.linear_model import SGDClassifier

*Step 1: Evaluate Performance on Base Classifier*

In [9]:
# Initialize Base Classifier
clf = SGDClassifier()

# Initialize ML Pipeline
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

# Evaluate Base Model Performance
model_evaluation_cv(pipeline, X_prepped, y, cv=5, scoring='accuracy', return_train_score=True)

fit_time [1.06982088 1.0507679  1.02597451 1.03102064 1.0332787 ]
score_time [0.10078168 0.10587502 0.10735512 0.09990263 0.10965347]
test_score [0.76632458 0.76444011 0.76875397 0.75635728 0.76100629]
train_score [0.76243508 0.76275294 0.76167447 0.76612459 0.76496231]
-----
Average cross-validation test score: 0.7633764463095509
Average cross-validation train score: 0.7635898794240836


*Step 2: Evaluate Hyperparameter-tuning*

In [26]:
# Create base ML Pipeline using SGDClassifier
clf = SGDClassifier()
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

# Create the parameter grid for hyper-parameter tuning
param_grid = {
    'model__loss': ['hinge', 'log', 'modified_huber'],
    'model__penalty': ['l2', 'l1', 'elasticnet'],
    'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1],
    'model__max_iter': [1000, 5000, 10000]}

# Test 10 random hyperparameter combinations with 5-fold CV
search = RandomizedSearchCV(pipeline, param_grid, n_iter=10, cv=5, scoring='accuracy', return_train_score=False, random_state=42, n_jobs=-1)
search.fit(X_prepped, y)

# Evaluate results of the search
search_df = pd.DataFrame(search.cv_results_)
print(f'Best Accuracy: {search.best_score_}')
search.best_params_

Best Accuracy: 0.7584540882000183


{'model__penalty': 'l1',
 'model__max_iter': 1000,
 'model__loss': 'modified_huber',
 'model__alpha': 0.0001}

In [25]:
# Initialize Tuned Classifier
clf = SGDClassifier(
    penalty='l1',
    max_iter=1000,
    loss='modified_huber',
    alpha=0.0001
)

# Initialize ML Pipeline with Tuned Classifier
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

# Evaluate Base Model Performance
model_evaluation_cv(pipeline, X_prepped, y, cv=5, scoring='accuracy', return_train_score=True)

fit_time [1.30916095 1.39068055 1.20214367 1.1642735  1.30064988]
score_time [0.09161973 0.09718871 0.09606147 0.09177518 0.09867525]
test_score [0.77027518 0.76419036 0.74477795 0.75692489 0.75823627]
train_score [0.7645977  0.7668852  0.76970058 0.77236838 0.77172827]
-----
Average cross-validation test score: 0.7588809292356239
Average cross-validation train score: 0.769056027554112


### Model 3: Random Forest Classifier

Create Model Pipeline

In [15]:
model = model=RandomForestClassifier()

random_forest_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', model)])

model_evaluation(random_forest_pipeline, X.head(25000), y.head(25000), cv=5, scoring='accuracy', return_train_score=True)

fit_time [3.36990643 2.58722329 2.31067395 2.3189652  2.31309247]
score_time [0.15515924 0.11544895 0.11325622 0.11164474 0.11375284]
test_score [0.759  0.7584 0.7496 0.7706 0.7528]
train_score [0.98315 0.9837  0.98365 0.9839  0.9836 ]
-----
Average cross-validation test score: 0.75808
Average cross-validation train score: 0.9836


Hyperparameter Tuning

In [20]:
model = model=RandomForestClassifier()

random_forest_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', model)])

random_forest_param_grid = {
    'model__bootstrap': [True, False],
    'model__max_depth': [5, 7, 10, 12, 15],
    'model__max_features': [None, 'sqrt', 'log2'],
    'model__min_samples_leaf': [1, 2, 3],
    'model__min_samples_split': [2, 4, 6, 8, 10, 12],
    'model__n_estimators': [100, 500, 1000]
}

rf_random = RandomizedSearchCV(
    estimator=random_forest_pipeline,
    param_distributions=random_forest_param_grid,
    n_iter=10,
    cv=5,
    random_state=42,
    n_jobs=-1).fit(X.head(25000), y.head(25000))

print(rf_random.best_score_)
rf_random.best_params_

0.77132


{'model__n_estimators': 500,
 'model__min_samples_split': 12,
 'model__min_samples_leaf': 2,
 'model__max_features': 'log2',
 'model__max_depth': 10,
 'model__bootstrap': True}

In [19]:
tuned_model = RandomForestClassifier(
    bootstrap=False,
    max_depth=10,
    max_features='sqrt',
    min_samples_leaf=2,
    min_samples_split=6,
    n_estimators=500
)

tuned_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', tuned_model)])

# Model Evaluation
model_evaluation(tuned_pipeline, X.head(25000), y.head(25000), cv=5, scoring='accuracy', return_train_score=True)

fit_time [6.54380774 6.54193783 6.28002405 6.67028141 6.45264959]
score_time [0.29910493 0.2912569  0.28822875 0.30168414 0.29417658]
test_score [0.7644 0.7798 0.768  0.7846 0.7632]
train_score [0.7903  0.7912  0.79395 0.7886  0.7916 ]
-----
Average cross-validation test score: 0.772
Average cross-validation train score: 0.7911300000000001


### Model 4: Support Vector Classifier

In [6]:
from sklearn.svm import SVC

In [12]:
sampled_df = X_prepped.sample(25000).join(y)
X_prepped_sample = sampled_df.drop(columns=['return'])
y_sample = sampled_df['return']

*Step 1: Evaluate Performance on Base Classifier*

In [7]:
# Initialize Base SVC Model
clf = SVC()

# Initialize base ML Pipeline
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

# Evaluate Base Model Performance
cross_validate(estimator=pipeline, X=X_prepped.head(25000), y=y.head(25000), cv=5, scoring='accuracy', return_train_score=False)

{'fit_time': array([23.12288785, 25.16459703, 22.36177158, 22.81326747, 21.2776823 ]),
 'score_time': array([3.48576641, 2.9429872 , 2.89863014, 2.67603254, 2.62749577]),
 'test_score': array([0.7772, 0.7844, 0.7684, 0.7874, 0.7604])}

*Step 2: Evaluate hyperparameter tuning*

In [15]:
# Initialize a base pipelipe
clf = SVC()
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

# Create Parameter Grid for RandomizedSearch
param_grid = {
    'model__C': [0.1, 1, 10],
    'model__kernel': ['linear', 'rbf', 'poly'],
    'model__gamma': ['scale', 'auto', 0.1, 0.01]
}

# Runing Hyperparameter Tuning Search Procedure
search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_grid,
    n_iter=10,
    cv=5,
    random_state=42,
    n_jobs=-1).fit(X_prepped_sample, y_sample)

# Evaluate best model parameters
print(search.best_score_)
search.best_params_

0.77056


{'model__kernel': 'rbf', 'model__gamma': 'scale', 'model__C': 1}

*Step 3: Evaluate performace of tuned model*

In [None]:
# Initialize Tuned SVC Model
clf = SVC(
    kernel='rbf',
    gammer='scale',
    C=1
)

# Initialize Tuned ML Pipeline
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

# Evaluate Tuned Model Performance
cross_validate(estimator=pipeline, X=X_prepped_sample, y=y_sample, cv=5, scoring='accuracy', return_train_score=False)

### Model 5: Bagging Classifier

In [5]:
sampled_df = X_prepped.sample(25000).join(y)
X_prepped_sample = sampled_df.drop(columns=['return'])
y_sample = sampled_df['return']

In [8]:
from sklearn.ensemble import BaggingClassifier

# Initialize a decision tree classifier
tree = DecisionTreeClassifier(
    min_samples_split=8,
    max_depth=6
)

# Initialize a bagging classifier with 10 decision trees
bagging = BaggingClassifier(tree, n_estimators=500)

bagging_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', bagging)])

model_evaluation_cv(bagging_pipeline, X_prepped_sample, y_sample, cv=5, scoring='accuracy', return_train_score=True)

fit_time [10.23965383 10.5209434  10.18956566 10.10906339 10.14805746]
score_time [0.32194686 0.36301517 0.29652238 0.29341698 0.31455374]
test_score [0.772  0.7706 0.7562 0.7728 0.773 ]
train_score [0.76915 0.7692  0.77275 0.7692  0.7693 ]
-----
Average cross-validation test score: 0.76892
Average cross-validation train score: 0.7699199999999999


---

## Fit Model and Predict Test Set        

In [8]:
numeric_cols = X_prepped.select_dtypes(include=np.number).columns
categorical_cols = X_prepped.select_dtypes(exclude=np.number).columns

# Create a generic preprocessor
generic_preprocessor = ColumnTransformer([
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]),categorical_cols),
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', MinMaxScaler())
        ]), numeric_cols)
    ])

# Initialize Tuned Classifier
clf = SVC(
    kernel='rbf',
    gamma= 'scale',
    C= 1)


# Initialize ML Pipeline with Tuned Classifier
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

pipeline.fit(X_prepped.head(25000), y.head(25000))

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['color', 'salutation', 'state', 'order_month',
       'customer_return_behavior', 'item_return_behavior',
       'manufacturer_return_behavior'],
      dtype='object')),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
           

In [11]:
make_predictions(pipeline, 'submission3_svc_sampled.csv')

---

### Train test split eval

In [72]:
X_train, X_val, y_train, y_val = train_test_split(X_train_prepped, y, test_size=.25, random_state=42)

clf = DecisionTreeClassifier(
    criterion='entropy',
    min_samples_split=2,
    max_depth=6
)

pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', tuned_clf)])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)

In [74]:
from sklearn.metrics import accuracy_score

accuracy_score(y_val, y_pred)

0.7658924205378973