#### Tasks
1. Create the most accurate classifier (evaluation metric: accuracy rate) for the data, as measured by the
test data.
2. Write a 8-12 page slides summarizing your approach to:  
    (a) cleaning and preparing the data for modeling - Assumption: Missing dates implying no delivery  
    (b) formulating the model design matrix - Definition of features  
    (c) building the model and tuning parameters - Different models tested and describe the tuning process  
    (d) validating the model by training & validation sets, or other approaches - 5-fold Cross-Validation   
    (e) comparing results from all attempts  
    (f) findings from the data and challenges from this contest.  

### **Imports**
---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, LabelEncoder, MinMaxScaler, Normalizer
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import plot_tree

from src.datasets import get_training
from src.prep import DataPrep
from src.utils import make_predictions, evaluate_model

### **Get and Prepare Training Data**
---

*Read training X and y frames*

In [2]:
X, y = get_training()
xy = X.join(y)
xy = xy[pd.to_datetime(xy['deliveryDate'], errors='coerce').dt.year != 1990]

X = xy.drop(columns='return')
y = xy['return']

In [16]:
X, y = get_training()

# Training Set
train_set = X[pd.to_datetime(X['orderDate']) < pd.to_datetime('2022-02-01')].join(y)
X_train = train_set.drop(columns='return')
y_train = train_set['return']

# Validation Set
validation_set = X[pd.to_datetime(X['orderDate']) >= pd.to_datetime('2022-02-01')].join(y)
X_val = validation_set.drop(columns='return')
y_val = validation_set['return']

# Prepped and Cleaned Sets
X_prep = DataPrep().run(X)
X_train_prep = DataPrep().run(X_train)
X_val_prep = DataPrep().run(X_val)

In [17]:
X_train_prep.drop(columns=['size'], inplace=True)
X_val_prep.drop(columns=['size'], inplace=True)

In [18]:
X_train_prep

Unnamed: 0,color,price,salutation,state,customer_age_at_order,account_age_months,order_month,is_delivered,customer_return_rate,customer_order_count,item_return_rate,manufacturer_return_rate
0,magenta,79.90,Mrs,Hesse,,0.0,9,1,0.0000,4,0.4324,0.5438
1,blue,89.90,Mrs,Berlin,61.0,18.0,9,1,0.4545,22,0.7931,0.5183
2,grey,39.90,Mrs,Lower Saxony,42.0,0.0,9,1,0.4545,11,0.5000,0.5516
3,brown,139.90,Mrs,North Rhine-Westphalia,40.0,18.0,9,1,0.6667,15,0.4000,0.5438
4,brown,54.95,Mr,Lower Saxony,53.0,0.0,9,1,0.0000,2,0.1538,0.2957
...,...,...,...,...,...,...,...,...,...,...,...,...
183903,black,59.95,Mr,Hamburg,26.0,0.0,1,0,0.0000,6,0.0000,0.3973
183904,black,69.95,Mr,Hamburg,26.0,0.0,1,0,0.0000,6,0.0000,0.3973
183905,black,59.90,Mrs,Bavaria,49.0,1.0,1,1,0.3333,3,0.5714,0.4507
183906,black,59.90,Mrs,Rhineland-Palatinate,,0.0,1,1,0.4000,5,0.5714,0.4507


### **Preprocessing**
---

*Preprocessing Steps*
1. Encode the categorical variables
    - OneHotEncoder for non-tree models
    - OrdinalEncoder for tree-based models
2. Impute missing values on numeric fields
    - SimpleImputer [mean, median, most_frequent]
3. Scale numerical values
4. [optional] normalize features

*Preprocessing Notes*
- For tree-based models, do not one-hot encode, instead use ordinal encoding. Tree-based models can basically learn the same information from an ordinal encoded feature as from a one-hot encoded feature, even if the features themselves are unordered.
- Cross-Validation on the entire pipeline
    - Data is split and then applies the pipeline steps (good) instead of preprocessing the data and then do cross-validation on just the model (bad - Data Leakage)
    - Preprocessing before splitting the data does not properly simulate reality
    - Splitting and then preprocessing does simulate reality, which is the entire purpose of cross-validation
    

*Create Preprocessors*

In [65]:
# Create lists of numerical and categorical columns in X data
numeric_cols = X_train_prep.select_dtypes(include=np.number).columns
categorical_cols = X_train_prep.select_dtypes(exclude=np.number).columns

# Create a preprocessor for tree-based models
treebased_preprocessor = ColumnTransformer([
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
        ]), categorical_cols),
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('normalize', Normalizer(norm='max'))
        ]), numeric_cols)
    ])

# Create a generic preprocessor
generic_preprocessor = ColumnTransformer([
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]),categorical_cols),
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('normalize', Normalizer(norm='max'))
        ]), numeric_cols)
    ])

### **Model Development and Testing**
---

*Models to Test*
- Naive Bayes
- Logistic Regression
- K-Nearest Neighbors
- SVC
- Decision Tree
- Bagging Decision Tree
- Boosted Decision Tree
- Random Forest Classifier
- Voting Classifier

#### **Model 1: Decision Tree Classifier**

In [77]:
from sklearn.tree import DecisionTreeClassifier

# Initilize Classifier with the best parameters
clf = DecisionTreeClassifier(
    min_samples_split=4,
    max_depth=11,
    splitter='best',
    criterion='gini'
)

# Create a ML Pipeline Instance with the Tuned Classifier
pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', clf)])

evaluate_model(pipeline, X_train_prep, X_val_prep, y_train, y_val)

Training Score: 0.7871109467777367
Validation Score: 0.7732338153734515


#### **Model 2: XGBoost**

In [66]:
from xgboost import XGBClassifier

clf = XGBClassifier(n_estimators=220, objective='binary:logistic', tree_method='hist', learning_rate=0.10, max_depth=4)
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

evaluate_model(pipeline, X_train_prep, X_val_prep, y_train, y_val)

Training Score: 0.7850555712638928
Validation Score: 0.783133779090862


#### **Model 3: Gradient Boosting**

In [64]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize Base Classifier
clf = GradientBoostingClassifier(
    min_samples_split=10,
    max_depth=8,
    n_estimators=25,
    max_features=9
)

# Initialize ML Pipeline
pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', clf)])

evaluate_model(pipeline, X_train_prep, X_val_prep, y_train, y_val)

Training Score: 0.7875405093851273
Validation Score: 0.782408127300057


#### **Model 4: SGDClassifier**

In [62]:
from sklearn.linear_model import SGDClassifier

# Initialize Tuned Classifier
clf = SGDClassifier(
    penalty='l1',
    max_iter=1000,
    loss='modified_huber',
    alpha=0.0001
)

# Initialize ML Pipeline with Tuned Classifier
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

evaluate_model(pipeline, X_train_prep, X_val_prep, y_train, y_val)

Training Score: 0.7439698109924527
Validation Score: 0.7448556471259006


#### **Model 5: Multi-Layer Perceptron Classifier**

In [78]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier()
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

# Fit base pipeline to training data
pipeline.fit(X_train_prep, y_train)

evaluate_model(pipeline, X_train_prep, X_val_prep, y_train, y_val)



Training Score: 0.7881386345346586
Validation Score: 0.7710568600010367


#### **Model 6: Random Forest Classifier**

In [79]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
    bootstrap=True,
    max_depth=10,
    max_features='log2',
    min_samples_leaf=2,
    min_samples_split=12,
    n_estimators=500
)

pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', clf)])

evaluate_model(pipeline, X_train_prep, X_val_prep, y_train, y_val)

Training Score: 0.781581007895252
Validation Score: 0.7723267506349453


## Fit Model and Predict Test Set        

---

### Fit model and make predictions

In [11]:
clf = XGBClassifier(n_estimators=220, objective='binary:logistic', tree_method='hist', learning_rate=0.1, max_depth=4)
pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', clf)])

# Fit base pipeline to training data
pipeline.fit(X_prep, y)

make_predictions(pipeline, 'submission9_xgboost.csv')