#### Tasks
1. Create the most accurate classifier (evaluation metric: accuracy rate) for the data, as measured by the
test data.
2. Write a 8-12 page slides summarizing your approach to:  
    (a) cleaning and preparing the data for modeling - Assumption: Missing dates implying no delivery  
    (b) formulating the model design matrix - Definition of features  
    (c) building the model and tuning parameters - Different models tested and describe the tuning process  
    (d) validating the model by training & validation sets, or other approaches - 5-fold Cross-Validation   
    (e) comparing results from all attempts  
    (f) findings from the data and challenges from this contest.  

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Data Preparation

In [2]:
from src.datasets import get_training
from src.prep import DataPrep

In [70]:
X, y = get_training()
X_train_prepped = DataPrep().run(X)

In [None]:
# Remove age outliers (< 100)
X_train_prepped = X_train_prepped[(X_train_prepped['customer_age_at_order'] < 100) | (X_train_prepped['customer_age_at_order'].isna())]
X_train_prepped.reset_index(drop=True, inplace=True)
y_train = y_train.drop(list(set(y_train.index) - set(X_train_prepped.index))).reset_index(drop=True)

### preprocessing

1. Encode the categorical variables
    - OneHotEncoder for non-tree models
    - OrdinalEncoder for tree-based models
2. Impute missing values on numeric fields
    - SimpleImputer [mean, median, most_frequent]
3. Scale numerical values
4. [optional] normalize features
5. Fit Model
    - Naive Bayes
    - Logistic Regression
    - K-Nearest Neighbors
    - SVC
    - Decision Tree
    - Bagging Decision Tree
    - Boosted Decision Tree
    - Random Forest Classifier
    - Voting Classifier

Transormation Notes
- For tree-based models, do not one-hot encode, instead use ordinal encoding. Tree-based models can basically learn the same information from an ordinal encoded feature as from a one-hot encoded feature, even if the features themselves are unordered.
- Cross-Validation on the entire pipeline
    - Data is split and then applies the pipeline steps (good) instead of preprocessing the data and then do cross-validation on just the model (bad - Data Leakage)
    - Preprocessing before splitting the data does not properly simulate reality
    - Splitting and then preprocessing does simulate reality, which is the entire purpose of cross-validation
    

In [22]:
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, LabelEncoder, MinMaxScaler, Normalizer
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import plot_tree

In [23]:
def model_evaluation_cv(estimator, X, y, cv=5, scoring='accuracy', return_train_score=False):
    cv_results = cross_validate(estimator=estimator, X=X, y=y, cv=cv, scoring=scoring, return_train_score=return_train_score)
    
    test_scores = cv_results['test_score']
    avg_test_score = test_scores.mean()
    train_scores = cv_results['train_score']
    avg_train_score = train_scores.mean()

    for i, j in cv_results.items():
        print(i, j)
    print('-----')
    print(f'Average cross-validation test score: {avg_test_score}')
    print(f'Average cross-validation train score: {avg_train_score}')

Pipelines to use in model testing
- Used treebased_preprocessor for tree-based models (decision trees, random forests, etc...) as categorical features are encoded with an OrdinalEncoder instead of a OneHotEncoder

In [34]:
numeric_cols = X_train_prepped.select_dtypes(include=np.number).columns
categorical_cols = X_train_prepped.select_dtypes(exclude=np.number).columns

# Create a preprocessor for tree-based models
treebased_preprocessor = ColumnTransformer([
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
        ]),categorical_cols),
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', MinMaxScaler())
        ]), numeric_cols)
    ])

# Create a generic preprocessor
generic_preprocessor = ColumnTransformer([
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]),categorical_cols),
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', MinMaxScaler())
        ]), numeric_cols)
    ])

### Model 1: Decision Tree Classifier

##### Performance on Base Classifier

In [35]:
base_clf = DecisionTreeClassifier()

decision_tree_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', base_clf)])

model_evaluation_cv(decision_tree_pipeline, X_train_prepped, y_train, cv=5, scoring='accuracy', return_train_score=True)

fit_time [1.66813493 1.46023226 1.3129797  1.33512473 1.36519361]
score_time [0.09393167 0.10915351 0.100178   0.09088707 0.10616064]
test_score [0.65138542 0.66259916 0.65902605 0.65446415 0.67715403]
train_score [0.97542628 0.97531953 0.97638136 0.97746004 0.97622418]
-----
Average cross-validation test score: 0.6609257608217236
Average cross-validation train score: 0.9761622784637941


##### Hyperparameter Tuning

In [38]:
base_clf = DecisionTreeClassifier()

decision_tree_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', base_clf)])

param_grid = {
    'model__criterion': ['gini', 'entropy', 'log_loss'],
    'model__max_depth': [2, 4, 6, 8, 10],
    'model__min_samples_split': [2, 4, 6, 8, 10]
}

search = RandomizedSearchCV(decision_tree_pipeline, param_grid, n_iter=20, cv=5, scoring='accuracy', return_train_score=False, random_state=42, n_jobs=-1)
search.fit(X_train_prepped, y_train)
search_df = pd.DataFrame(search.cv_results_)
print(f'Best Accuracy: {search.best_score_}')
search.best_params_

Best Accuracy: 0.7668296144526551


{'model__min_samples_split': 2,
 'model__max_depth': 6,
 'model__criterion': 'entropy'}

##### Model Evaluation after Tuning

In [42]:
tuned_clf = DecisionTreeClassifier(
    criterion='entropy',
    min_samples_split=2,
    max_depth=6
)

decision_tree_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', tuned_clf)])

model_evaluation_cv(decision_tree_pipeline, X_train_prepped, y_train, cv=5, scoring='accuracy', return_train_score=True)

fit_time [0.81263232 0.77864075 0.77703094 0.75275064 0.74283242]
score_time [0.08037639 0.09634995 0.09412789 0.08908486 0.09631419]
test_score [0.76893863 0.77880402 0.77583766 0.75550012 0.75506764]
train_score [0.76677435 0.76427428 0.76501025 0.7701059  0.77028697]
-----
Average cross-validation test score: 0.7668296144526551
Average cross-validation train score: 0.7672903505301016


### Model 2: SGDClassifier

In [138]:
from sklearn.linear_model import SGDClassifier

model = SGDClassifier()

sgd_classifier_pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', model)])

model_evaluation(sgd_classifier_pipeline, X, y, cv=5, scoring='accuracy', return_train_score=True)

fit_time [1.07739186 1.01104188 1.05359101 0.99112201 1.03263474]
score_time [0.0988276  0.1035502  0.11779785 0.09919524 0.10978985]
test_score [0.76814095 0.76630188 0.77218236 0.75751521 0.762414  ]
train_score [0.76441039 0.76487016 0.76340003 0.76706684 0.76786282]
-----
Average cross-validation test score: 0.7653108819090205
Average cross-validation train score: 0.7655220458489354


In [141]:
parameters = {
    'model__loss': ['hinge', 'log', 'modified_huber'],
    'model__penalty': ['l2', 'l1', 'elasticnet'],
    'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1],
    'model__max_iter': [1000, 5000, 10000]
}

search = RandomizedSearchCV(sgd_classifier_pipeline, parameters, n_iter=15, cv=5, scoring='accuracy', return_train_score=False, random_state=42,n_jobs=-1).fit(X, y)
search_df = pd.DataFrame(search.cv_results_)
print(f'Best Accuracy: {search.best_score_}')
search.best_params_

Best Accuracy: 0.7642255790127808


{'model__penalty': 'l2',
 'model__max_iter': 1000,
 'model__loss': 'hinge',
 'model__alpha': 0.001}

### Model 3: Random Forest Classifier

Create Model Pipeline

In [15]:
model = model=RandomForestClassifier()

random_forest_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', model)])

model_evaluation(random_forest_pipeline, X.head(25000), y.head(25000), cv=5, scoring='accuracy', return_train_score=True)

fit_time [3.36990643 2.58722329 2.31067395 2.3189652  2.31309247]
score_time [0.15515924 0.11544895 0.11325622 0.11164474 0.11375284]
test_score [0.759  0.7584 0.7496 0.7706 0.7528]
train_score [0.98315 0.9837  0.98365 0.9839  0.9836 ]
-----
Average cross-validation test score: 0.75808
Average cross-validation train score: 0.9836


Hyperparameter Tuning

In [20]:
model = model=RandomForestClassifier()

random_forest_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', model)])

random_forest_param_grid = {
    'model__bootstrap': [True, False],
    'model__max_depth': [5, 7, 10, 12, 15],
    'model__max_features': [None, 'sqrt', 'log2'],
    'model__min_samples_leaf': [1, 2, 3],
    'model__min_samples_split': [2, 4, 6, 8, 10, 12],
    'model__n_estimators': [100, 500, 1000]
}

rf_random = RandomizedSearchCV(
    estimator=random_forest_pipeline,
    param_distributions=random_forest_param_grid,
    n_iter=10,
    cv=5,
    random_state=42,
    n_jobs=-1).fit(X.head(25000), y.head(25000))

print(rf_random.best_score_)
rf_random.best_params_

0.77132


{'model__n_estimators': 500,
 'model__min_samples_split': 12,
 'model__min_samples_leaf': 2,
 'model__max_features': 'log2',
 'model__max_depth': 10,
 'model__bootstrap': True}

In [19]:
tuned_model = RandomForestClassifier(
    bootstrap=False,
    max_depth=10,
    max_features='sqrt',
    min_samples_leaf=2,
    min_samples_split=6,
    n_estimators=500
)

tuned_pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', tuned_model)])

# Model Evaluation
model_evaluation(tuned_pipeline, X.head(25000), y.head(25000), cv=5, scoring='accuracy', return_train_score=True)

fit_time [6.54380774 6.54193783 6.28002405 6.67028141 6.45264959]
score_time [0.29910493 0.2912569  0.28822875 0.30168414 0.29417658]
test_score [0.7644 0.7798 0.768  0.7846 0.7632]
train_score [0.7903  0.7912  0.79395 0.7886  0.7916 ]
-----
Average cross-validation test score: 0.772
Average cross-validation train score: 0.7911300000000001


### Model 4: Support Vector Classifier

In [21]:
from sklearn.svm import SVC

base_clf = SVC()

svc_pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', base_clf)])

model_evaluation(svc_pipeline, X.head(25000), y.head(25000), cv=5, scoring='accuracy', return_train_score=True)

fit_time [16.38579321 17.35776877 17.79804277 17.28743696 17.19053364]
score_time [8.40359044 9.33917785 8.46818519 8.68982649 9.16167474]
test_score [0.4878 0.4962 0.5358 0.4814 0.5278]
train_score [0.5074  0.506   0.5093  0.49995 0.51055]
-----
Average cross-validation test score: 0.5058
Average cross-validation train score: 0.50664


In [29]:
base_clf = SVC()

svc_pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', base_clf)])

param_grid = {
    'model__C': [0.1, 1, 10],
    'model__kernel': ['linear', 'rbf', 'poly'],
    'model__gamma': ['scale', 'auto', 0.1, 0.01]
}

search = RandomizedSearchCV(
    estimator=svc_pipeline,
    param_distributions=param_grid,
    n_iter=10,
    cv=5,
    random_state=42,
    n_jobs=-1).fit(X.head(25000), y.head(25000))

print(search.best_score_)
search.best_params_

0.7776000000000001


{'model__kernel': 'rbf', 'model__gamma': 'scale', 'model__C': 1}

In [32]:
from sklearn.ensemble import BaggingClassifier

# Initialize a decision tree classifier
tree = DecisionTreeClassifier(
    min_samples_split=8,
    max_depth=6
)

# Initialize a bagging classifier with 10 decision trees
bagging = BaggingClassifier(tree, n_estimators=500)

bagging_pipeline = Pipeline([
    ('preprocessor', generic_preprocessor),
    ('model', bagging)])

model_evaluation(bagging_pipeline, X.head(25000), y.head(25000), cv=5, scoring='accuracy', return_train_score=True)

fit_time [27.53406644 26.26263356 26.94116974 26.22585106 27.41593909]
score_time [0.49966478 0.49866152 0.4976685  0.48872352 0.48968434]
test_score [0.7752 0.7836 0.772  0.7906 0.767 ]
train_score [0.7794  0.7795  0.787   0.77925 0.7858 ]
-----
Average cross-validation test score: 0.7776799999999999
Average cross-validation train score: 0.7821899999999999


---

## Fit Model and Predict Test Set        

In [75]:
def make_predictions(fitted_pipeline, submission_name):
    # Read kaggle test data from disk
    X_test = pd.read_csv('data/test.csv')
    
    # Prep test data
    X_test_prepped = X_test.drop(columns='id')
    X_test_prepped = DataPrep().run(X_test_prepped)

    # Use fitted model pipeline to predict values on test data
    submission = pd.DataFrame({
        'id': list(X_test['id']),
        'return': fitted_pipeline.predict(X_test_prepped)
    })

    # Write results to csv file
    submission.to_csv(f'./submissions/{submission_name}', index=False)

In [43]:
numeric_cols = X_train_prepped.select_dtypes(include=np.number).columns
categorical_cols = X_train_prepped.select_dtypes(exclude=np.number).columns

# Create a preprocessor for tree-based models
treebased_preprocessor = ColumnTransformer([
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
        ]),categorical_cols),
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', MinMaxScaler())
        ]), numeric_cols)
    ])


clf = DecisionTreeClassifier(
    criterion='entropy',
    min_samples_split=2,
    max_depth=6
)

pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', tuned_clf)])

pipeline.fit(X_train_prepped, y_train)

In [72]:
X_train, X_val, y_train, y_val = train_test_split(X_train_prepped, y, test_size=.25, random_state=42)

clf = DecisionTreeClassifier(
    criterion='entropy',
    min_samples_split=2,
    max_depth=6
)

pipeline = Pipeline([
    ('preprocessor', treebased_preprocessor),
    ('model', tuned_clf)])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_val)

In [74]:
from sklearn.metrics import accuracy_score

accuracy_score(y_val, y_pred)

0.7658924205378973