# Table of Contents
1. [Brief Answers To Requested Questions](#Brief-Answers-To-Requested-Questions)
2. [Methodology](#Methodology)


## Brief Answers To Requested Questions

### Describe your model and why did you choose this model over other types of models?

Predictions on `marketing_test.csv` were generated from a stack of an sklearn RandomForestClassifier with n_estimators=350, a K-Neighbors classifier and default Light-GBM classifier with a logistic regression meta-estimator. I chose this stack because it generally offered a superior Cohen's kappa score on the cross-validation folds. Cohen's kappa is a more robust statistic than accuracy on imbalanced sets. While the difference between the optimal RandomForestClassifier and the stack is negligible, and one might question why the stack in the first place, there isn't much risk in the model overfitting, only taking a bit more compute time. I also oversampled the minority class using the SMOTE algorithm based largely on several of the plots in Figure 1 and Figure 2 which I generated and the fact that the extreme value generally occurred with 15 < K < 25.

![Figure 1: Cohen's kappa as a function of SMOTE K choice](images/RandomForestClassifier_oversampled.png)

![Figure 2: Cohen's kappa as a function of SMOTE K choice](images/StackingCV_oversampled.png)



### Describe any other models you have tried and why do you think this model preforms better?

I fit and tested several models:

* Logistic regression

* Random forest classifier

* Gradient boosting classifier (sample weights uniform and information-entropy weighted)

* Catboost x5 with averaging

The ROC curves of the above estimators for an 80/20 train/test split of the training samples can be seen in Figure 3.

![Figure 3: ROC curves on tested models](roc_curves.png)

I was also informed by the classification reports and the Cohen's kappa scores produced by these models which are listed below for reference.

##### LogisticRegression

cohen kappa: 0.328912364912

                    precision    recall  f1-score   support

              0       0.98      0.91      0.95      5659
              1       0.26      0.64      0.37       272

    avg / total       0.95      0.90      0.92      5931


##### RandomForestClassifier

cohen kappa: 0.34041814301

                    precision    recall  f1-score   support

              0       0.97      0.92      0.94      5585
              1       0.30      0.57      0.39       346

    avg / total       0.93      0.90      0.91      5931


##### GradientBoostingClassifier (sample weighted)

cohen kappa: 0.260324076594

                    precision    recall  f1-score   support

              0       0.81      0.94      0.87      4519
              1       0.59      0.27      0.37      1412

    avg / total       0.75      0.78      0.75      5931

##### GradientBoostingClassifier

cohen kappa: 0.314803968308

                    precision    recall  f1-score   support

              0       0.96      0.92      0.94      5487
              1       0.32      0.47      0.38       444
    avg / total       0.91      0.88      0.89      5931

##### Catboost x5

cohen kappa: 0.315548974743

                    precision    recall  f1-score   support

              0       0.98      0.91      0.94      5647
              1       0.26      0.61      0.36       284
    avg / total       0.94      0.90      0.92      5931



This is the report for the model from which I generated the submission:


##### StackingCV

cohen kappa: 0.373040070916

                    precision    recall  f1-score   support

              0       0.95      0.93      0.94      5394
              1       0.40      0.48      0.44       537

    avg / total       0.90      0.89      0.89      5931
    

In general, my choice of this kind of model is guided by empircal support for the performance of stacked ensemble methods in general, the simplicity in constructing them, and a generous hyper-parameter space on which to perform a grid search.


### How did you handle missing data?

Missing data were filled with dummy values (-999).  I also incorportaed a VarianceThreshold from sklearn.feature_selection which removes columns from the set that have 0 variance, i.e. the same value. 

### How did you handle categorical (string) data?

I encoded categorical data with sklearn.preprocessing.LabelEncoder which simply maps categories to integers.  Some classifiers, notably catboost.CatBoostClassifier have special treatment for categorical variables, which requires that the indices for the categorical features be preserved.

### How did you handle unbalanced data?

There are several strategies for dealing with imbalanced data. I explored 3 methods for handling imbalance in the target:

#### 1. Random oversampling

The excellent library imbalanced-learn provides an sklearn style interface around several popular resampling techniques.  One provided is RandomOverSampler, which contains an oversampling strategy where sampling with replacement is performed.  The 'minority' strategy attempts to balance the sample classes somewhat by oversampling the minority class.  This has the general effect of training estimators that are better able to identify the minority class at the cost of a slight increase in bias due to the repeated training samples.  I should also note that it is extremely critical that the train/test split is performed BEFORE the oversampling is done, otherwise it is very likely the test samples from a cross-validation fold will bleed into the training samples.

#### 2. Interpolated oversampling

The SMOTE algorithm attempts to interpolate between points in the minority class by computing the barycenter of some K-neighbors set, where K can vary.  I plot the Cohen's kappa as a function of K for investigation.

#### 3. Information-entropy weighted boosting

It is also possible to weight the samples by the information entropy of their class when the classifier supports sample weighting.  The negative log of the probability mass for a class gives us the relative weight of the sample in this case.  Classes that have a higher frequency have a relatively low information entropy and vice versa. While sample-weighting generally doesn't increase the overall F1 or Cohen's kappa, it generally provides better positive precision on the minority class.

### How did you test your model?

I split the training samples into train/test sets, generally either 50/50 or 80/20 test depending on oversampling.  I measured the Cohen's kappa and also generated a classification report which includes precision/recall and F1 scores of classifier performance.

## Methodology

I include the general outline of my thought process before I began writing the code for the solution just for reference.

1. Load and label encode categorical data
2. Standardize data or ensure standardized prior to fit, clean: fillna with -999 or some value
3. Sanity-check, ensure variance of each column is non-zero, one-way ANOVA between train and test data
   print box-plots or swarm/violin plots on offending columns (F-statistic p > 0.05)
4. Use cohens kappa because accuracy is pretty useless with extremely imbalanced classes
5. Establish baseline with linear model, i.e. logistic regression with default parameters, plot decision boundaries
6. Upsample minority class using a couple of methods: random oversampling and interpolation (SMOTE or ADASYN)
7. Build default gradient boosting classifier and compare, plot decision boundaries
8. Estimate feature importance with gradient boosting classifier, do PCA if necessary
9. Build knn classifier and compare, plot decision boundaries
10. Build random forest ensemble
11. If decision boundary correalation is low between the previous models meaning they are good candidates for stacking, build stacked classifier with them
12. Pipeline grid search cross-validation for optimal parameters