# QTW Lab 15

## David Josephs, Andy Heroy, Carson Drake, Che' Cobb

### April 10, 2020

## Introduction

For the final case study for DS7333, we were given an unlabeled dataset and tasked with providing insight as to possible cost savings for a business partner. The dataset itself was 160,000 records with 51 unlabeled features.   Our only hint as to its domain is that its origins lie in the insurance industry.  The features themselves are labeled (x0-x49) with a binary target category ('y') of 0 or 1.  Our job is to apply machine learning modeling techniques to show a cost savings for our business partner for every incorrect classification of the target variable.  

Seeing as we're dealing with a target variable that is binary, our job will be to select classification models that can best analyze the data.  The classification models we will attempt are as follows.  

- Random Forrest
- Random Forest with Permutation Importance
- Random Forrest with PCA
- Logistic Regression
- ExtraDecisionTree's

## Background

Put simply, a classification model trains a model on a sample of training data to then predict the class of unseen test data.  The metrics for classification models are accuracy, precision, recall and F1-Score.  The main goal of this paper is to minimize false positives and false negatives.  As our business partner has kindly given us a "cost" function as to what false positives and negatives cost our company. 

- False positive = $ 10
- False negative = $ 500
- True pos/neg = $ 0

When we evaluate our models, we implement a custom function to incorporate our business partners requirements in a cost function in the scoring section of the algorithm.  Sklearn comes with a handy function called make_scorer that allows us to monitor our models performance with our partners cost savings in mind.  For a basic overview of other metrics that are standard in classification.  See Table 1 below





| Metric | Description | Equation |
|:---------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------:|
| Accuracy | Accuracy is defined as the number of correct predictions divided  by the total number of predictions. | (True Positive + False Negatives)/Total amount of samples |
| Precision | Precision is defined as the ratio of correctly predicted positives  to the total number of predicted positive observations. | True Positive/(True Positives + False Positives) |
| Recall | Recall is defined as the ratio of correct positives to all the  observations in the class | True Positives/(True Positives + False Positives) |
| F1-Score | F1 score is defined as the harmonic average of Precision and Recall.  This metric is more useful when you have uneven class balances but it  sometimes useful as it includes false postives and false negatives | 2x(Recall x Precision)/(Recall + Position) |

The equations for these metrics can be seen below:

\begin{align}
\mathrm{acc} = \left(\mathop{\sum_{j}\sum_{i}}_{i=j} M\right) / \left(\mathop{\sum_{j}\sum_{i}} M\right)\\
\mathrm{precision} = \mathrm{diag}(M) / \left(\mathop{\sum_{j}} M\right)\\
\mathrm{recall} = \mathrm{diag}(M) / \left(\mathop{\sum_{i}} M\right)\\
\mathrm{f1} = \frac{2(\mathrm{precision})(\mathrm{recall})}{\mathrm{precision} + \mathrm{recall}}
\end{align}

## Data Preparation
The data is completely unlabeled, so we had to figure out on our own what each variable meant. First, the target variable, `y`, is a categorical variable with two levels. There are also 45 continuous features, and 5 categorical features. Two pairs of features, `x2` and `x6`, and `x37` and `x41`, have 100% perfect correlation, thus one of their partners were removed. The weekly and monthly categorical features were encoded as a sine and cosine with a weekly and monthly period, and an amplitude of two. The percentage variable, `x32`, was encoded as a categorical variable because it only had 5 unique levels. 

## Data Preparation


Seeing as the data itself was unlabeled, we now begin the process of cleaning it in order to prepare it for modeling.  Below are our findings of the various properties of the dataset.

1.  Datatypes:
    - Numerical: 45 features
    - Categorical: 5 features
    - Boolean: 1 feature
2.  Correlation:
    - Two sets of columns show direct correlation.  
    - x2 and x6
    - x37 and x41
    - Result: Due to perfect correlation the columns we decided to drop x2 and x41
3.  Data Cleaning
    - x24 contains country data.  "euorpe" was changed to "europe"
    - x29 contains monthly data.
        - "Dev" was changed to "Dec"
        - "sept." was changed to "Sep"
    - x30 contains day of the week.  "thurday" changed to "thursday"
    - x32 contains a percentage amount. All % signs were removed and datatype changed to float64
    - x37 contains a dollar amount.  All $ were removed and the datatype was changed to float64
4.  Missing Value's 
    - Each column look to have anywhere from 21 - 47 missing values.  At most, this comprises about 2.9% of the data.  Which is a small amount when compared to the full dataset.   Instead of dropping those rows, we imputed with the median for each numeric column.
5.  Distribution
    - After checking the histograms for each column, it was determined the data was normally distributed with no skew.  This leads us to believe that this dataset could have been generated with sklearns make_classification function.  To see the distributions check Figure 1 below. 
6.  Scaling
    - In order to make sure our data was scaled correctly for classification, we implemented sklearns StandardScaler function over the numerical columns.
7.  Categorical features
    - The continent variable was encoded in two ways, neither of which panned out, China or not china, and one hot encoded. All categorical variables were mode imputed, and all continuous variables were mean imputed. Mean imputation is appropriate because all the variables followed a normal distribution, as seen in the figure below:
    
![Histograms](../plots/Histograms.png)
**Figure 1:** histograms of the numerical data
Continuous distributions were tested both with and without scaling.


## Feature Engineering

it turns out that EDA is a bit difficult when you have no idea what the corresponding columns don't have any 

# TODO TALK ABOUT CARSONS CYCLIC SHIT


## Feature Selection
Feature selection took our group quite some time as having unlabeled columns makes it difficult to quantify what variables are important.  When this is the case, we rely on other methods to determine what columns are important for analysis and going back and forth of dropping them and running our analysis on various models. 



Our first attempt at this process was using Random Forest which comes with feature importance.  Upon the first run, we saw a fairly low variable scoring of importance within the dataset. Due to this, we decided to implement a "Random" column in to the analysis to see if random data was indeed more useful than the data.  Luckily, as you can see in the Figure 2 below, the random column ranks fairly low in the importance so we can rest a little easier knowing that our data contains useful information.  

### Random Forest Feature Importance

![FeatImportance](../plots/BaselineRF_Feature_Imp.png "Baseline Random Forest Feature Importance") **Figure 2:** Feature importance of baseline Random Forest


The most important features according to Random Forest > 0.02 were as follows. 

    - x23
    - x20
    - x48
    - x49
    - x42
    - x12
    - x28
    - x27
    - x40
    - x37
    - x7 
    - x46 
    - x41
    - x38
    - x2 
    - x6 
    - x32

Leaving the remaining 33 variables available for dropping in re-running a random forest.  We didn't immediately drop these columns because we wanted to run Principal Component Analysis (PCA) on the full dataset to discover how many columns we needed to keep to maintain at least 95% variance. This type of dimensionality technique is very useful when you have unlabeled data such as this.  

### Principal Component Analysis

The main beneft of PCA is to reduce the number of features when computational cost becomes too cumbersome when running a model.  Luckily there's only 50 features for this dataset, but some can number in the 100's.  Below in Figure XX, we can see that in order to maintain 95% variance, we can select up to 36 different features to maintain our desired variance.  This is about twice as many features as with the random forest, but maintaining a conservative approach, we will use this limit to set our n_components=36 for the next random forest on the reduced dataset.  

![PCA](../plots/PCA.png "Principal Component Analysis") **Figure 3:** PCA feature importance with respect to variation.

## Modeling
Before we begin delving into classification models, its important to describe our loss function for this exercise as stipulated by our business partner.  

#### Custom "Slater-loss" Function

We now proceed with basic modeling of the data. The loss function is calculated as follows (given a confusion matrix $C$):
$$
\mathbb{L}_\mathrm{slater} =\frac {C * \begin{bmatrix}
0 & 10  \\
500 & 0  \\
\end{bmatrix}}  {\sum_i \sum_j C}
$$
Representing the cost in dollars per prediction. We want a dollar amount loss for every false positive/negative prediction. The business must earn more than that per prediction for this to be profitable.  We've come up with a way to evaluate the cost function using sklearns make_scorer [Make scorer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer).  To use this correctly, we will need to multiply the resulting confusion matrices by the weights we were given by Dr. Slater.  Below is the code for how this was performed.

In [None]:
def custom_loss(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    weight = np.array([[0, 10], [500, 0]])
    out = cm * weight
    return out.sum()/cm.sum()

This function takes in the target columns true and predicted values, then multiplys the corresponding false positive/negatives by the dollar amount penalty for each.  It then divides the outcome by the sum of the confusion matrix to give us a dollar amount lost per incorrect prediction.  


### Baseline Random Forest
As was introduced it earlier, our baseline random forest was run on our dataset after it had been cleaned and prepped for analysis in the data cleaning section. We used an 80-20 test split and stratified the target variable, using a 5 fold cross validation.  Our base prediction of the Random Forest Model built 300 tree's yielding quite good results as shown below.

Baseline Random Forrest:
Accuracy of Baseline RF: 92.49 %

Confusion Matrix:
 
| 18374 | 787 |
|-------|-------|
| 1617 | 11222 |

Custom Cross Validation Score:

(22.96359944 24.6265625  20.7203125  20.00625    23.20518831)

 Classification Report
             
                precision    recall  f1-score   support

           0       0.92      0.96      0.94     19161
           1       0.93      0.87      0.90     12839

    accuracy                           0.92     32000
    macro avg      0.93      0.92      0.92     32000
    weighted avg   0.93      0.92      0.92     32000

As you can see from the custom cross validation scores, On average a false positive/negative only costs our business partner around $23.30 for each incorrect prediction.   

# TODO - LINK CODE SECTION TO APPENDIX

### Random Forest with Permutation Importance

Next, we used permutation importance in order to determine features which are important for generalization. An excellent discussion of permutation importance and other importance tools can be seen in [this blog post by fast.ai](https://explained.ai/rf-importance/) and [this blog post by the authors](https://josephsdavid.github.io/iml.html). 

Accuracy of RF w permutation importance: 93.28 %

Confusion Matrix:

| 18355 | 806 |
|-------|-------|
| 1345 | 11494 |

Custom Cross Validation Score:

(22.55929058 21.17351849 -19.50295614 -20.18493057 -20.40770024)

Classification Report

               precision    recall  f1-score   support

           0       0.93      0.96      0.94     19161
           1       0.93      0.89      0.91     12839

    accuracy                           0.93     32000
    macro avg      0.93      0.93      0.92     32000
    weighted avg   0.93      0.93      0.93     32000

This model did very well showing our business partner a cost of $20.77 for each incorrect prediction. 

# TODO - LINK CODE SECTION TO APPENDIX


### Random Forest with Prinicipal Component Analysis

We introduced PCA in the feature engineering section of this case study but will now implement another random forest with reduced dimensions set at n_components = 36.  As our earlier chart showed that would be a proper amount of features to select to maintain our goal of 95% of the variance.

Accuracy of RF w PCA: 83.25 %
Confusion Matrix:

| 17496 | 1665 |
|-------|------|
| 3695 | 9144 |

Custom Cross Validation Score:

(33.12294954 33.0703125  28.484375   34.0953125  32.66760431)

Classification Report

               precision    recall  f1-score   support

           0       0.83      0.91      0.87     19161
           1       0.85      0.71      0.77     12839

    accuracy                           0.83     32000
    macro avg      0.84      0.81      0.82     32000
    weighted avg   0.83      0.83      0.83     32000

Here we see the classification accuracy lower 9% points lower and the cost function yielding an average of \$32.29 per wrong prediction.  Seeing as our business partner wants to save money rather than spend more, we don't want him/her to want to pay \$10 more per wrong prediction.  We don't suggest dimensionality reduction to optimize cost savings.  


Next, we implement a Logistic Regression Model as its another popular model for classification.  


# TODO - LINK CODE SECTION TO APPENDIX


### Logistic Regression

Now that we've seen a few models perform, lets turn our attention to logistic regression.  Our business partner has predicted that one might be able to achieve excellent accuracies with this particular model so we're interested in its implementation.  


Baseline Logistic Regression:
Accuracy of Logistic Regression: 70.44 %
Confusion Matrix:

| 1590 | 3260 |
|------|------|
| 6200 | 6639 |

Custom Cross Validation Score:

(97.47509863 97.85672435 98.18398437 96.64205633 99.56365483)

Classification Report

              precision    recall  f1-score   support
           0       0.72      0.83      0.77     19161
           1       0.67      0.52      0.58     12839

    accuracy                           0.70     32000
    macro avg      0.70      0.67      0.68     32000
    weighted avg   0.70      0.70      0.70     32000


Sadly our initial thoughts on this being an overly successful algorithm with the dataset is not working as expected.  Accuracy has dropped to 70.34% and on average, our business partner is losing almost \$100 for every wrong prediction.  With that being said, Its probably safe to say that tuning a logistic regression at this point isn't going to recover 22% points in accuracy in order to catch up to Random Forest.  

# TODO - LINK CODE SECTION TO APPENDIX

### Extra Tree's Classifier

Lastly, we'll try implementing a Extra Tree's Classifer see if we can improve upon our baseline random forest.  The advantage of Extra Tree's is that this algorithm fits a number of randomized tree's on sub-samples of the dataset.  Using averaging to improve predictive accuracy and prevent over-fitting.  [Extra Tree's Docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)

Baseline Extra Random Forest:
Accuracy of Baseline ERF: 92.38 %

Confusion Matrix: 

| 18473 | 688 |
|-------|-------|
| 1750 | 11089 |

Custom Cross Validation Score:

(30.3953125 32.35      25.8765625 27.7890625 29.534375)

Classification Report

                precision    recall  f1-score   support
           0       0.91      0.96      0.94     19161
           1       0.94      0.86      0.90     12839
    accuracy                           0.92     32000
    macro avg      0.93      0.91      0.92     32000
    weighted avg   0.92      0.92      0.92     32000

# TODO - LINK CODE SECTION TO APPENDIX



## Results

Between all the models we've run, it seems the untuned random forest performed best barely beating out Extra Tree's Classifier by 0.01%.  Dimensionality reduction with PCA and another Random Forrest didn't pan out well for our purposes, while Logistic Regression came in last place in terms of accuracy and cost savings.  Therefore our suggestion to our business partner is stick with Random Forest for their classification needs in cost savings.  

| Model                                     	| Accuracy 	| Custom Scoring Loss 	|
|-------------------------------------------	|----------	|---------------------	|
| Random Forest with Permutation Importance 	| 93.28%   	| $20.77              	|
| Random Forest                             	| 92.49%   	| $23.30              	|
| Extra Tree's                              	| 92.48%   	| $29.19              	|
| Random Forest with PCA                    	| 83.25%   	| $32.29              	|
| Logistic Regression                       	| 70.44%   	| $97.94              	|

In [None]:
## Code Appendix
