# QTW Lab 15

## David Josephs, Andy Heroy, Carson Drake, Che' Cobb

### April 10, 2020

## Introduction

For the final case study for DS7333, we were given an unlabeled dataset and tasked with providing insight as to possible cost savings for a business partner. The dataset itself was 160,000 records with 51 unlabeled features.   Our only hint as to its domain is that its origins lie in the insurance industry.  The features themselves are labeled (x0-x49) with a binary target category ('y') of 0 or 1.  Our job is to apply machine learning modeling techniques to show a cost savings for our business partner for every incorrect classification of the target variable.  

Seeing as we're dealing with a target variable that is binary, our job will be to select classification models that can best analyze the data.  The classification models we will attempt are as follows.  

- Random Forrest
- Random Forest with Permutation Importance
- Random Forrest with PCA
- Logistic Regression
- ExtraDecisionTree's

## Background

Put simply, a classification model trains a model on a sample of training data to then predict the class of unseen test data.  The metrics for classification models are accuracy, precision, recall and F1-Score.  The main goal of this paper is to minimize false positives and false negatives.  As our business partner has kindly given us a "cost" function as to what false positives and negatives cost our company. 

- False positive = $ 10
- False negative = $ 500
- True pos/neg = $ 0

When we evaluate our models, we implement a custom function to incorporate our business partners requirements in a cost function in the scoring section of the algorithm.  Sklearn comes with a handy function called make_scorer that allows us to monitor our models performance with our partners cost savings in mind.  For a basic overview of other metrics that are standard in classification.  See Table 1 below





| Metric | Description | Equation |
|:---------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------:|
| Accuracy | Accuracy is defined as the number of correct predictions divided  by the total number of predictions. | (True Positive + False Negatives)/Total amount of samples |
| Precision | Precision is defined as the ratio of correctly predicted positives  to the total number of predicted positive observations. | True Positive/(True Positives + False Positives) |
| Recall | Recall is defined as the ratio of correct positives to all the  observations in the class | True Positives/(True Positives + False Positives) |
| F1-Score | F1 score is defined as the harmonic average of Precision and Recall.  This metric is more useful when you have uneven class balances but it  sometimes useful as it includes false postives and false negatives | 2x(Recall x Precision)/(Recall + Position) |



## Data Preparation/Engineering


Seeing as the data itself was unlabeled, we now begin the process of cleaning it in order to prepare it for modeling.  First, the target variable, `y`, is a categorical variable with two levels. Below are the steps and features engineered for this dataset.

1.  Datatypes:
    - Numerical: 45 features
    - Categorical: 5 features
    - Boolean: 1 feature
2.  Correlation:
    - Two sets of columns show direct correlation.  (Figure 2)
    - x2 and x6
    - x37 and x41
    - Result: Due to perfect correlation the columns we decided to drop x2 and x41
3.  Data Cleaning
    - x24 contains country data.  "euorpe" was changed to "europe."  We also one hot encoded it
    - x29 contains monthly data.  We one hot encoded it.
        - "Dev" was changed to "Dec"
        - "sept." was changed to "Sep"
    - x30 contains day of the week.  "thurday" changed to "thursday".  We also one hot encoded it
    - x32 contains a percentage amount. All % signs were removed and datatype changed to float64
        - x32 was encoded as a categorical variable because it only had 5 unique levels.
    - x37 contains a dollar amount.  All $ were removed and the datatype was changed to float64
4.  Missing Value's 
    - Each column look to have anywhere from 21 - 47 missing values.  At most, this comprises about 2.9% of the data.  Which is a small amount when compared to the full dataset.  Instead of dropping those rows, we imputed with the mean for each numeric column.
5.  Distribution
    - After checking the histograms for each column, it was determined the data was normally distributed with no skew.  This leads us to believe that this dataset could have been generated with sklearns make_classification function.  To see the distributions check Figure 1 below. 
6.  Scaling
    - In order to make sure our data was scaled correctly for classification, we implemented sklearns StandardScaler function over the numerical columns. 
7.  Categorical features
    - The weekly and monthly categorical features were encoded as a sine and cosine with a weekly and monthly period, and an amplitude of two.   
    - The continent variable was encoded in two ways, neither of which panned out, China or not china, and one hot encoded. All categorical variables were mode imputed, and all continuous variables were mean imputed. Mean imputation is appropriate because all the variables followed a normal distribution, as seen in the figure below:
    
![Histograms](../plots/Histograms.png)
**Figure 1:** histograms of the numerical data
Continuous distributions were tested both with and without scaling.

![CorrelationPlot](../plots/Correlation_plot.png)
**Figure 2:** Correlation plot of numerical categories



## Feature Selection
Feature selection took our group quite some time as having unlabeled columns makes it difficult to quantify what variables are important.  When this is the case, we rely on other methods to determine what columns are important for analysis and going back and forth of dropping them and running our analysis on various models. 



Our first attempt at this process was using Random Forest which comes with feature importance.  Upon the first run, we saw a fairly low variable scoring of importance within the dataset. Due to this, we decided to implement a "Random" column in to the analysis to see if random data was indeed more useful than the data.  Luckily, as you can see in the Figure 3 below, the random column ranks fairly low in the importance so we can rest a little easier knowing that our data contains useful information.  

### Random Forest Feature Importance

![FeatImportance](../plots/BaselineRF_Feature_Imp.png "Baseline Random Forest Feature Importance") **Figure 3:** Feature importance of baseline Random Forest



The most important features according to Random Forest > 0.02 were as follows. 

    - x23
    - x20
    - x48
    - x49
    - x42
    - x12
    - x28
    - x27
    - x40
    - x37
    - x7 
    - x46 
    - x41
    - x38
    - x2 
    - x6 
    - x32

Leaving the remaining 33 variables available for dropping in re-running a random forest.  We didn't immediately drop these columns because we wanted to run Principal Component Analysis (PCA) on the full dataset to discover how many columns we needed to keep to maintain at least 95% variance. This type of dimensionality technique is very useful when you have unlabeled data such as this.  

### Principal Component Analysis

The main beneft of PCA is to reduce the number of features when computational cost becomes too cumbersome when running a model.  Luckily, there's only 50 features for this dataset, but some can number in the 100's.  Below in Figure 4, we can see that in order to maintain 95% variance, we can select up to 36 different features to maintain our desired variance.  This is about twice as many features as with the random forest, but maintaining a conservative approach, we will use this limit to set our n_components=36 for the next random forest on the reduced dataset.  

![PCA](../plots/PCA.png "Principal Component Analysis") **Figure 4:** PCA feature importance with respect to variation.

## Modeling
Before we begin delving into classification models, its important to describe our loss function for this exercise as stipulated by our business partner.  

#### Custom "Slater-loss" Function

We now proceed with basic modeling of the data. The loss function is calculated as follows (given a confusion matrix $C$):
$$
\mathbb{L}_\mathrm{slater} =\frac {C * \begin{bmatrix}
0 & 10  \\
500 & 0  \\
\end{bmatrix}}  {\sum_i \sum_j C}
$$
Representing the cost in dollars per prediction. We want a dollar amount loss for every false positive/negative prediction. The business must earn more than that per prediction for this to be profitable.  We've come up with a way to evaluate the cost function using sklearns make_scorer [Make scorer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer).  To use this correctly, we will need to multiply the resulting confusion matrices by the weights we were given by Dr. Slater.  Below is the code for how this was performed.

In [2]:
def custom_loss(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    weight = np.array([[0, 10], [500, 0]])
    out = cm * weight
    return out.sum()/cm.sum()

This function takes in the target columns true and predicted values, then multiplys the corresponding false positive/negatives by the dollar amount penalty for each.  It then divides the outcome by the sum of the confusion matrix to give us a dollar amount lost per incorrect prediction.  


### Baseline Random Forest
As was introduced earlier, our baseline random forest was run on our dataset after it had been cleaned and prepped for analysis in the data cleaning section. We used an 80-20 test split and stratified the target variable, using a 5 fold cross validation.  Our base prediction of the Random Forest Model built 300 tree's yielding quite good results as shown below.

Baseline Random Forrest:
Accuracy of Baseline RF: 92.49 %

Confusion Matrix:
 
| 18374 | 787 |
|-------|-------|
| 1617 | 11222 |

Custom Cross Validation Score:

(22.96359944 24.6265625  20.7203125  20.00625    23.20518831)

 Classification Report
             
                precision    recall  f1-score   support

           0       0.92      0.96      0.94     19161
           1       0.93      0.87      0.90     12839

    accuracy                           0.92     32000
    macro avg      0.93      0.92      0.92     32000
    weighted avg   0.93      0.92      0.92     32000

As you can see from the custom cross validation scores, on average a false positive/negative only costs our business partner around $23.30 for each incorrect prediction.   


### Random Forest with Permutation Importance

Next, we used permutation importance in order to determine features which are important for generalization. An excellent discussion of permutation importance and other importance tools can be seen in [this blog post by fast.ai](https://explained.ai/rf-importance/) and [this blog post by the authors](https://josephsdavid.github.io/iml.html.  The basic mechanism is this.  The Random Forest will record a baseline accuracy by sending a validation set through model.  Then it permutes a single column and passes that column back to the test set.  Recomputes the accuracy and takes the difference of the two.  While expensive computationally, it can lead to an interesting insight as to which features are important in a classifier.

Accuracy of RF w permutation importance: 93.28 %

Confusion Matrix:

| 18355 | 806 |
|-------|-------|
| 1345 | 11494 |

Custom Cross Validation Score:

(22.55929058 21.17351849 -19.50295614 -20.18493057 -20.40770024)

Classification Report

               precision    recall  f1-score   support

           0       0.93      0.96      0.94     19161
           1       0.93      0.89      0.91     12839

    accuracy                           0.93     32000
    macro avg      0.93      0.93      0.92     32000
    weighted avg   0.93      0.93      0.93     32000

This model did very well showing our business partner a cost of \$20.77 for each incorrect prediction. So far this model has the lead with both savings and accuracy.


### Random Forest with Prinicipal Component Analysis

We introduced PCA in the feature engineering section of this case study but will now implement it. Another random forest was run with reduced dimensions set at n_components = 36.  As our earlier chart showed that would be a proper amount of features to select to maintain our goal of 95% variance.

Accuracy of RF w PCA: 83.25 %
Confusion Matrix:

| 17496 | 1665 |
|-------|------|
| 3695 | 9144 |

Custom Cross Validation Score:

(33.12294954 33.0703125  28.484375   34.0953125  32.66760431)

Classification Report

               precision    recall  f1-score   support

           0       0.83      0.91      0.87     19161
           1       0.85      0.71      0.77     12839

    accuracy                           0.83     32000
    macro avg      0.84      0.81      0.82     32000
    weighted avg   0.83      0.83      0.83     32000

Here we see the classification accuracy lower 9% points lower and the cost function yielding an average of \$32.29 per wrong prediction.  Seeing as our business partner wants to save money rather than spend more, we don't want him/her to want to pay \$10 more per wrong prediction.  We don't suggest dimensionality reduction to optimize cost savings.  


Next, we implement a Logistic Regression Model as its another popular model for classification.  


### Logistic Regression

Now that we've seen a few models perform, lets turn our attention to logistic regression.  Our business partner has predicted that one might be able to achieve excellent accuracies with this particular model so we're interested in its implementation.  


Baseline Logistic Regression:
Accuracy of Logistic Regression: 70.44 %
Confusion Matrix:

| 1590 | 3260 |
|------|------|
| 6200 | 6639 |

Custom Cross Validation Score:

(97.47509863 97.85672435 98.18398437 96.64205633 99.56365483)

Classification Report

              precision    recall  f1-score   support
           0       0.72      0.83      0.77     19161
           1       0.67      0.52      0.58     12839

    accuracy                           0.70     32000
    macro avg      0.70      0.67      0.68     32000
    weighted avg   0.70      0.70      0.70     32000


Sadly our initial thoughts on this being an overly successful algorithm with the dataset is not working as expected.  Accuracy has dropped to 70.34% and on average, our business partner is losing almost \$100 for every wrong prediction.  With that being said, Its probably safe to say that tuning a logistic regression at this point isn't going to recover 22% points in accuracy in order to catch up to Random Forest.  


### Extra Tree's Classifier

Lastly, we'll try implementing a Extra Tree's Classifer see if we can improve upon our baseline random forest.  The advantage of Extra Tree's is that this algorithm fits a number of randomized tree's on sub-samples of the dataset.  Using averaging to improve predictive accuracy and prevent over-fitting.  [Extra Tree's Docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)

Baseline Extra Random Forest:
Accuracy of Baseline ERF: 92.38 %

Confusion Matrix: 

| 18473 | 688 |
|-------|-------|
| 1750 | 11089 |

Custom Cross Validation Score:

(30.3953125 32.35      25.8765625 27.7890625 29.534375)

Classification Report

                precision    recall  f1-score   support
           0       0.91      0.96      0.94     19161
           1       0.94      0.86      0.90     12839
    accuracy                           0.92     32000
    macro avg      0.93      0.91      0.92     32000
    weighted avg   0.92      0.92      0.92     32000

The Extra Tree's algorithm performed very well showing just 0.01% less accuracy than the base random forest.  


## Results

Between all the models we've run, it Random Forest with permutation importance performed best barely beating out Extra Tree's Classifier by 0.8%.  Dimensionality reduction with PCA and another Random Forrest didn't pan out well for our purposes, while Logistic Regression came in last place in terms of accuracy and cost savings.  Therefore, our suggestion to our business partner is stick with Random Forest with Permutation Importance for their classification needs in cost savings.  

| Model | Accuracy | Custom Scoring Loss |
|-------------------------------------------|----------|---------------------|
| Random Forest with Permutation Importance | 93.28% | $20.77 |
| Random Forest | 92.49% | $23.30 |
| Extra Tree's | 92.48% | $29.19 |
| Random Forest with PCA | 83.25% | $32.29 |
| Logistic Regression | 70.44% | $97.94 |





## Code Appendix


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import csv
from typing import Dict, Any
from collections import Counter
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix


def custom_loss(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    weight = np.array([[0, 10], [500, 0]])
    out = cm * weight
    return out.sum() / cm.sum()  # y_true.shape[0]


slater_loss = make_scorer(custom_loss, greater_is_better=False)

"""
data loading
First we define some helper functions
"""


def load_data(path):
    """
    read a csv in from a dict
    """
    result = {}
    reader = csv.DictReader(open(path))
    for row in reader:
        for k, v in row.items():
            result.setdefault(k, []).append(v)
    return result


def _test_value(x: Any) -> bool:
    """
    test if a column is continuous or not
    """
    try:
        float(x)
        return True
    except ValueError:
        return False


def cleanup(d: Dict[str, str]) -> Dict[str, np.ndarray]:
    """
    turn default continuous columns into floats, replace non alphanumeric with np.nan
    """
    res = {}
    for k, v in d.items():
        if _test_value(v[0]):
            res[k] = np.array([float(x) if _test_value(x) else np.nan for x in v])
        else:
            res[k] = np.array(v)
    return res


def convert_dollars_percs(x: np.ndarray) -> np.ndarray:
    """replace dollar signs and percentages, as well as rogue negative signs"""
    out = [
        x[i].replace("$", "").replace("%", "").replace("-", "")
        for i in range(x.shape[0])
    ]
    out = np.array([float(z) if _test_value(z) else np.nan for z in out])
    return out


def impute_cats(
    d: Dict[str, np.ndarray], c: Dict[str, Counter]
) -> Dict[str, np.ndarray]:
    """
    mode impute categorical variables, given a dictionary of counts and a
    dictionary of data
    """
    for k in c.keys():
        d[k][np.isnan(d[k])] = c[k].most_common(1)[0][0]
    return d


def load_dataset(path: str) -> Dict[str, np.ndarray]:
    data = load_data(path)
    data = cleanup(data)
    for k in ["x32", "x37"]:
        data[k] = convert_dollars_percs(data[k])
    # cats are continent, month, day, and percentage. We have those enumerated
    # and mode imputed
    cats = [k for k, v in data.items() if not _test_value(v[0])]
    # percentage is also a categorical variable
    cats.append("x32")
    # give it some nans
    cont_dict = {"": np.nan}
    for idx, k in enumerate(list(set(data["x24"]))[1:]):
        cont_dict[k] = idx
    data["x24"] = np.array([cont_dict[v] for v in data["x24"]])
    day_dict = dict(
        zip(["monday", "tuesday", "wednesday", "thurday", "friday"], range(0, 5))
    )
    day_dict[""] = np.nan
    data["x30"] = np.array([day_dict[v] for v in data["x30"]])
    months = [
        "January",
        "Feb",
        "Mar",
        "Apr",
        "May",
        "Jun",
        "July",
        "Aug",
        "sept.",
        "Oct",
        "Nov",
        "Dev",
    ]
    month_dict = dict(zip(months, range(0, 12)))
    month_dict[""] = np.nan
    data["x29"] = np.array([month_dict[v] for v in data["x29"]])
    cat_dict = {k: Counter(data[k]) for k in cats}
    data = impute_cats(data, cat_dict)
    return data


def cyclical(x, period):
    # http://blog.davidkaleko.com/feature-engineering-cyclical-features.html
    """
    sine cosine transformation for days and months
    """
    s = np.sin(x * (2.0 * np.pi / period))
    c = np.cos(x * (2.0 * np.pi / period))
    return s, c


x = load_dataset("../data/final_project.csv")
sc = [cyclical(m, 5) for m in x["x30"]]
x["x30s"] = np.stack(sc, axis=0)[:, 0]
x["x30c"] = np.stack(sc, axis=0)[:, 1]
sc = [cyclical(m, 12) for m in x["x29"]]
x["x29s"] = np.stack(sc, axis=0)[:, 0]
x["x29c"] = np.stack(sc, axis=0)[:, 1]

# y variable
y = x.pop("y")


# categorical variables with continent asia or not 
def categorize_with_asia(d: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
    d = d.copy()
    continent = "x24"
    asia = Counter(d[continent]).most_common(1)[0][0]
    d[continent] = np.array([1.0 if x == asia else 0.0 for x in d[continent]])
    perc = "x32"
    n_percs = np.unique(d[perc]).shape[0]
    enum_dict = dict(zip(np.unique(d[perc]), range(np.unique(d[perc]).shape[0])))
    enum_percs = np.array([enum_dict[x] for x in d[perc]])
    d[perc] = np.eye(n_percs)[enum_percs]
    to_drop = ["x2", "x41", "x29", "x30"]
    for k in to_drop:
        d.pop(k, None)
    return d


x_encoded = categorize_with_asia(x)
X = np.column_stack(list(x_encoded.values()))


from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier


X = SimpleImputer().fit_transform(X)

#Baseline Random Forest
rfc_1 = RandomForestClassifier(n_estimators=300, n_jobs=-1, verbose=2)
rfc_1_score = cross_val_score(
    rfc_1, X, y, cv=5, scoring=slater_loss, n_jobs=-1, verbose=1
)
print(rfc_1_score)


# happy matrix
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42
)
rf = RandomForestClassifier(n_estimators=300, n_jobs=-1, verbose=2)
rf.fit(X_train, y_train)
ppp = rf.predict(X_test)
print(confusion_matrix(y_test, ppp))


def permutation_importances(rf, X_train, y_train, metric):
    baseline = metric(rf, X_train, y_train)
    imp = []
    for i in range(X_train.shape[1]):
        save = X_train[:, i].copy()
        X_train[:, i] = np.random.permutation(X_train[:, i])
        m = metric(rf, X_train, y_train)
        X_train[:, i] = save
        imp.append(baseline - m)
    return np.array(imp)


imps = permutation_importances(rf, X_train, y_train, slater_loss)

for i in range(imps.shape[0]):
    plt.barh(i, imps[i])
plt.axvline(imps.mean())
plt.show()

keep_vars = [i for i in range(imps.shape[0]) if imps[i] > imps.mean()]


X_small = X[:, keep_vars].copy()

# Random forest with permutation importance:

rfc_s = RandomForestClassifier(n_estimators=300, n_jobs=-1, verbose=2)
rfc_s_score = cross_val_score(
    rfc_s, X_small, y, cv=5, scoring=slater_loss, n_jobs=-1, verbose=1
)
print(rfc_s_score)


# Random Forest scaled:

rfc_s_scaled = RandomForestClassifier(n_estimators=300, n_jobs=-1, verbose=2)
rfc_ss_score = cross_val_score(
    rfc_s_scaled,
    StandardScaler().fit_transform(X_small),
    y,
    cv=5,
    scoring=slater_loss,
    n_jobs=-1,
    verbose=1,
)
print(rfc_ss_score)


erfc_s = ExtraTreesClassifier(n_estimators=500, n_jobs=-1, verbose=2)
erfc_s_score = cross_val_score(
    erfc_s, X_small, y, cv=5, scoring=slater_loss, n_jobs=-1, verbose=1
)
print(erfc_s_score)










# logreg lives here
from sklearn.metrics import mean_squared_error, r2_score, recall_score, confusion_matrix, make_scorer,classification_report
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

%time df = pd.read_csv("../data/final_project.csv", sep=",", header=0)

#Check ze rows
print(len(df))
print(df.shape)

# %%
df.info()
df.describe()

#%%
#Looking at nulls
df.isnull().sum()

#%%
#  Looking at missing data
t_nulls = df.isnull().sum().sort_values(ascending=False)
perc = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)

missing_vals = pd.concat([t_nulls, perc], axis=1, keys=["Total", "Missing Percent"])
missing_vals["Missing Percent"] = missing_vals['Missing Percent'].apply(lambda x: x *100)

print(missing_vals)

#%%

#Looking at categorical.
df_cat = df.describe(include=['object'])
print(df_cat.T)

#renaming columns
df.rename(columns={'x24':'continent', 'x29':'month', 'x30':'day'},inplace=True)

#%%
# Data Cleaning.

df['x37'] = df['x37'].str.replace('$', '').astype(float)
df['x32'] = df['x32'].str.replace('%', '').astype(float)


df['continent'] = df['continent'].str.replace('euorpe','europe')
df['month'] = df['month'].str.replace('Dev','Dec')
df['month'] = df['month'].str.replace('sept.','Sep')
df['day'] = df['day'].str.replace('thurday','thursday')

#Fill NA's with the median
for col in df.select_dtypes(include=['float64']).columns:
    df[col] = df[col].fillna(df[col].mean())

#%%
# EDA

table=pd.crosstab(df['day'],df['y'])
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Frequency of Target vs day of the week')
plt.xlabel('Day')
plt.ylabel('Frequency')

table=pd.crosstab(df['month'],df['y'])
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Frequency of Target vs month')
plt.xlabel('Month')
plt.ylabel('Frequency')



# Check numerical histograms of data
df.hist(bins=50, figsize = (20,15))


#%%
# heatmap
plt.figure(figsize=(20,10))
sns.heatmap(df.corr().round(1),vmax=1, annot=True, cmap = 'YlGnBu',annot_kws={"fontsize":10})

#%%
X = df.drop('y', axis = 1)
y = df['y']

# Adding in a random noise componenet to test feature importance
#X['Random'] = np.random.random(size=len(X))

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state = 42)

print("\nChecking shape of test/train data")
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)



#%%

#Scaling Data and prepping for RF and PCA
X1 = X_train.copy()
y1 = y_train.copy()

# Drop vars
drop_col = ['day','month','continent', 'x2','x41']

# Dropping from xtrain and xtest
X1 = X1.drop(drop_col, axis=1)
X_test_sc = X_test.drop(drop_col, axis=1)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X1)
X_test_sc = scaler.transform(X_test_sc)
y1 = np.array(y1)

#Setting up loss function
def custom_loss(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    weight = np.array([[0, 10], [500, 0]])
    out = cm * weight
    return out.sum()/cm.sum()


#%%

# Baseline RF

#Fit the model
slater_loss = make_scorer(custom_loss, greater_is_better=True)

#%%
# PCA with no components
pca = PCA().fit(X_train_sc)
pca_trans = pca.transform(X_train_sc)

print("\nThe components are as follows:\n {}".format(pca.components_))
print("\nThe explained variance is :\n {}".format(pca.explained_variance_))


plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title("PCA Analysis")
plt.xlabel('number of components',fontsize=15)
plt.ylabel('cumulative explained variance', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.axvline(linewidth=3, color='r', linestyle='--', x = 36, ymin=0)
plt.axhline(y=0.95, xmin=0, color='r', linestyle='--')
plt.show()

#%%
# PCA with 36 components.  Which retains 95% of the variation
pca = PCA(n_components=36).fit(X_train_sc)
X_train_sc_pca = pca.transform(X_train_sc)
X_test_sc_pca = pca.transform(X_test_sc)

# Now rfc on the reduced data
rfc_2 = RandomForestClassifier(n_estimators=300, n_jobs=-1, verbose=2)
%time rfc_2.fit(X_train_sc_pca, y1)

y_pred_pca = rfc_2.predict(X_test_sc_pca)

# Custom Loss function
slater_loss= make_scorer(custom_loss, greater_is_better=True)

rfc_2_cf = confusion_matrix(y_test, y_pred_pca)
rfc_2_score = cross_val_score(rfc_2, X_test_sc_pca, y_pred_pca, cv=5, scoring=slater_loss)

print("\nRandom Forest w PCA:")
print('Accuracy of RF w PCA: {:.2f}'.format(rfc_2.score(X_test_sc_pca, y_test)*100),'%')
print("Confusion Matrix:\n",rfc_2_cf )
print("Custom Cross Validation Score:\n", rfc_2_score)
print("Classification Report", classification_report(y_test, y_pred_pca))


# %%
#Log reg

#Reimporting the dat
X = df.drop('y', axis = 1)
y = df['y']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state = 42)

X1_train = X_train.copy()
y1_train = y_train.copy()


# Drop vars
drop_col = ['day','month','continent', 'x2','x41']

# Dropping from xtrain and xtest
X1_train = X1_train.drop(drop_col, axis=1)
X1_test = X_test.drop(drop_col, axis=1)

#Scaling
scaler = StandardScaler()
X1_train_sc = scaler.fit_transform(X1_train)
X1_test_sc = scaler.transform(X1_test)
y1_train = np.array(y1_train)

def custom_loss(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    weight = np.array([[0, 10], [500, 0]])
    out = cm * weight
    return out.sum()/cm.sum()

# Logistic Regression
lr_1 = LogisticRegression(penalty='l2')
lr_1.fit(X1_train_sc, y1_train)
y_pred = lr_1.predict(X1_test_sc)

slater_loss = make_scorer(custom_loss, greater_is_better=True)
lr_1_score = cross_val_score(lr_1,
							StandardScaler().fit_transform(X1),
							y1,
							cv=5,
							scoring = slater_loss,
							n_jobs=-1,
							verbose=1)

lr_confusion = confusion_matrix(y_test, y_pred)


print("Baseline Logistic Regression:")
print('Accuracy of Logistic Regression: {:.2f}'.format(lr_1.score(X1_test_sc, y_test)*100),'%')
print("Confusion Matrix:\n", lr_confusion)
print("Custom Cross Validation Score:\n", lr_1_score)
print("Classification Report", classification_report(y_test, y_pred))

