## Training - Guideline

This notebook was designed with the purpose of guiding you with the development of your first Training Job

There are many functions and methods that we provide you in the melitk.analytics library (which comes pre-installed), but will not appear in this demo. The general idea behind this demo is for you to understand how the Fury Data Apps expects your Training process in order to run successfully and generate the model (and eventually metrics) you expect

### Example

Here we will read the dataset that we generated in the demo etl/etl.ipynb and train a Sklearn Decission Tree model to predict the probability of some users of recharging their cellphones

## Imports

In [None]:
import pickle
import numpy as np

from melitk.fda import workspace
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from shared.settings import DATASET_FILENAME, DS_SPLIT_SIZE, DS_SPLIT_SEED

import matplotlib.pyplot as plt 
import seaborn as sns

plt.rc("font", size=14)
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

## Initial Setup

## Reading the dataset

Load the dataset (that was saved in the ETL script) and split it into train and test sets.

In [None]:
serialized_dataset = workspace.load_etl_file(DATASET_FILENAME)

In [None]:
dataset = pickle.loads(serialized_dataset)  # Because in the ETL we pickled the pandas dataframe

## Explore Data

In [None]:
dataset.head(2)

In [None]:
# Target Distribution
dataset.TARGET.value_counts(normalize=True)

In [None]:
# Boxplot
bp = sns.boxplot(data=dataset[['PAYMENTS_M3','PAYMENTS_M2','PAYMENTS_M1']], 
                 orient="h", 
                 palette="Set2")

## Split Data: Train - Test

In [None]:
# `test_size` and `random_state`params are fixed in the settings to be able to reproduce this training
train_df, test_df = train_test_split(dataset, test_size=DS_SPLIT_SIZE, random_state=DS_SPLIT_SEED)

y_train = train_df.iloc[:, -1]
X_train = train_df.iloc[:, :-1]

y_test = test_df.iloc[:, -1]
X_test = test_df.iloc[:, :-1]

columns_name = X_train.columns

## Initialize and train the model

In [None]:
sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)

r_forest = RandomForestRegressor(n_estimators=20, random_state=0)  
r_forest.fit(X_train, y_train)  

## Saving the trained model

This part is of enormous importance. If we do not save our trained model, we will not be able to use it later nor 
retrieve its results

In [None]:
serialized_model = pickle.dumps(r_forest)
workspace.save_raw_model(serialized_model)

## Saving some model metrics

Although you might be able to get your trained model without metrics, generating these metrics will allow to compare trained models and their performance from the Train front-end in your Fury Data App. We strongly encourage you to store some key metrics to measure your model's performance correctly, as it gives you an idea of how will your model performe with more realistic data

In [None]:
y_pred = r_forest.predict(X_test) 

metrics_dict = {
    'mean_absolute_error': metrics.mean_absolute_error(y_test, y_pred),
    'mean_squared_error': metrics.mean_squared_error(y_test, y_pred),
    'root_mean_squared_error': np.sqrt(metrics.mean_squared_error(y_test, y_pred))
}

In [None]:
workspace.save_metrics(metrics_dict)
loaded_metrics = workspace.load_metrics()

In [None]:
loaded_model_b = workspace.load_raw_model()
loaded_model = pickle.loads(loaded_model_b)

In [None]:
# # Just for validation purposes
assert(loaded_model.__class__ == r_forest.__class__)
assert(loaded_metrics == metrics_dict)

## Model Metrics

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import pandas as pd

### Confusion Matrix

In [None]:
y_pred = r_forest.predict( X_test )

In [None]:
y_pred = np.where(y_pred <= 0.5,0,1)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
# Confusion Matrix - Custom Plot
from confusion_matrix_plot import confusion_matrix_analysis
plt.rc("font", size=14)
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

confusion_matrix_analysis(y_test, y_pred, [0,1], ymap=None, figsize=(4,4))

### Accuracy

In [None]:
accuracy = accuracy_score(y_test,y_pred)
print("Accuracy: ",accuracy)

### K-S

In [None]:
logit_roc_auc = roc_auc_score(y_test, np.where(r_forest.predict(X_test) <= 0.5,0,1) )
fpr, tpr, thresholds = roc_curve(y_test, r_forest.predict(X_test))

print("K-S: ",max(tpr-fpr))

### Performance Report

In [None]:
print(classification_report(y_test, y_pred))

### ROC Curve

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, r_forest.predict(X_test))
plt.figure()
plt.plot(fpr, tpr, label='Model AUC (area = %0.2f)' % accuracy)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

## Feature Importances

In [None]:
feature_importances = pd.DataFrame( r_forest.feature_importances_,
                                    index = columns_name,
                                    columns=['importance']).sort_values('importance', ascending=False )

In [None]:
feature_importances