# Your Task

__FROM:__ Guido Rossum<br>
__Subject:__ Building the Models

Hi,

Now that you have successfully imported, prepared and explored the data you are ready to start exploring some possible tools for your analysis. When you worked in R you used the caret package for many of your machine learning and data mining tasks. Python has a similar library called Sci-Kit Learn that the client has specifically asked us to use because it is likely to be compatible a custom software solution they plan to deploy.

In this task you’ll build your models just as you have done previously, but with a different set of tools. As you progress remember the following:

1. Let the data tell the story – don't make any assumptions.
2. It is often best to build three or more models and compare the results.
3. Make sure you have chosen the correct tools for the type of data you have.

I suggest you start this task with a quick orientation on Sci-Kit Learn to become familiar with the benefits of using it and how to use it effectively for this project. 

GR 

Guido Rossum
Senior Data Scientist
Credit One
www.creditonellc.com

# Introduction

Now that you have properly prepared and thoroughly explored the data it's time to begin the modeling process. Throughout this task will examine feature selection and model building through the use of the Python module called Sci-Kit Learn. Is very important for you to understand that this task uses the CreditOne data in a regression type problem, but your final analysis will be centered on classification. The steps will be very similar, but you will need to replicate and them in a different way and obviously on different features and variables. Let's get started with an introduction to Sci-Kit Learn and how it differs from what you've already done with caret and R.

# Import Libs

In [1]:
import sys, os, warnings, importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import scipy
import sklearn, sklearn.preprocessing, sklearn.linear_model, sklearn.datasets, sklearn.tree, sklearn.svm, sklearn.metrics, sklearn.ensemble, sklearn.model_selection

In [2]:
pd.options.display.max_columns = 1000
mpl.rcParams['font.size'] = 14

# Selecting and Dividing the Data

## Introduction to Sci-Kit Learn

You have already installed the Sci-Kit Learn library on your machine so take a few minutes and come up to speed on how to use it. As you'll see it is not very different from using caret in R, but there are a few key differences, specifically:

1. It is much faster than caret in R
2. The pipeline is easier to work with
3. Models can easily be serialized for deployment 
4. More more metrics are available
5. Feature and variable selection is in the form of indices
6. Data is stored and accessed in arrays consisting of samples and features

## Data Structure

Import the data 'default of credit card clients.csv' data

In [3]:
path_data_file = '../C05T01_Get_Started_With_Data_Science_and_Python/default of credit card clients.csv'
path_report_folder = './outputs'

df = pd.read_csv(path_data_file, header = 1)

display(df.head())
display(df.info())


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
ID                            30000 non-null int64
LIMIT_BAL                     30000 non-null int64
SEX                           30000 non-null int64
EDUCATION                     30000 non-null int64
MARRIAGE                      30000 non-null int64
AGE                           30000 non-null int64
PAY_0                         30000 non-null int64
PAY_2                         30000 non-null int64
PAY_3                         30000 non-null int64
PAY_4                         30000 non-null int64
PAY_5                         30000 non-null int64
PAY_6                         30000 non-null int64
BILL_AMT1                     30000 non-null int64
BILL_AMT2                     30000 non-null int64
BILL_AMT3                     30000 non-null int64
BILL_AMT4                     30000 non-null int64
BILL_AMT5                     30000 non-null int64
BILL_AMT6               

None

## Selecting the Data

Select the features

In [4]:
headers_dict = {}
headers_dict['features'] = list(df.iloc[:,12:23].columns)
headers_dict['labels'] = ['PAY_AMT6']

display(headers_dict)

{'features': ['BILL_AMT1',
  'BILL_AMT2',
  'BILL_AMT3',
  'BILL_AMT4',
  'BILL_AMT5',
  'BILL_AMT6',
  'PAY_AMT1',
  'PAY_AMT2',
  'PAY_AMT3',
  'PAY_AMT4',
  'PAY_AMT5'],
 'labels': ['PAY_AMT6']}

In [5]:
X = df[headers_dict['features']]
y = df[headers_dict['labels']]

In [6]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y)
print('X_train.shape',X_train.shape)
print('y_train.shape',y_train.shape)

X_train.shape (22500, 11)
y_train.shape (22500, 1)


## Building the Models and Choosing the Right Model

<font color='red'> Since I am already very familiar with sklearn, instead of following the exact details in the plan of attack, we will just leverage the custom python package I typically use for ML: [JLpy_utils_package](https://github.com/jlnerd/JLpy_utils_package)

### Import ML_models sub-package from JLpy_utils_package

In [7]:
path_desktop = '/Users/johnleonard/Desktop/'
sys.path.append(path_desktop)

from JLpy_utils_package import ML_models as ML

JLpy_utils_package mounted (repo: https://github.com/jlnerd/JLpy_utils_package.git)


### Fetch Models Dict

In [8]:
n_features = len(headers_dict['features'])
n_labels = len(headers_dict['labels'])

models_dict = ML.model_selection.models_dict.fetch.regression(n_features, n_labels, NeuralNets=True)

for key in models_dict.keys():
    print(key,' model_dict:\t',list(models_dict[key].keys()))

Linear  model_dict:	 ['model', 'param_grid']
DecisionTree  model_dict:	 ['model', 'param_grid']
RandomForest  model_dict:	 ['model', 'param_grid']
GradBoost  model_dict:	 ['model', 'param_grid']
SVM  model_dict:	 ['model', 'param_grid']
KNN  model_dict:	 ['model', 'param_grid']
DenseNet  model_dict:	 ['compiler', 'model', 'param_grid']


<font color='red'> Above, we see all the models we will evaluate against each other. Each model type of model is defined by a model_dict, which contains the model object itself (i.e. sklearn.tree.RandomForestRegressor()) and a param_grid, which defines some default/typical hyperparameters we will run a grid search across to find the best configuration for the particular model of interest

### Run Grid Search CV

In [14]:
importlib.reload(ML)
importlib.reload(ML.model_selection)
importlib.reload(ML.model_selection.GridSearchCV)

<module 'JL_ML_GridSearchCV' from '/Users/johnleonard/Desktop/JLpy_utils_package/JL_ML_models/JL_ML_model_selection/JL_ML_GridSearchCV.py'>

In [23]:
metrics = {'MSE':sklearn.metrics.mean_squared_error}

ML.model_selection.GridSearchCV.multi_model(
    models_dict,
    X_train,
    y_train,
    X_test,
    y_test,
    cv=5,
    retrain=False,
    metrics= metrics,
    path_root_dir='./outputs/GridSearchCV',
    n_jobs=-1,
)


---- Linear ----
path_model_dir: ./outputs/GridSearchCV/Linear
	best_csv_score: 0.12409304759168244
	best_pred_score: 0.10644756577942893
	 MSE : 272896365.2351728

---- DecisionTree ----
path_model_dir: ./outputs/GridSearchCV/DecisionTree
	best_csv_score: 0.060144389620854194
	best_pred_score: 0.04623765903314914
	 MSE : 291284838.1137017

---- RandomForest ----
path_model_dir: ./outputs/GridSearchCV/RandomForest
Fitting 5 folds for each of 540 candidates, totalling 2700 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 14.5min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed: 27.9min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 50.3min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 87.9min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 123.9min


KeyboardInterrupt: 

In [None]:
jlk

# Make Prediction and Evaluating the Results