### Hello everyone to the Session 9 about Hyperparameter Tuning and Pipelines. 

# Hyperparameter Tuning

## What are these hyperparameters?
  * Parameters are the components that are learned by the model during the modelling process. 
  * Hyperparameters are NOT learned during the process, but instead set before the modelling process. 
    * Some are not improving our model performance, while some are. 
    * Take for example n_jobs in *RandomForestClassifier*, that simply describes how many jobs to run in parallel. 
    * Whereas, *n_estimators* describes the count of trees in the forest or *max_features* on how many features to split on. 

## Exercise 1 - Hyperparameter Tuning

### Exercise 1.0 - Import the telco dataset, split it to features and labels and training and test sets. 

* Import the telco churn dataset from Session 7. 
* Split the dataset between target and features. 
* Use 80/20 train_test_split with random_seed of 99.

The data is also available below.

In [1]:
import pandas as pd

# Import the data
churn_df = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vT7M4x2vLWiCYW9YrLrLxPQWj0XAG8h71lGMLfUJvzH1qsVR-fqpGYl53Luvi_B1xBe8qw1mKos-tFw/pub?output=csv")

In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import pandas_profiling
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score,recall_score,precision_score,f1_score
from collections import Counter
from sklearn.metrics import roc_auc_score, accuracy_score
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

#### Exercise 1.0.1 - Create the target and features. 

In [None]:
X = churn_df.drop("Churn", axis=1)
y=churn_df["Churn"]



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#### Exercise 1.0.2 - Create test train split with 80/20 split, random seed of 99 and use stratification.

#### Exericse 1.0.3 - Import the StandardScaler from preprocessing library and instantiate the StandardScaler object.

In [13]:
scl = StandardScaler()

#### Exercise 1.0.4 - Using StandardScaler, fit-transform the features training set and transform the test set. 

In [14]:
X_train = scl.fit_transform(X_train)
X_test = scl.fit_transform(X_test)

### Exercise 1.1 - Hyperparameter tuning with for loop for KNeighborsClassifier. 
* Build estimators for 10, 20 and 30 K-neighbors. [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.kneighbors)
* Fit each of the three estimators to training data.
* Create predictions of the fitted data. 
* Get accuracy score for each of the models.
- PS! Use loop. 

In [24]:
for n in range(10,31,10):
    knn = KNeighborsClassifier(n_neighbors=n)
    knn_fit= knn.fit(X_train, y_train)
    knn_pred= knn_fit.predict(X_test)
    knn_acc = accuracy_score(y_test,knn_pred)
    print("neighbours = ", n, "acc = ", knn_acc)
    
    

neighbours =  10 acc =  0.9523809523809523
neighbours =  20 acc =  0.9349206349206349
neighbours =  30 acc =  0.9301587301587302


#### 1.1.1 - The programmatic way. 

#### 1.2 Now repeat the process using RandomForestClassifier. [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
- Use a combination of n_estimators of 300, 500, 800 and max_depth of 8, 10, 12, 14.
- How many models did we create? Count each of the iteration when using a for loop and print the total count. 
- Assign these values into a dataframe of *df_rf_accs*. Use columns *n_estimators, max_depth, accuracy*.
- What are the top three combinations from *df_rf_accs*?

In [28]:
rf_accs=[]
c=0

for n_estimator in [300,500,800]:
    for depth in range(8,15,2):
        rf = RandomForestClassifier(n_estimators = n_estimator, max_depth = depth, random_state= 97)
        rf_fit = rf.fit(X_train, y_train)
        rf_pred = rf_fit.predict(X_test)
        rf_accs.append([n_estimator,depth,accuracy_score(y_test,rf_pred)])
        c+=1
        
print(c)

12


### 1.3 - Now repeat the process using GradientBoostingClassifier [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier)
- Use 5 learning rates evenly spread between 0.01 and 1. 
- Use crossvalidation with 5 folds. 
- Round these values to 4th decimal. 
- Print out each learning rate with it's current iteration number, accuracies, mean accuracy and standard deviation.

#### How many times did we run the model?

In [34]:
from sklearn.model_selection import cross_val_score
for lr in np.linspace(.01,1,5):
    rounded_lr = lr.round(4)
    gbc = GradientBoostingClassifier(learning_rate = rounded_lr, random_state =97)
    gbc_cv_score = cross_val_score(gbc, X_train, y_train, cv=5, scoring ="accuracy")
    
    print( "gbc_score = ",gbc_cv_score)

gbc_score =  [0.89880952 0.9047619  0.91269841 0.89880952 0.8968254 ]
gbc_score =  [0.95238095 0.95039683 0.96031746 0.94444444 0.93452381]
gbc_score =  [0.94642857 0.95039683 0.96031746 0.95238095 0.94047619]
gbc_score =  [0.93849206 0.9484127  0.95833333 0.94246032 0.93253968]
gbc_score =  [0.95833333 0.95039683 0.93849206 0.93452381 0.92857143]


In [41]:

gbc_base = GradientBoostingClassifier(random_state =97)

### 1.4 - Now repeat the previous process using GradientBoostingClassifier together with Sklearns Grid Search.
* Use 1/2 of your available cores to parallelize the execution. 
* Use GradientBoostingClassifier as the estimator with random state of 77. 
* Use the following hyperparameters (params): 
    * Five learning rates between 0.2575 and 0.7525. 
    * Max depth of 2, 4, 6.
    * n-estimators of 120, 200, 280.
* Use 5 k-folds for cross-validation. 
* Use accuracy for scoring. 
* Refit the best hyperparameters.

PS. If you need help then refer back to your 10th Python class session

#### Use GridSearch properties to return it's log statistics. 
* What are the `best params`? 
* What is the `best score`? 
* What is the log statistics of the top five performers?
* How long does it take for the Grid Search?
* Given the range we used, did we find the global optimal hyperparameters? 

In [36]:
params = {
    'learning_rate' : [*np.linspace(.2575, .7525, 5).round(4)],
    'max_depth' : [*range(2,7,2)],
    'n_estimators' : list(range(150,451,150))
}

In [42]:
import os
from sklearn.model_selection import GridSearchCV
grid_s = GridSearchCV(
    estimator = gbc_base,
    param_grid = params,
    scoring = "accuracy",
    n_jobs = int((os.cpu_count()/2)),
    cv=5,
    refit=True)

In [43]:
grid_s.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=GradientBoostingClassifier(random_state=97),
             n_jobs=6,
             param_grid={'learning_rate': [0.2575, 0.3812, 0.505, 0.6287,
                                           0.7525],
                         'max_depth': [2, 4, 6],
                         'n_estimators': [150, 300, 450]},
             scoring='accuracy')

In [52]:
grid_s.best_params_

AttributeError: 'DataFrame' object has no attribute 'best_params_'

In [53]:
grid_s.best_score_

AttributeError: 'DataFrame' object has no attribute 'best_score_'

#### 1.4.1 - Use the `os` library to find your cpu count. 

In [51]:
grid_s = pd.DataFrame(grid_s.cv_results_)
grid_s.sort_values(by="rank_test_score")

AttributeError: 'DataFrame' object has no attribute 'cv_results_'

#### 1.4.2 - Create GradientBoostingClassifier base model with set parameters of random state of 77.

#### 1.4.3 - Create your param grid. Try avoiding hardcoding the values, if you can.

#### 1.4.4 - Create the GridSearchCV object and pay attention to it's parameters also besides param_grid.

#### 1.4.5 - Fit the GridSearch object to our training features and target.

#### 1.4.6 - Print the best parameters and the their best score. 

#### 1.4.7 - Create a dataframe of the fitted results of the Grid Search object. Return the 5 highest ranked cross-validation mean test scores.

# 2.1 Column Transformers

#### Let's first create our dataframe that we will use. 

In [55]:
df_exercise = pd.DataFrame({
    'voted': [1, 0, 1, 0, 1],
    'city': ['Madrid', 'Madrid', 'Barcelona', 'Valencia', ''],
    'age': [23, 44, 88, 63, 31],
    'income': [20000, 40000, None, 35000, 30000],
    'gender': ['M', 'F', 'F', 'M', 'M']
})

df_exercise

Unnamed: 0,voted,city,age,income,gender
0,1,Madrid,23,20000.0,M
1,0,Madrid,44,40000.0,F
2,1,Barcelona,88,,F
3,0,Valencia,63,35000.0,M
4,1,,31,30000.0,M


Let's also split it into features and target.

In [56]:
features, target = df_exercise.drop("voted", axis=1), df_exercise["voted"]

#### 2.1.1 - Use Pandas to figure out what columns are NA. 
- What can you spot?
- Why is it so?

In [57]:
df_exercise.isnull().sum()

voted     0
city      0
age       0
income    1
gender    0
dtype: int64

#### 2.1.2 - Create a missing data transformer using ColumnTransformer [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) and SimpleImputer [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html). 
- Fill the `city` empty values with a constant 'Other'.
- Fill the `income` NaN's with mean.
- Fit transform the dataset using the `features`

In [58]:
features.isna()

Unnamed: 0,city,age,income,gender
0,False,False,False,False
1,False,False,False,False
2,False,False,True,False
3,False,False,False,False
4,False,False,False,False


#### 2.1.3 - Use OneHotEncoder to transform `city` and `gender` columns. 

#### 2.1.4 - Use StandardScaler to transform `income` and `age` columns. 

## 2.2 Pipelines
* Now that we better understand the intrinsics of Column Transformers, let's use `churn` dataset to apply Pipelines on it.

In this section, we will create exactly the same model but use pipelines.

In [9]:
from sklearn.pipeline import Pipeline

In [10]:
X, y = churn_df.drop("Churn", axis=1), churn_df["Churn"]

#### Let's create the train_test_split again. 

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#### Creating a pipeline

Pipeline objects have the following syntax

```
pipeline = Pipeline(
    steps=[
      ('step_name1', function_to_execute_1()),
      ('step_name2', function_to_execute_2()),
      ('step_name3', function_to_execute_3())
      ]
    )
```
Create a pipeline that has the same steps as in the first model.

#### Exercise 2.2.1 - Create pipeline for including a StandardScaler and RandomForestClassifier.

#### Exercise 2.2.2 - Fit the pipeline to the training data set. 

#### Exercise 2.2.3 - Define param_grid.
* Use a combination of n_estimators of 300, 500, 800 and max_depth of 4, 6, 8.

Parameter grids are dictionaries where the parameter we want to search over is the key, and the parameter values are a list of values.

```
param_grid = {
    'stepname__parameter_one':[0, 1, 2, 3, 4],
    'stepname__parameter_two': [2, 3, 5, 7, 11, 13]
}
```


#### Exercise 2.2.4 - Define a GridSearch Object
You can leveage [this](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) documenation. The estimator will be our pipeline and the paramater_grid the parameter grid. There are more optional arguments you can pass in.
* Use crossvalidation with 5 folds. 
* Use accuracy for scoring.

#### Exercise 2.2.5 Fit the Grid Search Model on the training data

#### Exercise 2.2.6 Get the best score and best params.