## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [72]:
# IMPORT PACKAGES
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [73]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
feature_names = col_names[:-1]
target_name = col_names[-1]
data = pd.read_csv('pima-indians-diabetes.csv', sep=';')
data.head(3)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [74]:
X = data[feature_names]
y = data[target_name]

### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [79]:
pca = PCA(n_components=2)

selection = SelectKBest(k=3)

combined_features = FeatureUnion([('pca', pca), ('univ_select', selection)])

clf = RandomForestClassifier()
clf.fit(X,y)
y_pred = clf.predict(X)

accuracy_score(y, y_pred)

1.0

In [76]:
pipeline = Pipeline([('features', combined_features), ('rand_forest', clf)])

params = {'features__pca__n_components': [1, 2, 3],
            'features__univ_select__k': [1, 2, 3],
            'rand_forest__max_depth': [1, 2, 3],
            'rand_forest__min_samples_split': [2, 3, 4],
            'rand_forest__min_samples_leaf': [1, 2, 3]
}

# create gridsearch object
grid = GridSearchCV(pipeline, params, verbose=10, refit=True)

# fit the model and tune params
grid.fit(X,y)

Fitting 5 folds for each of 243 candidates, totalling 1215 fits
[CV 1/5; 1/243] START features__pca__n_components=1, features__univ_select__k=1, rand_forest__max_depth=1, rand_forest__min_samples_leaf=1, rand_forest__min_samples_split=2
[CV 1/5; 1/243] END features__pca__n_components=1, features__univ_select__k=1, rand_forest__max_depth=1, rand_forest__min_samples_leaf=1, rand_forest__min_samples_split=2;, score=0.708 total time=   0.0s
[CV 2/5; 1/243] START features__pca__n_components=1, features__univ_select__k=1, rand_forest__max_depth=1, rand_forest__min_samples_leaf=1, rand_forest__min_samples_split=2
[CV 2/5; 1/243] END features__pca__n_components=1, features__univ_select__k=1, rand_forest__max_depth=1, rand_forest__min_samples_leaf=1, rand_forest__min_samples_split=2;, score=0.708 total time=   0.0s
[CV 3/5; 1/243] START features__pca__n_components=1, features__univ_select__k=1, rand_forest__max_depth=1, rand_forest__min_samples_leaf=1, rand_forest__min_samples_split=2
[CV 3/5; 

In [77]:
print(grid.best_params_)
print(grid.best_estimator_)

{'features__pca__n_components': 2, 'features__univ_select__k': 3, 'rand_forest__max_depth': 3, 'rand_forest__min_samples_leaf': 1, 'rand_forest__min_samples_split': 4}
Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('pca', PCA(n_components=2)),
                                                ('univ_select',
                                                 SelectKBest(k=3))])),
                ('rand_forest',
                 RandomForestClassifier(max_depth=3, min_samples_split=4))])


In [78]:
print(grid.best_score_)

0.7656735421441304
