## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [1]:
# IMPORT PACKAGES

In [2]:
from sklearn.svm import SVC
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [3]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [4]:
df = pd.read_csv("pima-indians-diabetes.csv",sep=";")

In [5]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
X, y = df.loc[ : , df.columns != 'class'], df['class']

### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [7]:
pca = PCA(n_components=2)
selection = SelectKBest(k=3)
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

In [8]:
svm = SVC(kernel='linear')
pipeline = Pipeline([("features", combined_features), ("svm", svm)])

In [9]:
param_grid = {"features__pca__n_components": [1, 2, 3],
                  "features__univ_select__k": [1, 2, 3],
                  "svm__C":[0.1, 1, 10]}

In [10]:
# why does this lock up? Maybe doesn't, just takes forever.
grid_search = GridSearchCV(pipeline, param_grid, verbose=10, refit=True) 
grid_search.fit(X,y)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV 1/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1
[CV 1/5; 1/27] END features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1;, score=0.721 total time=   0.2s
[CV 2/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1
[CV 2/5; 1/27] END features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1;, score=0.708 total time=   0.2s
[CV 3/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1
[CV 3/5; 1/27] END features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1;, score=0.753 total time=   0.1s
[CV 4/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1
[CV 4/5; 1/27] END features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1;, score=0.784 total time=   0.4s
[CV 5/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, svm__C

[CV 5/5; 8/27] END features__pca__n_components=1, features__univ_select__k=3, svm__C=1;, score=0.765 total time=   2.3s
[CV 1/5; 9/27] START features__pca__n_components=1, features__univ_select__k=3, svm__C=10
[CV 1/5; 9/27] END features__pca__n_components=1, features__univ_select__k=3, svm__C=10;, score=0.753 total time=  18.8s
[CV 2/5; 9/27] START features__pca__n_components=1, features__univ_select__k=3, svm__C=10
[CV 2/5; 9/27] END features__pca__n_components=1, features__univ_select__k=3, svm__C=10;, score=0.734 total time=  11.2s
[CV 3/5; 9/27] START features__pca__n_components=1, features__univ_select__k=3, svm__C=10
[CV 3/5; 9/27] END features__pca__n_components=1, features__univ_select__k=3, svm__C=10;, score=0.747 total time=  18.1s
[CV 4/5; 9/27] START features__pca__n_components=1, features__univ_select__k=3, svm__C=10
[CV 4/5; 9/27] END features__pca__n_components=1, features__univ_select__k=3, svm__C=10;, score=0.784 total time=   8.9s
[CV 5/5; 9/27] START features__pca__

[CV 4/5; 16/27] END features__pca__n_components=2, features__univ_select__k=3, svm__C=0.1;, score=0.791 total time=   0.3s
[CV 5/5; 16/27] START features__pca__n_components=2, features__univ_select__k=3, svm__C=0.1
[CV 5/5; 16/27] END features__pca__n_components=2, features__univ_select__k=3, svm__C=0.1;, score=0.784 total time=   0.2s
[CV 1/5; 17/27] START features__pca__n_components=2, features__univ_select__k=3, svm__C=1
[CV 1/5; 17/27] END features__pca__n_components=2, features__univ_select__k=3, svm__C=1;, score=0.760 total time=   1.8s
[CV 2/5; 17/27] START features__pca__n_components=2, features__univ_select__k=3, svm__C=1
[CV 2/5; 17/27] END features__pca__n_components=2, features__univ_select__k=3, svm__C=1;, score=0.753 total time=   1.1s
[CV 3/5; 17/27] START features__pca__n_components=2, features__univ_select__k=3, svm__C=1
[CV 3/5; 17/27] END features__pca__n_components=2, features__univ_select__k=3, svm__C=1;, score=0.727 total time=   2.3s
[CV 4/5; 17/27] START feature

[CV 3/5; 24/27] END features__pca__n_components=3, features__univ_select__k=2, svm__C=10;, score=0.747 total time=  18.5s
[CV 4/5; 24/27] START features__pca__n_components=3, features__univ_select__k=2, svm__C=10
[CV 4/5; 24/27] END features__pca__n_components=3, features__univ_select__k=2, svm__C=10;, score=0.791 total time=  10.3s
[CV 5/5; 24/27] START features__pca__n_components=3, features__univ_select__k=2, svm__C=10
[CV 5/5; 24/27] END features__pca__n_components=3, features__univ_select__k=2, svm__C=10;, score=0.739 total time=  12.5s
[CV 1/5; 25/27] START features__pca__n_components=3, features__univ_select__k=3, svm__C=0.1
[CV 1/5; 25/27] END features__pca__n_components=3, features__univ_select__k=3, svm__C=0.1;, score=0.753 total time=   0.3s
[CV 2/5; 25/27] START features__pca__n_components=3, features__univ_select__k=3, svm__C=0.1
[CV 2/5; 25/27] END features__pca__n_components=3, features__univ_select__k=3, svm__C=0.1;, score=0.760 total time=   0.2s
[CV 3/5; 25/27] START 

GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA(n_components=2)),
                                                                       ('univ_select',
                                                                        SelectKBest(k=3))])),
                                       ('svm', SVC(kernel='linear'))]),
             param_grid={'features__pca__n_components': [1, 2, 3],
                         'features__univ_select__k': [1, 2, 3],
                         'svm__C': [0.1, 1, 10]},
             verbose=10)

In [11]:
print(grid_search.best_params_)

{'features__pca__n_components': 3, 'features__univ_select__k': 3, 'svm__C': 0.1}
