## Advanced tuning of parameters

In this tutorial, we will apply the skills from previous tutorials and build a classifier using the `Pipelines` and `FeatureUnion` functions from sklearn.

In [12]:
# IMPORT PACKAGES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # For plotting data
import seaborn as sns # For plotting data

from sklearn.model_selection import train_test_split # For train/test splits
from sklearn.neighbors import KNeighborsClassifier # The k-nearest neighbor classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold # Feature selector
from sklearn.pipeline import Pipeline, FeatureUnion # For setting up pipeline
from sklearn.svm import SVC

# Various pre-processing steps
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV # For optimization
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

### Data

We will be building a binary classifier that predicts wheter a person has diabetes or not, using information on the patient's health. 

The data can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing). 


In [1]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [5]:
data = pd.read_csv('data/pima-indians-diabetes.csv', delimiter=';')

In [6]:
y = data['class']
X = data.drop('class', axis=1)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)

### Task

Build classifier which predicts the target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> **In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA is probably not the best technique to use during the data preparation from a methodology point of view.**

In [15]:
pca = PCA(n_components=2)

selection = SelectKBest(k=3)

In [16]:
combined_features = FeatureUnion([('pca',pca),('univ_select', selection)])

In [19]:
rfc = RandomForestClassifier()

In [20]:
pipeline = Pipeline([('features', combined_features), ('rfc', rfc)])

In [23]:
param_grid = {
    "features__pca__n_components": [1,2,3],
    "features__univ_select__k": [1,2,3],
    "rfc__n_estimators": [25,100,200]
}

# create a Grid Search object
grid_search = GridSearchCV(pipeline, param_grid, verbose=10,refit=True)

In [24]:
grid_search.fit(X,y)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV 1/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, rfc__n_estimators=25
[CV 1/5; 1/27] END features__pca__n_components=1, features__univ_select__k=1, rfc__n_estimators=25;, score=0.688 total time=   0.1s
[CV 2/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, rfc__n_estimators=25
[CV 2/5; 1/27] END features__pca__n_components=1, features__univ_select__k=1, rfc__n_estimators=25;, score=0.656 total time=   0.1s
[CV 3/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, rfc__n_estimators=25
[CV 3/5; 1/27] END features__pca__n_components=1, features__univ_select__k=1, rfc__n_estimators=25;, score=0.727 total time=   0.1s
[CV 4/5; 1/27] START features__pca__n_components=1, features__univ_select__k=1, rfc__n_estimators=25
[CV 4/5; 1/27] END features__pca__n_components=1, features__univ_select__k=1, rfc__n_estimators=25;, score=0.732 total time=   0.0s
[CV 5/