## Advanced tuning of parameters

In this tutorial, we will apply skills from previous tutorials and build the classifier using Pipelines and FeatureUnion

In [1]:
# IMPORT PACKAGES

In [2]:
import pandas as pd
import numpy as np

### Data
We will use data about diabetes. We will build a classifier that predicts whether person has a diabetes or no using information about his health. The dataset can be found [here](https://drive.google.com/file/d/1TvCKlmH3Z32XAKk-VUcZyYu95Ccyw3PO/view?usp=sharing).

In [25]:
col_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [3]:
filepath = 'C:/Users/Tim/Desktop/lighthouse/w7/d1/'
df = pd.read_csv(filepath+'pima-indians-diabetes.csv',sep=';')
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Task

Build classifier which predicts target variable `class` using rest of the attributes. The model should be fitted using pipeline that contains:
- PCA method
- SelectKBest method
- FeatureUnion
- Random Forest

Choose the best set of parameters using `Pipeline` and grid_search.

> #### Note
> In this exercise, we are focusing on the implementation of pipeline. Since we have only 9 columns in our dataset the PCA isn't probably the best technique to use during the data preparation from methodology point of view.

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [10]:
X = df.drop(columns='class')
y = df['class']

In [13]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=27, stratify=y)

In [14]:
feature_union = FeatureUnion([('pca', PCA(n_components=3)), 
                              ('select_best', SelectKBest(k=6))])

pipeline = Pipeline(steps=[('scaling', StandardScaler()),
                           ('features', feature_union),
                           ('classifier', RandomForestClassifier(n_jobs=-1))])

pipeline.fit(x_train, y_train)

y_pred = pipeline.predict(x_test)
acc = accuracy_score(y_test, y_pred)
print(f'Test set accuracy: {acc}')

Test set accuracy: 0.7337662337662337


In [15]:
from sklearn import set_config
set_config(display='diagram')

pipeline

In [17]:
from sklearn.model_selection import GridSearchCV

In [23]:
param_grid = {'classifier__n_estimators': [50, 100, 200], 
              'features__pca__n_components': [3, 5, 7],
              'features__select_best__k': [1, 3, 6]}

grid = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)
grid.fit(x_train, y_train)

best_model = grid.best_estimator_
best_hyperparams = grid.best_params_

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:   11.4s finished


In [24]:
best_acc = grid.score(X_test, y_test)
print(f'Best test set accuracy: {best_acc}\nAchieved with hyperparameters: {best_hyperparams}')

Best test set accuracy: 0.7792207792207793
Achieved with hyperparameters: {'classifier__n_estimators': 100, 'features__pca__n_components': 7, 'features__select_best__k': 3}
