# Model evaluation and hyperparameters tuning
In this notebook we will see how to build [pipelines](https://scikit-learn.org/stable/modules/compose.html#combining-estimators) of data transformers and estimators with Scikit-Learn and how to assess a model's performance.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
print("numpy version: %s"%np.__version__)
print("pandas version: %s"%pd.__version__)

numpy version: 1.23.1
pandas version: 1.4.3


## The Breast Cancer Wisconsin Data Set
In this notebook we will use the [Breast Cancer Wisconsin](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) data set from the UCI website. The data set contains 569 records of fine needle aspirate (FNA) of a breast mass with 30 features that describe the characteristics of the cell nuclei. The first two columns represent the unique sample ID and the diagnosis (M=malignant, B = benign).

In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
bcw_df = pd.read_csv(url, header=None)
bcw_df.shape

(569, 32)

In [7]:
bcw_df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


We change the values of the diagnosis: from malignant (M) to 1, and from benign (B) to 0.

In [8]:
from sklearn.preprocessing import LabelEncoder

X = bcw_df.loc[:, 2:].values
y = bcw_df.loc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)

### Data partition
We split the data set into a training set (80%) and a validation set (20%)

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)

## Scikit-Learn transformers, estimators and pipelines
Scikit-Learn provides several transformations such as the StandardScaler to standardize the input data and the PCA for dimensionality reduction. Classifiers and regressors are classes that implement algorithms for classification and regression tasks such as Logistic Regression or Support Vector Machines and are defined collectively as estimators. Transformer objects implement the fit() and transform() methods, estimator objects implement the fit() and prdict() methods. A pipeline with one or more transformers and a final estimator can be build to implement a task. In the example below we have set up a pipeline with two transformers and a final estimator.

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression())

pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)
test_acc = pipe_lr.score(X_test, y_test)
print(f'Test accuracy: {test_acc:.3f}')

Test accuracy: 0.956


## Model performance assessment