# Preprocessor Tuning

## (0) The `tumors` Dataset

* 👩🏻‍⚕️ The following dataset describes tumors that are either <font color=red>malignant</font> or <font color=green>benign</font>. 
* 🎯 The task is to detect as many malignant tumors as possible.

In [90]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import make_union
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

pd.set_option('display.max_columns', None)

url = "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/tumors_dataset.csv"
data = pd.read_csv(url)

data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,malignant
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,,0.205,0.4,0.1625,0.2364,0.07678,1


In [91]:
round(data.malignant.value_counts(normalize = True),2)

0    0.63
1    0.37
Name: malignant, dtype: float64

## (1) Building a Pipeline

❓ **Question: Building a Pipeline** ❓

Combine the following steps in a **`Pipeline`** object named `pipeline`:

1. Impute missing values with a **`KNNImputer`**
2. Scale all the (numerical) features with a **`MinMaxScaler`**
3. Model a **`LogisticRegression`** with default parameters

In [92]:
X = data.drop(columns="malignant")
y=data['malignant']


In [93]:
# Preprocess "age"
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

pipeline = Pipeline([
    ('imputer', KNNImputer()),
    ('scaler', MinMaxScaler())
])

pipeline.fit(X)
pipeline.transform(X)

array([[0.52103744, 0.0226581 , 0.54598853, ..., 0.91202749, 0.59846245,
        0.41886396],
       [0.64314449, 0.27257355, 0.61578329, ..., 0.63917526, 0.23358959,
        0.22287813],
       [0.60149557, 0.3902604 , 0.59574321, ..., 0.83505155, 0.40370589,
        0.21343303],
       ...,
       [0.45525108, 0.62123774, 0.44578813, ..., 0.48728522, 0.12872068,
        0.1519087 ],
       [0.64456434, 0.66351031, 0.66553797, ..., 0.91065292, 0.49714173,
        0.45231536],
       [0.03686876, 0.50152181, 0.02853984, ..., 0.        , 0.25744136,
        0.10068215]])

In [94]:
from sklearn import set_config; set_config (display='diagram')
pipeline_log = make_pipeline(pipeline, LogisticRegression())
pipeline_log.fit(X,y)

## (2) Optimizing a pipelined model

❓ **Question (GridSearching a Pipeline)** ❓

* What is the optimal number of neighbors for the KNN imputer: 2, 5, or 10 ? 
    * Perform a GridSearch on your pipeline and save your answer under a variable called `n_best`.
    * _Be careful: Use a scoring metric that is relevant for the task in your Grid Search, just saying... :)_
* Feel free to GridSearch on the whole dataset instead of using a train/test split in this challenge. Here, the goal is just to become familiar with Pipelines :)



In [95]:
pipeline_log.get_params()

{'memory': None,
 'steps': [('pipeline',
   Pipeline(steps=[('imputer', KNNImputer()), ('scaler', MinMaxScaler())])),
  ('logisticregression', LogisticRegression())],
 'verbose': False,
 'pipeline': Pipeline(steps=[('imputer', KNNImputer()), ('scaler', MinMaxScaler())]),
 'logisticregression': LogisticRegression(),
 'pipeline__memory': None,
 'pipeline__steps': [('imputer', KNNImputer()), ('scaler', MinMaxScaler())],
 'pipeline__verbose': False,
 'pipeline__imputer': KNNImputer(),
 'pipeline__scaler': MinMaxScaler(),
 'pipeline__imputer__add_indicator': False,
 'pipeline__imputer__copy': True,
 'pipeline__imputer__metric': 'nan_euclidean',
 'pipeline__imputer__missing_values': nan,
 'pipeline__imputer__n_neighbors': 5,
 'pipeline__imputer__weights': 'uniform',
 'pipeline__scaler__clip': False,
 'pipeline__scaler__copy': True,
 'pipeline__scaler__feature_range': (0, 1),
 'logisticregression__C': 1.0,
 'logisticregression__class_weight': None,
 'logisticregression__dual': False,
 'logist

In [100]:
from sklearn.model_selection import GridSearchCV

# Inspect the parameters of all of the pipe's components to find the one you want to Grid Search
pipeline.get_params()

# Instantiate Grid Search
grid_search = GridSearchCV(
    pipeline_log, 
    param_grid={
        
        'pipeline__imputer__n_neighbors': [2,5,10],
        'pipeline__scaler': [MinMaxScaler(),StandardScaler(),RobustScaler()]
        
    },
    cv=5,
    scoring="r2")

grid_search.fit(X, y)

grid_search.best_params_


{'pipeline__imputer__n_neighbors': 10, 'pipeline__scaler': StandardScaler()}

In [66]:
n_best = 5

In [67]:
n_best

5

In [102]:
from sklearn.model_selection import cross_val_score

cv_score = cross_val_score(pipe_tuned, X, y, cv=5, scoring='r2').mean()
cv_score

0.9248505243346383

In [101]:
pipe_tuned = grid_search.best_estimator_
pipe_tuned

## (3) Evaluating a pipeline

❓ **Question: what is the performance of the optimal pipeline**  ❓

- Make sure you cross-validate your optimal pipeline! 
- Store your result as a `float` number in a variable named `cv_score`

In [103]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'solution', 
    n_best = n_best,
    cv_score=cv_score
)

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/guilhermecavalcantidesabarreto/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/guilhermecavalcantidesabarreto/code/guipyc/data-preprocessor-tuning
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 2 items

tests/test_solution.py::TestSolution::test_n_neighbours [32mPASSED[0m[32m           [ 50%][0m
tests/test_solution.py::TestSolution::test_score_good_enough [32mPASSED[0m[32m      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master


In [104]:
!git add tests/solution.pickle

!git commit -m 'Completed solution step'

!git push origin master

[master e5230c7] Completed solution step
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 tests/solution.pickle
Enumerating objects: 13, done.
Counting objects: 100% (13/13), done.
Delta compression using up to 8 threads
Compressing objects: 100% (12/12), done.
Writing objects: 100% (13/13), 3.12 KiB | 3.12 MiB/s, done.
Total 13 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), done.[K
To github.com:guipyc/data-preprocessor-tuning.git
 * [new branch]      master -> master


## (4) Predicting using a fitted and pipelined model

👇 Here is a new tumor.

In [105]:
new_url = "https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/new_tumor.csv"

new_data = pd.read_csv(new_url)
new_data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


❓ **Question: Using your optimal pipeline, predict whether the new tumor is malignant or not** ❓

In [None]:
# YOUR CODE HERE

🏁 Congratulations! You are now an expert at pipelining !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!