<div style="line-height:0.5">
<h1 style="color:#B0EE8F ">  Common practices in Machine Learning 3 </h1>
</div>
<div style="line-height:1.2">
<h4>  11 examples based on Scikit-learn for classification and regression. <br> 
Focus on Imputers, ColumnTransformers, CountVectorizers, Pipeline. </h4>
</div>
<div style="margin-top: -15px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3> fetch_openml + enable_iterative_imputer + pandas isna() + Pipeline / make_pipeline + statistics_
</span>
</div>

In [74]:
import pandas as pd
import numpy as np
from copy import copy 
import matplotlib.pyplot as plt

import sklearn
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_openml, load_iris
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.compose import ColumnTransformer, make_column_transformer, make_column_selector

In [40]:
""" IterativeImputer.
It is experimental, the API might change without any deprecation cycle. 
N.B.
To use it, you need to explicitly import enable_iterative_imputer.
"""
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer      
# from sklearn.experimental import enable_hist_gradient_boosting        #useless now, but once was used
from sklearn.ensemble import HistGradientBoostingClassifier

<div style="line-height:0.5">
<h2 style="color:#B0EE8F ">  # 1) colum Tranformer </h2>
</div>
Use columTranformer to apply different preprocessing to different columns: <br>
    - select from DataFrame columns by name <br>
    - passthrough or drop unspecified columns <br>

In [41]:
fare = np.random.uniform(10.0, 100.0, size=40)  
embarked = np.random.choice(['S', 'C', 'Q'], size=40)
sex = np.random.choice(['male', 'female'], size=40)
age = np.random.randint(18, 80, size=40)

# Create the dataframe object
df = pd.DataFrame({'Fare': fare, 'Embarked': embarked, 'Sex': sex, 'Age': age})
df['Fare'] = df['Fare'].round(2)
# Print the dataframe
print(df)


     Fare Embarked     Sex  Age
0   28.84        S    male   73
1   39.76        C  female   45
2   90.45        C  female   46
3   91.01        S  female   79
4   64.45        C  female   37
5   96.66        Q  female   53
6   73.52        S  female   79
7   75.98        C    male   26
8   89.20        S  female   23
9   21.29        C    male   78
10  66.98        S    male   56
11  66.84        C    male   50
12  25.16        Q  female   67
13  64.86        C    male   65
14  72.04        Q    male   23
15  38.14        Q  female   24
16  89.00        C  female   46
17  47.72        C    male   62
18  15.85        Q  female   26
19  85.47        S    male   30
20  12.38        S    male   24
21  85.27        S  female   35
22  20.22        C    male   76
23  85.76        Q    male   22
24  69.79        C  female   54
25  75.60        S    male   47
26  23.75        C  female   67
27  43.07        C  female   74
28  93.95        S    male   37
29  23.31        S  female   67
30  37.2

In [42]:
""" Create a ColumnTransformer object using scikit-learn's make_column_transformer() function, 
and then applying it to a pandas dataframe df.

1. Apply one-hot encoding to the 'Embarked' and 'Sex' columns, and fills in missing values in the 'Age' column 
with the mean value of the non-missing values. 
2. Concatenate the transformed columns with the original columns (since remainder='passthrough'), 
and returns the resulting numpy array.        

N.B.
a) OneHotEncoder transformer is used to encode categorical features as one-hot numeric arrays
b) SimpleImputer object (a transformer) fills in missing (nan) values in the input data. 
Not specify a value for the strategy parameter mean to use the default strategy of 'mean'.
C) The ColumnTransformer object is used to apply different transformations to different columns of the input data. 
    __two transformers: ohe and imp. 
    The ohe transformer is applied to the 'Embarked' and 'Sex' columns, and the imp transformer is applied to the 'Age' column. 
    The remainder='passthrough' parameter specifies that any columns not explicitly transformed should be passed through unchanged.
        All remaining columns that were not specified in `transformers` will be automatically passed through. 
        This subset of columns is concatenated with the output of the transformers.

d) fit_transform() => Apply the ColumnTransformer object ct to the input dataframe df. 
    To each transformer to the appropriate columns of the input data, producing a transformed output. 
    The output is a numpy array with the transformed data.
"""
ohe = OneHotEncoder()
imp = SimpleImputer()
# imp = SimpleImputer(strategy='mean') # fill in missing values with the mean of each column

ct = make_column_transformer((ohe, ['Embarked', 'Sex']), (imp, ['Age']), remainder='passthrough') 
ct.fit_transform(df)

array([[ 0.  ,  0.  ,  1.  ,  0.  ,  1.  , 73.  , 28.84],
       [ 1.  ,  0.  ,  0.  ,  1.  ,  0.  , 45.  , 39.76],
       [ 1.  ,  0.  ,  0.  ,  1.  ,  0.  , 46.  , 90.45],
       [ 0.  ,  0.  ,  1.  ,  1.  ,  0.  , 79.  , 91.01],
       [ 1.  ,  0.  ,  0.  ,  1.  ,  0.  , 37.  , 64.45],
       [ 0.  ,  1.  ,  0.  ,  1.  ,  0.  , 53.  , 96.66],
       [ 0.  ,  0.  ,  1.  ,  1.  ,  0.  , 79.  , 73.52],
       [ 1.  ,  0.  ,  0.  ,  0.  ,  1.  , 26.  , 75.98],
       [ 0.  ,  0.  ,  1.  ,  1.  ,  0.  , 23.  , 89.2 ],
       [ 1.  ,  0.  ,  0.  ,  0.  ,  1.  , 78.  , 21.29],
       [ 0.  ,  0.  ,  1.  ,  0.  ,  1.  , 56.  , 66.98],
       [ 1.  ,  0.  ,  0.  ,  0.  ,  1.  , 50.  , 66.84],
       [ 0.  ,  1.  ,  0.  ,  1.  ,  0.  , 67.  , 25.16],
       [ 1.  ,  0.  ,  0.  ,  0.  ,  1.  , 65.  , 64.86],
       [ 0.  ,  1.  ,  0.  ,  0.  ,  1.  , 23.  , 72.04],
       [ 0.  ,  1.  ,  0.  ,  1.  ,  0.  , 24.  , 38.14],
       [ 1.  ,  0.  ,  0.  ,  1.  ,  0.  , 46.  , 89.  ],
       [ 1.  ,

<div style="line-height:0.5">
<h2 style="color:#B0EE8F ">  # 2) Seven ways to select columns using ColumnTransformer  </h2>
</div>

- column name 
- integer position
- slice
- boolean mask
- regex pattern
- dtypes to include
- dtypes to exclude

In [43]:
#2 
""" Seven ways to select columns using ColumnTransformer 

"""
ct = make_column_transformer((ohe, ['Embarked', 'Sex']))
ct1 = make_column_transformer((ohe, [1, 2]))
ct2 = make_column_transformer((ohe, slice(1,3)))
ct3 = make_column_transformer((ohe, [False, True, False, True]))
ct4 = make_column_transformer((ohe, make_column_selector(pattern='E|S')))
ct5 = make_column_transformer((ohe, make_column_selector(dtype_include=object)))
ct6 = make_column_transformer((ohe, make_column_selector(dtype_exclude='number')))

print(ct)
print(ct1)
print(ct2)
print(ct3)
print(ct4)
print(ct5)
print(ct6)
# to obtain the same result
# ct.fit_transform(df)
# ct1.fit_transform(df)
# ct2.fit_transform(df)
# ct3.fit_transform(df)
# ct4.fit_transform(df)
# ct5.fit_transform(df)
ct6.fit_transform(df)

ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 ['Embarked', 'Sex'])])
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(), [1, 2])])
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 slice(1, 3, None))])
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 [False, True, False, True])])
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f04a7517a30>)])
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f04a5fdd9f0>)])
ColumnTransformer(transformers=[('onehotencoder', OneHotEncoder(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f04a

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 1.

<div style="line-height:0.5">
<h2 style="color:#B0EE8F ">  # 3) sklearn Tranformers methods </h2>
</div>
Models has only the fit method, Transformers has the transform method: <br>
Difference between "models" and "transformers" object in sklearn: <br>
&emsp; - fit = transformer learns something about the data <br>
&emsp; - transform = use what it is learned to do the data tranformation <br>
- CountVectorized <br>
&emsp; - fit = learn vocabulary <br>
&emsp; - transform = creates a document-term matrix using the vocabulary <br>
- SimpleImputer <br>
&emsp; - fit = learn the value to impute <br>
&emsp; - transform = fills in missing entries using the imputation value <br>
- StandartScarler <br>
&emsp; - fit = learns the mean scale of each feature <br>
&emsp; - transform = standaridizes the features using the mean and scale <br>
- HashingVectorizer <br>
&emsp; - fit = statelles transformer <br>
&emsp; - transform = creates the document-term matrix using a hash of the token <br>


### => CountVectorizer

In [44]:
# Dummy list of sentences
documents = [
    "This is the first document.",
    "Is this the first document?"
    "This document is the second document.",
    "And this is the third one.",
]
# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer on the documents and transform the documents into a document-term matrix
# X is the counts of each word in the documents is stored as a sparse matrix where each row corresponds to a document, 
# and each column corresponds to a word in the vocabulary.
X = vectorizer.fit_transform(documents)

# Convert the matrix to a dense array
print(X.toarray())

# Get the feature names (words in the vocabulary)
feature_names = vectorizer.get_feature_names_out()
print("Feature names are:\n", feature_names)

[[0 1 1 1 0 0 1 0 1]
 [0 3 1 2 0 1 2 0 2]
 [1 0 0 1 1 0 1 1 1]]
Feature names are:
 ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


### => StandardScaler

In [45]:
data = np.array([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
    [7.0, 8.0, 9.0]
])

# Create an instance of StandardScaler
scaler = StandardScaler()
# Fit the scaler 
scaled_data = scaler.fit_transform(data)

print("Standardized data:\n", scaled_data)
print("\nMean of each feature:")
print(scaler.mean_)
print("\nStandard deviation of each feature:")
print(scaler.scale_)

Standardized data:
 [[-1.22474487 -1.22474487 -1.22474487]
 [ 0.          0.          0.        ]
 [ 1.22474487  1.22474487  1.22474487]]

Mean of each feature:
[4. 5. 6.]

Standard deviation of each feature:
[2.44948974 2.44948974 2.44948974]


### => HashingVectorizer

In [46]:
# Create an instance of HashingVectorizer
vectorizer = HashingVectorizer(n_features=10)
# Transform the documents into a document-term matrix
X = vectorizer.transform(documents)
print(X.toarray())

[[ 0.          0.          0.          0.          0.          0.
  -0.57735027  0.57735027 -0.57735027  0.        ]
 [ 0.          0.         -0.30151134  0.          0.          0.30151134
  -0.30151134  0.60302269 -0.60302269  0.        ]
 [ 0.          0.40824829  0.40824829  0.         -0.40824829 -0.40824829
   0.          0.40824829 -0.40824829  0.        ]]


<h2 style="color:#B0EE8F ">  # 4) Encoding </h2>

In [47]:
data = np.array([
    ['circle', 'first', 'S'],
    ['oval', 'second', 'M'],
    ['square', 'third', 'L'],
    ['triangle', 'fourth', 'XL']])

df1 = pd.DataFrame(data, columns=['Shape', 'Class', 'Size'])
df1.head()

Unnamed: 0,Shape,Class,Size
0,circle,first,S
1,oval,second,M
2,square,third,L
3,triangle,fourth,XL


In [48]:
""" 4.1 Dummy Econding -> no relation between class.
N.B.
Using "ohe1 = OneHotEncoder(sparse=False)" lead to: "FutureWarning: `sparse` was renamed to `sparse_output` in newer versions"
When some categories are unknown (not present in the training set)...they can be replaced  (e.g. with zeros).
"""
ohe1 = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe1.fit_transform(df1[['Shape']])

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [49]:
""" 4.2 Encode the single with Ordinal Encoder when order of features matters! 
there is also a relation between two values of rows ..like the class and the size that are always related! 
"""
catego = [['first', 'second', 'third', 'fourth'], ['S','M','L','XL']]
ore1 = OrdinalEncoder(categories=catego)
ore1.fit_transform(df1[['Class', 'Size']])

array([[0., 0.],
       [1., 1.],
       [2., 2.],
       [3., 3.]])

<h2 style="color:#B0EE8F ">  # 5) Imputing </h2>

In [50]:
data = np.array([
    [10.0, 20.0, np.nan, 3.2, 'A'],
    [1.4, 32.9, 5.0, np.nan, 'B'],
    [np.nan, np.nan, np.nan, np.nan, 'C'],])

data_test = np.array([
    [10.0, 20.0, np.nan, 3.2, np.nan, 91.2],
    [6.7, 3.1, 7.9, 21.1, np.nan, 8.8],])

data_88 = np.array([
    [10.0, np.nan, 20.0, 2.1, 6.8, 9.99, np.nan, 3.2, np.nan, 91.2],
    [32.3, 12.3, 1.5, np.nan, np.nan, 12.1, 18.8, 74.2, 43.1, np.nan],])

train_df = pd.DataFrame(data, columns=['feat1', 'feat2', 'feat3', 'feat4', 'label'])
test_df = pd.DataFrame(data_test, columns=['feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6'])

df_88 = pd.DataFrame(data_88, columns=['feat1', 'feat2', 'feat3', 'feat4', 'feat5', 'feat6', 'feat7', 'feat8', 'feat9', 'feat10']) 

print("train_df\n {}".format(train_df))
print("test_df\n {}".format(test_df))
print()
print("df_88\n {}".format(test_df))

imp_me = SimpleImputer(strategy='median')
imp_mo = SimpleImputer(strategy='most_frequent')
imp_co = SimpleImputer(strategy='constant', add_indicator=True)     #to indicate where the location of the substition!

#### Fit_transform the inputer
a = imp_me.fit_transform(df_88)
b = imp_mo.fit_transform(df_88)
c = imp_co.fit_transform(df_88)
print(a)
print()
print(b)
print()
print(c)

train_df
   feat1 feat2 feat3 feat4 label
0  10.0  20.0   nan   3.2     A
1   1.4  32.9   5.0   nan     B
2   nan   nan   nan   nan     C
test_df
    feat1  feat2  feat3  feat4  feat5  feat6
0   10.0   20.0    NaN    3.2    NaN   91.2
1    6.7    3.1    7.9   21.1    NaN    8.8

df_88
    feat1  feat2  feat3  feat4  feat5  feat6
0   10.0   20.0    NaN    3.2    NaN   91.2
1    6.7    3.1    7.9   21.1    NaN    8.8
[[10.   12.3  20.    2.1   6.8   9.99 18.8   3.2  43.1  91.2 ]
 [32.3  12.3   1.5   2.1   6.8  12.1  18.8  74.2  43.1  91.2 ]]

[[10.   12.3  20.    2.1   6.8   9.99 18.8   3.2  43.1  91.2 ]
 [32.3  12.3   1.5   2.1   6.8  12.1  18.8  74.2  43.1  91.2 ]]

[[10.    0.   20.    2.1   6.8   9.99  0.    3.2   0.   91.2   1.    0.
   0.    1.    1.    0.  ]
 [32.3  12.3   1.5   0.    0.   12.1  18.8  74.2  43.1   0.    0.    1.
   1.    0.    0.    1.  ]]


In [51]:
imputer = SimpleImputer()
clf = LogisticRegression()
# Create a 2-step pipelines that applies impute before fitting the classifier
pipe = make_pipeline(imputer, clf) 

# Fit the pipeline to the training data
pipe.fit(train_df[['feat1', 'feat2', 'feat3', 'feat4']], train_df['label'])

# Use the pipeline to make predictions on the test data
preds = pipe.predict(test_df[['feat1', 'feat2', 'feat3', 'feat4']])
preds

array(['A', 'A'], dtype=object)

<h2 style="color:#B0EE8F ">  # 6) Pipeline </h2>

<div style="line-height:0.1">
<h4>Pipeline or make_pipeline? </h4>
</div>
<div style="line-height:1.6">

Pipeline is a class that allows you to create a pipeline by sequentially applying a list of transformers and a final estimator. <br>
It takes a list of tuples, where each tuple contains the name you want to give to a step in the pipeline and the corresponding transformer or estimator object. <br>
The first step of the pipeline must be a transformer, and all the following steps can be either transformers or an estimator. <br>
On the other hand, make_pipeline is a function that allows you to create a pipeline by sequentially applying a sequence of transformers and a final estimator. <br>
Unlike Pipeline, make_pipeline automatically names each step based on the class name of the transformer or estimator. <br>
This can be convenient if you do not want to specify the name of each step manually. 
</div>

In [52]:
data = np.array([
    ['circle', 'first', 'S', 83.1],
    ['oval', 'second', 'M', 3.2],
    ['square', 'third', 'L', 45.1],
    ['triangle', 'fourth', 'XL', np.nan]
])

df6 = pd.DataFrame(data, columns=['Shape', 'Class', 'Size', 'Val'])

ohe6 = OneHotEncoder()
sim6 = SimpleImputer()
clf6 = LogisticRegression()

ohe6.fit_transform(df1[['Shape']])

ct6 = ColumnTransformer([('encoder', ohe6, ['Shape', 'Class']), ('imputer', imp, ['Val'])], remainder= 'passthrough')
pipe6 = Pipeline([('preprocessor', ct6), ('classifier', clf6)])
pipe6

In [53]:
# Load the Boston Housing dataset
boston = fetch_openml(name='boston')
# Get the feature matrix X and target vector y
X = boston.data
y = boston.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
X_train[:10]

  warn(
  warn(


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
13,0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307.0,21.0,396.9,8.26
61,0.17171,25.0,5.13,0,0.453,5.966,93.4,6.8185,8,284.0,19.7,378.08,14.44
377,9.82349,0.0,18.1,0,0.671,6.794,98.8,1.358,24,666.0,20.2,396.9,21.24
39,0.02763,75.0,2.95,0,0.428,6.595,21.8,5.4011,3,252.0,18.3,395.63,4.32
365,4.55587,0.0,18.1,0,0.718,3.561,87.9,1.6132,24,666.0,20.2,354.7,7.12
272,0.1146,20.0,6.96,0,0.464,6.538,58.7,3.9175,3,223.0,18.6,394.96,7.73
208,0.13587,0.0,10.59,1,0.489,6.064,59.1,4.2392,4,277.0,18.6,381.32,14.66
236,0.52058,0.0,6.2,1,0.507,6.631,76.5,4.148,8,307.0,17.4,388.45,9.54
98,0.08187,0.0,2.89,0,0.445,7.82,36.9,3.4952,2,276.0,18.0,393.53,3.57
364,3.47428,0.0,18.1,1,0.718,8.78,82.9,1.9047,24,666.0,20.2,354.55,5.29


<h2 style="color:#B0EE8F ">  # 7) train_test_split </h2>

In [54]:
## Get the feature matrix X and target vector y using slicing (need to convert the dictionary-like object to a NumPy array)
data = boston.data.to_numpy()
target = boston.target.to_numpy()
# Get the feature matrix X and target vector y using slicing
X = data[:, :]
y = target[:]
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # random_state avoid to obtain different results
X_train[:10]

array([[0.62976, 0.0, 8.14, '0', 0.538, 5.949, 61.8, 4.7075, '4', 307.0,
        21.0, 396.9, 8.26],
       [0.17171, 25.0, 5.13, '0', 0.453, 5.966, 93.4, 6.8185, '8', 284.0,
        19.7, 378.08, 14.44],
       [9.82349, 0.0, 18.1, '0', 0.671, 6.794, 98.8, 1.358, '24', 666.0,
        20.2, 396.9, 21.24],
       [0.02763, 75.0, 2.95, '0', 0.428, 6.595, 21.8, 5.4011, '3', 252.0,
        18.3, 395.63, 4.32],
       [4.55587, 0.0, 18.1, '0', 0.718, 3.561, 87.9, 1.6132, '24', 666.0,
        20.2, 354.7, 7.12],
       [0.1146, 20.0, 6.96, '0', 0.464, 6.538, 58.7, 3.9175, '3', 223.0,
        18.6, 394.96, 7.73],
       [0.13587, 0.0, 10.59, '1', 0.489, 6.064, 59.1, 4.2392, '4', 277.0,
        18.6, 381.32, 14.66],
       [0.52058, 0.0, 6.2, '1', 0.507, 6.631, 76.5, 4.148, '8', 307.0,
        17.4, 388.45, 9.54],
       [0.08187, 0.0, 2.89, '0', 0.445, 7.82, 36.9, 3.4952, '2', 276.0,
        18.0, 393.53, 3.57],
       [3.47428, 0.0, 18.1, '1', 0.718, 8.78, 82.9, 1.9047, '24', 666.0,
        

Changing random_state from 1 to 2 will result in a different random seed being used to split the data into training and test sets. <br>
This will result in different training and test sets being generated, which may affect the performance of the model. <br>
However, the overall impact of changing the random seed will likely be small, especially if the dataset is large enough. <br>
In practice, it is common to try multiple random seeds and choose the one that gives the best performance on average.<br>

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2) 
X_train

array([[3.67367, 0.0, 18.1, ..., 20.2, 388.62, 10.58],
       [0.09604, 40.0, 6.41, ..., 17.6, 396.9, 2.98],
       [3.53501, 0.0, 19.58, ..., 14.7, 88.01, 15.02],
       ...,
       [0.17331, 0.0, 9.69, ..., 19.2, 396.9, 12.01],
       [0.62739, 0.0, 8.14, ..., 21.0, 395.62, 8.47],
       [2.3004, 0.0, 19.58, ..., 14.7, 297.09, 11.1]], dtype=object)

<div style="line-height:0.5">
<h2 style="color:#B0EE8F ">  # 8) IterativeImputer </h2>
</div>
IterativeImputer built a regression model and those prediction are the missing values. <br>
It is a strategy for imputing missing values by modeling each feature with missing values <br>
as a function of other features in a round-robin fashion.

In [56]:
####### Create a Pandas DataFrame with 5 columns and 50 rows
df8 = pd.DataFrame({
    'A': np.random.randint(low=0, high=100, size=50),
    'B': np.random.randint(low=0, high=100, size=50),
    'C': np.random.randint(low=0, high=100, size=50),
    'D': np.random.randint(low=0, high=100, size=50),
    'E': np.random.randint(low=0, high=100, size=50),
})

#### Introduce some missing values into the DataFrame
df8.iloc[2:5, 0] = np.nan
df8.iloc[7:9, 2] = np.nan
df8.iloc[12, 3] = np.nan
df8.iloc[15, 4] = np.nan

impute_it = IterativeImputer() 
impute_it.fit_transform(df8)

array([[17.        , 80.        , 21.        , 39.        , 82.        ],
       [88.        , 31.        , 35.        , 63.        , 34.        ],
       [52.65692187, 90.        , 59.        ,  2.        , 85.        ],
       [52.37222056, 91.        , 34.        , 34.        , 10.        ],
       [52.60022313, 62.        ,  4.        ,  8.        , 24.        ],
       [26.        , 19.        , 24.        , 99.        , 78.        ],
       [45.        , 80.        ,  0.        , 15.        , 97.        ],
       [26.        , 37.        , 45.17234036, 53.        , 85.        ],
       [76.        , 93.        , 44.74688926, 60.        , 79.        ],
       [29.        , 63.        , 84.        , 79.        , 26.        ],
       [28.        , 56.        , 72.        , 63.        , 62.        ],
       [ 6.        , 66.        , 75.        , 95.        , 89.        ],
       [69.        , 30.        , 95.        , 55.5177167 , 39.        ],
       [52.        , 72.        , 24. 

<div style="line-height:0.5">
<h2 style="color:#B0EE8F ">  # 9) KNNImputer </h2>
</div>
Imputation for completing missing values using k-Nearest Neighbors. <br>
Find the two most similar rows how close are di nan values close to the value in similar rows, averaging the values (not missing in that same column). <br>
All features that are passed to the Imputer are taken into account. <br> 
Each sample's missing values are imputed using the mean value from `n_neighbors` nearest neighbors found in the training set.  <br>
Two samples are close if the features that neither is missing are close.

In [None]:
""" Create KNNImputer object """
impute_knn = KNNImputer()

# ...with specific settings
imputer1 = KNNImputer(n_neighbors=3, weights='distance')

In [None]:
# ...with a custom distance metric
def my_distance(x, y):
    return 1.0 / np.sum(np.abs(x - y))
imputer2 = KNNImputer(metric=my_distance)

In [None]:
# ...with an indicator for missing values
imputer3 = KNNImputer(add_indicator=True)

# ...that keeps empty features
imputer4 = KNNImputer(keep_empty_features=True)

In [57]:
impute_knn.fit_transform(df8)

array([[17. , 80. , 21. , 39. , 82. ],
       [88. , 31. , 35. , 63. , 34. ],
       [60.6, 90. , 59. ,  2. , 85. ],
       [82.2, 91. , 34. , 34. , 10. ],
       [66.4, 62. ,  4. ,  8. , 24. ],
       [26. , 19. , 24. , 99. , 78. ],
       [45. , 80. ,  0. , 15. , 97. ],
       [26. , 37. , 42.6, 53. , 85. ],
       [76. , 93. , 38. , 60. , 79. ],
       [29. , 63. , 84. , 79. , 26. ],
       [28. , 56. , 72. , 63. , 62. ],
       [ 6. , 66. , 75. , 95. , 89. ],
       [69. , 30. , 95. , 59.8, 39. ],
       [52. , 72. , 24. , 99. , 26. ],
       [83. , 93. ,  2. , 52. , 37. ],
       [20. , 43. , 35. , 55. , 74.2],
       [31. , 16. , 77. , 48. , 49. ],
       [49. , 21. , 15. , 46. , 75. ],
       [91. , 94. , 22. ,  0. , 28. ],
       [ 8. , 72. , 66. , 75. , 50. ],
       [24. ,  1. , 48. , 49. , 95. ],
       [76. , 28. , 30. ,  3. , 73. ],
       [75. , 24. , 69. , 39. , 72. ],
       [53. , 48. ,  6. ,  9. , 58. ],
       [ 9. ,  0. , 94. , 45. , 11. ],
       [59. ,  3. , 33. ,

<div style="line-height:0.5">
<h2 style="color:#B0EE8F ">  # 10) Pipeline </h2>
</div>
Examine the intermediate steps in a Pipeline  the "named_steps" attribute.

In [65]:
data10 = {
    'age': np.random.randint(18, 65, size=30),
    'income': np.random.uniform(1000, 50000, size=30),
    'weight': np.random.normal(70, 10, size=30),
    'height': np.random.normal(170, 10, size=30)    
}
# Create a dataFrame from the dictionary
df_df10 = pd.DataFrame(data10)
## Add some missing values to the DataFrame
df_df10.iloc[[1, 3, 5, 7, 9], [0, 2]] = np.nan
df_df10.iloc[[0, 2, 4, 6, 8], [1, 2]] = np.nan
print(df_df10)

# N.B using a Logistic Regressor to predict a continuous target variable ('height') instead of a categorical or binary target variable,
# will lead to an error.
pipeline10 = make_pipeline(SimpleImputer(strategy='mean'), LinearRegression())

# Fit and transform the pipeline on the dataframe
X = df_df10[['age', 'income', 'weight']]
y = df_df10[['height']]
y = y.values.ravel()
pipeline10.fit(X, y)
pipeline10.named_steps.simpleimputer.statistics_

     age        income     weight      height
0   27.0           NaN        NaN  160.369850
1    NaN  46098.954586        NaN  182.485959
2   33.0           NaN        NaN  165.390121
3    NaN  29154.694513        NaN  169.951719
4   24.0           NaN        NaN  174.382466
5    NaN  14139.835887        NaN  177.142351
6   21.0           NaN        NaN  173.677084
7    NaN  16951.347248        NaN  171.302470
8   64.0           NaN        NaN  172.566214
9    NaN   4399.612106        NaN  188.329464
10  31.0   7099.221189  62.144848  177.983105
11  20.0  36317.705028  77.394001  174.893749
12  45.0  16892.663894  67.837276  162.693641
13  24.0  25513.308948  56.349276  167.499316
14  47.0  10806.506602  76.132502  155.175132
15  60.0  13545.618444  53.723602  156.728152
16  43.0  34168.227303  81.336970  164.925205
17  33.0  39296.855218  66.351949  161.619488
18  44.0  30385.773809  57.882477  171.160851
19  48.0  29830.510104  72.507152  173.497910
20  37.0  35359.521792  74.861651 

array([   39.48      , 24341.95417275,    69.41410532])

The ValueError: Unknown label type: 'continuous' error occurs when you are trying to use a classification algorithm on a regression problem.    
In other words, you are trying to predict a continuous value (the height in this case) with a classifier (Logistic Regression).

In [66]:
data10_1 = {
    'age': np.random.randint(18, 65, size=30),
    'income': np.random.uniform(1000, 50000, size=30),
    'weight': np.random.normal(70, 10, size=30),
    'height': np.random.normal(170, 10, size=30)    
}
# DataFrame from the dictionary
df_df10_1 = pd.DataFrame(data10_1)
# Add some missing values to the DataFrame
df_df10_1.iloc[[1, 3, 5, 7, 9], [0, 2]] = np.nan
df_df10_1.iloc[[0, 2, 4, 6, 8], [1, 3]] = np.nan
# ...to avoid that y contains nan values as targets!
df_df10_1.dropna(subset=['height'], inplace=True) 
print(df_df10_1)

pipeline10_1 = make_pipeline(SimpleImputer(strategy='mean'), LinearRegression())
# Fit and transform the pipeline on the dataframe
X = df_df10_1[['age', 'income', 'weight']]
y = df_df10_1['height']
pipeline10_1.fit(X, y)
pipeline10_1.named_steps.simpleimputer.statistics_

     age        income     weight      height
1    NaN  11054.445086        NaN  172.639935
3    NaN  36508.536719        NaN  188.096091
5    NaN   4776.843955        NaN  153.047794
7    NaN   8100.747119        NaN  174.241175
9    NaN  35064.573432        NaN  143.435412
10  51.0  13174.220919  50.638642  177.626253
11  25.0  48637.415259  81.489223  174.199149
12  28.0  37645.152788  76.323688  177.732540
13  54.0  11317.419565  62.753748  164.563532
14  32.0  21117.539019  55.345778  167.980545
15  40.0  31380.635226  65.353420  176.902690
16  32.0  28190.970242  72.387077  172.067400
17  28.0  22987.606034  89.239745  177.966825
18  35.0  28471.498554  73.346421  165.829286
19  60.0  44531.739827  56.089202  175.550565
20  18.0  38073.433008  54.920984  162.340931
21  27.0  29122.933297  68.919706  174.474787
22  59.0  38783.650454  69.170823  157.333739
23  26.0  32295.215340  71.356488  168.477760
24  52.0   1671.559132  95.750649  178.362821
25  63.0  10298.318636  59.068668 

array([   39.        , 24181.493246  ,    69.67469663])

In [67]:
df_df10_1

Unnamed: 0,age,income,weight,height
1,,11054.445086,,172.639935
3,,36508.536719,,188.096091
5,,4776.843955,,153.047794
7,,8100.747119,,174.241175
9,,35064.573432,,143.435412
10,51.0,13174.220919,50.638642,177.626253
11,25.0,48637.415259,81.489223,174.199149
12,28.0,37645.152788,76.323688,177.73254
13,54.0,11317.419565,62.753748,164.563532
14,32.0,21117.539019,55.345778,167.980545


<div style="line-height:0.5">
<h2 style="color:#B0EE8F ">  # 11) Handling missing values </h2>
</div>

In [68]:
df_11 = copy(df_df10_1)
df_11 = df_11.apply(lambda x: x * np.random.rand())

print(df_df10_1.head(3))
print(df_11.head(3))
print()
# Count the NaNs in the whole dataframe
print(df_11.isna().sum().sum())  
# Count the NaNs in each column
print(df_11.isna().sum())        

df_11_1 = df_11
label = df_11_1.pop('height')
print(label)
print()
print(df_11_1.head())

   age        income  weight      height
1  NaN  11054.445086     NaN  172.639935
3  NaN  36508.536719     NaN  188.096091
5  NaN   4776.843955     NaN  153.047794
   age       income  weight     height
1  NaN  1776.319014     NaN  69.438018
3  NaN  5866.491484     NaN  75.654684
5  NaN   767.582514     NaN  61.557805

10
age       5
income    0
weight    5
height    0
dtype: int64
1     69.438018
3     75.654684
5     61.557805
7     70.082057
9     57.691581
10    71.443580
11    70.065154
12    71.486330
13    66.189584
14    67.563951
15    71.152553
16    69.207737
17    71.580562
18    66.698687
19    70.608711
20    65.295625
21    70.176019
22    63.281668
23    67.763937
24    71.739837
25    71.136732
26    67.797308
27    64.824495
28    73.272476
29    71.368767
Name: height, dtype: float64

   age       income  weight
1  NaN  1776.319014     NaN
3  NaN  5866.491484     NaN
5  NaN   767.582514     NaN
7  NaN  1301.694570     NaN
9  NaN  5634.463606     NaN


In [71]:
df_11.head()

Unnamed: 0,age,income,weight
1,,1776.319014,
3,,5866.491484,
5,,767.582514,
7,,1301.69457,
9,,5634.463606,


In [70]:
## Select all rows except the last one and all columns
X1 = df_11[:-1] 
X2 = df_11[:-2]

print(X1)
print(X2)

          age       income    weight
1         NaN  1776.319014       NaN
3         NaN  5866.491484       NaN
5         NaN   767.582514       NaN
7         NaN  1301.694570       NaN
9         NaN  5634.463606       NaN
10  20.327706  2116.942003  1.196770
11   9.964562  7815.459289  1.925878
12  11.160309  6049.132288  1.803799
13  21.523453  1818.575913  1.483093
14  12.754639  3393.339585  1.308016
15  15.943299  5042.498163  1.544532
16  12.754639  4529.956600  1.710762
17  11.160309  3693.837309  2.109051
18  13.950386  4575.034193  1.733435
19  23.914948  7155.725645  1.325586
20   7.174484  6117.951871  1.297977
21  10.761727  4679.712077  1.628816
22  23.516366  6232.075442  1.634751
23  10.363144  5189.460406  1.686406
24  20.726288   268.599848  2.262926
25  25.110695  1654.818407  1.396001
26  18.733376  3785.477219  1.671650
27  16.341881  4637.638737  1.654835
28  14.348969  1858.356842  1.715714
          age       income    weight
1         NaN  1776.319014       NaN
3

In [75]:
""" 
1) drop rows containing Nans
2) drop columns containing NaNs 
3) fill NaNs with imputed values
4) use a model that natively handles NaNs
"""
X = df_11
X = X.fillna(2)
print(X)
print(label)
# Split data defining random_state avoid to obtain different results
X_train, X_test, y_train, y_test = train_test_split(X, label, test_size=0.3, random_state=1) 

clf_11 = GradientBoostingRegressor()
clf_11.fit(X_train, y_train)
y_pred = clf_11.predict(X_test)

          age       income    weight
1    2.000000  1776.319014  2.000000
3    2.000000  5866.491484  2.000000
5    2.000000   767.582514  2.000000
7    2.000000  1301.694570  2.000000
9    2.000000  5634.463606  2.000000
10  20.327706  2116.942003  1.196770
11   9.964562  7815.459289  1.925878
12  11.160309  6049.132288  1.803799
13  21.523453  1818.575913  1.483093
14  12.754639  3393.339585  1.308016
15  15.943299  5042.498163  1.544532
16  12.754639  4529.956600  1.710762
17  11.160309  3693.837309  2.109051
18  13.950386  4575.034193  1.733435
19  23.914948  7155.725645  1.325586
20   7.174484  6117.951871  1.297977
21  10.761727  4679.712077  1.628816
22  23.516366  6232.075442  1.634751
23  10.363144  5189.460406  1.686406
24  20.726288   268.599848  2.262926
25  25.110695  1654.818407  1.396001
26  18.733376  3785.477219  1.671650
27  16.341881  4637.638737  1.654835
28  14.348969  1858.356842  1.715714
29  10.363144  1180.879743  1.843187
1     69.438018
3     75.654684
5     

<div style="line-height:0.5">
<h2 style="color:#B0EE8F ">  # 12) HistGradientBoostingClassifier </h2>
</div>

In [76]:
iris = load_iris()
## Split
X_train, X_test, y_train, y_test = train_test_split(X, label, test_size=0.3, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Create an instance of the classifier
clf = HistGradientBoostingClassifier()      
# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Use the classifier to make predictions on the testing data
y_pred = clf.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
accuracy

1.0

N.B. <br>
The make_pipeline function from the sklearn.pipeline module can accept either a sequence of transformer and estimator <br> objects or a sequence of (name, transformer/estimator) tuples. <br> 
In this case, SimpleImputer() and LogisticRegression() are the transformer and estimator objects, respectively.

In [77]:
# Simple dataframe
df = pd.DataFrame({'Age': [22, 21, 29, 19, 31, 23, 44, np.nan, 21, 42], 
                'Grade': [2, 1, 3, 4, 5, 4, 1, 5, 4, 0], 
                'Passed': ['no', 'no', 'yes', 'yes', 'yes', 'yes', 'no', 'yes', 'yes', 'no']})
print(df)

X = df[['Age', 'Grade']]
y = df['Passed']

pipe = make_pipeline(SimpleImputer(), LogisticRegression())
#pipe.fit(X, y); # jupyter trick! use ; to avoid printing the output!
pipe.fit(X, y)

    Age  Grade Passed
0  22.0      2     no
1  21.0      1     no
2  29.0      3    yes
3  19.0      4    yes
4  31.0      5    yes
5  23.0      4    yes
6  44.0      1     no
7   NaN      5    yes
8  21.0      4    yes
9  42.0      0     no


In [78]:
""" Inspect the imputerm collecting the mean in case you have value to impute. 
Use lower-case verion of the name of the imputer object! (in our case SimpleImputer())
"""
print(pipe.named_steps.simpleimputer.statistics_)
print(pipe.named_steps.logisticregression.coef_)

[28.   2.9]
[[-0.02176605  1.35158464]]
