# Pipelines

The problems in this notebook correspond to the concepts covered in:
- `Lectures/Cleaning/2. Basic Pipelins` and
- `Lectures/Cleaning/5. More Advanced Pipelines`.

Note you should wait to solve these problems until after we have covered the first few `Classification` lecture notebooks.

In [1]:
import pandas as pd
import numpy as np

##### 1. `iris` pipeline

Load the usual iris data set from `sklearn`. Build a pipleline that scales these data then fits a $k$-nearest neighbors model to predict the iris type.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [3]:
data = load_iris()

data.keys()

X = data['data']
y = data['target']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                            test_size=.2,
                                            shuffle=True,
                                            random_state=40301)

##### Sample Solution

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [6]:
pipe = Pipeline([('scale', StandardScaler()),
                    ('knn', KNeighborsClassifier(5))])


pipe.fit(X_train, y_train)

In [7]:
accuracy_score(y_train, pipe.predict(X_train))

0.9666666666666667

In [8]:
confusion_matrix(y_train, pipe.predict(X_train))

array([[40,  0,  0],
       [ 0, 39,  2],
       [ 0,  2, 37]])

##### 2. `iris` with NAs pipeline - `SimpleImputer`

Now load the adjusted iris data set which has some missing values.

Build a pipeline that imputes the missing values with the `SimpleImputer` using the median, scales the data then fits $k$NN with $k=5$.

In [9]:
iris = pd.read_csv("../../data/iris_w_nas.csv")

iris_train, iris_test = train_test_split(iris.copy(),
                                            shuffle=True,
                                            random_state=233,
                                            stratify = iris['iris_class'])

##### Sample Solution

In [10]:
from sklearn.impute import SimpleImputer

In [11]:
pipe = Pipeline([('impute', SimpleImputer(strategy='median')),
                    ('scale', StandardScaler()),
                    ('knn', KNeighborsClassifier(5))])

In [12]:
pipe.fit(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']],
            iris_train.iris_class)

pred = pipe.predict(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']])


pipe.predict(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']])

array([0, 1, 1, 0, 0, 2, 2, 1, 2, 2, 2, 1, 2, 0, 1, 2, 1, 2, 1, 0, 1, 2,
       2, 0, 2, 1, 2, 0, 0, 2, 2, 0, 2, 0, 2, 0, 0, 0, 1, 0, 0, 2, 2, 1,
       1, 1, 1, 2, 1, 2, 1, 1, 0, 0, 0, 1, 2, 1, 2, 2, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 2, 1, 2, 2, 2, 2, 1, 1, 0, 0, 2, 2, 0,
       2, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 1, 0, 1, 1,
       2, 0])

In [13]:
accuracy_score(iris_train.iris_class, pred)

0.9375

In [14]:
confusion_matrix(iris_train.iris_class, pred)

array([[38,  0,  0],
       [ 0, 35,  2],
       [ 0,  5, 32]])

##### 3. `iris` with NAs pipeline - `KNNImputer`

Rebuild the pipeline in 2. but this time use `KNNImputer`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer">https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer</a>, with $k=5$ instead of `SimpleImputer`.

##### Sample Solution

In [15]:
from sklearn.impute import KNNImputer

In [16]:
pipe = Pipeline([('impute', KNNImputer(n_neighbors=5)),
                    ('scale', StandardScaler()),
                    ('knn', KNeighborsClassifier(5))])

In [17]:
pipe.fit(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']],
            iris_train.iris_class)

pred = pipe.predict(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']])


pipe.predict(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']])

array([0, 1, 1, 0, 0, 2, 2, 1, 2, 2, 2, 1, 2, 0, 1, 2, 1, 2, 2, 0, 1, 2,
       2, 0, 2, 1, 2, 0, 0, 2, 2, 0, 2, 0, 2, 0, 0, 0, 1, 0, 0, 2, 2, 1,
       1, 1, 1, 2, 1, 2, 1, 1, 0, 0, 0, 1, 2, 1, 2, 2, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 2, 1, 2, 2, 2, 2, 1, 1, 0, 0, 2, 2, 0,
       2, 1, 1, 0, 1, 2, 1, 1, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 1, 0, 2, 1,
       2, 0])

In [18]:
accuracy_score(iris_train.iris_class, pred)

0.9642857142857143

In [19]:
confusion_matrix(iris_train.iris_class, pred)

array([[38,  0,  0],
       [ 0, 35,  2],
       [ 0,  2, 35]])

##### 4. `iris` with NAs pipeline - Custom Imputer 

<i>You may want to go through `Lectures/Cleaning/5. More Advanced Pipelines` prior to attempting this exercise.</i>

Create a custom imputer object to impute the missing values of `petal_width` by regressing onto the other three features.

Use that imputer as the first step in the pipeline you have built in problems 2. and 3. above.

##### Sample Solution

In [20]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression

In [21]:
from sklearn.linear_model import LinearRegression

class reg_impute(BaseEstimator, TransformerMixin):
    # Class Constructor 
    # This allows you to initiate the class when you call
    # reg_impute
    def __init__(self):
        # I want to initiate each object with
        # an unfitted LinearRegression object
        self.LinearRegression = LinearRegression()
        
        
    
    # fitting the LinearRegression
    def fit(self, X, y = None ):
        ## first get where there is complete data to fit the model
        X_no_nas = X.loc[~X.petal_width.isna()].copy()
        
        ## Then fit the linear regression with that data
        self.LinearRegression.fit(X_no_nas[['sepal_length',
                                        'sepal_width',
                                        'petal_length']],
                                     X_no_nas['petal_width'])
        return self
    
    # transform should return the data with imputed values
    def transform(self, X, y = None):
        ## first I copy X
        copy_X = X.copy()
        
        ## This replaces the missing values of `petal_width`
        ## with regressed values
        copy_X.loc[copy_X.petal_width.isna(),
                      'petal_width'] = self.LinearRegression.predict(copy_X.loc[copy_X.petal_width.isna(),
                                                                                            ['sepal_length',
                                                                                                'sepal_width',
                                                                                                'petal_length']])
        return copy_X

In [22]:
pipe = Pipeline([('impute', reg_impute()),
                    ('scale', StandardScaler()),
                    ('knn', KNeighborsClassifier(5))])

In [23]:
pipe.fit(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']],
            iris_train.iris_class)

pred = pipe.predict(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']])


pipe.predict(iris_train[['sepal_length',
                        'sepal_width',
                        'petal_length',
                        'petal_width']])

array([0, 1, 1, 0, 0, 1, 2, 1, 2, 2, 2, 1, 2, 0, 1, 2, 1, 2, 2, 0, 1, 2,
       2, 0, 2, 1, 2, 0, 0, 2, 2, 0, 2, 0, 2, 0, 0, 0, 1, 0, 0, 2, 2, 1,
       1, 1, 1, 2, 1, 2, 1, 1, 0, 0, 0, 1, 2, 1, 2, 2, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 2, 1, 2, 2, 2, 2, 1, 1, 0, 0, 2, 2, 0,
       2, 1, 1, 0, 1, 2, 1, 1, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 1, 0, 2, 1,
       2, 0])

In [24]:
accuracy_score(iris_train.iris_class, pred)

0.9732142857142857

In [25]:
confusion_matrix(iris_train.iris_class, pred)

array([[38,  0,  0],
       [ 0, 36,  1],
       [ 0,  2, 35]])

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)