# Summary

This week, we will cover chapters 1 and 2 of the book:
* The **first chapter** (*The Machine Learning Landscape*) is an introduction to the topic of machine learning
* The **second chapter** (*End-to-End Machine Learning Project*) goes through an example machine learning project from end to end

There are two key concepts in the material of the second chapter which we will review:
1. The design principles of `sklearn`
2. The notion of pipelines in `sklearn`

# Design principles of `sklearn`

The principles behind `sklearn` are laid out in the paper [API design for machine learning software: experiences from the scikit-learn project](https://arxiv.org/abs/1309.0238)

### Estimators

Any object that can estimate some parameters based on a dataset is called an **estimator** (e.g., an imputer is an estimator). The estimation itself is performed by the `fit()` method, and it takes only a dataset as a parameter (or two for supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the estimation process is considered a **hyperparameter** (such as an imputer ’s strategy), and it must be set as an instance variable (generally via a constructor parameter).

### Transformers

Some estimators (such as an imputer) can also transform a dataset; these are called **transformers**. Once again, the API is simple: the transformation is performed by the `transform()` method with the dataset to transform as a parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters, as is the case for an imputer . All transformers also have a convenience method called `fit_transform()` that is equivalent to calling `fit()` and then `transform()` (but sometimes `fit_transform()` is optimized and runs much faster).

### Predictors

Some estimators, given a dataset, are capable of making predictions; they are called **predictors**. A predictor has a `predict()` method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has a `score()` method that measures the quality of the predictions, given a test set (and the corresponding labels, in the case of supervised learning algorithms).

### Inspection

All the estimator’s **hyperparameters** are accessible directly via public instance variables (e.g., `imputer.strategy`), and all the estimator’s **learned parameters** are accessible via public instance variables with an underscore suffix (e.g. `imputer.statistics_`).

### Nonproliferation of classes

Datasets are represented as `NumPy` arrays or `SciPy` sparse matrices, instead of homemade classes. Hyperparameters are just regular Python strings or numbers.

### Composition

Existing building blocks are reused as much as possible. For example, it is easy to create a `Pipeline` estimator from an arbitrary sequence of transformers followed by a final estimator.

### Sensible defaults

Scikit-Learn provides reasonable default values for most parameters, making it easy to quickly create a baseline working system.

# Pipelines in sklearn

Pipelines are best-practice to encapsulate the entire data processing that is required to get from the raw data (as read in from a file or database table) to the inputs of a model. Pipelines chain multiple transformers that each perform specific tasks. The last element of a pipeline can be an estimator. In this case, the estimator combines data processing and model estimation.

The main challenge with pipelines is that sklearn and pandas do not interact seamlessly all the time 

## Example: Building a regression model for the Titanic data

In [47]:
import numpy as np
import pandas as pd

In [60]:
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
col_types = {'Sex': 'category', 'Embarked': 'category'}
train = pd.read_csv('data/train.csv', usecols=cols, dtype=col_types)

In [61]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null category
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null category
dtypes: category(2), float64(2), int64(4)
memory usage: 43.8 KB


In [62]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [63]:
y = train['Survived'].astype('float32').values

In [64]:
X = train.drop('Survived', axis=1)

We need to get this `DataFrame` into a shape that `sklearn` estimators can work with, i.e. a `numpy` array with dtype float. We will do this with a pipeline transformer that contains all necessary steps.

## Transformers: Building blocks of pipelines

Each class you intent to put in a pipeline or feature union should inherit from `BaseEstimator` and `TransformerMixin`.
* The former makes your class accessible for hyper parameter methods such as `GridSearch`
* The latter applies a robust fit_transform function.

For the base classes the only mandatory functions are `fit` and `transform`. Additional useful class methods for transformers are:
* With `__init__` you can give your transformer some config at initialization
* The `inverse__transform` enables the inverse transform call on the full pipeline if each element supports it. 
* The `get_feature_names` function is crucial when you want the names of transformed features in a feature union accessible.

An example of a transformer that applies a vectorized function `trans_func` and inverse function `untrans_func` to the attributes specified in `columns` in a `DataFrame` could look like this:

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin

In [5]:
class SimpleTransformer(BaseEstimator, TransformerMixin):
    """Apply given transformation."""
    def __init__(self, trans_func, untrans_func, columns):
        self.transform_func = trans_func
        self.inverse_transform_func = untrans_func
        self.cols = columns

    def fit(self, x, y=None):
        return self

    def transform(self, x):
        x = self._get_selection(x)
        return self.transform_func(x) if callable(self.transform_func) else x

    def inverse_transform(self, x):
        return self.inverse_transform_func(x) \
            if callable(self.inverse_transform_func) else x

    def _get_selection(self, df):
        assert isinstance(df, pd.DataFrame), "df is not a pandas DataFrame"
        return df[self.cols]

    def get_feature_names(self):
        return self.cols

### Back to the example

`sklearn` provides the `ColumnTransformer` class which selects columns by name. The selected subset of columns can then be further processed. Typically, columns with the same type need to be processed in the same manner.

A possible implementation of a `TypeSelector` is as follows:

In [67]:
from sklearn.base import BaseEstimator, TransformerMixin

In [68]:
class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        assert isinstance(X, pd.DataFrame), "X is not a DataFrame"
        return X.select_dtypes(include=[self.dtype])

For the data, we need two pipelines
1. Categorical data: Categories need to be encoded.
2. Numeric data

**Categorical data**

Recently, `pandas` has introduced the `categorical` data type which handles most of the work related to categorical data. However, categoricals are not accepted by most machine learning models and require further encoding or creating dummy variables first. In addition, `pandas` `categorical` type does not include `NaN` values in the list of values and replaces them with -1. Many encoder classes in `sklearn`, including `OneHotEncoder` expect positive values. This mismatch between the two packages needs to be explicitly addressed:
1. Make the missing values an explicit category by changing the index
2. Imputation of missing values

In [None]:
class MissingCategory(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        assert isinstance(X, pd.DataFrame), "X is not a DataFrame"
        assert X.apply(lambda c: c.dtype).eq('category').all(), "Not all columns of X have category dtype"
        return X.apply(lambda s: s.cat.codes.replace(
            {-1: len(s.cat.categories)}
        ))

In [71]:
from sklearn.impute import SimpleImputer

In [74]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

In [75]:
pipeline_cat = Pipeline([
    ('selector', TypeSelector('category')),
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

In [78]:
X_cat = pipeline_cat.fit_transform(X)

In [81]:
print(X_cat[:4,].todense())

[[0. 1. 0. 0. 1.]
 [1. 0. 1. 0. 0.]
 [1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1.]]


**Numeric data**

For numeric data, we need to distingish between integer values data (which are really categorical variables) and numeric data. For integer data, we need to apply the same pipeline as for categorical variables.

In [87]:
pipeline_int = Pipeline([
    ('selector', TypeSelector(int)),
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(categories='auto'))
])

In [89]:
X_int = pipeline_int.fit_transform(X)
print(X_int[:4,].todense())

[[0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]]


For true numeric data, the steps are imputation and then scaling as many ML algorithms require that features have a similar value range. We will use standardization here.

In [82]:
from sklearn.preprocessing import StandardScaler

In [90]:
pipeline_num = Pipeline([
    ('selector', TypeSelector('float')),
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

In [92]:
X_num = pipeline_num.fit_transform(X)
print(X_num[:4,])

[[-0.56573646 -0.50244517]
 [ 0.66386103  0.78684529]
 [-0.25833709 -0.48885426]
 [ 0.4333115   0.42073024]]


**Combining multiple parallel pipelines**: FeatureUnion

The `FeatureUnion` class concatenates the results of multiple pipelines into a single numpy array.

In [93]:
from sklearn.pipeline import FeatureUnion

In [94]:
full_pipeline = FeatureUnion(transformer_list=[
        ("numeric_features", pipeline_num),
        ("categorical_features", pipeline_cat),
        ("integer_features", pipeline_int)
])

In [98]:
X_processed = full_pipeline.fit_transform(X)
print(X_processed[:4,].todense())

[[-0.56573646 -0.50244517  0.          1.          0.          0.
   1.          0.          0.          1.          0.          1.
   0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.        ]
 [ 0.66386103  0.78684529  1.          0.          1.          0.
   0.          1.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.        ]
 [-0.25833709 -0.48885426  1.          0.          0.          0.
   1.          0.          0.          1.          1.          0.
   0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.        ]
 [ 0.4333115   0.42073024  1.          0.          0.          0.
   1.          1.          0.          0.          0.          1.
   0.          0.          0.          0.        

**Make a model pipeline**: Add an estimator as the final pipeline step

When expanding a dataset it is often required to keep the actual dataset in the feature union, which is not retained by default. We can work around this by having an identity transformer. We can even use our new class SimpleTransformer for that - just pass None for trans_func and untrans_func.

A very basic feature union that expects the pandas dataframe format can look like this:

In [None]:
from sklearn.pipeline import FeatureUnion
import numpy as np

In [None]:
all_feature_names = ['Age', 'Gender', 'Height', 'Weight', 'y1', 'y2']

simple_union = FeatureUnion([('simple_trans_y',
                               SimpleTransformer(np.sqrt, np.square,
                                                 ['y1', 'y2'])
                              ),
                              ('identity',
                               SimpleTransformer(None, None,
                                ['Age', 'Gender', 'Height', 'Weight'])
                              )
                             ])


The FeatureUnion class just takes as argument a list of tuples where each tuple consists of a name and a transformer. Here, a dataset is simply filtered by subsets of all_feature_names where the columns y1 and y2 are transformed into their square root (assume their content to be floating point numbers). Well, the output of this union will still be a numpy matrix, but hold on, the magic is still to come. So far we have seen how to have transformations specific to certain columns but the df format is not retained yet.

### Scaling Transformations

Scaling your dataset by a StandardScaler or MinMaxScaler is what data scientists do for a living since a lot of linear models rely on this preprocessing in order to learn the latent patterns. But what if we want only specific columns to be scaled? What if I need a separate scaling for my independent and dependent features since I want to inverse the scaling of my predictions later that naturally embrace only the target variables?Scaling your dataset by a StandardScaler or MinMaxScaler is what data scientists do for a living since a lot of linear models rely on this preprocessing in order to learn the latent patterns. But what if we want only specific columns to be scaled? What if I need a separate scaling for my independent and dependent features since I want to inverse the scaling of my predictions later that naturally embrace only the target variables?

In [None]:
class Scaler(BaseEstimator, TransformerMixin):
    """scales selected columns only with given scaler"""
    def __init__(self, scaler, columns):
        self.scaler = scaler
        self.cols = columns

    def fit(self, X, y=None):
        X = self._get_selection(X)
        self.scaler.fit(X, y)
        return self

    def transform(self, X):
        X = self._get_selection(X)
        return self.scaler.transform(X)

    def inverse_transform(self, X):
        return self.scaler.inverse_transform(X)

    def _get_selection(self, df):
        assert isinstance(df, pd.DataFrame)
        return df[self.cols]

    def get_feature_names(self):
        return self.cols


This scaler class is pretty straight forward and calls in each of its functions the corresponding scaler function, where the scaler was given during initialization. The difference to the SimpleTransformer is that we now also do something during fitting but no surprises here. Build the feature union like this:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion

scaling_union = FeatureUnion([('scaler_x',
                              Scaler(StandardScaler(),
                                     ['Age', 'Gender', 'Height', 'Weight']),
                              ('scaler_y',
                               Scaler(StandardScaler(),
                                      ['y1', 'y2']))
                             ])


### Rolling Transformations

Rolling transformations are common when working with Time Series Data, and are easily implemented by relying on pandas' functionality.

In [None]:
class RollingFeatures(BaseEstimator, TransformerMixin):
    """This Transformer adds rolling statistics"""
    def __init__(self, columns, lookback=10):
        self.lookback = lookback
        self.cols = columns
        self.transformed_cols = None

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = self._get_selection(X)
        feat_d = {'std': X.rolling(self.lookback).std(),
                  'mean': X.rolling(self.lookback).mean(),
                  'sum': X.rolling(self.lookback).sum()
                  }
        for k in feat_d:
            feat_d[k].columns = \
                ['{}_rolling{}_{}'.format(c, self.lookback, k) for
                 c in X.columns]
        df = pd.concat(list(feat_d.values()), axis=1)
        self.transformed_cols = list(df.columns)
        return df

    def _get_selection(self, df):
        assert isinstance(df, pd.DataFrame)
        return df[self.cols]

    def get_feature_names(self):
        return self.transformed_cols


Here we come up with completely new feature names the first time. Note, that we have a new class variable transformed_cols to take account of those cols that were generated here.

### Cleaning the DataFrame

A class to perform some simple cleaning of a `DataFrame` could look like this:

In [15]:
class DFCleaner(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        X.dropna(inplace=True)
        X.reset_index(drop=True, inplace=True)
        return X

NameError: name 'BaseEstimator' is not defined

### Retain DataFrame as output format

For this challenge we can exploit the following simple trick. The FeatureUnion class has a method called get_feature_names that exhibits the feature names of each transformer although their output is a numpy matrix. In order to workaround the numpy output we can make each feature union a two-step pipeline where the union denotes the first step while a transformer fetching the actual feature names represents the second step. Sounds crazy? Check this out:

In [None]:
class FeatureUnionReframer(BaseEstimator, TransformerMixin):
    """Transforms preceding FeatureUnion's output back into Dataframe"""
    def __init__(self, feat_union, cutoff_transformer_name=True):
        self.union = feat_union
        self.cutoff_transformer_name = cutoff_transformer_name

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, np.ndarray)
        if self.cutoff_transformer_name:
            cols = [c.split('__')[1] for c in self.union.get_feature_names()]
        else:
            cols = self.union.get_feature_names()
        df = pd.DataFrame(data=X, columns=cols)
        return df

    @classmethod
    def make_df_retaining(cls, feature_union):
        """With this method a feature union will be returned as a pipeline
        where the first step is the union and the second is a transformer that
        re-applies the columns to the union's output"""
        return Pipeline([('union', feature_union),
                         ('reframe', cls(feature_union))])


This class does the job. The optional bool argument gives you the freedom to keep the default prefix in feature union chains that is the name of the transformer. The class’ static method is making things even more comfortable as we don’t need to instantiate it explicitly.

### Putting all together

In [None]:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

x_feats = ['Age', 'Gender', 'Height', 'Weight']
y_feats = ['y1', 'y2']

featurize_union = make_union(SimpleTransformer(np.sqrt, np.square, y_feats),
                             SimpleTransformer(None, None, x_feats),
                             RollingFeatures(x_feats, lookback=10)
                             )
scaling_union = make_union(Scaler(StandardScaler(), x_feats),
                           Scaler(StandardScaler(), y_feats)
                          )

featurize_pipe = FeatureUnionReframer.make_df_retaining(featurize_union)
scaling_pipe = FeatureUnionReframer.make_df_retaining(scaling_union)

pipe = make_pipeline(featurize_pipe,
                     DFCleaner(),
                     scaling_pipe)


## Example

Let's build an example pipeline.

Every blue segment is a standard scikit-learn `Transformer`. The yellow segments are custom-made.

## Selectors

### ColumnSelector

### TypeSelector

Often data is loaded in a pandas `DataFrame` which can combine several types of variables. Sklearn, however, works on numpy arrays with a floating point data type  (the default is `float64` but `float32` can also be specified. Therefore, most pipelines perform a type selection as their first step and all variables of the same type are processed in the same manner. If necessary, of course, additional selection steps can be performed so that specific groups of variables are processed differently.

`sklearn` does not provide a type selector class out of the box, presumably as sklearn is built around `numpy` and the "problem" is introduced by the pandas package.

### Catagory Encoder

The only problem that needs to be explicitly addressed is that the `pandas` `categorical` type does not include `NaN` values in the list of values and replaces them with -1. Many encoder classes in `sklearn`, including `OneHotEncoder` expect positive values. This mismatch between the two packages needs to be explicitly addressed.

### Full codes

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [None]:
transformer = Pipeline([
    ('features', FeatureUnion(n_jobs=1, transformer_list=[
        # Part 1
        ('boolean', Pipeline([
            ('selector', TypeSelector('bool')),
        ])),  # booleans close
        
        ('numericals', Pipeline([
            ('selector', TypeSelector(np.number)),
            ('scaler', StandardScaler()),
        ])),  # numericals close
        
        # Part 2
        ('categoricals', Pipeline([
            ('selector', TypeSelector('category')),
            ('labeler', StringIndexer()),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ]))  # categoricals close
    ])),  # features close
])  # pipeline close

## Example

As an example, we will be using the

### Implementation without a Pipeline

In [1]:
import pandas as pd
import pmlb

df = pmlb.fetch_data('churn', return_X_y=False)

# Remove the target column and the phone number
x_cols = [c for c in df if c not in ["target", "phone number"]]

binary_features = ["international plan", "voice mail plan"]
categorical_features = ["state", "area code"]

# Column types are defaulted to floats
X = (
    df
    .drop(["target"], axis=1)
    .astype(float)
)
X[binary_features] = X[binary_features].astype("bool")

# Categorical features can't be set all at once
for f in categorical_features:
    X[f] = X[f].astype("category")

y = df.target

# Randomly set 500 items as missing values
random.seed(42)
num_missing = 500
indices = [(row, col) for row in range(X.shape[0]) for col in range(X.shape[1])]
for row, col in random.sample(indices, num_missing):
    X.iat[row, col] = np.nan

# Partition data set into training/test split (2 to 1 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=42)

ModuleNotFoundError: No module named 'pmlb'

In [None]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)

        try:
            return X[self.columns]
        except KeyError:
            cols_error = list(set(self.columns) - set(X.columns))
            raise KeyError("The DataFrame does not include the columns: %s" % cols_error)
            
cs = ColumnSelector(columns=["state", "account length", "area code"])
cs.fit_transform(df).head()

In [None]:
class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

In [None]:
ts = TypeSelector("category")
ts.fit_transform(X).head()

### Preprocessing Pipeline

1. Select only the relevant feature columns (omits the phone number column)
2. Impute and standardize the numeric features
3. Impute and one-hot encode the categorical features
4. Impute the boolean features
5. Apply a FeatureUnion to join the transformed features into a single data set

In [None]:
preprocess_pipeline = make_pipeline(
    ColumnSelector(columns=x_cols),
    FeatureUnion(transformer_list=[
        ("numeric_features", make_pipeline(
            TypeSelector(np.number),
            Imputer(strategy="median"),
            StandardScaler()
        )),
        ("categorical_features", make_pipeline(
            TypeSelector("category"),
            Imputer(strategy="most_frequent"),
            OneHotEncoder()
        )),
        ("boolean_features", make_pipeline(
            TypeSelector("bool"),
            Imputer(strategy="most_frequent")
        ))
    ])
)

In [None]:
preprocess_pipeline.fit(X_train)

X_test_transformed = preprocess_pipeline.transform(X_test)
X_test_transformed

In [None]:
classifier_pipeline = make_pipeline(
    preprocess_pipeline,
    SVC(kernel="rbf", random_state=42)
)

In [None]:
param_grid = {
    "svc__gamma": [0.1 * x for x in range(1, 6)]
}

classifier_model = GridSearchCV(classifier_pipeline, param_grid, cv=10)
classifier_model.fit(X_train, y_train)

In [None]:
y_score = classifier_model.decision_function(X_test)

fpr, tpr, thresholds = roc_curve(y_test, y_score)
roc_auc = roc_auc_score(y_test, y_score)

# Plot ROC curve
plt.figure(figsize=(16, 12))
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1 - Specificity)', size=16)
plt.ylabel('True Positive Rate (Sensitivity)', size=16)
plt.title('ROC Curve', size=20)
plt.legend(fontsize=14);