# ML Pipelines

In this series of exercices, you will learn how a build robust a ML pipeline using [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

Also an important part of a ML pipeline is the pre-processing part. For this, you will learn how to master 
Sklearn encoders and tranformers as part of the [Preprocessing Sklearn module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Preprocessing Data

1. [Scaling with StandardScaler](#exo1)
2. [Encoding Categorical Features](#exo2)
3. [Dealing with missing data](#exo3)
4. [Custom Transformers and Encoders](#exo4)

### 1. Scaling with StandardScaler <a id='exo1'/>

Standardize features by removing the mean and scaling to unit variance is a common pre-processing step we apply to help many machine learning algorithms behave more efficiently.

[Sklearn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) can do the scaling transformation for you.
The standard score of a sample x is calculated as:

z = (x - u) / s

As you know, there are 2 main methods for any encoder/transformer. 
- `fit` which computes the mean and std to be used for later scaling.
- `tranform` which performs standardization by centering and scaling  

🎯 The goal of this exercice is to understand the concepts behind `fit()` and `transfom()` by re-implementing the standard scaler

#### Exercice
- Given the numpy arrays `data` and `test_data`, write a simple custom implementation of standard scaler. To test it, fit the scaler with `data` and tranform `test_data` with it.
- Compare your results with `StandardScaler`
- Make the custom implementation using a python class

In [None]:
import numpy as np
data = np.array([[1, 10], [2, -1], [0, 22], [3, 15]])
test_data = np.array([[2, 1], [5, 1], [3, 55], [3, 1]])

In [None]:
# Implement fit() and transform() functions here

def fit(X):
    """implement fit method"""
    

def transform(X, params):
    """implement transformation method"""

In [None]:
params = fit(data)
transformed_test_data = transform(test_data, params)
transformed_test_data

In [None]:
## Use Sklearn StandardScaler and compare results



In [None]:
## Custom Implementation with a Class 

class Scaler(object):
    def __init__(self):
        """define parameters useful for fit and transform"""
    
    def fit(self, X):

        
    def transform(self, X):
        


In [4]:
# Compare results here


### 2. Encoding Categorical Variables <a id='exo2'/>

Often features are not given as continuous values but categorical. However, machine learning algorithms only accept numerical data as inputs. That is why we need to make sure categorical variables are encoded before passed in ML estimators.

One encoder that is commonly used for categorical variables is [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

🎯 The goal of this exercice is to see a particular yet common case 

#### Exercise

Given `data` and `test_data`, use sklearn OneHotEncoder on these 2 sets, applying concepts from last exercice

In [None]:
import numpy as np
data = np.array([['France'], ['USA'], ['Italy'], ['Japan'], ['UK'], ['Germany'], ['USA'], ['Japan']])
test_data = np.array([['China'], ['USA'], ['Italy']])

In [None]:
## use Sklearn OneHotEncoder and compare results on test_data
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder()
ohe.fit(data)

In [None]:
ohe.transform(data).toarray()

In [None]:
ohe.transform(test_data).toarray()

🤔Why do you think you have this error ?   
Solve this error using `handle_unknown` parameter from `OneHotEncoder()`

In [5]:
# Solve error here


### 3. Dealing with Missing Data <a id='exo3'/>

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning.

For this, Sklearn has multiple ways to impute from missing data with the [Inpute module](https://scikit-learn.org/stable/modules/impute.html#)

#### Exercise
- Re-implement the [`SimpleInputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) tranformer using `mean` strategy.
- Test your implementation with `data` and `test_data`
- Compare with transformed data using `Sklearn SimpleInputer`
- Bonus: Implement for all 4 strategies (`mean`, `median`, `most_frequent` and `constant`)

In [None]:
import numpy as np
data = np.array([[1, 3, 3], [2, np.nan, 6], [3, 9, 9]])
test_data = np.array([[1, 1, 1], [1, np.nan, 1], [1, 1, 1]])
data[np.isnan(data)]

In [None]:
class CustomSimpleInputer(object):
    """Implement SimpleInputer """

    def __init__(self, strategy="mean"):

        
    def fit(self, X, **kwargs):
    
    def transform(self, X, **kwargs):


In [None]:
csi = CustomSimpleInputer()
csi.fit(data)
csi.transform(test_data)

In [None]:
## use Sklearn Simple Inputer and compare transformed data using your custom implementation


# 4. Custom Transformers and Encoders <a id='exo4'/>

Sklearn provides a large collection of transformers and encoders but you might need to implement you own encoder to fit the needs of your data and problem.

For this, there are two very useful Sklearn classes:  
👉 [BaseEstimator](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) and [TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) are base classes one can inherrit from to implement completely new custom encoders  
👉 Refer to this morning's slides to check how to implement custom transformer

#### Exercice
With the Taxi Fare Prediction Challenge data, using `BaseEstimator` and `TransformerMixin`, implement:

- a transformer that computes haversine distance between pickup and dropoff location
- a custom encoder that extract time features from `pickup_datetime`
- Use these two new encoders to fit and transform the training data

In [None]:
# Get data
df = pd.read_csv('s3://wagon-public-datasets/taxi-fare-train.csv', nrows=1000)

In [None]:
# Here we Clean data using first exercice function
def clean_df(df, test=False):
    df = df.dropna(how='any', axis='rows')
    df = df[(df.dropoff_latitude != 0) | (df.dropoff_longitude != 0)]
    df = df[(df.pickup_latitude != 0) | (df.pickup_longitude != 0)]
    if "fare_amount" in list(df):
        df = df[df.fare_amount.between(0, 4000)]
    df = df[df.passenger_count < 8]
    df = df[df.passenger_count >= 0]
    df = df[df["pickup_latitude"].between(left = 40, right = 42 )]
    df = df[df["pickup_longitude"].between(left = -74.3, right = -72.9 )]
    df = df[df["dropoff_latitude"].between(left = 40, right = 42 )]
    df = df[df["dropoff_longitude"].between(left = -74, right = -72.9 )]
    return df

In [None]:
df = clean_df(df)

In [None]:
# define vectorized haversine function here
def haversine_vectorized(df, 
    start_lat="start_lat", 
    start_lon="start_lon", 
    end_lat="end_lat", 
    end_lon="end_lon"):

    """ 
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees).
        Vectorized version of the haversine distance for pandas df
        Computes distance in kms
    """

    lat_1_rad, lon_1_rad = np.radians(df[start_lat].astype(float)), np.radians(df[start_lon].astype(float))
    lat_2_rad, lon_2_rad = np.radians(df[end_lat].astype(float)), np.radians(df[end_lon].astype(float))
    dlon = lon_2_rad - lon_1_rad
    dlat = lat_2_rad - lat_1_rad

    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat_1_rad) * np.cos(lat_2_rad) * np.sin(dlon / 2.0) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    return 6371 * c

In [None]:
# test the function here on df


In [None]:
# Implement Custom Transformer inheritting from BaseEstimator and TransformerMixin
from sklearn.base import BaseEstimator, TransformerMixin

class DistanceTransformer(BaseEstimator, TransformerMixin):
    """implement class here"""


In [None]:
custom_transfo = DistanceTransformer()
hav_dist = custom_transfo.transform(df)

In [None]:
hav_dist.head()

In [None]:
assert list(df.hav_dist.values) == list(hav_dist.distance.values)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd 

class TimeFeaturesEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, time_column, time_zone_name='America/New_York'):
        self.time_column = time_column
        self.time_zone_name = time_zone_name

    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame)
        X.index = pd.to_datetime(X[self.time_column])
        X.index = X.index.tz_convert(self.time_zone_name)
        X["dow"] = X.index.weekday
        X["hour"] = X.index.hour
        X["month"] = X.index.month
        X["year"] = X.index.year
        return X[["dow", "hour", "month", "year"]].reset_index(drop=True)

    def fit(self, X, y=None):
        return self


In [None]:
tf = TimeFeaturesEncoder("pickup_datetime")
tf.transform(df).head()

## Putting all together as a Pipeline

A Pipeline is very useful concept. In Machine Learning, you often need to perform a sequence of different transformations (scaling, filling missing values, transforming, encoding) of raw dataset before applying a final estimator.

A Pipeline gives you a simple interface for all these different steps of transformation and the resulting estimator. With that, it is easier to iterate and improve models because you can easily add, remove or re-order these different steps. Also, changing one or several parameters is very strightforward and does not require a lot code refactoring.

For this, you will learn how to use 2 Sklearn modules:
1. [ColumnTransformer](#exo11)
2. [Pipeline](#exo12)

### 1. Column Transformer <a id="exo11" />

Before building your pipeline let's use a very useful Sklearn module called [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

This module is very useful when your input data is a pandas dataframe as you can select columns from their names.

#### Exercise

You are given a small dataset containing weights and heights for a few individuals.

In [None]:
import pandas as pd
data = pd.DataFrame(
    [
        {'gender': 'Male', 'height': 180, 'weight': 82},
        {'gender': 'Female', 'height': np.nan, 'weight': 72},
        {'gender': 'Male', 'height': 175, 'weight': 75},
        {'gender': 'Female', 'height': 175, 'weight': 60},
        {'gender': 'Male', 'height': 170, 'weight': 76},
    ])

test_data = pd.DataFrame(
    [
        {'gender': 'Male', 'height': 170, 'weight': 72},
        {'gender': 'Female', 'height': np.nan, 'weight': 60}
    ]
)

data

With `ColumnTransformer`, build a single `encoder` that apply these transformations:
- encode `gender` with OneHot
- fill missing values for height

In [None]:
# Implement your encoder with ColumnTransformer here


In [None]:
encoder.fit_transform(data)

In [None]:
encoder.transform(test_data)

### 2. Pipeline <a id="exo12" />

Now it is time to use a Sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

#### Exercice
With the weight/height dataset, build a pipeline to predict the weight of individuals in the test set.

This pipeline should have:
- a oneHotEncode for `gender`
- fill missing values for height
- a scaler for height
- a simple estimator like a linear regression

**Tip** You can also use [make_pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) which is an alias of `Pipeline` to easily generate a pipeline without giving names to the transformers.

In [None]:
#Implement Pipeline here
pipe = 

In [None]:
pipe.fit(data, data.weight)

In [None]:
pipe.predict(test_data)