# Feature Engineering II: Column Transformers and Pipelines
<img src = "./images/lego.webp" width = "450">

<a href = "https://www.highsnobiety.com/p/lego-transformers-optimus-prime/">Image Source</a>

This notebook build on the [feature engineering introduction notebook](1.1_intro_to_fe.ipynb) to automate the transformation process, simplifying our workflow and unlocking the potential of the sklearn library.
<hr style="border:2px solid black">

## Penguin Dataset

We will use the Palmer Penguin Dataset.

### Business Goal
> Predict the penguin body mass given the input feature : flipper_length_mm, bill_length_mm, species and sex

#### Load Packages

In [None]:
# data analysis stack
import numpy as np
import pandas as pd

# data visualization stack
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')

# machine-learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    RobustScaler,
    MinMaxScaler,
    KBinsDiscretizer,
    PolynomialFeatures
)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

#### Load Data

In [None]:
df = pd.read_csv('./data/penguins.csv')
df.head()

#### Features and Target

In [None]:
numerical_features = [
    'flipper_length_mm',
    'bill_length_mm'
]

categorical_features = [
    'species',
    'sex'
]

features = numerical_features + categorical_features

target_variable = 'body_mass_g'

#### Feature-Target separation

In [None]:
# Feature matrix 
X = df[features]

# Target column
y = df[target_variable]

#### Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=88, shuffle=True, stratify=X['species'])

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

## Exploratory Data Analysis
Check issues with data:
+ which variable has missing values?
+ which variables are binary, categorical, metric?
+ do categorical variables have non-numeric values?
+ do metric features are varying on a different scale?


In [None]:
# Assuming X_train is a DataFrame and y_train is a Series
df_train = pd.concat([X_train, y_train], axis=1)

print("Combined train data shape:", df_train.shape)

In [None]:
df_train.isna().sum()

<hr style="border:2px solid black">

## Feature Engineering
We have a pair of tools, `ColumnTransformer()` and `Pipeline()`, which can dramatically simplify and automate feature engineering.


### ColumnTransformer()
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html">`ColumnTransformer` </a> allows us to specify which columns receive which transformations (and conveniently reintegrates the dataset).

Parameters:
 * `transformers` - list of tuples `(name, transformer, columns)`

 * `remainder` - used as last tuple if there are any untouched columns. Choose either `drop` or `passthrough`<br></br>
  
>**Note** that `ColumnTransformer()` runs all transformers in parallel, not sequentially, so if a column is transformed more than once, the version generated by each of these transformations will be included.

#### Building our first transformer


In [None]:
# define our transformers - name, method, target
transformers = [('ohe', OneHotEncoder(drop = 'first',sparse_output=False), ['species', 'sex']),
                ('bill_scaler', RobustScaler(), [['bill_length_mm', 'flipper_length_mm']]),
                #('flip_scaler', RobustScaler(), ['flipper_length_mm'])
               ]

In [None]:
# now we instantiate our ColumnTransformer() object
column_transformer = ColumnTransformer(transformers,
                                       remainder = 'drop')
column_transformer

We still need to impute missing values in sex and flipper_length_mm, but if we do so in this transformer we will create an imputed copy of sex and flipper_length_mm and a one-hot encoded versions with missing values.

What we need here is a way to sequentially apply transformations, which leads us nicely into...

### Pipeline()

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">`Pipeline()`</a> allows us to sequentially apply multiple transformers on the same column(s).


Parameters:
 * `steps` - list of tuples `(name, transformer)`

#### Build a pipeline and integrate it into our transformer

In [None]:
# Let's define the steps to impute and transform sex
sex_steps = [('imputer', SimpleImputer(strategy = 'most_frequent')),
             ('sex_ohe', OneHotEncoder(drop = 'first',sparse_output=False))
             ]

In [None]:
# Let's instantiate the sex pipeline
sex_pipeline = Pipeline(steps=sex_steps)
sex_pipeline

In [None]:
# Let's define the steps to impute and transform flipper_length_mm
flipper_steps = [('imputer', SimpleImputer(strategy = 'median')),
             ('flipper_scaler', RobustScaler())
             ]

In [None]:
# Let's instantiate the flipper pipeline
flipper_pipeline = Pipeline(steps=flipper_steps)
flipper_pipeline

In [None]:
# Let's build a new transformer to include this pipeline
transformers_2 = [('sex_pipeline', sex_pipeline, ['sex']),
                  ('ohe', OneHotEncoder(drop = 'first',sparse_output=False), ['species']),
                  ('flipper_pipeline', flipper_pipeline, ['flipper_length_mm']),
                ('scaler',RobustScaler(), ['bill_length_mm'])
                 ]

column_transformer_2 = ColumnTransformer(transformers=transformers_2,
                                         remainder = 'drop').set_output(transform='pandas')
column_transformer_2      

#### Let's Try It Out!

In [None]:
# Fit the column transformer object ONLY using train data
column_transformer_2.fit(X_train)

In [None]:
X_train.isna().sum()

In [None]:
# Transform the data
X_train_fe = column_transformer_2.transform(X_train)
X_train_fe

As you can see the variables have been tranformed according to the strategies defined in our pipeline. 

## Model Building

### Nesting Pipelines

We've already seen how we can use essentially any named function or object as a step in our pipelines and transformers. The final trick we'll explore with pipelines is the ability to nest several layers within one.

In [None]:
# build a pipeline containing our complete transformer and then a linear regression model
model_steps = [('feature_enginnering', column_transformer_2),
               ('linear_regression', LinearRegression())]
linear_model = Pipeline(steps = model_steps)
linear_model

**train model**

In [None]:
linear_model.fit(X_train,y_train)

In [None]:
training_score = linear_model.score(X_train,y_train)
print(f"training r2 score: {round(training_score, 6)}")

### Model Evaluation

**Model Weigths**

In [None]:
column_step = linear_model.steps[0][1]
column_step

In [None]:
model_step = linear_model.steps[1][1]
model_step

In [None]:
coef_model = pd.DataFrame(data=model_step.coef_.reshape(1,-1), columns=column_step.get_feature_names_out(), index=['weigth'])

coef_model['intercept'] = model_step.intercept_
coef_model

**Model Prediction**

In [None]:
y_pred_test = linear_model.predict(X_test)
y_pred_test

**Model Performance**

In [None]:
test_score = linear_model.score(X_test,y_test)
print(f"test r2 score: {round(test_score, 6)}")

>By applying a pipeline to our test data, we ensure that the test data is treated the in exact same way as the data the model was trained on. 

<hr style="border:2px solid black">