<table border="0" style="width:100%">
 <tr>
    <td>
        <img src="https://static-frm.ie.edu/university/wp-content/uploads/sites/6/2022/06/IE-University-logo.png" width=150>
     </td>
    <td><div style="font-family:'Courier New'">
            <div style="font-size:25px">
                <div style="text-align: right"> 
                    <b> MASTER IN BIG DATA</b>
                    <br>
                    Python for Data Analysis II
                    <br><br>
                    <em> Daniel Sierra Ramos </em>
                </div>
            </div>
        </div>
    </td>
 </tr>
</table>

# **S15: PIPELINES**

One of the problems that usually arise when we build machine learning problems with `scikit-learn` is the fact that we have to concatenate a lot of operations. For example:
 - one-hot-encoding for categorical variables
 - imputation of missing values just for some variables
 - standard scaling of continuous features
 - training a model on the resulting data
 
Every operation above have its own `fit/transform` operation, and we have to apply all of them in order to the training set for training, and then we have to replicate exactly the same operations if we want to get the results on the test set.

## Composite transformers and estimators

In `scikit-learn` we have some tools to avoid replicating these structures and calling *fit* of *transform* several times for every operation in the data preprocessing and training steps.

 - `Pipeline` - To chain several operations that goes consecutively. For example, we can build a pipeline that first fit a `Standardscaler` and then a `LogisticRegression`.
 - `ColumnTransformer` - This is used to apply specific transformations for every column in the DataFrame independently. For example, we can apply a `OneHotEncoder` to the categorical variables and a `StandardScaler` to the numerical variables and automatically join the result of both operations in a new table.

## **Example: The Adults dataset**

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [9]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score, roc_curve

from itertools import product

### **Load Data**

In [2]:
data = pd.read_csv("data/adult.csv.zip")

In [4]:
data.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [5]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


### 1. Build a `ColumnTransformer` 

## **Data Quality**

In this part, let's check 2 things:
 - Missing values -> No missing values in this problem
 - Data types -> Some data types need to be changed
    - `capital_gain` -> `float`
    - `capital_loss` -> `float`
    - `age` -> `float`

In [6]:
data = data.astype({
    "fnlwgt": float,
    "capital-gain": float,
    "capital-loss": float,
    "hours-per-week": float,
    "age": float
})

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              48842 non-null  float64
 1   workclass        48842 non-null  object 
 2   fnlwgt           48842 non-null  float64
 3   education        48842 non-null  object 
 4   educational-num  48842 non-null  int64  
 5   marital-status   48842 non-null  object 
 6   occupation       48842 non-null  object 
 7   relationship     48842 non-null  object 
 8   race             48842 non-null  object 
 9   gender           48842 non-null  object 
 10  capital-gain     48842 non-null  float64
 11  capital-loss     48842 non-null  float64
 12  hours-per-week   48842 non-null  float64
 13  native-country   48842 non-null  object 
 14  income           48842 non-null  object 
dtypes: float64(5), int64(1), object(9)
memory usage: 5.6+ MB


## **Build a model** [without Pipelines]

In this case, we're building a model with the following steps:
   1. Preprocessing
      1. Transform categorical data with OHE
      2. Transform numerical data with StandardScaler
   2. Train a model

### **Prepare data**

In [8]:
X = data.drop(columns=["income"])
y = data["income"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

### **Preprocessing**

**Apply One-Hot-Encoging** to categorical data

In [13]:
# fing categorical columns
X_train_cat = X_train.select_dtypes(["O","int"])

# apply OHE to categorical columns
ohe = OneHotEncoder(sparse=False)
X_train_cat = ohe.fit_transform(X_train_cat)

X_train_cat = pd.DataFrame(X_train_cat, columns=ohe.get_feature_names_out())

**Apply standardization** to numerical data

In [21]:
X_train_num = X_train.select_dtypes(["float"])

scaler = StandardScaler()
X_train_num = scaler.fit_transform(X_train_num)

X_train_num = pd.DataFrame(X_train_num, columns=scaler.get_feature_names_out())

**Join both results in one table**

In [22]:
X_train_full = X_train_cat.join(X_train_num)

### **Build the model**

In [23]:
clf = LogisticRegression(solver='liblinear', max_iter=1000)

In [24]:
%%time

clf.fit(X_train_full, y_train)

CPU times: total: 438 ms
Wall time: 526 ms


LogisticRegression(max_iter=1000, solver='liblinear')

### **Evaluate the model**

#### **Evaluate on training**

In [25]:
pred = clf.predict(X_train_full)
probas = clf.predict_proba(X_train_full)

In [26]:
precision_test = precision_score(y_train, pred, pos_label=">50K")
recall_test = recall_score(y_train, pred, pos_label=">50K")
f1_test = f1_score(y_train, pred, pos_label=">50K")
roc_auc_test = roc_auc_score(y_train, probas[:,1])

In [27]:
print(f"Train Precision: {round(precision_test,3)}")
print(f"Train Recall: {round(recall_test,3)}")
print(f"Train F1: {round(f1_test,3)}")
print(f"Train ROC_AUC: {round(roc_auc_test,3)}")

Train Precision: 0.738
Train Recall: 0.607
Train F1: 0.666
Train ROC_AUC: 0.909


#### **Evaluate on test**

In [28]:
# apply OHE to categorical columns
X_test_cat = X_test.select_dtypes(["O","int"])
X_test_cat = ohe.transform(X_test_cat)
X_test_cat = pd.DataFrame(X_test_cat, columns=ohe.get_feature_names_out())

# apply stanrdadization also to the test
X_test_num = X_test.select_dtypes(["float"])
X_test_num = scaler.transform(X_test_num)
X_test_num = pd.DataFrame(X_test_num, columns=scaler.get_feature_names_out())

# join both tables
X_test_full = X_test_cat.join(X_test_num)

In [29]:
pred = clf.predict(X_test_full)
probas = clf.predict_proba(X_test_full)

In [30]:
precision_test = precision_score(y_test, pred, pos_label=">50K")
recall_test = recall_score(y_test, pred, pos_label=">50K")
f1_test = f1_score(y_test, pred, pos_label=">50K")
roc_auc_test = roc_auc_score(y_test, probas[:,1])

In [31]:
print(f"Test Precision: {round(precision_test,3)}")
print(f"Test Recall: {round(recall_test,3)}")
print(f"Test F1: {round(f1_test,3)}")
print(f"Test ROC_AUC: {round(roc_auc_test,3)}")

Test Precision: 0.721
Test Recall: 0.599
Test F1: 0.654
Test ROC_AUC: 0.905


## **Build a model** [with Pipelines]

### **Prepare data**

In [32]:
X = data.drop(columns=["income"])
y = data["income"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)

### **Preprocessing**

Build a **ColumnTransformer** with all the preprocessing

In [33]:
categorical_columns = X.select_dtypes(["O","int"]).columns
numerical_columns = X.select_dtypes(["float"]).columns

In [39]:
preprocessing = ColumnTransformer(
    [
        (
            "ohe",
            OneHotEncoder(sparse=False),
            categorical_columns
        ),
        (
            "scaler",
            StandardScaler(),
            numerical_columns
        )
    ],
    remainder="drop"  # this can be "drop", "passthrough", or another Estimator
)

In [40]:
# we can directly apply the OHE + scaler to original data
X_train_full = preprocessing.fit_transform(X_train)

In [41]:
# transform to DataFrame
X_train_full = pd.DataFrame(X_train_full, columns=preprocessing.get_feature_names_out())

In [42]:
X_train_full.head()

Unnamed: 0,ohe__workclass_?,ohe__workclass_Federal-gov,ohe__workclass_Local-gov,ohe__workclass_Never-worked,ohe__workclass_Private,ohe__workclass_Self-emp-inc,ohe__workclass_Self-emp-not-inc,ohe__workclass_State-gov,ohe__workclass_Without-pay,ohe__education_10th,...,ohe__native-country_Thailand,ohe__native-country_Trinadad&Tobago,ohe__native-country_United-States,ohe__native-country_Vietnam,ohe__native-country_Yugoslavia,scaler__age,scaler__fnlwgt,scaler__capital-gain,scaler__capital-loss,scaler__hours-per-week
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.487854,-1.453557,1.876761,-0.216957,1.584532
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,-0.044861,0.005824,-0.146436,-0.216957,0.774555
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.174099,-0.652381,0.271428,-0.216957,-0.035421
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.466044,-1.525541,-0.146436,3.482491,2.394508
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,-0.628752,-0.76767,-0.146436,-0.216957,-0.035421


### **Build the whole process**

Now that we have defined the preprocessing step, we can use the `Pipeline` to chain the preprocessing and the model training

In [43]:
clf = LogisticRegression(solver='liblinear', max_iter=1000)

In [44]:
model = Pipeline(
    steps=[
        ("preprocessing", preprocessing),
        ("classifier", clf)
    ]
)

### **Apply the whole process to original data**

In [45]:
%%time

model.fit(X_train, y_train)

CPU times: total: 422 ms
Wall time: 502 ms


Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('ohe',
                                                  OneHotEncoder(sparse=False),
                                                  Index(['workclass', 'education', 'educational-num', 'marital-status',
       'occupation', 'relationship', 'race', 'gender', 'native-country'],
      dtype='object')),
                                                 ('scaler', StandardScaler(),
                                                  Index(['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week'], dtype='object'))])),
                ('classifier',
                 LogisticRegression(max_iter=1000, solver='liblinear'))])

### **Evaluate the model**

#### **Evaluate on training**

In [46]:
pred = model.predict(X_train)
probas = model.predict_proba(X_train)

In [47]:
precision_test = precision_score(y_train, pred, pos_label=">50K")
recall_test = recall_score(y_train, pred, pos_label=">50K")
f1_test = f1_score(y_train, pred, pos_label=">50K")
roc_auc_test = roc_auc_score(y_train, probas[:,1])

In [48]:
print(f"Train Precision: {round(precision_test,3)}")
print(f"Train Recall: {round(recall_test,3)}")
print(f"Train F1: {round(f1_test,3)}")
print(f"Train ROC_AUC: {round(roc_auc_test,3)}")

Train Precision: 0.738
Train Recall: 0.607
Train F1: 0.666
Train ROC_AUC: 0.909


#### **Evaluate on test**

In [49]:
pred = model.predict(X_test)
probas = model.predict_proba(X_test)

In [50]:
precision_test = precision_score(y_test, pred, pos_label=">50K")
recall_test = recall_score(y_test, pred, pos_label=">50K")
f1_test = f1_score(y_test, pred, pos_label=">50K")
roc_auc_test = roc_auc_score(y_test, probas[:,1])

In [51]:
print(f"Test Precision: {round(precision_test,3)}")
print(f"Test Recall: {round(recall_test,3)}")
print(f"Test F1: {round(f1_test,3)}")
print(f"Test ROC_AUC: {round(roc_auc_test,3)}")

Test Precision: 0.721
Test Recall: 0.599
Test F1: 0.654
Test ROC_AUC: 0.905


### **Exercise:** Build the following pipeline

1. Preprocessing
   1. OHE to all columns except `workclass`
   2. OrdinalEncoder for `workclass`
   3. StandardScaler for all resulting columns + numerical ones
2. Feature Selection technique (SelectKBest)
3. Model training

In [11]:
X= data.drop(["income"],axis=1)
y = data["income"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=99)

Sc = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore')



In [19]:
cat_features = X.select_dtypes(["int","O"]).columns

In [20]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB


In [24]:
from sklearn.preprocessing import OneHotEncoder
cat_preprocessing = ColumnTransformer(
    [
        ("ohe", OneHotEncoder(sparse=False), cat_features.drop("workclass")),
        ("ordinal",OneHotEncoder(),["workclass"] )
    ],
    remainder="passthrough"
)

In [25]:
preprocessing = Pipeline(
    [
        ("cat_preprocessing", cat_preprocessing),
        ("scaler", StandardScaler())
    ]
)

In [27]:
steps=[]

In [28]:
steps.append(("preprocessing", preprocessing))

In [31]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
fs = SelectKBest(score_func=f_classif, k=10)

In [32]:
steps.append(("feature_selection", fs))

In [33]:
lr = LogisticRegression()

search_space = {
    "C": np.logspace

In [34]:
steps.append(("model", lr))

In [35]:
final = Pipeline(steps =steps)

In [39]:
final

Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('cat_preprocessing',
                                  ColumnTransformer(remainder='passthrough',
                                                    transformers=[('ohe',
                                                                   OneHotEncoder(sparse=False),
                                                                   Index(['age', 'fnlwgt', 'education', 'educational-num', 'marital-status',
       'occupation', 'relationship', 'race', 'gender', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')),
                                                                  ('ordinal',
                                                                   OneHotEncoder(),
                                                                   ['workclass'])])),
                                 ('scaler', StandardScaler())])),
                ('feature_selection', SelectKBest()),


In [40]:

final.get_feature_names_out()

NotFittedError: This ColumnTransformer instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.