# Sklearn pipelines


### Learning Objectives
After completing this lesson, you will:
- understand the need for preprocessing data in general
- recognize pitfalls and "danger" in the process
- appreciate the structured and systematic approach offered by **pipelines**
  

## 1. MLR on housing data

We will use the **California Housing Dataset** from `sklearn.datasets`.  
It contains data on housing prices and district-level demographics in California.


In [1]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [2]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

data = fetch_california_housing(as_frame=True)
df = data.frame
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


The variable **`MedHouseVal`** is the **median house value** in $100,000s.  
We will predict it using a few numeric predictors.


#### Select a Subset of Variables

Use only the following features:

| Feature | Description |
|----------|--------------|
| MedInc | Median income in block group |
| AveRooms | Average number of rooms per household |
| AveBedrms | Average number of bedrooms per household |
| Population | Block group population |
| HouseAge | Median age of houses in the block group |


In [3]:
cols = ['MedHouseVal', 'MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']
df_sub = df[cols]
df_sub.head()

Unnamed: 0,MedHouseVal,MedInc,AveRooms,AveBedrms,Population,HouseAge
0,4.526,8.3252,6.984127,1.02381,322.0,41.0
1,3.585,8.3014,6.238137,0.97188,2401.0,21.0
2,3.521,7.2574,8.288136,1.073446,496.0,52.0
3,3.413,5.6431,5.817352,1.073059,558.0,52.0
4,3.422,3.8462,6.281853,1.081081,565.0,52.0


#### Multiple Regression using **statsmodels**
Use `statsmodels.api.OLS` to fit a multiple regression model.


In [4]:
import statsmodels.api as sm

X = df_sub[['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']]
y = df_sub['MedHouseVal']

X = sm.add_constant(X)
model_sm = sm.OLS(y, X).fit()
print(model_sm.summary())

                            OLS Regression Results                            
Dep. Variable:            MedHouseVal   R-squared:                       0.538
Model:                            OLS   Adj. R-squared:                  0.538
Method:                 Least Squares   F-statistic:                     4801.
Date:                Wed, 19 Nov 2025   Prob (F-statistic):               0.00
Time:                        12:35:47   Log-Likelihood:                -24278.
No. Observations:               20640   AIC:                         4.857e+04
Df Residuals:                   20634   BIC:                         4.862e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.4407      0.028    -15.945      0.0

In [5]:
X

Unnamed: 0,const,MedInc,AveRooms,AveBedrms,Population,HouseAge
0,1.0,8.3252,6.984127,1.023810,322.0,41.0
1,1.0,8.3014,6.238137,0.971880,2401.0,21.0
2,1.0,7.2574,8.288136,1.073446,496.0,52.0
3,1.0,5.6431,5.817352,1.073059,558.0,52.0
4,1.0,3.8462,6.281853,1.081081,565.0,52.0
...,...,...,...,...,...,...
20635,1.0,1.5603,5.045455,1.133333,845.0,25.0
20636,1.0,2.5568,6.114035,1.315789,356.0,18.0
20637,1.0,1.7000,5.205543,1.120092,1007.0,17.0
20638,1.0,1.8672,5.329513,1.171920,741.0,18.0


# R-Squared = our 5 columns in comparrison explain 53.8% of the variance 

## Predictions

In [6]:
# New data equal to an average house !
X_new = X.mean()
print(pd.DataFrame(X_new))

# Predict
pred = model_sm.predict(X_new)
print(pred)
beta = model_sm.params
print(X_new @ beta)

                      0
const          1.000000
MedInc         3.870671
AveRooms       5.429000
AveBedrms      1.096675
Population  1425.476744
HouseAge      28.639486
None    2.068558
dtype: float64
2.068558169089288


In [7]:
def print_dot_product(beta, x, names, intercept_name="Intercept"):
    terms = []
    for b, v, n in zip(beta, x, names):
        terms.append(f"{b:.3f}*{v:.3f}")
    equation = " + ".join(terms)
    print(equation)

print_dot_product(beta, X_new, names = beta.index.tolist())

-0.441*1.000 + 0.536*3.871 + -0.211*5.429 + 0.991*1.097 + 0.000*1425.477 + 0.016*28.639


##  scikit-learn

From now on we will switch almost entirely to the **sklearn** library!


Let us fit the same model using `LinearRegression`.


In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = df_sub[['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']]
y = df_sub['MedHouseVal']

model_sk = LinearRegression()
model_sk.fit(X, y)

print("Intercept:", model_sk.intercept_)
print("Coefficients:", model_sk.coef_)

Intercept: -0.4407217437469684
Coefficients: [ 5.36014757e-01 -2.11185756e-01  9.90813314e-01  1.84789639e-05
  1.63455751e-02]


In [9]:
# Predictions
X_new_sk = X.mean().to_frame().T ## convert Series → DataFrame with one row
#print(X_new_sk)
pred = model_sk.predict(X_new_sk)
print(pred)

[2.06855817]


In [10]:
# Compute R²
y_pred = model_sk.predict(X)
r2_score(y, y_pred)

0.5377839208402416

## Data Scaling

Many ML methods need data to be scaled


In [11]:
df_sub.describe()

Unnamed: 0,MedHouseVal,MedInc,AveRooms,AveBedrms,Population,HouseAge
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,2.068558,3.870671,5.429,1.096675,1425.476744,28.639486
std,1.153956,1.899822,2.474173,0.473911,1132.462122,12.585558
min,0.14999,0.4999,0.846154,0.333333,3.0,1.0
25%,1.196,2.5634,4.440716,1.006079,787.0,18.0
50%,1.797,3.5348,5.229129,1.04878,1166.0,29.0
75%,2.64725,4.74325,6.052381,1.099526,1725.0,37.0
max,5.00001,15.0001,141.909091,34.066667,35682.0,52.0


In [12]:
from sklearn.preprocessing import StandardScaler # MinMaxScaler 

#Manual:
# Initialize the scaler
scaler = StandardScaler()

# Fit + transform
X_scaled = scaler.fit_transform(X) # 1 Fit Step , 2 Transform

# Convert back to a DataFrame with the same column names
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

X_scaled.describe()

Unnamed: 0,MedInc,AveRooms,AveBedrms,Population,HouseAge
count,20640.0,20640.0,20640.0,20640.0,20640.0
mean,6.6097e-17,6.6097e-17,-1.060306e-16,-1.101617e-17,5.508083e-18
std,1.000024,1.000024,1.000024,1.000024,1.000024
min,-1.774299,-1.852319,-1.610768,-1.256123,-2.19618
25%,-0.6881186,-0.3994496,-0.1911716,-0.5638089,-0.8453931
50%,-0.1767951,-0.08078489,-0.101065,-0.2291318,0.02864572
75%,0.4593063,0.2519615,0.006015869,0.2644949,0.6643103
max,5.858286,55.16324,69.57171,30.25033,1.856182


# How do I 'scale ' my data?

This is the fit:
caltulate mean and std


This is the trasnformation:
x_scaled = (x - mean)/std

This always works regardless of weather my data is normal or not 

### For anything statistical (such as inference and testing..) you should check normality. 

In [13]:
model_sk = LinearRegression()
model_sk.fit(X_scaled, y)

print("Intercept:", model_sk.intercept_)
print("Coefficients:", model_sk.coef_)

Intercept: 2.068558169089147
Coefficients: [ 1.01830781 -0.52249747  0.46954581  0.02092622  0.20571319]


## Train Test Splits

In the absence of "new" data we can simulate the process by splitting the data set ourselves and calling one part "training" and the other "test" data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(X.shape)
print(X_train.shape)
print(X_test.shape)

# Pipelines

## Putting it "all" together

We should apply most preprocessing steps to both training and test data.
That is easier said than done, because

(i) we need to apply the identical algorithm to both parts, and
(ii) we need to avoid **data leakage**!

Imagine the overhead in bookkeeping of manually have to store all the parameters from scaling and e.g. mean/median imputations, etc...

That is where **pipelines** come in and make life much easier.



In [15]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ('model', LinearRegression())
])

pipe.fit(X_train, y_train)

### Score the Pipeline

The code below looks very simple and innocent but just pause and think what all is going on here !

In [16]:
test_score = pipe.score(X_test, y_test)
test_score # default r_squared

0.5089947802907766

## Missing values


In [18]:
# Set specific rows to missing

#Example: First 50 rows:

df_sub.loc[:49, "HouseAge"] = np.nan # make first fifty rows nan

X = df_sub[['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']]
y = df_sub['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model_sk = LinearRegression()
model_sk.fit(X_train, y_train)

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## Full Pipeline

Including an imputer!

In [19]:
from sklearn.impute import SimpleImputer

pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")), # impute = replace 
    ("scale", StandardScaler()),
    ("model", LinearRegression())
])

pipe.fit(X_train, y_train)

In [20]:
test_score = pipe.score(X_test, y_test)
test_score

0.5089722962899943

In [21]:
train_score = pipe.score(X_train, y_train)
train_score

0.5436563358368143

## OneHot Encoder

In [84]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import OneHotEncoder

penguins = sns.load_dataset("penguins").dropna()
penguins 

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


In [64]:
#train/test split 80/20

from sklearn.model_selection import train_test_split

#penguins = penguins.dropna()

#X
X = penguins.drop(columns="body_mass_g")
#
y = penguins.body_mass_g

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,sex
230,Gentoo,Biscoe,40.9,13.7,214.0,Female
84,Adelie,Dream,37.3,17.8,191.0,Female
303,Gentoo,Biscoe,50.0,15.9,224.0,Male
22,Adelie,Biscoe,35.9,19.2,189.0,Female
29,Adelie,Biscoe,40.5,18.9,180.0,Male
...,...,...,...,...,...,...
194,Chinstrap,Dream,50.9,19.1,196.0,Male
77,Adelie,Torgersen,37.2,19.4,184.0,Male
112,Adelie,Biscoe,39.7,17.7,193.0,Female
277,Gentoo,Biscoe,45.5,15.0,220.0,Male


In [65]:
from sklearn.compose import ColumnTransformer

categorical = X_train.select_dtypes(include=['object']).columns
numeric = X_train.select_dtypes(include=['number']).columns

categorical
numeric

Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'], dtype='object')

In [90]:
preprocessor = ColumnTransformer([
    # drop='frist' frist column of each categorical value is droped after One Hot Encoding 
    ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical), 
    ('num', StandardScaler(), numeric)
])

preprocess

In [91]:
pipe = Pipeline([
    #("impute", SimpleImputer(strategy="median")), # impute = replace 
    ('preprocess', preprocessor),
    ('model', LinearRegression())
])

pipe

How do you judge the qualitry of a regression ?

In [92]:
pipe.fit(X_train, y_train)

In [77]:
pipe.score(X_test, y_test)

0.8961688345769455

# The pridictors explain 89.6% of the variation of Bodymass around the mean

In [93]:
pipe = Pipeline([
    #("impute", SimpleImputer(strategy="median")),
    ('preprocess', preprocessor),
    ('model', LinearRegression())
])

X_trans = pipe[:-1].transform(X_train)
pd.DataFrame(X_trans)

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,1.0,0.0,0.0,0.0,-0.593727,-1.750939,0.935943
1,0.0,0.0,1.0,0.0,0.0,-1.261043,0.323107,-0.719956
2,0.0,1.0,0.0,0.0,1.0,1.093099,-0.638036,1.655899
3,0.0,0.0,0.0,0.0,0.0,-1.520555,1.031318,-0.863947
4,0.0,0.0,0.0,0.0,1.0,-0.667873,0.879558,-1.511908
...,...,...,...,...,...,...,...,...
261,1.0,0.0,1.0,0.0,1.0,1.259928,0.980731,-0.359978
262,0.0,0.0,0.0,1.0,1.0,-1.279579,1.132491,-1.223925
263,0.0,0.0,0.0,0.0,0.0,-0.816166,0.272520,-0.575965
264,0.0,1.0,0.0,0.0,1.0,0.258954,-1.093315,1.367916


In [70]:

pipe = Pipeline([
    #("impute", SimpleImputer(strategy="median")),
    ('preprocess', preprocessor),
    ('model', LinearRegression())
])

X_trans = pipe[:-1].transform(X_train)
pd.DataFrame(X_trans)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,-0.593727,-1.750939,0.935943
1,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,-1.261043,0.323107,-0.719956
2,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.093099,-0.638036,1.655899
3,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-1.520555,1.031318,-0.863947
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-0.667873,0.879558,-1.511908
...,...,...,...,...,...,...,...,...,...,...,...
261,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.259928,0.980731,-0.359978
262,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,-1.279579,1.132491,-1.223925
263,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.816166,0.272520,-0.575965
264,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.258954,-1.093315,1.367916


# Task: fit a linear regression without a pipe

In [87]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X_train_oh = pd.get_dummies(X_train)
X_train_oh



Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,species_Adelie,species_Chinstrap,species_Gentoo,island_Biscoe,island_Dream,island_Torgersen,sex_Female,sex_Male
230,40.9,13.7,214.0,False,False,True,True,False,False,True,False
84,37.3,17.8,191.0,True,False,False,False,True,False,True,False
303,50.0,15.9,224.0,False,False,True,True,False,False,False,True
22,35.9,19.2,189.0,True,False,False,True,False,False,True,False
29,40.5,18.9,180.0,True,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
194,50.9,19.1,196.0,False,True,False,False,True,False,False,True
77,37.2,19.4,184.0,True,False,False,False,False,True,False,True
112,39.7,17.7,193.0,True,False,False,True,False,False,True,False
277,45.5,15.0,220.0,False,False,True,True,False,False,False,True


In [88]:
model_sk = LinearRegression()

model_sk.fit(X_train_oh, y_train)

print("Intercept:", model_sk.intercept_)
print("Coefficients:", model_sk.coef_)

Intercept: -846.0265146981046
Coefficients: [  17.14973198   66.91629434   15.30729934 -266.29400401 -514.5295567
  780.82356071   13.14977171   26.0672393   -39.21701101 -195.50683155
  195.50683155]


In [106]:
import statsmodels.api as sm
import statsmodels.formula.api as smf



0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1500.0291,575.822,-2.605,0.010,-2632.852,-367.207
species[T.Chinstrap],-260.3063,88.551,-2.940,0.004,-434.513,-86.100
species[T.Gentoo],987.7614,137.238,7.197,0.000,717.771,1257.752
island[T.Dream],-13.1031,58.541,-0.224,0.823,-128.271,102.065
island[T.Torgersen],-48.0636,60.922,-0.789,0.431,-167.915,71.788
sex[T.Male],387.2243,48.138,8.044,0.000,292.521,481.927
bill_length_mm,18.1893,7.136,2.549,0.011,4.150,32.229
bill_depth_mm,67.5754,19.821,3.409,0.001,28.581,106.570
flipper_length_mm,16.2385,2.939,5.524,0.000,10.456,22.021


# Diamonds 
1. Fit a linear regression (price as outcome) including all columns !
2. Get the R^2 onm the test data and compare to training
3. Which features seem important
4. What about x,y,z ? Do they make sense to include in a linear fashion ?

In [94]:
import statsmodels.api as sm
df = sm.datasets.get_rdataset("diamonds", "ggplot2").data
df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [119]:
from sklearn.model_selection import train_test_split

df= pd.get_dummies(df)

#X
X = df.drop(columns="price")
#
y = df.price


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [120]:
df.columns

Index(['carat', 'depth', 'table', 'price', 'x', 'y', 'z', 'cut_Fair',
       'cut_Good', 'cut_Ideal', 'cut_Premium', 'cut_Very Good', 'color_D',
       'color_E', 'color_F', 'color_G', 'color_H', 'color_I', 'color_J',
       'clarity_I1', 'clarity_IF', 'clarity_SI1', 'clarity_SI2', 'clarity_VS1',
       'clarity_VS2', 'clarity_VVS1', 'clarity_VVS2'],
      dtype='object')

In [112]:
from sklearn.compose import ColumnTransformer

categorical = X_train.select_dtypes(include=['object']).columns
numeric = X_train.select_dtypes(include=['number']).columns

In [113]:
preprocessor = ColumnTransformer([
    # drop='frist' frist column of each categorical value is droped after One Hot Encoding 
    ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical), 
    ('num', StandardScaler(), numeric)
])

In [121]:
# Instantiate and fit the model
model_sk = LinearRegression()
model_sk.fit(X_train, y_train) ####### !!!!!!!!!!!!!!!!!!!!

# Generate predictions
y_pred = model_sk.predict(X_test)

# Evaluate
print("R² Score: ",r2_score(y_test, y_pred))

pipe = Pipeline([
    #("impute", SimpleImputer(strategy="median")), # impute = replace 
    ('preprocess', preprocessor),
    ('model', LinearRegression())
])

R² Score:  0.9189331350419373
