# Sklearn pipelines


### Learning Objectives
After completing this lesson, you will:
- understand the need for preprocessing data in general
- recognize pitfalls and "danger" in the process
- appreciate the structured and systematic approach offered by **pipelines**
  

## 1. MLR on housing data

We will use the **California Housing Dataset** from `sklearn.datasets`.  
It contains data on housing prices and district-level demographics in California.


In [5]:
pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl (11.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.1/11.1 MB[0m [31m6.2 MB/s[0m  [33m0:00:01[0mm0:00:01[0m00:01[0m
Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [scikit-learn][0m [scikit-learn]
[1A[2KSuccessfully installed joblib-1.5.2 scikit-learn-1.6.1 threadpoolctl-3.6.0
Note: you may need to restart the kernel to use updated packages.


In [9]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np

data = fetch_california_housing(as_frame=True)
df = data.frame
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


The variable **`MedHouseVal`** is the **median house value** in $100,000s.  
We will predict it using a few numeric predictors.


#### Select a Subset of Variables

Use only the following features:

| Feature | Description |
|----------|--------------|
| MedInc | Median income in block group |
| AveRooms | Average number of rooms per household |
| AveBedrms | Average number of bedrooms per household |
| Population | Block group population |
| HouseAge | Median age of houses in the block group |


In [10]:
cols = ['MedHouseVal', 'MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']
df_sub = df[cols]
df_sub.head()

Unnamed: 0,MedHouseVal,MedInc,AveRooms,AveBedrms,Population,HouseAge
0,4.526,8.3252,6.984127,1.02381,322.0,41.0
1,3.585,8.3014,6.238137,0.97188,2401.0,21.0
2,3.521,7.2574,8.288136,1.073446,496.0,52.0
3,3.413,5.6431,5.817352,1.073059,558.0,52.0
4,3.422,3.8462,6.281853,1.081081,565.0,52.0


#### Multiple Regression using **statsmodels**
Use `statsmodels.api.OLS` to fit a multiple regression model.


In [11]:
import statsmodels.api as sm

X = df_sub[['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']]
y = df_sub['MedHouseVal']

X = sm.add_constant(X)
model_sm = sm.OLS(y, X).fit()
print(model_sm.summary())

                            OLS Regression Results                            
Dep. Variable:            MedHouseVal   R-squared:                       0.538
Model:                            OLS   Adj. R-squared:                  0.538
Method:                 Least Squares   F-statistic:                     4801.
Date:                Mon, 17 Nov 2025   Prob (F-statistic):               0.00
Time:                        17:08:23   Log-Likelihood:                -24278.
No. Observations:               20640   AIC:                         4.857e+04
Df Residuals:                   20634   BIC:                         4.862e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.4407      0.028    -15.945      0.0

# R-Squared = our 5 columns in comparrison explain 53.8% of the variance 

## Predictions

In [12]:
# New data equal to an average house !
X_new = X.mean()
print(pd.DataFrame(X_new))

# Predict
pred = model_sm.predict(X_new)
print(pred)
beta = model_sm.params
print(X_new @ beta)

                      0
const          1.000000
MedInc         3.870671
AveRooms       5.429000
AveBedrms      1.096675
Population  1425.476744
HouseAge      28.639486
None    2.068558
dtype: float64
2.068558169089079


In [13]:
def print_dot_product(beta, x, names, intercept_name="Intercept"):
    terms = []
    for b, v, n in zip(beta, x, names):
        terms.append(f"{b:.3f}*{v:.3f}")
    equation = " + ".join(terms)
    print(equation)

print_dot_product(beta, X_new, names = beta.index.tolist())

-0.441*1.000 + 0.536*3.871 + -0.211*5.429 + 0.991*1.097 + 0.000*1425.477 + 0.016*28.639


##  scikit-learn

From now on we will switch almost entirely to the **sklearn** library!


Let us fit the same model using `LinearRegression`.


In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = df_sub[['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']]
y = df_sub['MedHouseVal']

model_sk = LinearRegression()
model_sk.fit(X, y)

print("Intercept:", model_sk.intercept_)
print("Coefficients:", model_sk.coef_)

Intercept: -0.4407217437469195
Coefficients: [ 5.36014757e-01 -2.11185756e-01  9.90813314e-01  1.84789639e-05
  1.63455751e-02]


In [15]:
# Predictions
X_new_sk = X.mean().to_frame().T ## convert Series → DataFrame with one row
#print(X_new_sk)
pred = model_sk.predict(X_new_sk)
print(pred)

[2.06855817]


In [16]:
# Compute R²
y_pred = model_sk.predict(X)
r2_score(y, y_pred)

0.5377839208402417

## Data Scaling

Many ML methods need data to be scaled


In [17]:
df_sub.describe()

Unnamed: 0,MedHouseVal,MedInc,AveRooms,AveBedrms,Population,HouseAge
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,2.068558,3.870671,5.429,1.096675,1425.476744,28.639486
std,1.153956,1.899822,2.474173,0.473911,1132.462122,12.585558
min,0.14999,0.4999,0.846154,0.333333,3.0,1.0
25%,1.196,2.5634,4.440716,1.006079,787.0,18.0
50%,1.797,3.5348,5.229129,1.04878,1166.0,29.0
75%,2.64725,4.74325,6.052381,1.099526,1725.0,37.0
max,5.00001,15.0001,141.909091,34.066667,35682.0,52.0


In [18]:
from sklearn.preprocessing import StandardScaler # MinMaxScaler 

#Manual:
# Initialize the scaler
scaler = StandardScaler()

# Fit + transform
X_scaled = scaler.fit_transform(X) # 1 Fit Step , 2 Transform

# Convert back to a DataFrame with the same column names
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

X_scaled.describe()

Unnamed: 0,MedInc,AveRooms,AveBedrms,Population,HouseAge
count,20640.0,20640.0,20640.0,20640.0,20640.0
mean,6.6097e-17,6.6097e-17,-1.060306e-16,-1.101617e-17,5.508083e-18
std,1.000024,1.000024,1.000024,1.000024,1.000024
min,-1.774299,-1.852319,-1.610768,-1.256123,-2.19618
25%,-0.6881186,-0.3994496,-0.1911716,-0.5638089,-0.8453931
50%,-0.1767951,-0.08078489,-0.101065,-0.2291318,0.02864572
75%,0.4593063,0.2519615,0.006015869,0.2644949,0.6643103
max,5.858286,55.16324,69.57171,30.25033,1.856182


# How do I 'scale ' my data?

This is the fit:
caltulate mean and std


This is the trasnformation:
x_scaled = (x - mean)/std

This always works regardless of weather my data is normal or not 

### For anything statistical (such as inference and testing..) you should check normality. 

In [21]:
model_sk = LinearRegression()
model_sk.fit(X_scaled, y)

print("Intercept:", model_sk.intercept_)
print("Coefficients:", model_sk.coef_)

Intercept: 2.068558169089147
Coefficients: [ 1.01830781 -0.52249747  0.46954581  0.02092622  0.20571319]


## Train Test Splits

In the absence of "new" data we can simulate the process by splitting the data set ourselves and calling one part "training" and the other "test" data.

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(20640, 5)
(16512, 5)
(4128, 5)


# Pipelines

## Putting it "all" together

We should apply most preprocessing steps to both training and test data.
That is easier said than done, because

(i) we need to apply the identical algorithm to both parts, and
(ii) we need to avoid **data leakage**!

Imagine the overhead in bookkeeping of manually have to store all the parameters from scaling and e.g. mean/median imputations, etc...

That is where **pipelines** come in and make life much easier.



In [23]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ('model', LinearRegression())
])

pipe.fit(X_train, y_train)

### Score the Pipeline

The code below looks very simple and innocent but just pause and think what all is going on here !

In [24]:
test_score = pipe.score(X_test, y_test)
test_score # default r_squared

0.5089947802907764

## Missing values


In [None]:
# Set specific rows to missing

#Example: First 50 rows:

df_sub.loc[:49, "HouseAge"] = np.nan # make first fifty rows nan

X = df_sub[['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'HouseAge']]
y = df_sub['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model_sk = LinearRegression()
model_sk.fit(X_train, y_train)

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## Full Pipeline

Including an imputer!

In [25]:
from sklearn.impute import SimpleImputer

pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")), # impute = replace 
    ("scale", StandardScaler()),
    ("model", LinearRegression())
])

pipe.fit(X_train, y_train)

In [26]:
test_score = pipe.score(X_test, y_test)
test_score

0.5089947802907764

In [27]:
train_score = pipe.score(X_train, y_train)
train_score

0.5437938436867445