![](../img/330-banner.png)

# Tutorial 3

UBC 2024-25

## Outline

During this tutorial, we will focus on preprocessing - the necessary steps to perform to make the data meaningful for a learning algorithm.

All questions can be discussed with your classmates and the TAs - this is not a graded exercise!

In [1]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML

sys.path.append("../code/.")
from plotting_functions import *
from utils import *

pd.set_option("display.max_colwidth", 200)

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

## `ColumnTransformer` on the California housing dataset 

In this notebook, you will practice features preprocessing using the [California housing dataset](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

Let's start by loading the dataset (this is done for you):

In [2]:
housing_df = pd.read_csv("../data/housing.csv")
train_df, test_df = train_test_split(housing_df, test_size=0.1, random_state=123)

train_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
6051,-117.75,34.04,22.0,2948.0,636.0,2600.0,602.0,3.125,113600.0,INLAND
20113,-119.57,37.94,17.0,346.0,130.0,51.0,20.0,3.4861,137500.0,INLAND
14289,-117.13,32.74,46.0,3355.0,768.0,1457.0,708.0,2.6604,170100.0,NEAR OCEAN
13665,-117.31,34.02,18.0,1634.0,274.0,899.0,285.0,5.2139,129300.0,INLAND
14471,-117.23,32.88,18.0,5566.0,1465.0,6303.0,1458.0,1.858,205000.0,NEAR OCEAN


Let's also add some new features that may help us with the prediction:

In [3]:
train_df = train_df.assign(
    rooms_per_household=train_df["total_rooms"] / train_df["households"]
)
test_df = test_df.assign(
    rooms_per_household=test_df["total_rooms"] / test_df["households"]
)

train_df = train_df.assign(
    bedrooms_per_household=train_df["total_bedrooms"] / train_df["households"]
)
test_df = test_df.assign(
    bedrooms_per_household=test_df["total_bedrooms"] / test_df["households"]
)

train_df = train_df.assign(
    population_per_household=train_df["population"] / train_df["households"]
)
test_df = test_df.assign(
    population_per_household=test_df["population"] / test_df["households"]
)

Finally, we are separating for you the target from the features:

In [8]:
# Let's keep both numeric and categorical columns in the data.
X_train = train_df.drop(columns=["median_house_value"])
y_train = train_df["median_house_value"]

X_test = test_df.drop(columns=["median_house_value"])
y_test = test_df["median_house_value"]

X_train.info()
X_train.columns

<class 'pandas.core.frame.DataFrame'>
Index: 18576 entries, 6051 to 19966
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   longitude                 18576 non-null  float64
 1   latitude                  18576 non-null  float64
 2   housing_median_age        18576 non-null  float64
 3   total_rooms               18576 non-null  float64
 4   total_bedrooms            18391 non-null  float64
 5   population                18576 non-null  float64
 6   households                18576 non-null  float64
 7   median_income             18576 non-null  float64
 8   ocean_proximity           18576 non-null  object 
 9   rooms_per_household       18576 non-null  float64
 10  bedrooms_per_household    18391 non-null  float64
 11  population_per_household  18576 non-null  float64
dtypes: float64(11), object(1)
memory usage: 1.8+ MB


Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity', 'rooms_per_household', 'bedrooms_per_household',
       'population_per_household'],
      dtype='object')

## Step 1

Your turn now! Start by importing ColumnTranformer and make_column_transformer

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer



## Step 2

Next, group features by type (numerical or categorical). You may also want to save the target separately. 

In [11]:
numerical_feats = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income', 'rooms_per_household', 'bedrooms_per_household',
       'population_per_household']
categorical_feats = ['ocean_proximity']

## Step 3

Create a ColumnTransformer for your features. The transformer should include imputation and scaling for numeric features, and encoding for categorical features (which type of encoding?)

In [15]:
numerical_transformer = make_pipeline(
    SimpleImputer(strategy = 'median'),
    StandardScaler()
)
categorical_transformer = make_pipeline(
    SimpleImputer(strategy = 'most_frequent'),
    OneHotEncoder(handle_unknown = 'ignore'),
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_feats),
        ('cat', categorical_transformer, categorical_feats)
    ]
)
                  

## Step 4

Visualize the transformed training set as a dataframe

In [23]:
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)
# Why not fit_transform for X_test? When we impute values, we put medians in, kinda like training. If we do it on 
# our test set, that's why we don't do that

column_names = (
    numerical_feats   
    + preprocessor.named_transformers_["cat"].get_feature_names_out().tolist()
)

pd.DataFrame(X_train_transformed, columns=column_names)
pd.DataFrame(X_test_transformed, columns=column_names)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_household,bedrooms_per_household,population_per_household,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-1.537388,1.223664,0.903859,-0.303314,-0.302996,-0.341246,-0.266454,-0.037439,-0.185598,-0.138226,-0.044089,1.0,0.0,0.0,0.0,0.0
1,0.258000,0.216450,-1.558810,0.303085,0.073362,0.280669,0.177370,0.084097,0.150666,-0.200540,0.002682,0.0,1.0,0.0,0.0,0.0
2,1.293223,-1.301398,-1.320487,0.428215,0.418753,1.657643,0.433222,0.119237,-0.026154,-0.054688,0.193016,1.0,0.0,0.0,0.0,0.0
3,0.573068,-0.668962,-0.128873,0.259083,1.030930,1.729470,1.169449,-0.858602,-0.815796,-0.152577,0.053621,1.0,0.0,0.0,0.0,0.0
4,0.548062,-0.757971,0.983300,0.031283,0.047160,-0.334239,-0.057596,0.295147,0.091884,0.141037,-0.086020,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2059,1.258216,-1.352930,0.427213,-0.209810,-0.195805,-0.206352,-0.141139,-0.364953,-0.216005,-0.158834,-0.037669,0.0,0.0,0.0,0.0,1.0
2060,0.458043,-0.664277,-1.876574,1.280291,0.304417,0.776450,0.488048,3.056345,0.985733,-0.266725,0.031672,1.0,0.0,0.0,0.0,0.0
2061,-1.377354,0.909788,1.857150,-0.979383,-0.927083,-0.972797,-0.926970,-0.459276,-0.790004,-0.157780,-0.085748,0.0,0.0,0.0,0.0,1.0
2062,0.983156,-0.772025,-0.208314,-0.646619,-0.805600,-0.623298,-0.772937,-0.599307,0.230405,-0.251196,0.046141,0.0,1.0,0.0,0.0,0.0


## Step 5

Finally, let's train a classifier (or even better, for practice, a baseline and another classifier): 
- create a pipeline with the preprocessor and a classifier of your choice.
- use the pipeline to perform cross-validation

In [30]:


pipe = make_pipeline(preprocessor, DecisionTreeRegressor(max_depth = 7, random_state=123))
scores = cross_validate(pipe, X_train, y_train, return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.093672,0.002757,0.708943,0.75247
1,0.070735,0.002163,0.729328,0.744497
2,0.06987,0.002084,0.723684,0.747225
3,0.070127,0.002187,0.68677,0.756781
4,0.069681,0.002332,0.694079,0.753856


## <font color='red'>Recap/comprehension questions</font>

- If we only plan to use a Decision Tree as classifier, do we still need to scale the numerical features?
  > no need to scale. 
- If the dataset included an ordinal feature "Neighbourhood desirability", with numerical labels 1 (poor), 2 (good) and 3 (excellent), would we need to apply an ordinal encoder to it?
  > Yes, because it makes sense to apply to ordinal encoder. Have to specify the order of the encoder. 
- Why do we add the argument `drop="if_binary"` to `OneHotEncoder` when dealing with categorical features with only two possible values? What would be the disadvantages of not doing so?
  > 

SyntaxError: invalid syntax (2014536704.py, line 1)

NameError: name 'there' is not defined