# Week 3-I: Data Preparation on the Kings County Housing Dataset

## Setup

In [1]:
# Python ≥3.9 is required
import sys
assert sys.version_info >= (3, 9)

# Scikit-Learn ≥1.0 is required
import sklearn
assert sklearn.__version__ >= "1.0"

# Common imports
import numpy as np
import pandas as pd
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Precision options
np.set_printoptions(precision=2)
pd.options.display.float_format = '{:.3f}'.format

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## 1. Get the Data + Train/test split (again)

In [7]:
housing = pd.read_csv(
    "kings_county_house_data.csv",
    dtype={'zipcode': str}   # US ZIP codes look like numbers but we want to treat them like strings
)

Perform a stratified split wrt `sqft_living`:

In [8]:
from sklearn.model_selection import StratifiedShuffleSplit

housing["sqft_living_cat"] = pd.cut(
    housing.sqft_living, 
    bins=[0., 1000., 2000., 3000., 4000., np.inf],
    labels=[1, 2, 3, 4, 5]
)
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in splitter.split(housing, housing.sqft_living_cat):
    train_set = housing.loc[train_index]
    test_set = housing.loc[test_index]
# delete the "sqft_living_cat"columns  
for set_ in (train_set, test_set):
    set_.drop("sqft_living_cat", axis=1, inplace=True)

## 2. Data Preparation Plan

From the EDA we have defined the following data preparation plan:
    
* We have identified a cutoff at latitude  ~47.5 between more expensive houses and cheaper houses.We will create a binary engineered feature to capture this. We could also create a cutoff at -126.1 long to separate the urban west from the rural east of the county. We will then remove `lat` and `long`.
* We have decided to discard `sqft_living15` and `sqft_living_above` in favour of `sqft_living`
* We have decided to add a binary engineered feature that indicates whether a house has a basement or not. We will then remove the continuous variable `sqft_basement`
* We will create a renovated binary flag. If a house is older than 25 years (relative to the most recent data in the dataset) and has not been renovated we will set renovated to 0, otherwise to 1. We will then remove the continuous variable `yr_built` and `yr__renovated`
* We have decided to collapse the 70 zipcodes into 9 zipcode groups based on average house prices in the zipcodes. This will be performed by `make_zipcode_groups()` function and you do not have to worry about the implementation details of it.
* Some houses report 0 bathrooms. We need to replace those values with more meaningful estimates.
* One house has 33 bedrooms. We will replace that value with 3, as it looks like a reporting mistake.   

## 3 Data preparation: check there are no missing values

We can use pd.DataFrame.isna() or pd.DataFrame.isnull() to look for null or missing values in any of our variables/features.

NOTE: axis=1 performs the operation along the columns

In [9]:
## Look for rows with incomplete values
incomplete_rows = train_set[train_set.isna().any(axis=1)]
incomplete_rows

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15


There are no missing values in our dataset.

## Feature Engineering

**Exercise:** fill the missing feature engineering steps in the `engineer_features()` function. 

In [15]:
from kings_county_utils import make_zipcode_groups, assign_to_zipcode_group

def engineer_features(data: pd.DataFrame) -> pd.DataFrame:
    # let's make a copy of the original dataset
    engineered_data = data.copy()    
    # create a variable "north_loc", 1 for latitudes greather than 47.5
    engineered_data["north_loc"] = np.where(engineered_data["lat"] > 47.5, 1, 0)
    # create a variable "rural_east", 1 for longitudes
    engineered_data["rural_east"] = np.where(engineered_data["long"] > -121.6, 1, 0)
    # drop "lat" and "long"
    engineered_data = engineered_data.drop(columns=["lat", "long"])
    # drop "sqft_living15" and "sqft_living_above"
    engineered_data = engineered_data.drop(columns=["sqft_living15", "sqft_above"])
    # add a binary variable for the presence of a  basement
    engineered_data["basement"] = np.where(engineered_data["sqft_basement"] > 0.0, 1, 0)
    # drop "sqft_basement"
    engineered_data = engineered_data.drop(columns=["sqft_basement"])
    # create an "is_renovated" binary variable
    max_year = max(engineered_data["yr_built"].max(), engineered_data["yr_renovated"].max())
    engineered_data["renovated"] = np.where(
        (engineered_data["yr_built"] > max_year - 25) | (engineered_data["yr_renovated"] > 0), 1, 0
    )
    # drop "yr_built" and "yr_renovated"
    engineered_data = engineered_data.drop(columns=["yr_built", "yr_renovated"])
    # group zipcodes into groups (as we did last week)
    zipcode_groups = make_zipcode_groups(engineered_data)
    engineered_data["zipcode_group"] = engineered_data["zipcode"].apply(assign_to_zipcode_group, zipcode_groups=zipcode_groups)
    # drop "zipcode"
    engineered_data = engineered_data.drop(columns=["zipcode"])
    # replace 0 bathrooms with NaN (we will fill it later on)
    engineered_data.loc[engineered_data['bathrooms'] == 0, 'bathrooms'] = np.nan
    # replace 33 bedrooms with Nan (we will fill it later on)
    engineered_data.loc[engineered_data['bedrooms'] == 33, 'bedrooms'] = np.nan
    # drop id and date (as we won't use them)
    engineered_data = engineered_data.drop(columns=["id", "date", "price"])
    return engineered_data

In [16]:
train_data = engineer_features(train_set)
train_labels = train_set["price"].copy()
train_data.sample(10, random_state=42)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_lot15,north_loc,rural_east,basement,renovated,zipcode_group
17465,4.0,2.5,2620,19864,2.0,0,0,4,8,13285,1,0,0,0,zg_0
14541,4.0,2.5,2540,4241,2.0,0,0,3,8,4929,1,0,0,1,zg_0
11107,4.0,3.0,1940,8170,1.0,0,0,4,7,8169,1,0,0,0,zg_3
12147,3.0,1.75,1660,11500,1.0,0,0,3,7,11000,0,0,1,0,zg_3
20312,3.0,2.25,1360,1041,2.0,0,0,3,8,1382,1,0,1,1,zg_0
12907,6.0,4.0,3120,4240,2.0,0,2,4,7,4240,1,0,1,1,zg_1
18345,3.0,2.0,1640,13249,1.0,0,0,3,7,9240,0,0,0,1,zg_1
13041,2.0,1.5,1220,5000,1.0,0,2,4,7,3850,1,0,0,0,zg_0
139,3.0,2.25,1170,1249,3.0,0,0,3,8,1310,1,0,0,1,zg_0
11086,3.0,2.0,1500,7828,1.0,0,0,4,7,7700,0,0,0,0,zg_1


In [17]:
len(train_data.columns)

15

We now have 15 features in our dataset, but as we will see when preparing our data for feeding it into regression models we will get a few extra "dummy" features.

## 3 Check there are no missing values in the engineered dataset

In [19]:
## Look for rows with incomplete values
incomplete_rows = train_data[train_data.isna().any(axis=1)]
incomplete_rows

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_lot15,north_loc,rural_east,basement,renovated,zipcode_group
15870,,1.75,1620,6000,1.0,0,0,5,7,4700,1,0,1,0,zg_0
6994,0.0,,4810,28008,2.0,0,0,3,12,35061,1,0,0,0,zg_4
1149,1.0,,670,43377,1.0,0,0,3,3,42882,0,0,0,0,zg_1
9854,0.0,,1470,4800,2.0,0,0,3,7,7200,1,0,0,1,zg_0
9773,0.0,,2460,8049,2.0,0,0,3,8,8050,0,0,0,0,zg_1
14423,0.0,,844,4269,1.0,0,0,4,7,9600,0,0,0,0,zg_1
3119,0.0,,1470,979,3.0,0,2,3,8,1399,1,0,0,1,zg_1
875,0.0,,3064,4764,3.5,0,2,3,7,4000,1,0,0,0,zg_5


As we can see we now have 1 sample with missing bedrooms info and 7 samples with missing bathrooms info. We need to deal with this missing values. The problem of dealing with missing values is called imputation.

### 3.1 Imputation

In statistics, imputation is the process of replacing missing data with substituted values.
In scikit-learn we can use the `SimpleImputer` calss to perform univariate imputation on missing  values. Generally we wil want to replace missing numeric (quantitative) and ordinal values with the median value of that feature. For categorical features we may want to either use a "missing"/"unknown" category, use the mode, or drop the samples with missing values.

Let's fill in the missing ordinal values of our dataset:

In [20]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
train_data_num = train_data.select_dtypes(include=[np.number])
train_data_num

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_lot15,north_loc,rural_east,basement,renovated
20474,3.000,3.250,1380,1234,3.000,0,0,3,8,1282,1,0,0,1
3840,2.000,1.000,820,10450,1.000,0,0,4,7,11200,0,0,0,0
7426,3.000,3.500,4240,21578,2.000,0,0,3,10,16440,1,0,1,1
4038,4.000,1.000,1140,6250,1.500,0,0,3,6,1370,1,0,0,0
11420,3.000,2.500,1600,3172,2.000,0,0,3,7,3698,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14469,4.000,1.750,2000,5100,1.000,0,0,4,7,5100,1,0,1,0
8505,3.000,1.750,1370,10866,1.000,0,0,4,6,14250,1,0,0,0
549,2.000,1.000,1320,8865,1.000,0,0,4,6,6490,0,0,0,0
4482,3.000,1.000,1140,4560,1.000,0,0,4,6,3980,1,0,1,0


Basically of our fields, with the exception of `zipocode_group` is quantitative or ordinal (or a binary variable):

In [21]:
imputer.fit(train_data_num)
imputer.statistics_

array([3.00e+00, 2.25e+00, 1.91e+03, 7.61e+03, 1.50e+00, 0.00e+00,
       0.00e+00, 3.00e+00, 7.00e+00, 7.62e+03, 1.00e+00, 0.00e+00,
       0.00e+00, 0.00e+00])

The `SimpleImputer.statistics_` property is just the median value for each column:

In [22]:
train_data_num.median()

bedrooms         3.000
bathrooms        2.250
sqft_living   1910.000
sqft_lot      7610.000
floors           1.500
waterfront       0.000
view             0.000
condition        3.000
grade            7.000
sqft_lot15    7620.000
north_loc        1.000
rural_east       0.000
basement         0.000
renovated        0.000
dtype: float64

We can now fill the missing values by applying the `transform()` method of the imputer to `train_data_num`:

In [23]:
train_data_arr = imputer.transform(train_data_num)
# imputer.transform return a NumPy array, we need to wrapp it back into a dataframe with column names and index
train_data_num_filled = pd.DataFrame(
    train_data_arr,
    columns=train_data_num.columns,
    index=train_data_num.index
)

We can now verify that there are no more rows with missing values in `train_data_num`:

In [24]:
train_data_num_filled[train_data_num_filled.isna().any(axis=1)]

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_lot15,north_loc,rural_east,basement,renovated


### 3.2 Handling Categorical Attributes

In general we can consider three types of text features: categorical, ordinal, and unstructured.
Unstructured text is more the subject of Natural Language Processing, hence we will not consider its processing/encoding at this stage (and we have no unstructured data). Ordinal data are text categories that imply and intrinsic order such as the set ("BAD", "AVERAGE", "GOOD", "VERY GOOD", "EXCELLENT"). 
These are generally encodes as integers ("BAD" => 0, "AVERAGE" => 1, "GOOD" => 2, "VERY GOOD" => 3, "EXCELLENT" => 4). These transformations can be handed with custom functions as above or using `sklearn.preprocessing.OrdinalEncoder`. In our case all the ordinal features are already expressed as numbers so we don't need to do anything with them.

To handle Categorical Attributes that are not ordinal, a common solution is to create one binary attribute per category. This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called *dummy attributes*. Scikit-Learn provides a `sklearn.preprocessing.OneHotEncoder` class to convert categorical values into one-hot vectors.

In our case we have the "zipcode" attribute that can be considered as categorical. Each "zipcode" category should become a mutually exclusive dummy attribute

<b>Exercise:</b> Use the `OneHotEncoder` class to encode each ZIP code as a separate category. Check the documentation for appropriate use of the `OneHotEncoder` transformer. What kind of output do you get? 

In [25]:
## Write your solution here
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder(categories='auto')
train_data_cat = train_data[["zipcode_group"]]
train_data_cat_1hot = cat_encoder.fit_transform(train_data_cat)
train_data_cat_1hot

<17290x9 sparse matrix of type '<class 'numpy.float64'>'
	with 17290 stored elements in Compressed Sparse Row format>

Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After one-hot encoding, we get a matrix with thousands of columns, and the matrix is full of 0s except for a single 1 per row. We can get a dense array out of sparse matrix by calling the `.toarray()` method.

In [26]:
train_data_cat_1hot.toarray()

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [27]:
train_data_cat_1hot.toarray().shape

(17290, 9)

We can get the "column" names by checking the `OneHotEncoder.categories_` attribute:

In [28]:
cat_encoder.categories_

[array(['zg_0', 'zg_1', 'zg_2', 'zg_3', 'zg_4', 'zg_5', 'zg_6', 'zg_7',
        'zg_8'], dtype=object)]

### 3.3 Custom Transformers

You can define you own transformers creating a class that inherits both from `BaseEstimator` (so that it inherits the `.fit()` method) and the mixin class `TransformerMixin` (so that it acquires the `.tranform()` method)


**Exercise:** convert `engineering_features()` to a scikit-learn's transformer subclassing `BaseEstimator` and `TransformerMixin`. I am already providing you with the backbone of the class and the `.fit()` method for you here below:

NB: `data` won't have the "price" column as it is passed as `labels`

In [29]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureEngineeringTransformer(BaseEstimator, TransformerMixin):
    def fit(self, data, labels=None) -> "FeatureEngineeringTransformer":
        self.zipcode_groups = make_zipcode_groups(pd.concat([data["zipcode"], labels], axis=1))
        return self

    def transform(self, data, labels=None) -> pd.DataFrame:
        # let's make a copy of the original dataset
        engineered_data = data.copy()    
        # create a variable "north_loc", 1 for latitudes greather than 47.5
        engineered_data["north_loc"] = np.where(engineered_data["lat"] > 47.5, 1, 0)
        # create a variable "rural_east", 1 for longitudes
        engineered_data["rural_east"] = np.where(engineered_data["long"] > -121.6, 1, 0)
        # drop "lat" and "long"
        engineered_data = engineered_data.drop(columns=["lat", "long"])
        # drop "sqft_living15" and "sqft_living_above"
        engineered_data = engineered_data.drop(columns=["sqft_living15", "sqft_above"])
        # add a binary variable for the presence of a  basement
        engineered_data["basement"] = np.where(engineered_data["sqft_basement"] > 0.0, 1, 0)
        # drop "sqft_basement"
        engineered_data = engineered_data.drop(columns=["sqft_basement"])
        # create an "is_renovated" binary variable
        max_year = max(engineered_data["yr_built"].max(), engineered_data["yr_renovated"].max())
        engineered_data["renovated"] = np.where(
            (engineered_data["yr_built"] > max_year - 25) | (engineered_data["yr_renovated"] > 0), 1, 0
        )
        # drop "yr_built" and "yr_renovated"
        engineered_data = engineered_data.drop(columns=["yr_built", "yr_renovated"])
        # group zipcodes into groups (as we did last week)
        engineered_data["zipcode_group"] = engineered_data["zipcode"].apply(assign_to_zipcode_group, zipcode_groups=self.zipcode_groups)
        # drop "zipcode"
        engineered_data = engineered_data.drop(columns=["zipcode"])
        # replace 0 bathrooms with NaN (we will fill it later on)
        engineered_data.loc[engineered_data['bathrooms'] == 0, 'bathrooms'] = np.nan
        # replace 33 bedrooms with Nan (we will fill it later on)
        engineered_data.loc[engineered_data['bedrooms'] == 33, 'bedrooms'] = np.nan
        # drop id and date (as we won't use them)
        engineered_data = engineered_data.drop(columns=["id", "date"])
        return engineered_data


If your implementation is correct the cell below will run successfully:

In [30]:
fe_trf = FeatureEngineeringTransformer()
fe_trf.fit(
    train_set.drop(columns=["price"]),  # training set data
    train_set["price"]                  # training set labels
)
fe_trf.transform(train_set.drop(columns=["price"]))

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_lot15,north_loc,rural_east,basement,renovated,zipcode_group
20474,3.000,3.250,1380,1234,3.000,0,0,3,8,1282,1,0,0,1,zg_0
3840,2.000,1.000,820,10450,1.000,0,0,4,7,11200,0,0,0,0,zg_1
7426,3.000,3.500,4240,21578,2.000,0,0,3,10,16440,1,0,1,1,zg_2
4038,4.000,1.000,1140,6250,1.500,0,0,3,6,1370,1,0,0,0,zg_3
11420,3.000,2.500,1600,3172,2.000,0,0,3,7,3698,1,0,0,1,zg_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14469,4.000,1.750,2000,5100,1.000,0,0,4,7,5100,1,0,1,0,zg_0
8505,3.000,1.750,1370,10866,1.000,0,0,4,6,14250,1,0,0,0,zg_0
549,2.000,1.000,1320,8865,1.000,0,0,4,6,6490,0,0,0,0,zg_1
4482,3.000,1.000,1140,4560,1.000,0,0,4,6,3980,1,0,1,0,zg_0


## 4. Feature Scaling

One of the most important transformations you need to apply to your data is feature scaling. In the great majority of case, Machine Learning algorithms will not perform well when the input numerical attributes have very different scales.

There are two common ways to get all attributes to have the same scale:
* min-max scaling:  rescaling the range of features to scale the range in [0, 1] or [−1, 1] (using scikit-learn `MinMaxScaler`)
* standardization: scales the data to have zero mean and variance = 1 (using scikit-learn `StandardScaler`).

We will see an example of feature scaling below, when we'll show how all the preprocessing can be performed together building a pipeline.

In [12]:
### Example courtesy of ChatGPT
from sklearn.preprocessing import MinMaxScaler

ages = [[30], [40], [50], [60]]
scaler = MinMaxScaler()
scaled_ages = scaler.fit_transform(ages)

print("Original ages:")
print(ages)
print("\nScaled ages (min-max scaled):")
print(scaled_ages)

Original ages:
[[30], [40], [50], [60]]

Scaled ages (min-max scaled):
[[0.  ]
 [0.33]
 [0.67]
 [1.  ]]


In [13]:
### Example courtesy of ChatGPT
from sklearn.preprocessing import StandardScaler

ages = [[30], [40], [50], [60]]
scaler = StandardScaler()
standardized_ages = scaler.fit_transform(ages)

print("Original ages:")
print(ages)
print("\nStandardized ages:")
print(standardized_ages)

Original ages:
[[30], [40], [50], [60]]

Standardized ages:
[[-1.34]
 [-0.45]
 [ 0.45]
 [ 1.34]]


## 5. Transformation Pipelines

Pipeline object are useful to chain transformations (and potentially estimators) together, ensuring clean code and reproducibility.

We are goin to use a `StandardScaler` directly after the `SimpleImputer` only for non-binary numerical features using a scikit-learn tranformation pipeline.

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

cat_feats = ["zipcode_group"]
binary_feats = ["waterfront", "north_loc", "rural_east", "basement", "renovated"] 
num_feats = [
    el for el in list(
        train_data.select_dtypes(include=[np.number])
    ) if el not in binary_feats
]

In [32]:
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])
train_data_num_scaled = num_pipeline.fit_transform(train_data[num_feats])

In [33]:
train_data_num_scaled[:5]

array([[-0.41,  1.48, -0.77, -0.33,  2.79, -0.31, -0.63,  0.29, -0.43],
       [-1.51, -1.46, -1.38, -0.11, -0.91, -0.31,  0.91, -0.56, -0.05],
       [-0.41,  1.81,  2.37,  0.16,  0.94, -0.31, -0.63,  1.98,  0.14],
       [ 0.69, -1.46, -1.03, -0.21,  0.01, -0.31, -0.63, -1.4 , -0.42],
       [-0.41,  0.5 , -0.53, -0.28,  0.94, -0.31, -0.63, -0.56, -0.34]])

As you can see the data has all been scaled to be zero-centred and with variance = 1.

Scikit-learn's built-in transformers output NumPy arrays by default. If we want to get a pandas data frame out, we can use the `set_output` API:

In [34]:
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median").set_output(transform="pandas")),
    ('std_scaler', StandardScaler().set_output(transform="pandas"))
])
train_data_num_scaled = num_pipeline.fit_transform(train_data[num_feats])

In [35]:
train_data_num_scaled.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,grade,sqft_lot15
20474,-0.409,1.479,-0.767,-0.33,2.794,-0.305,-0.629,0.29,-0.427
3840,-1.509,-1.455,-1.38,-0.108,-0.915,-0.305,0.91,-0.557,-0.054
7426,-0.409,1.805,2.367,0.159,0.94,-0.305,-0.629,1.985,0.143
4038,0.691,-1.455,-1.03,-0.209,0.013,-0.305,-0.629,-1.405,-0.424
11420,-0.409,0.501,-0.526,-0.283,0.94,-0.305,-0.629,-0.557,-0.336


In [60]:
import yaml
def read_one_block_of_yaml_data(filename):
    with open(f'{filename}.yaml','r') as f:
        my_dict = yaml.safe_load(f)
    return my_dict
    
a = read_one_block_of_yaml_data('imputer_type')

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy=a['imputer_type']).set_output(transform="pandas")),
    ('std_scaler', StandardScaler().set_output(transform="pandas"))
])
train_data_num_scaled = num_pipeline.fit_transform(train_data[num_feats])
train_data_num_scaled.head(3)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,grade,sqft_lot15
20474,-0.409,1.479,-0.767,-0.33,2.794,-0.305,-0.629,0.29,-0.427
3840,-1.509,-1.455,-1.38,-0.108,-0.915,-0.305,0.91,-0.557,-0.054
7426,-0.409,1.805,2.367,0.159,0.94,-0.305,-0.629,1.985,0.143


### 6. Chaining all together - `ColumnTransformer` and `Pipeline`

So far we have transformed the data step by step, and given different transformations to different features in our dataset. Now, ideally, we would like to chain all the data preparation steps in a single operation. This would simplify applying the tranformations to different data sets (i.e. training and test set) as well as ensuring the reproducibility of our data manipulation pipeline (all the transformations would be executed atomically).

Until now, we have handled the categorical, binary and numerical columns separately. It would be more convenient if we had just one transformer capable to handle all columns, applying the appropriate transformations to each column. Solution: we can use scikit-learn ColumnTransformer!

In [61]:
from sklearn.compose import ColumnTransformer

column_transformer = ColumnTransformer(
    (
        ("numerical", num_pipeline, num_feats),
        ("categorical", OneHotEncoder(categories='auto', sparse_output=False).set_output(transform="pandas"), cat_feats),
    ),
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

column_transformer

In [62]:
train_data_prepared = column_transformer.fit_transform(train_data)
train_data_prepared

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,grade,sqft_lot15,zipcode_group_zg_0,...,zipcode_group_zg_4,zipcode_group_zg_5,zipcode_group_zg_6,zipcode_group_zg_7,zipcode_group_zg_8,waterfront,north_loc,rural_east,basement,renovated
20474,-0.409,1.479,-0.767,-0.330,2.794,-0.305,-0.629,0.290,-0.427,1.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,0,1
3840,-1.509,-1.455,-1.380,-0.108,-0.915,-0.305,0.910,-0.557,-0.054,0.000,...,0.000,0.000,0.000,0.000,0.000,0,0,0,0,0
7426,-0.409,1.805,2.367,0.159,0.940,-0.305,-0.629,1.985,0.143,0.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,1,1
4038,0.691,-1.455,-1.030,-0.209,0.013,-0.305,-0.629,-1.405,-0.424,0.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,0,0
11420,-0.409,0.501,-0.526,-0.283,0.940,-0.305,-0.629,-0.557,-0.336,0.000,...,1.000,0.000,0.000,0.000,0.000,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14469,0.691,-0.477,-0.087,-0.237,-0.915,-0.305,0.910,-0.557,-0.284,1.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,1,0
8505,-0.409,-0.477,-0.778,-0.098,-0.915,-0.305,0.910,-1.405,0.060,1.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,0,0
549,-1.509,-1.455,-0.832,-0.146,-0.915,-0.305,0.910,-1.405,-0.231,0.000,...,0.000,0.000,0.000,0.000,0.000,0,0,0,0,0
4482,-0.409,-1.455,-1.030,-0.250,-0.915,-0.305,0.910,-1.405,-0.326,1.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,1,0


Most of the steps in the pipeline are now chained together. We are just missing the `engineering_features()` transformation step. We can use [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) to convert `engineering_features()` to a scikit-learn's transformer.

**Exercise:** create a new pipeline chaining the `FeatureEngineeringTransformer` with the `column_transformer` we have defined above and apply the transformation to the original "raw" `train_set` dataset.

Hint: you can also use the `make_pipeline()` utility function to easily create a pipeline.

In [63]:
# write your solution here:
from sklearn.pipeline import make_pipeline

full_pipeline = make_pipeline(FeatureEngineeringTransformer(), column_transformer)
full_pipeline

In [64]:
print(train_set.columns)
full_pipeline.fit(train_set.drop(columns=["price"]), train_set["price"])

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')


In [65]:
train_data_prepared =  full_pipeline.transform(train_set.drop(columns=["price"]))
train_data_prepared

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,grade,sqft_lot15,zipcode_group_zg_0,...,zipcode_group_zg_4,zipcode_group_zg_5,zipcode_group_zg_6,zipcode_group_zg_7,zipcode_group_zg_8,waterfront,north_loc,rural_east,basement,renovated
20474,-0.409,1.479,-0.767,-0.330,2.794,-0.305,-0.629,0.290,-0.427,1.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,0,1
3840,-1.509,-1.455,-1.380,-0.108,-0.915,-0.305,0.910,-0.557,-0.054,0.000,...,0.000,0.000,0.000,0.000,0.000,0,0,0,0,0
7426,-0.409,1.805,2.367,0.159,0.940,-0.305,-0.629,1.985,0.143,0.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,1,1
4038,0.691,-1.455,-1.030,-0.209,0.013,-0.305,-0.629,-1.405,-0.424,0.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,0,0
11420,-0.409,0.501,-0.526,-0.283,0.940,-0.305,-0.629,-0.557,-0.336,0.000,...,1.000,0.000,0.000,0.000,0.000,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14469,0.691,-0.477,-0.087,-0.237,-0.915,-0.305,0.910,-0.557,-0.284,1.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,1,0
8505,-0.409,-0.477,-0.778,-0.098,-0.915,-0.305,0.910,-1.405,0.060,1.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,0,0
549,-1.509,-1.455,-0.832,-0.146,-0.915,-0.305,0.910,-1.405,-0.231,0.000,...,0.000,0.000,0.000,0.000,0.000,0,0,0,0,0
4482,-0.409,-1.455,-1.030,-0.250,-0.915,-0.305,0.910,-1.405,-0.326,1.000,...,0.000,0.000,0.000,0.000,0.000,0,1,0,1,0


Great job: we now have all our data preparation pipeline encapsuled in the `full_pipeline` object.

Now we can prepare the test set for the final evaluation as well.

*Please note:* you must never fit an estimator (predictor or transformer) to the test set. This would mean leaking your test data into the training phase and would invalidate any conclusions on generalisation you may draw from evaluation on the test set. Hence, we must only use the `full_pipeline.transform()` method with our test set.

In [30]:
test_data_prepared = full_pipeline.transform(test_set.drop(columns=["price"]))

In [31]:
test_data_prepared.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,grade,sqft_lot15,zipcode_group_zg_0,...,zipcode_group_zg_4,zipcode_group_zg_5,zipcode_group_zg_6,zipcode_group_zg_7,zipcode_group_zg_8,waterfront,north_loc,rural_east,basement,renovated
2620,1.791,3.109,3.342,0.676,0.94,-0.305,-0.629,2.832,1.162,0.0,...,0.0,1.0,0.0,0.0,0.0,0,1,0,0,1
12950,-0.409,-0.477,-0.495,6.515,-0.915,-0.305,-0.629,-0.557,7.657,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
8075,0.691,0.501,0.274,-0.249,0.94,-0.305,-0.629,0.29,-0.258,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
18162,0.691,0.501,0.406,-0.193,0.94,-0.305,-0.629,-0.557,-0.215,0.0,...,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1
19739,-0.409,0.501,0.219,-0.25,0.94,-0.305,-0.629,0.29,-0.302,0.0,...,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1


### 7. Save all the prepared data - train and test

Let's add back the "price" column to our prepared datasets.

In [32]:
train_set_prepared = pd.concat([train_data_prepared, train_set["price"]], axis=1)
train_set_prepared.sample(5, random_state=77)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,grade,sqft_lot15,zipcode_group_zg_0,...,zipcode_group_zg_5,zipcode_group_zg_6,zipcode_group_zg_7,zipcode_group_zg_8,waterfront,north_loc,rural_east,basement,renovated,price
1916,-0.409,-0.477,-0.602,-0.147,-0.915,-0.305,2.449,0.29,-0.129,0.0,...,1.0,0.0,0.0,0.0,0,1,0,0,0,500000.0
13881,-0.409,0.501,0.625,-0.302,0.94,-0.305,-0.629,0.29,-0.392,0.0,...,1.0,0.0,0.0,0.0,0,1,0,1,0,850000.0
4125,1.791,0.501,-0.065,-0.161,-0.915,-0.305,0.91,0.29,-0.139,0.0,...,0.0,0.0,0.0,0.0,0,0,0,1,0,346500.0
6521,-0.409,0.175,1.03,-0.215,2.794,2.316,-0.629,0.29,-0.25,0.0,...,0.0,0.0,0.0,0.0,0,1,0,1,1,439000.0
18632,0.691,0.827,0.417,-0.273,0.94,-0.305,-0.629,0.29,-0.34,1.0,...,0.0,0.0,0.0,0.0,0,1,0,1,1,710000.0


In [33]:
test_set_prepared = pd.concat([test_data_prepared, test_set["price"]], axis=1)
test_set_prepared.sample(5, random_state=77)

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,condition,grade,sqft_lot15,zipcode_group_zg_0,...,zipcode_group_zg_5,zipcode_group_zg_6,zipcode_group_zg_7,zipcode_group_zg_8,waterfront,north_loc,rural_east,basement,renovated,price
13781,2.891,1.153,1.961,0.692,0.94,-0.305,-0.629,0.29,0.855,1.0,...,0.0,0.0,0.0,0.0,0,1,0,1,0,585000.0
16377,0.691,1.153,0.077,-0.164,0.94,-0.305,-0.629,0.29,-0.17,0.0,...,0.0,0.0,0.0,1.0,0,1,0,0,1,1200000.0
3901,-0.409,-0.477,-0.153,-0.158,-0.915,-0.305,0.91,0.29,-0.149,0.0,...,1.0,0.0,0.0,0.0,0,1,0,1,0,550000.0
5867,-1.509,-0.477,-1.128,-0.124,0.013,-0.305,0.91,-1.405,0.003,0.0,...,0.0,0.0,0.0,0.0,0,1,1,0,0,175000.0
2304,-1.509,-1.455,-0.81,-0.123,-0.915,-0.305,-0.629,-0.557,-0.139,0.0,...,0.0,0.0,0.0,0.0,0,1,0,0,0,290000.0


Let's save the datasets as CSV files in a newly created `prepared` subdirectory of `datasets`:

In [34]:
os.makedirs("../datasets/prepared/", exist_ok=True)
train_set_prepared.to_csv("../datasets/prepared/kd-housing-train.csv", index=False)
test_set_prepared.to_csv("../datasets/prepared/kd-housing-test.csv", index=False)

In [35]:
train_set_prepared.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17290 entries, 20474 to 1941
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   bedrooms            17290 non-null  float64
 1   bathrooms           17290 non-null  float64
 2   sqft_living         17290 non-null  float64
 3   sqft_lot            17290 non-null  float64
 4   floors              17290 non-null  float64
 5   view                17290 non-null  float64
 6   condition           17290 non-null  float64
 7   grade               17290 non-null  float64
 8   sqft_lot15          17290 non-null  float64
 9   zipcode_group_zg_0  17290 non-null  float64
 10  zipcode_group_zg_1  17290 non-null  float64
 11  zipcode_group_zg_2  17290 non-null  float64
 12  zipcode_group_zg_3  17290 non-null  float64
 13  zipcode_group_zg_4  17290 non-null  float64
 14  zipcode_group_zg_5  17290 non-null  float64
 15  zipcode_group_zg_6  17290 non-null  float64
 16  z