In [27]:
import os
import pandas as pd
import numpy as np
import csv

HOUSING_PATH = os.path.join("datasets", "housing")

housing = pd.read_csv(os.path.join(HOUSING_PATH, "strat_train_set.csv"))

In [28]:
housing_labels = housing["median_house_value"].copy()
housing = housing.drop("median_house_value", axis = 1)

Total bedrooms has some missing values. Can either drop the missing value entries, drop the whole attribute, or fill them with median, zero or mean.
If you use a median, mean or something, don't forget to save the value, since you will need it to fill the values for the test set as well.

In [29]:
# housing.dropna(subset=["total_bedrooms"]) # option 1
# housing.drop("total_bedrooms", axis=1) # option 2
# median = housing["total_bedrooms"].median() # option 3
# housing["total_bedrooms"].fillna(median, inplace=True)

In [30]:
# Substituting with median using sklearn
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
housing_numeric = housing.select_dtypes(include=[np.number])
# housing_numeric = housing.drop("ocean_proximity", axis = 1)

# We only have missing values in total_bedrooms, but we can't be sure that there won't be additional missing values in new data coming in. That's why we calculate them all.
imputer.fit(housing_numeric)

In [31]:
housing_numeric.median().values

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

Now you can use this “trained” imputer to transform the training set by replacing missing values by the learned medians:

X = imputer.transform(housing_num)

The result is a plain NumPy array containing the transformed features. If you want to put it back into a Pandas DataFrame, it’s simple:

housing_tr = pd.DataFrame(X, columns=housing_num.columns)

In [32]:
# Transforming ocean_proximity to numerical values
from sklearn.preprocessing import OrdinalEncoder
housing_cat = housing[["ocean_proximity"]]
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)


In [33]:
housing_cat_encoded[:10]

array([[1.],
       [4.],
       [1.],
       [4.],
       [0.],
       [3.],
       [0.],
       [0.],
       [0.],
       [0.]])

In [34]:
ordinal_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

This doesn't really work well for us in this scenario because the model will assume that 1 and 2 are more similar than 1 and 5, which is not the case. It would be okay if the categorical variable was like bad, average, good, excellent or something like that. Here we need to utilize dummy variables

In [35]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

 The result is stored as a sparse matrix to not waste space on 0s, it only stores the locations of the non-zero elements.
 Lets transform it to a regular numpy array

In [36]:
housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

Custom Transformers

Although Scikit-Learn provides many useful transformers, you may need to write your own for tasks such as custom cleanup operations or combining specific attributes. You can make your transformer work seamlessly with Scikit-Learn functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inheritance), all you need is to create a class and implement three methods:

	•	fit() (which returns self),
	•	transform(),
	•	fit_transform().

You can get the last one (fit_transform()) for free by simply adding TransformerMixin as a base class. Additionally, if you add BaseEstimator as a base class (and avoid using *args and **kwargs in your constructor), you’ll get two extra methods: get_params() and set_params(), which are helpful for automatic hyperparameter tuning.

For example, here’s a small transformer class that adds combined attributes (as discussed earlier):

In [39]:
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = housing.columns.get_loc("total_rooms"), housing.columns.get_loc("total_bedrooms"), housing.columns.get_loc("population"), housing.columns.get_loc("households")

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

In [40]:
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

Transformation Pipelines

As you can see, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class to help with such sequences of transformations. Here is a small pipeline for the numerical attributes:

In [41]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median'))
    , ('attribs_adder', CombinedAttributesAdder())
    , ('std_scaler', StandardScaler())
])

housing_num_tr = num_pipeline.fit_transform(housing_numeric)

So far, we have handled the categorical columns and the numerical columns separately. It would be more convenient to have a single transformer able to handle all columns, applying the appropriate transformations to each column. In version 0.20, Scikit-Learn introduced the ColumnTransformer for this purpose, and the good news is that it works great with Pandas DataFrames. Let’s use it to apply all the transformations to the housing data:

In [42]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_numeric)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs)
    , ("cat", OneHotEncoder(), cat_attribs)
])

housing_prepared = full_pipeline.fit_transform(housing)

array([[-0.94135046,  1.34743822,  0.02756357, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.17178212, -1.19243966, -1.72201763, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.26758118, -0.1259716 ,  1.22045984, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-1.5707942 ,  1.31001828,  1.53856552, ...,  0.        ,
         0.        ,  0.        ],
       [-1.56080303,  1.2492109 , -1.1653327 , ...,  0.        ,
         0.        ,  0.        ],
       [-1.28105026,  2.02567448, -0.13148926, ...,  0.        ,
         0.        ,  0.        ]])