CPTR 435 Machine Learning 



Name: Kaleb Tsegaye

This activity is adapted from the notebook provided for chapter 2 of *Hands-On Machine Learning with Scikit-Learn & TensorFlow* by Geron (2017).

For the original notebook and all other code/data from the book, see:
https://github.com/ageron/handson-ml


!["Is the pipeline literally running from your laptop?" "Don't be silly, my laptop disconnects far too often to host a service we rely on. It's running on my phone."](https://imgs.xkcd.com/comics/data_pipeline.png)

*"Is the pipeline literally running from your laptop?" "Don't be silly, my laptop disconnects far too often to host a service we rely on. It's running on my phone."*

https://xkcd.com/2054/

# End-to-end Machine Learning project (Part II: Preparing the data)

The purpose of this activity is to understand the workflow of a machine learning project from start to finish. The specific task and ML algorithms we see in this notebook are not as important as understanding the process that we go through to approach the problem. 

## Problem: Predict house prices

Suppose you are a data scientist working for a real estate company. Your task is to predict median house values in Californian districts, given a number of features from these districts.

The data set is based on the 1990 California census data. For pupose of the example, the book author (Geron) added a categorical attribute and removed some features. 

An *input* instance in this problem is a *block group* (refered to as a *district* in the book). A block group has a population of 600 to 3000 people. The *output* is the *median house price* for the *block group* (district).

**Note (from Geron)**: You may find little differences between the code outputs in the book and in these Jupyter notebooks: these slight differences are mostly due to the random nature of many training algorithms: although I have tried to make these notebooks' outputs as constant as possible, it is impossible to guarantee that they will produce the exact same output on every platform. Also, some data structures (such as dictionaries) do not preserve the item order. Finally, I fixed a few minor bugs (I added notes next to the concerned cells) which lead to slightly different results, without changing the ideas presented in the book.

# Setup

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

#
# load our housing data
#
from six.moves import urllib

# URL for data file
DOWNLOAD_URL = "https://raw.githubusercontent.com/ackleywill/CPTR435/main/housing.csv"

# local path where data file will be stored on computer (or in virtual environment)
HOUSING_PATH = os.path.join("datasets", "housing")

def fetch_housing_data(housing_url=DOWNLOAD_URL, housing_path=HOUSING_PATH):
    # create local directories for storing data files (if necessary)
    # NOTE: if running this in Colaboratory, these directories will not be
    # created on your computer, but in the virtual environment for the notebook
    # in colaboratory. It will only be available to this notebook, not others.
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)

    # build local path for data file
    csv_path = os.path.join(housing_path, "housing.csv")
    # download datafile if not already downloaded
    urllib.request.urlretrieve(housing_url, csv_path)

fetch_housing_data()

# 
# Read housing data from file and store in pandas dataframe
#
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

#
# perform stratified sampling
#
from sklearn.model_selection import StratifiedShuffleSplit

# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    print('Training samples: {}, testing samples: {}'.format(train_index.shape, test_index.shape))
    
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# we only used income category for a stratified split of the data
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

Training samples: (16512,), testing samples: (4128,)


# Prepare the data for Machine Learning algorithms

After exploring our data set to gain insights, we want to develop an ML system to work with it. However, before we feed our data into an ML system, we often need to prepare and clean the data first. 

First, we start with a clean copy of our training set. *The one without the combined attributes.* 

In [2]:
strat_train_set.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,72100.0,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,279600.0,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,82700.0,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,112500.0,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,238300.0,<1H OCEAN


We will also separate the *input attributes* from the *output value* (median house price). 

This will make it easier to feed the input attributes into our ML system. It will eliminate the chance that we accidentally use the output value as a feature in the system.

In [3]:
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN


In [4]:
housing_labels.head()

12655     72100.0
15502    279600.0
2908      82700.0
14053    112500.0
20496    238300.0
Name: median_house_value, dtype: float64

## Data cleaning

With some data sets there are instances that are missing or have corrupt attributes. *Data cleaning* is the process of "fixing" the data set by correcting or removing these incomplete/erroneous instances.

For example, if we examine the data set, we will discover that "total bedrooms" has some missing values. There are several ways we can correct for this:
1. Remove the districts with missing "total bedrooms"
2. Remove the "total bedrooms" attribute from all districts
3. Set the missing value to some default value (zero, the average, the median, etc)

### Option 1: Remove districts with missing "total bedrooms"

In [5]:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
1606,-122.08,37.88,26.0,2947.0,,825.0,626.0,2.933,NEAR BAY
10915,-117.87,33.73,45.0,2264.0,,1970.0,499.0,3.4193,<1H OCEAN
19150,-122.7,38.35,14.0,2313.0,,954.0,397.0,3.7813,<1H OCEAN
4186,-118.23,34.13,48.0,1308.0,,835.0,294.0,4.2891,<1H OCEAN
16885,-122.4,37.58,26.0,3281.0,,1145.0,480.0,6.358,NEAR OCEAN


In [6]:
sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # option 1

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity


Note: that the *drop* operation returns the modified dataframe. It does not modify the original dataframe. 

Also, this method does not remove the `total_bedrooms` column. It only removes the rows with no entry for it.

In [7]:
sample_incomplete_rows

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
1606,-122.08,37.88,26.0,2947.0,,825.0,626.0,2.933,NEAR BAY
10915,-117.87,33.73,45.0,2264.0,,1970.0,499.0,3.4193,<1H OCEAN
19150,-122.7,38.35,14.0,2313.0,,954.0,397.0,3.7813,<1H OCEAN
4186,-118.23,34.13,48.0,1308.0,,835.0,294.0,4.2891,<1H OCEAN
16885,-122.4,37.58,26.0,3281.0,,1145.0,480.0,6.358,NEAR OCEAN


### 2. Remove the "total bedrooms" attribute from all districts

The following will remove the `total_bedrooms` column altogether.

In [8]:
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,population,households,median_income,ocean_proximity
1606,-122.08,37.88,26.0,2947.0,825.0,626.0,2.933,NEAR BAY
10915,-117.87,33.73,45.0,2264.0,1970.0,499.0,3.4193,<1H OCEAN
19150,-122.7,38.35,14.0,2313.0,954.0,397.0,3.7813,<1H OCEAN
4186,-118.23,34.13,48.0,1308.0,835.0,294.0,4.2891,<1H OCEAN
16885,-122.4,37.58,26.0,3281.0,1145.0,480.0,6.358,NEAR OCEAN


### 3. Set the missing value to some default value (zero, the average, the median, etc)

In [9]:
median = housing["total_bedrooms"].median()
print('median total bedrooms: {}'.format(median))

sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
sample_incomplete_rows

median total bedrooms: 433.0


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
1606,-122.08,37.88,26.0,2947.0,433.0,825.0,626.0,2.933,NEAR BAY
10915,-117.87,33.73,45.0,2264.0,433.0,1970.0,499.0,3.4193,<1H OCEAN
19150,-122.7,38.35,14.0,2313.0,433.0,954.0,397.0,3.7813,<1H OCEAN
4186,-118.23,34.13,48.0,1308.0,433.0,835.0,294.0,4.2891,<1H OCEAN
16885,-122.4,37.58,26.0,3281.0,433.0,1145.0,480.0,6.358,NEAR OCEAN


### ``Imputer`` class
Scikit-Learn provides the class ``SimpleImputer`` that will automatically fix missing values in a data set.




In [10]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

This object will replace missing attributes with the median of that attribute. It will do this to all attributes in the set, not just one particular attribute such as "total bedrooms".

Since the median may only be calculated for *numeric* attributes, we need to remove the "ocean proximity" attribute (a text attribute).

In [11]:
housing_num = housing.drop('ocean_proximity', axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])

Calculate the median for each attribute and store result in ``statistics_``.

In [12]:
imputer.fit(housing_num)

In [13]:
imputer.statistics_

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

Check that this is the same as manually computing the median of each attribute:

In [14]:
housing_num.median().values

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

Transform the training set and replace missing values with the median:

In [15]:
X = imputer.transform(housing_num)

The output ``X`` is a ``ndarray``. We need to convert it back to a ``DataFrame``.

So we can easily find the instances that were missing attributes, we will assign the *original indices* to each of the districts.

In [16]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index = list(housing.index.values))

In [17]:
print(housing.index.values)

[12655 15502  2908 ... 19263 19140 19773]


If we look at the entries with missing "total bedroom" values, we see the median value for that attribute (433). Since we've assigned the original indices for each district, we can now list just those which were missing attributes.

In [18]:
housing_tr.loc[sample_incomplete_rows.index.values]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
1606,-122.08,37.88,26.0,2947.0,433.0,825.0,626.0,2.933
10915,-117.87,33.73,45.0,2264.0,433.0,1970.0,499.0,3.4193
19150,-122.7,38.35,14.0,2313.0,433.0,954.0,397.0,3.7813
4186,-118.23,34.13,48.0,1308.0,433.0,835.0,294.0,4.2891
16885,-122.4,37.58,26.0,3281.0,433.0,1145.0,480.0,6.358


In [19]:
imputer.strategy

'median'

Now that we've checked the instances with missing values, we can reconvert ``X`` to a ``DataFrame``. This time, we will let the instances use default indices (i.e. first row has index 0 instead of 17606).

In [20]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns)
housing_tr.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736
1,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373
2,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875
3,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264
4,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964


## Handling text and categorical attributes

Now let's preprocess the *categorical* input feature, `ocean_proximity`. 

All of the attributes (or features) we've examined so far are *numeric* features. Their values are real numbers over various ranges. 

With *categorical* features, the attribute can take on *one of several discrete values*. There is no particular ordering to the values. For the "ocean proximity", this attribute has 5 possible values.

In [21]:
housing["ocean_proximity"].value_counts()

ocean_proximity
<1H OCEAN     7277
INLAND        5262
NEAR OCEAN    2124
NEAR BAY      1847
ISLAND           2
Name: count, dtype: int64

In [22]:
housing_cat = housing[['ocean_proximity']]
housing_cat.head(10)

Unnamed: 0,ocean_proximity
12655,INLAND
15502,NEAR OCEAN
2908,INLAND
14053,NEAR OCEAN
20496,<1H OCEAN
1481,NEAR BAY
18125,<1H OCEAN
5830,<1H OCEAN
17989,<1H OCEAN
4861,<1H OCEAN


### Encoding categorical attributes
Most ML algorithms work exclusively with *numeric* input values. For an algorithm to use a *categorical* input feature, the value must be encoded as a numeric value. There are various approaches for encoding categorical attributes.

#### Label encoding
One approach is to assign numbers to each possible value (or *label*) that the categorical feature may have. For example, we may assign 0 to `<1H OCEAN`, 1 to `INLAND`, 2 to `ISLAND`, and so on. 

While this is simple to implement, it has a drawback. The ML algorithm may figure that labels with close encoding values (e.g. 0 and 1) are more similar than those with encoding values that are farther apart (e.g. 0 and 4). In many cases this may not be true and could lead to poor performance.

       

In [23]:
from sklearn.preprocessing import LabelEncoder

In [24]:
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat.head()

12655        INLAND
15502    NEAR OCEAN
2908         INLAND
14053    NEAR OCEAN
20496     <1H OCEAN
Name: ocean_proximity, dtype: object

In [25]:
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded[:5]

array([1, 4, 1, 4, 0])

In [26]:
encoder.classes_

array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)

#### One-hot encoding
Another approach is to represent a categorical value with a 1D array (or *vector*). The array has an entry for each possible label the feature may have. If the feature has a particular label, then this is represented by an array with a 1 in the location for that label, and 0 for all other locations in the array. It's called *one-hot* since only one entry will be 1 (hot), the rest will be 0 (cold).

Example: Consider "ocean proximity". Suppose our 1-hot encoding vector uses the following order:
    
    ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', NEAR OCEAN']    
    
For the value ``'<1H OCEAN'``, it is encoded with the vector:

    [1, 0, 0, 0, 0]
    
For the value ``'NEAR BAY'``, it is encoded with the vector:

    [0, 0, 0, 1, 0]

In [27]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat = housing[["ocean_proximity"]]
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
print(housing_cat_1hot)

  (0, 1)	1.0
  (1, 4)	1.0
  (2, 1)	1.0
  (3, 4)	1.0
  (4, 0)	1.0
  (5, 3)	1.0
  (6, 0)	1.0
  (7, 0)	1.0
  (8, 0)	1.0
  (9, 0)	1.0
  (10, 1)	1.0
  (11, 0)	1.0
  (12, 1)	1.0
  (13, 1)	1.0
  (14, 4)	1.0
  (15, 0)	1.0
  (16, 0)	1.0
  (17, 0)	1.0
  (18, 3)	1.0
  (19, 0)	1.0
  (20, 1)	1.0
  (21, 3)	1.0
  (22, 1)	1.0
  (23, 0)	1.0
  (24, 1)	1.0
  :	:
  (16487, 1)	1.0
  (16488, 0)	1.0
  (16489, 4)	1.0
  (16490, 4)	1.0
  (16491, 1)	1.0
  (16492, 1)	1.0
  (16493, 0)	1.0
  (16494, 0)	1.0
  (16495, 0)	1.0
  (16496, 1)	1.0
  (16497, 0)	1.0
  (16498, 4)	1.0
  (16499, 0)	1.0
  (16500, 0)	1.0
  (16501, 1)	1.0
  (16502, 1)	1.0
  (16503, 1)	1.0
  (16504, 1)	1.0
  (16505, 0)	1.0
  (16506, 0)	1.0
  (16507, 0)	1.0
  (16508, 1)	1.0
  (16509, 0)	1.0
  (16510, 0)	1.0
  (16511, 1)	1.0


By default, the `OneHotEncoder` class returns a *sparse array*, but we can convert it to a dense array if needed by calling the `toarray()` method:

In [28]:
housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:

In [29]:
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot



array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [30]:
cat_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

## Custom transformer

Transformer objects in Scikit-Learn perform some transformation on a data set. Examples we've seen so far are `SimpleImputer` and the encoder classes. 

We can create our own custom transformer that works with Scikit-Learn's API. We will create a custom transformer to automatically add the combined attributes we experimented with earlier.

When creating a custom transformer, the class must have the following methods:
 - `fit()`
 - `transform()`
 - `fit_transform()`

In [31]:
from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X, y=None):
        """ Add attributes: rooms_per_household, bedrooms_per_room, population_per_household. """
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]


# create instance of our new transformer (skip bedrooms per room for now)
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)

# add new attributes (but not bedrooms per room)
housing_extra_attribs = attr_adder.transform(housing.values)

In [32]:
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,rooms_per_household,population_per_household
0,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND,5.485836,3.168555
1,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN,6.927083,2.623698
2,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,INLAND,5.393333,2.223333
3,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN,3.886128,1.859213
4,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN,6.096552,3.167241


## Creating a transformation pipeline

Machine learning systems often have a *pipeline* architecture. The system consists of stages where the output of one stage becomes the input to the next stage. Each stage is responsible for performing some type of processing on the data. 

Scikit-Learn provides a `Pipeline` class for convenient creation and execution of a pipeline.

Now let's build a pipeline for preprocessing the numerical attributes.

In [33]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

A `Pipeline` consists of a list of name/estimator pairs. When `fit_transform` is called, the estimator objects are applied to the data in the order they appear in the pipeline list. Here our pipeline:
1. Fills in missing attribute values.
2. Adds the combined attributes.
3. Scales the values for each of the numeric attributes.

For this example, we use the `StandardScaler` transformer to scale the attribute values. This scales the data such that *mean* is 0 and the *variance* is 1 when computed over each attribute column.

Try different attribute columns. For each, the mean should be 0 and variance 1.

In [34]:
print(housing_num_tr.shape)
print('Feature mean: {}'.format(np.mean(housing_num_tr[:,2])))
print('Feature variance: {}'.format(np.var(housing_num_tr[:,2])))

(16512, 11)
Feature mean: 8.778507636571005e-17
Feature variance: 1.0000000000000002


And a transformer to just select a subset of the Pandas DataFrame columns. We will use it later to select the categorical features.

In [35]:
from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

Now let's join all these components into a big pipeline that will preprocess both the numerical and the categorical features:

In [36]:
print(list(housing_num))


['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']


In [37]:
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

# pipeline for cleaning and preparing numerical features
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])


# pipeline for preparing categorical features
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', OneHotEncoder(sparse=False)),
    ])

`FeatureUnion` is a way to concatentate pipelines.

In [38]:
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])


In [39]:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared[0]



array([-0.94135046,  1.34743822,  0.02756357,  0.58477745,  0.64037127,
        0.73260236,  0.55628602, -0.8936472 ,  0.01739526,  0.00622264,
       -0.12112176,  0.        ,  1.        ,  0.        ,  0.        ,
        0.        ])

In [40]:
housing_prepared.shape

(16512, 16)

# Questions
1.	What options are used in the data cleaning step in relation to the “total bedrooms” attribute?

1. Remove the districts with missing "total bedrooms"

2.	What does Scikit-Learn's SimplerImputer class accomplish?

2. Replaces missing attributes with the median of that attribute

3.	Why is it important to transform categorical data into numeric data?

3. Because most ML algorithms work exclusively with numeric input values

4.	What is the difference between Label encoding and One-hot encoding approaches?

4. Label encoding assigns numbers to each possible value (or label) that the categorical feature may have. One-hot encoding represents a categorical value with a 1D array (or vector).

5.	What is the advantage of using One-hot encoding instead of Label encoding?

5. One-hot encoding is better because it represents a categorical value with a 1D array (or vector) and it is easier to implement.

6.	What does the custom transformer (CombinedAttributesAdder) created in the example do?

6. It adds the combined attributes we experimented with earlier.

7.	What is the advantage of using a transformation pipeline?

7. It makes it easier to run the data through the pipeline.

8.	What steps were included in the final pipelines?

8. Fills in missing attribute values, adds the combined attributes, and scales the values for each of the numeric attributes.