CPTR 435 Machine Learning 


Name:Kaleb Tsegaye

This activity is adapted from the notebook provided for chapter 2 of *Hands-On Machine Learning with Scikit-Learn & TensorFlow* by Geron (2017).

For the original notebook and all other code/data from the book, see:
https://github.com/ageron/handson-ml


# End-to-end Machine Learning project (Part III: Training the model)

The purpose of this activity is to understand the workflow of a machine learning project from start to finish. The specific task and ML algorithms we see in this notebook are not as important as understanding the process that we go through to approach the problem.

## Problem: Predict house prices

Suppose you are a data scientist working for a real estate company. Your task is to predict median house values in Californian districts, given a number of features from these districts.

The main steps you will go through are:
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

The data set is based on the 1990 California census data. For pupose of the example, the book author (Geron) added a categorical attribute and removed some features. 

An *input* instance in this problem is a *block group* (refered to as a *district* in the book). A block group has a population of 600 to 3000 people. The *output* is the *median house price* for the *block group* (district).

**Note (from Geron)**: You may find little differences between the code outputs in the book and in these Jupyter notebooks: these slight differences are mostly due to the random nature of many training algorithms: although I have tried to make these notebooks' outputs as constant as possible, it is impossible to guarantee that they will produce the exact same output on every platform. Also, some data structures (such as dictionaries) do not preserve the item order. Finally, I fixed a few minor bugs (I added notes next to the concerned cells) which lead to slightly different results, without changing the ideas presented in the book.

# Setup

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

#
# load our housing data
#
from six.moves import urllib

# URL for data file
DOWNLOAD_URL = "https://raw.githubusercontent.com/ackleywill/CPTR435/main/housing.csv"
# local path where data file will be stored on computer (or in virtual environment)
HOUSING_PATH = os.path.join("datasets", "housing")

def fetch_housing_data(housing_url=DOWNLOAD_URL, housing_path=HOUSING_PATH):
    # create local directories for storing data files (if necessary)
    # NOTE: if running this in Colaboratory, these directories will not be
    # created on your computer, but in the virtual environment for the notebook
    # in colaboratory. It will only be available to this notebook, not others.
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)

    # build local path for data file
    csv_path = os.path.join(housing_path, "housing.csv")
    # download datafile if not already downloaded
    urllib.request.urlretrieve(housing_url, csv_path)

fetch_housing_data()

# 
# Read housing data from file and store in pandas dataframe
#
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

#
# perform stratified sampling
#
from sklearn.model_selection import StratifiedShuffleSplit

housing["income_cat"] = pd.cut(housing['median_income'], 
                               bins=[0, 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])


split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    print('Training samples: {}, testing samples: {}'.format(train_index.shape, test_index.shape))
    
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# we only used income category for a stratified split of the data
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)
    
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

#
# prepare data
#
from sklearn.base import BaseEstimator, TransformerMixin

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X, y=None):
        """ Add attributes: rooms_per_household, bedrooms_per_room, population_per_household. """
        # column index
        rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num = housing.drop('ocean_proximity', axis=1)
housing_num_tr = num_pipeline.fit_transform(housing_num)

# Create a class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', OneHotEncoder(sparse=False)),
    ])

from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared.shape

Training samples: (16512,), testing samples: (4128,)




(16512, 16)

# Select and train a model 

So we finally are ready to train a machine learning model to predict median house prices. To recap, we acquired the data, explored it, created train/test sets, and cleaned the data. While it may feel like we got stuck in the weeds for a while, data exploration and preparation is an important and time-consuming process in real world ML applications. 

## Training and evaluating on the training set

If you recall, we want to predict median house prices for given districts in California. This is an example of a *regression* task. There are various approaches we can use for this problem, including developing complex neural net models. However, an important principle with ML is to start simple and add complexity as the problem demands it.

For regression, a good algorithm to start with is *linear regression*. This algorithm fits a straight line to the training data and uses the resulting linear equation to predict values for new input. 

*Is it a good model for our problem?* 

We don't know until we try. But linear regression can work reasonably well in many situations and it is easy to compute (included in many libraries), so it's a good starting point.



In [2]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

# train the model
# housing_prepared is the cleaned training data (input)
# housing_labels is the array of median house values (output)
lin_reg.fit(housing_prepared, housing_labels)

Before we go nuts on the whole training set, let's just make a few predictions with instances from our training set and see what we are getting. This is just a *sanity check*. It is not proper evaluation of our model.

In [3]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

In [4]:
some_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN


In [5]:
# let's try the full pipeline on a few training instances
some_data_prepared = full_pipeline.transform(some_data)


In [6]:
print("Predictions:", lin_reg.predict(some_data_prepared))

Predictions: [ 85657.90192014 305492.60737488 152056.46122456 186095.70946094
 244550.67966089]


Compare against the actual values:

In [7]:
print("Labels:", list(some_labels))

Labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]


*Well, it looks like the predictions are roughly the same scale*.

*Is this good?*

*How can we tell?*

## Measuring error

Since this is a *regression* task, we will use *Root Mean Square Error* (RMSE) to evaluate our models.

$$ error = \sqrt{\frac{1}{m}\sum\limits_{i=1}^{m}{(predicted_i - correct_i)^2}} $$


Scikit-Learn has a function for calculating *mean squared error* (MSE). To get RMSE, we just compute the square root of the value returned by this function.



In [8]:
# predict values for each sample in training set
housing_predictions = lin_reg.predict(housing_prepared)

In [9]:
from sklearn.metrics import mean_squared_error

# compute RMSE for training set (training error)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

68627.87390018745

The our error on the training set is almost \$70K. Since this is the same set that the model was trained on, we don't get too excited one way or the other. 

Of course we would like the error to be close to zero. Minimizing training error is, in fact, the goal of training. However, in practice, too much focus on minimizing training error often leads to *overfitting* and, counterintuitively, poor generalization to new instances. This unintuitive tradeoff creates a challenge when tuning the system for optimal performance. 

If you're curious (of course you are), we can also compute the *mean absolute error* measure that we first developed.

In [10]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae

49438.66860915801

Notice that this error is lower. It is always the case that MAE $\le$ RMSE. RMSE is more sensitive to large prediction errors. With some tasks minimizing occasional large errors is important. In other cases, it may not.

If outliers get exponentially rarer as they get more extreme (think of tail of bell shaped curve), then RMSE tends to be preferred over MAE. Small errors are less of a problem than bigger errors. Selecting a model with smaller RMSE tends to lead to models with fewer extreme prediction errors.


### Decision tree regression

There are other regression algorithms besides linear regression. For instance, we can perform regression using *decision trees*.

http://www.saedsayad.com/decision_tree_reg.htm

With *classification* problems, the leaves are the label predictions. With *regression*, the leaves are the numerical value prediction.

In [11]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

In [12]:
housing_predictions = tree_reg.predict(housing_prepared)

In [13]:
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

*Zero error?* 

*That's good right?*

Sadly, it is probably an indication of *overfitting*. We won't know for sure until we are ready to deploy the system and apply our models to our test set.

# Fine-tune your model
As the decision tree regression experiment shows, there is a limit to how much we can tune our model with just a training set. 

It is dangerous to tune using the final held out test set. Iteratively training and testing on the test set reduces the usefulness of the test set. We start *teaching to the test*. 

Our goal is not an *A in the class*. 

Our goal is success in the real world.

Fooling ourselves by teaching to the test and receiving a high score will only lead to disappointment and poor performance when the ML system is deployed in the real world.

*What do we do?*

We have a few options. 

We could split our *training set* into a *smaller training set* and a *validation set*. The *validation set* is our "test set" for the purpose of tuning our model (and even selecting a different one such as a decision tree). 

Another option is to use *K-fold cross-validation*. We split the training data into $K$ equal subsets. Train on the collection of $(K-1)$ sets and test on the remaining set. Then we rotate the training sets. This train/test process is repeated $K$ times and the error or accuracy measure is averaged over the $K$ runs. Typically $K=10$, so the training set is split into $10$ subsets. The model is trained on 9 of the sets and tested on the other one. Then the process is repeated 9 more times, each time the test set is swapped with another from the training set until the system has been tested on all 10 sets individually. 

K-fold cross-validation is particularly useful when the number of training examples is small and splitting the set intro train/validation leaves too few samples in the training set for successful training.

## Comparing models
Let's compare performance between linear regression and decision tree regression using 10-fold cross-validation.

First we compute error for *decision tree regression*.

In [14]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [15]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

Scores: [72831.45749112 69973.18438322 69528.56551415 72517.78229792
 69145.50006909 79094.74123727 68960.045444   73344.50225684
 69826.02473916 71077.09753998]
Mean: 71629.89009727491
Standard deviation: 2914.035468468928


Now compute error using *linear regression*.

In [16]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [71762.76364394 64114.99166359 67771.17124356 68635.19072082
 66846.14089488 72528.03725385 73997.08050233 68802.33629334
 66443.28836884 70139.79923956]
Mean: 69104.07998247063
Standard deviation: 2880.3282098180653


Interesting.

Recall that the *training error* for *linear regression* was about \\$68K and  \\$0 for *decision tree*.

However, when *cross-validation* is used, so the models are not tested on instances that they have seen before (more realistic), *linear regression* actually performs a bit better.

*Why?*

The decision tree is a more complex (and more powerful) model than linear regression. It was able to perfectly learn the training set. However, it was *overfitting*. This is like memorizing the answers to last year's physics exam without understanding the reasoning behind the answers. Last year's answers do not help on this year's physics exam when the questions change slightly.


### Random forest regressor

In practice, decision trees are often used with an *ensemble learning* approach. Multiple decision trees are trained on random subsets of the training set. Decision trees are very sensitive to changes in training sets. So, add/removing some instances can lead to very different trees. This collection of trees (*forest*) is treated like a committee. 

When making a prediction, each tree makes its own prediction and the results are averaged or a majority vote is used. The resulting committee decision is often more accurate than any individual tree. 

In [17]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42, n_estimators=10)
forest_reg.fit(housing_prepared, housing_labels)

First we compute *training error*.

In [18]:
housing_predictions = forest_reg.predict(housing_prepared)

In [19]:
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

22413.454658589766

Now we compute error using 10-fold cross-validation. This may take a little while since the train/test cycle is repeated 10 times.

In [20]:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Scores: [53519.05518628 50467.33817051 48924.16513902 53771.72056856
 50810.90996358 54876.09682033 56012.79985518 52256.88927227
 51527.73185039 55762.56008531]
Mean: 52792.92669114079
Standard deviation: 2262.8151900582


Much better than either linear regression or an individual decision tree. All we had to do was add more trees into the mix.

However, notice that training error is still much lower than cross-validation error. Not as bad as decision tree, but still the difference is there. This indicates that our model is still overfitting some.

### Cross-validation statistics

With 10-fold cross-validation, we run train/test process 10 times. Each time leads to a slightly different prediction error. By looking at statistics from these runs, we get some idea of what to expect in the future. 

In [21]:
scores = cross_val_score(lin_reg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

# print list of RMSE values
print(rmse_scores)

# load scores into panda DataFrame to quickly calculate stats
pd.Series(rmse_scores).describe()

[71762.76364394 64114.99166359 67771.17124356 68635.19072082
 66846.14089488 72528.03725385 73997.08050233 68802.33629334
 66443.28836884 70139.79923956]


count       10.000000
mean     69104.079982
std       3036.132517
min      64114.991664
25%      67077.398482
50%      68718.763507
75%      71357.022543
max      73997.080502
dtype: float64

## Tuning hyperparameters

Most ML algorithms have parameters that may be adjusted to improve performance. For example, KNN has the distance measure and $K$, the number of nearest training instances used to label and input instance.

We can manually adjust hyperparameters. But this can be a tedious process where we change a hyperparameter, evaluate, change hyperparameter again, evaluate, ...

Fortuanately this process can be automated. 

### GridSearchCV

This algorithm will try all possible combinations of predefined hyperparameter values. It will perform cross-validation on each one. As you may expect, this can take some time. We are rerunning train/test for each parameter combination and for each fold in our cross-validation. To speed things up, we may decide to use 5-fold cross-validation instead of 10-fold.

*Bootstrap aggregation (or Bagging)*: The following code also compares the effect of turning *bootstrap aggregation* off for generating random forests (it's on by default). Bootstrapping is a strategy for building different datasets from an original dataset. A sample set is created by randomly selecting (with replacement) instances from the original set. It's possible that some instances may appear multiple times in a set and some may be left out. 

Here we look at performance with `bootstrap=False` when generating our forest of decision trees. With Bootstrapping turned off, the entire data set is used for training each tree. Different trees are then obtained through randomness when selecting features for each split of a tree. With `bootstrap=True` samples are drawn from the original training set *with replacement*. Different trees in our forest are produced by training them on different random subsets of our original training set.

For more on *Bootstrapping*, see:
https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


*Note: The following may take a little while to run.*

In [22]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

In [23]:
forest_reg = RandomForestRegressor(random_state=42)

In [24]:
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

The best hyperparameter combination found:

In [25]:
grid_search.best_params_

{'max_features': 8, 'n_estimators': 30}

In [26]:
grid_search.best_estimator_

Let's look at the score of each hyperparameter combination tested during the grid search:

In [27]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

63895.161577951665 {'max_features': 2, 'n_estimators': 3}
54916.32386349543 {'max_features': 2, 'n_estimators': 10}
52885.86715332332 {'max_features': 2, 'n_estimators': 30}
60075.3680329983 {'max_features': 4, 'n_estimators': 3}
52495.01284985185 {'max_features': 4, 'n_estimators': 10}
50187.24324926565 {'max_features': 4, 'n_estimators': 30}
58064.73529982314 {'max_features': 6, 'n_estimators': 3}
51519.32062366315 {'max_features': 6, 'n_estimators': 10}
49969.80441627874 {'max_features': 6, 'n_estimators': 30}
58895.824998155826 {'max_features': 8, 'n_estimators': 3}
52459.79624724529 {'max_features': 8, 'n_estimators': 10}
49898.98913455217 {'max_features': 8, 'n_estimators': 30}
62381.765106921855 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54476.57050944266 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59974.60028085155 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52754.5632813202 {'bootstrap': False, 'max_features': 3, 'n_estimators': 1

In [28]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_features,param_n_estimators,param_bootstrap,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.250573,0.062541,0.004841,0.003954,2,3,,"{'max_features': 2, 'n_estimators': 3}",-4119912000.0,-3723465000.0,...,-4082592000.0,186737500.0,18,-1155630000.0,-1089726000.0,-1153843000.0,-1118149000.0,-1093446000.0,-1122159000.0,28342880.0
1,0.676676,0.107596,0.008127,0.000141,2,10,,"{'max_features': 2, 'n_estimators': 10}",-2973521000.0,-2810319000.0,...,-3015803000.0,113980800.0,11,-598294700.0,-590478100.0,-612385000.0,-572768100.0,-590521000.0,-592889400.0,12849780.0
2,2.079492,0.411783,0.026354,0.003298,2,30,,"{'max_features': 2, 'n_estimators': 30}",-2801229000.0,-2671474000.0,...,-2796915000.0,79808920.0,9,-441256700.0,-432639800.0,-455372200.0,-432074600.0,-431160600.0,-438500800.0,9184397.0
3,0.358277,0.100257,0.003203,0.003923,4,3,,"{'max_features': 4, 'n_estimators': 3}",-3528743000.0,-3490303000.0,...,-3609050000.0,137568300.0,16,-978236800.0,-980645500.0,-1003780000.0,-1016515000.0,-1011270000.0,-998089600.0,15773720.0
4,1.022382,0.079587,0.00973,0.003138,4,10,,"{'max_features': 4, 'n_estimators': 10}",-2742620000.0,-2609311000.0,...,-2755726000.0,118260400.0,7,-506321500.0,-525798300.0,-508198400.0,-517440500.0,-528206600.0,-517193100.0,8882622.0
5,3.182836,0.108004,0.027778,0.004325,4,30,,"{'max_features': 4, 'n_estimators': 30}",-2522176000.0,-2440241000.0,...,-2518759000.0,84880840.0,3,-377656800.0,-390210600.0,-388504200.0,-383086600.0,-389477900.0,-385787200.0,4774229.0
6,0.437753,0.10662,0.004883,0.003989,6,3,,"{'max_features': 6, 'n_estimators': 3}",-3362127000.0,-3311863000.0,...,-3371513000.0,137808600.0,13,-890939700.0,-958373300.0,-900020100.0,-896473100.0,-915192700.0,-912199800.0,24448370.0
7,1.461802,0.263948,0.00971,0.003269,6,10,,"{'max_features': 6, 'n_estimators': 10}",-2622099000.0,-2669655000.0,...,-2654240000.0,69679780.0,5,-493990600.0,-514599600.0,-502351200.0,-495946700.0,-514708700.0,-504319400.0,8880106.0
8,4.440078,0.231982,0.02761,0.003949,6,30,,"{'max_features': 6, 'n_estimators': 30}",-2446142000.0,-2446594000.0,...,-2496981000.0,73570460.0,2,-376096800.0,-387663600.0,-387530700.0,-376093800.0,-386105600.0,-382698100.0,5418747.0
9,0.550068,0.035476,0.003319,0.004065,8,3,,"{'max_features': 8, 'n_estimators': 3}",-3590333000.0,-3232664000.0,...,-3468718000.0,129375800.0,14,-950501200.0,-916611900.0,-903391000.0,-907064200.0,-945938600.0,-924701400.0,19734710.0


### Randomized search

Grid search works fine when the number of hyperparameter combinations is relatively small. However, when there are a lot of possible hyperparameter combinations, it may take too long to be practical.

A better approach when the hyperparameter search space is large is *randomized search*. With this, we choose the number of iterations (hyperparameter tests) we want to run. For each iteration, the method selects a random value for each hyperparamenter. The resulting model is evaluated using cross-validation, just like with grid search.

*Note: The following may also take a little while to run.*

In [29]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', 
                                random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

In [30]:
rnd_search.best_params_

{'max_features': 7, 'n_estimators': 180}

In [31]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

49117.55344336652 {'max_features': 7, 'n_estimators': 180}
51450.63202856348 {'max_features': 5, 'n_estimators': 15}
50692.53588182537 {'max_features': 3, 'n_estimators': 72}
50783.614493515 {'max_features': 5, 'n_estimators': 21}
49162.89877456354 {'max_features': 7, 'n_estimators': 122}
50655.798471042704 {'max_features': 3, 'n_estimators': 75}
50513.856319990606 {'max_features': 3, 'n_estimators': 88}
49521.17201976928 {'max_features': 5, 'n_estimators': 100}
50302.90440763418 {'max_features': 3, 'n_estimators': 150}
65167.02018649492 {'max_features': 5, 'n_estimators': 2}


## Analyzing the best models and their errors
With decision trees it's possible to look at the importance of each attribute for predicting values. 

In [32]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

array([6.96542523e-02, 6.04213840e-02, 4.21882202e-02, 1.52450557e-02,
       1.55545295e-02, 1.58491147e-02, 1.49346552e-02, 3.79009225e-01,
       5.47789150e-02, 1.07031322e-01, 4.82031213e-02, 6.79266007e-03,
       1.65706303e-01, 7.83480660e-05, 1.52473276e-03, 3.02816106e-03])

*You know which features are which?* 

Me neither.

Let's display the features and their importance together.

In [33]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = cat_pipeline.named_steps["cat_encoder"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[(0.3790092248170967, 'median_income'),
 (0.16570630316895876, 'INLAND'),
 (0.10703132208204355, 'pop_per_hhold'),
 (0.06965425227942929, 'longitude'),
 (0.0604213840080722, 'latitude'),
 (0.054778915018283726, 'rooms_per_hhold'),
 (0.048203121338269206, 'bedrooms_per_room'),
 (0.04218822024391753, 'housing_median_age'),
 (0.015849114744428634, 'population'),
 (0.015554529490469328, 'total_bedrooms'),
 (0.01524505568840977, 'total_rooms'),
 (0.014934655161887772, 'households'),
 (0.006792660074259966, '<1H OCEAN'),
 (0.0030281610628962747, 'NEAR OCEAN'),
 (0.0015247327555504937, 'NEAR BAY'),
 (7.834806602687504e-05, 'ISLAND')]

*Which are the most important features?*

*Which ones are not that helpful?*

# Evaluating on the test set

Once we find a good model, we evaluate it on the *test* set.

In [34]:
final_model = grid_search.best_estimator_

In [35]:
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

In [36]:
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

In [37]:
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

In [38]:
final_rmse

47873.26095812988

*Is this good?*

*How do we know?*

When evaluating performance on a test set, we like to have a *baseline* to compare with. The baseline is usually the *state-of-the-art* approach if it is available. If not, it is a "reasonable" approach for the problem. For a regression problem, *linear regression* is a common approach, so we will use that as our baseline. This will put our random forest performance in perspective.

In [39]:
lin_reg = LinearRegression()

# train the model on entire training set
lin_reg.fit(housing_prepared, housing_labels)

# predict values for test set
lin_reg_predictions = lin_reg.predict(X_test_prepared)
lin_reg_mse = mean_squared_error(y_test, lin_reg_predictions)
lin_reg_rmse = np.sqrt(lin_reg_mse)

print(lin_reg_rmse)

66913.4419132093


*How does the tuned random forest approach compare with linear regression?*

## Impact of extra attributes
So what is the effect of the extra attributes that we added? For comparison, we will use the same prediction algorithm, but *without* the extra attributes added to the training and testing data.

In [40]:
num_pipeline_no_extra_attr = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

full_pipeline_no_extra_attr = FeatureUnion(transformer_list=[
        ("num_pipeline_no_extra_attr", num_pipeline_no_extra_attr),
        ("cat_pipeline", cat_pipeline),
    ])

housing_prepared_no_extra_attr = full_pipeline_no_extra_attr.fit_transform(housing)
housing_prepared_no_extra_attr.shape



(16512, 13)

In [41]:
forest_reg = RandomForestRegressor(random_state=42)
grid_search_no_extra_attr = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search_no_extra_attr.fit(housing_prepared_no_extra_attr, housing_labels)

In [42]:
X_test_prepared.shape

(4128, 16)

In [43]:
X_test_prepared_no_extra_attr = full_pipeline_no_extra_attr.transform(X_test)

In [44]:
X_test_prepared_no_extra_attr.shape

(4128, 13)

In [45]:
final_model_no_extra_attr = grid_search_no_extra_attr.best_estimator_

In [46]:
final_predictions_no_extra_attr = final_model_no_extra_attr.predict(X_test_prepared_no_extra_attr)

In [47]:
final_mse_no_extra_attr = mean_squared_error(y_test, final_predictions_no_extra_attr)
final_rmse_no_extra_attr = np.sqrt(final_mse_no_extra_attr)
final_rmse_no_extra_attr

47859.544319108194

*Do the extra attributes help?*

# Summary

The purpose of this task was to learn the process of applying machine learning to a problem. The problem itself, whether it was a regression or classification, supervised or unsupervised, or even the ML algorithms (linear regression, decision trees, neural nets) does not change the overall approach that we saw here.

When faced with a new ML task:
1. Learn about the problem
2. Get the data (and examine its structure)
3. Create a test set
4. Explore and visualize training set
5. Prepare and clean the data sets
6. Select/develop an ML algorithm
7. Tune the approach
8. Evaluate the final model
9. Deploy the resulting ML system

Differences in tasks, data formats, ML algorithms, evaluation metrics affect some of the details of these steps. However, the same steps remain regardless. This gives us a high level algorithm that we can follow whenever we apply ML to a new task.

# Questions

1.	In the example using DecisionTreeRegressor we observed that the RMSE was equal to zero. What kind of problem could this indicate?
1.  This means that the model is overfitting the training set. It is not generalizing well to new instances.
2.	Briefly explain what K-fold cross validation is.
2. K-fold cross validation is a technique that splits the training set into K equal subsets. The model is trained on K-1 subsets and tested on the remaining subset. This process is repeated K times, each time the test set is swapped with another from the training set until the system has been tested on all K subsets individually.
3.	What are hyperparameters? Which algorithms described can help us in the process of adjusting hyperparameters more automatically.
3. Hyperparameters are parameters that are not specificly learned within the estimators. GridSearchCV and RandomizedSearchCV can help us in the process of adjusting hyperparameters more automatically, The first being suited to when the combinations are smaller.
4.	In one of the steps of the data preparation process, a new attribute was created by combining other attributes. In practice, according to the results of the experiment, did this new attribute have a positive impact?
4. Yes, the new attribute had a small but positive impact.
5.	Some techniques and concepts used in this project (including the three parts) were taught in the classroom and others will still be taught. Describe in a few words what you thought of the study carried out with this project. In your opinion, was it possible to have a broader view of the process of applying ML to a real problem?
5. Yes, I was able to get a broader view of the process of applying ML to a real problem. It was a good experience to see how the process works and how to apply it to a real problem. I think it would also be nice to see how to apply it to a problem that would be even more complex than this and how the approach might change.