<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# End-to-end Machine Learning Process

In this lab, we will go through end-to-end process of building a regression model to predict housing prices.

At the end of the session, you will learn how to:


1. Perform exploratary data analysis
2. Perform data preparation
3. Train and validate model
4. Fine Tune Model
5. Test the model
6. Package the model for deployment

## Import required libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

## Getting the data

We will be using the California housing Prices dataset.  This dataset was based on data from the 1990 California census. You can see a description of the data here:
https://www.kaggle.com/datasets/camnugent/california-housing-prices


In [None]:
import pandas as pd

df = pd.read_csv('data/housing.csv')

## Understanding the data

As in all Machine Learning project, it is important to have a good understanding of your data. We will be doing some exploratory data analysis as our next step. But before we delve further into it, let's just take a quick look at our data.  We can first examine some samples, using `Dataframe.head()`

In [None]:
df.head()

The `info()` method is useful to get a quick description of the data, in particular the total number of rows, each attribute’s type, and the number of non-null values.

In [None]:
df.info()

**Question**

1. How many samples we have?
2. Do we have any missing values? Which feature(s) have missing values?
3. Which feature(s) is a categorical feature?

<details><summary>Click here for answer</summary>

1. 20640 samples in total
2. Yes, we have missing values. The samples have 207 missing 'total_bedrooms' values.
3. ocean_proximity is a categorical value, which has 'object' as its data type

The `describe()` method shows a summary of the numerical attributes.

In [None]:
df.describe()

Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute.

In [None]:
sns.set_theme(style='whitegrid')
df.hist(bins=50, figsize=(12, 8))
plt.show()

You noticed that some attributes have a skewed-right distribution, so you may want to transform them (e.g.,
by computing their logarithm) when preparing data later on.

**Question**

What do you notice from the histogram plot about median housing value? Will there be potential problem?

<details><summary>Click here for answer</summary>
    
The histogram shows a large count of houses at the maximum price.  This is due to the way the data is collected, for example, housing prices are capped at a maximum value (e.g. 500,001)

This may be a serious problem since it is your target attribute (your labels). Your machine learning algorithms may learn that prices never go beyond that limit. If you need to predict values beyond $500,001, then you should collect proper labels for the districts whose labels are capped. Or you can remove those districts from the training and test set.
    
</details>

## Splitting Data into Train and Test Set
Before we proceed with more data exploration, it is often a good practice for us to first set aside a part of our dataset as test set, so as to prevent us from snooping information/pattern from the test set and 'overfit' ourselves (and eventually our model) to the test set.

We can either random shuffle the data and split them into train/test split using scikit-learn's `train_test_split()` method, e.g.

```
train_set, test_set = train_test_split(df, 0.2)
```

This is
generally fine if your dataset is large enough (especially relative to the
number of attributes), but if it is not, you run the risk of introducing a
significant sampling bias.  Your train set may not have a representative distribution as your eventual test set or real-world data.

If based on the domain experts inputs, who feel that income distribution is a key for good prediction, we want to make sure our train and test set has the same income distribution. So we may want to split in such a way that train/test set has same distribution of income categories, e.g. This can be done by stratified sampling.

Before that let's take a closer look at the income distribution using histogram


In [None]:
# plt.hist(df.median_income)
sns.histplot(data=df, x='median_income')

**Creating income categories**


Most median income values are clustered around 1.5 to 6 (i.e.,  \\$15,000 to \\$60,000), but some
median incomes go far beyond 6. It is important to have a sufficient number
of instances in your dataset for each stratum, or else the estimate of a
stratum’s importance may be biased. This means that you should not have too
many strata, and each stratum should be large enough.

We can use the [`pd.cut()`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) to bin the median income into 5 categories (e.g. $[0, 1.5], [1.5, 3.0], [3.0, 4.5], [4.5, 6]$ and $[6, \infty]$).

In [None]:
df["income_cat"] = pd.cut(df["median_income"],
                          bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                          labels=[1, 2, 3, 4, 5])

Let us find out the number of samples for each categories 1, 2, ... 5.

In [None]:
df["income_cat"].value_counts().sort_index()

In [None]:
sns.set_theme()
df["income_cat"].value_counts().sort_index().plot.bar()

### Using Stratified Sampling to Split data

Stratified random sampling is a method of sampling that involves the division of a population into smaller sub-groups known as strata. In stratified random sampling, the strata are formed based on members' shared attributes or characteristics such as income or educational attainment.  The following code shows you how we can use Stratified Sampling to split the data into training and testing set.

In [None]:
from sklearn.model_selection import train_test_split

strat_train_set, strat_test_set = train_test_split(df, shuffle=True,
                                                   stratify=df['income_cat'],
                                                   random_state=42)

In the code cell below, we will compute and display the percentage of each income categories for 'Stratified' and 'Random' splitted data.

In [None]:
def income_cat_props(data):
    # compute the percentage of data across different categories
    return data['income_cat'].value_counts()/len(data)

rand_train_set, rand_test_set = train_test_split(df, test_size=0.2, shuffle=True, random_state=42)

compare_props = pd.DataFrame({
    'Overall': income_cat_props(df),
    'Stratified': income_cat_props(strat_test_set),
    'Random': income_cat_props(rand_test_set)
}).sort_index()

compare_props['Rand. %error'] = 100 * compare_props['Random'] / compare_props['Overall'] - 100
compare_props['Strat. %error'] = 100 * compare_props['Stratified'] / compare_props['Overall'] - 100

compare_props

## Exploratory Data Analysis

We shall further explore out train dataset to gain more insights.   Let's create a copy of the housing data so that we can experiment with it without affecting the training set.  Use the copy method to create a new copy of the stratified training data set we created earlier.

In [None]:
housing = strat_train_set.copy()

### Visualize geographical data

Because the dataset includes geographical information (latitude and longitude), it is a good idea to create a scatterplot of all the districts to visualize the data

In [None]:
housing.plot(kind='scatter',  x='longitude', y='latitude', alpha=0.4)

Now, let's get a bit more insight into whether how population, and median house values are related to the location. We can use the size of marker to represent the population variable, and color to represent population variable. We choose a predefined colormap 'jet' which ranges from blue (low value) to red (high value).

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4,
             s=housing['population']/100, label='population',
             c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True)
plt.legend()

**Question**

What can you conclude from this scatterplot?

<details><summary>Click here for answer</summary>
    
This plot tells you that the housing prices are very much related to the
location (e.g., close to the ocean) and to the population density.

### Looking for Correlations

We can compute the standard correlation coefficient (Pearson's r) between every pair of attributes using the corr() method, and examine how much each attribute correlate with the median house value

In [None]:
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

#### Question 1

Which variable(s) have high positive correlation with median_housing_value?

<details><summary>Click Here for Answer</summary>

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation. In our case, median house value and median income have strong positive correlation: when median median goes up, median house value goes up as well. When the coefficient is close to –1, it means that there is a strong negative correlation; you can see a small negative correlation between the latitude and the median house value (i.e., prices have a slight tendency to go down when you go north). Finally, coefficients close to 0 mean that there is no linear correlation.
    


Another way to check for correlation between attributes is to use `sns.pairplot` function to plot every numerical attribute against every other numerical attribute.  The most promising attribute the predict the median house value seems to be the median income.  

In [None]:
attributes = ['median_house_value', 'median_income', 'total_rooms', 'housing_median_age']
plt.figure(figsize=(12,8))
sns.pairplot(housing[attributes])
plt.show()

Looking at the correlation scatterplots, it seems like the most promising attribute to predict the median house value is the median income, so we zoom in on their scatterplot.

In [None]:
sns.scatterplot(data=housing, x='median_income', y='median_house_value', alpha=0.2)

#### Question 2

Do you notice something peculiar about the scatter plot?

<details><summary>Click here for answer</summary>

The price cap we noticed earlier is clearly visible as a horizontal line at \\$500,000. But the plot also reveals other less obvious straight lines: a horizontal line around \\$450,000, another around \\$350,000.

You may want to try removing the corresponding districts to prevent your algorithms from learning this data quirks.

In [None]:
# strat_train_set = strat_train_set.loc[(strat_train_set.median_house_value != 500001.0) & (strat_train_set.median_house_value != 350000.0)]
# strat_test_set = strat_test_set.loc[(strat_test_set.median_house_value != 500001.0) & (strat_test_set.median_house_value != 350000.0)]
# housing = strat_train_set.copy()

### Experimenting with Attribute Combinations

One last thing you may want to do before preparing the data for machine learning algorithms is to try out various attribute combinations. For example, the total number of rooms in a district is not very useful if you don’t know how many households there are. What you really want is the number of rooms per household. Similarly, the total number of bedrooms by itself is not very useful: you probably want to compare it to the number of rooms. And the population per household also seems like an interesting attribute combination to look at. You create these new attributes as follows:

In [None]:
housing['rooms_per_household'] = housing['total_rooms']/housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms']
housing['population_per_household'] = housing['population']/housing['households']

Now we look at the correlation matrix again:

In [None]:
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)

It looks like the new bedrooms_ratio attribute is much more correlated with the median house value than the total number of rooms or bedrooms. Apparently houses with a lower bedroom/room ratio tend to be more expensive. The number of rooms per household is also more informative than the total number of rooms in a district.

## Data Preparation

After gaining some understanding of our data, we are ready to prepare our data for machine learning. We will revert to our clean dataset, and separate our features (predictors) and labels (target values), i.e. the median house value.

### Separate features and labels

In [None]:
# Separate the target values from predictors
housing = strat_train_set.drop('median_house_value', axis=1)
housing_labels=strat_train_set['median_house_value'].copy()

### Clean the data

We observed earlier that *total_bedrooms* feature has some missing values. We can either:
1. Get rid of the corresponding rows that has missing values for *total_bedrooms*
2. Get rid of the feature totally
3. Set the missing values to some value, which can be zero, the mean, the median, etc. This is called imputation.

Scikit Learn provides a handy class to fill in the missing values: `Imputer`.

Let's us just use the median as replacement values.  As median only make sense for numerical values, we will separate numerical features from categorical features.

In [None]:
housing_num = housing.drop('ocean_proximity', axis=1)
housing_cat =  housing['ocean_proximity']

We will then create an instance of imputer, specifying median as our replacement values, and fit (train) the imputer on our training data to learn the statistics, i.e the median.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
imputer.fit(housing_num)
imputer.statistics_

Now you can use this “trained” imputer to transform the training set by replacing missing values with the learned medians:

In [None]:
X = imputer.transform(housing_num)
print(X)

Notice that after the transformation, the result is no more a dataframe, but numpy array. So let us just convert `X` back to Dataframe.

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index = housing_num.index)
housing_tr.info()

### Handling Text and Categorical Data

In this dataset, there is just one attribute that is text: the ocean_proximity attribute.  Let's just see what are the different values that this attribute has.

In [None]:
housing['ocean_proximity'].value_counts()

We can see that there is only a limited number of possible values, which means this is a categorical attribute.  Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. One way to do this is to assign a number to each category, e.g. `1<H OCEAN = 0, INLAND = 1, NEAR OCEAN = 2, etc.`. This can be done using `OrdinalEncoder()` in scikit-learn. However, one issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values, which may not be a valid assumption.  A better encoding for categorical data is to use one-hot encoding, using `OneHotEncoder()` class in scikit-learn.
                                        

In [None]:
housing_cat

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder(drop="first")
housing_cat_1hot = cat_encoder.fit_transform(housing_cat.values.reshape(-1, 1))
housing_cat_1hot.toarray()

We can get the list of categories using the encoder’s categories_ instance variable:

In [None]:
cat_encoder.categories_

### Feature Scaling and Transformation

#### Scaling

One of the most important transformations you need to apply to your data is feature scaling. With few exceptions, machine learning algorithms don’t perform well when the input numerical attributes have very different scales. This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Without any scaling, most models will be biased toward ignoring the median income and focusing more on the number of rooms.

There are two common ways to get all attributes to have the same scale: *min-max scaling* and *standardization*.

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

When a feature’s distribution has a heavy tail, both min-max scaling and standardization will squash most values into a small range. Machine
learning models generally don’t like this at all. So before you scale the feature, you should first transform it to shrink the heavy tail, and if possible to make the distribution roughly symmetrical. For example, a common way to do this is to replace the feature with its logarithm.

For example, the *population* feature has a long tail. After we apply log transform, it now more closer to a Gaussian distribution.

In [None]:
_, ax = plt.subplots(1,2)
sns.histplot(data=housing_tr['population'], ax=ax[0])
sns.histplot(data=housing_tr['population'].apply(np.log), ax=ax[1])
ax[0].set_xlabel('population')
ax[1].set_xlabel('log of population')
ax[0].set_ylabel("number of districts")

#### Custom Transformer

Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom transformations, cleanup operations, or combining specific attributes.

For transformations that don’t require any training, you can just write a function that takes a NumPy array as input and outputs the transformed
array. For example, we can implement the log transform in the above cell as a FunctionTransformer.

In [None]:
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log)

Custom transformers are also useful to combine features. For example, here’s a FunctionTransformer that computes the ratio between the input features 0 and 1.

In [None]:
ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
ratio_transformer.transform(np.array([ [1., 2.], [3., 4.]]))

Previously, we showed that some derived features such as bedroom ratio (total_bedrooms/total_rooms) are more informative than total_bedrooms alone. Below we show how we can create a transformer for this:

In [None]:
def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"] # feature names out

ratio_transformer = FunctionTransformer(column_ratio, feature_names_out=ratio_name)

In [None]:
housing_rooms = housing[['total_bedrooms', 'total_rooms']]
ratio_transformer.fit_transform(housing_rooms.values)

#### Transformation Pipeline

As you can see, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides `make_pipeline()` to help with such sequences of transformations. Here is a small pipeline for numerical attributes, which will first impute then
scale the input features:

In [None]:
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

sklearn.set_config(display='diagram')  # display pipeline as diagram

## Alternate way of creating a named pipeline
# num_pipeline = Pipeline([
#     ("impute", SimpleImputer(strategy="median")),
#     ("standardize", StandardScaler()),
# ])
num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
num_pipeline

You can now use the pipeline to transform your housing_num.

In [None]:
housing_num_prepared = num_pipeline.fit_transform(housing_num)
housing_num_prepared[:2]

When you call the pipeline’s `fit()` method, it calls `fit_transform()` sequentially on all the transformers, passing the
output of each call as the parameter to the next call until it reaches the final estimator, for which it just calls the `fit()` method.
The pipeline exposes the same methods as the final estimator. In this example the last estimator is a `StandardScaler`, which is a transformer, so the pipeline also acts like a transformer. If you call the pipeline’s `transform()` method, it will sequentially apply all the transformations to the data.

If you want to recover a nice DataFrame, you can use the pipeline’s `get_feature_names_out()` method:

In [None]:
df_housing_num_prepared = pd.DataFrame(housing_num_prepared,
                                       columns=num_pipeline.get_feature_names_out(),
                                       index=housing_num.index)

In [None]:
df_housing_num_prepared

Similarly we can define a pipeline for categorical feature:

In [None]:
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(handle_unknown='ignore'))
cat_pipeline

#### Column Transformer

So far, we have handled the categorical columns and the numerical columns separately. It would be more convenient to have a single
transformer capable of handling all columns, applying the appropriate transformations to each column. For this, you can use a ColumnTransformer. For example, the following ColumnTransformer will apply `num_pipeline` to numerical attributes and `cat_pipeline` to categorical attribute.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer

num_attribs = ["longitude", "latitude", "housing_median_age", "total_rooms",
               "total_bedrooms", "population", "households", "median_income"]
cat_attribs = ["ocean_proximity"]

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])


If we don't care about naming the individual transformer (pipeline), we can use `make_column_transformer()`:

In [None]:
preprocessing = make_column_transformer(
                    (num_pipeline, num_attribs),
                    (cat_pipeline, cat_attribs))

Scikit-Learn provides a `make_column_selector()` function that returns a selector function you can use to automatically select all the features of a given type, such as numerical or categorical. You can pass this selector function to the ColumnTransformer instead of column names or indices.

In [None]:
from sklearn.compose import make_column_selector

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object)),
)

preprocessing

#### Integrating all the different transform pipeline

Now let us apply all the different transformations we have experimented with earlier, and put them into a single ColumnTransformer.

In [None]:
def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"] # feature names out

ratio_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(column_ratio, feature_names_out=ratio_name),
    StandardScaler())

log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler())

default_num_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler())

cat_pipeline = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore'))

preprocessing = ColumnTransformer([
    ("bedrooms", ratio_pipeline, ["total_bedrooms", "total_rooms"]),
    ("rooms_per_house", ratio_pipeline, ["total_rooms", "households"]),
    ("people_per_house", ratio_pipeline, ["population", "households"]),
    ("log", log_pipeline, ["total_bedrooms", "population", "households", "median_income"]),
    ("cat", cat_pipeline, make_column_selector(dtype_include=object)),],
    remainder=default_num_pipeline)

preprocessing

In [None]:
housing_prepared = preprocessing.fit_transform(housing)
housing_prepared.shape

In [None]:
preprocessing.get_feature_names_out()

In [None]:
df_housing_prepared = pd.DataFrame(housing_prepared,
                                   columns=preprocessing.get_feature_names_out(),
                                   index=housing.index)

In [None]:
df_housing_prepared

## Select and Train a Model

We are now ready to select and train a machine learning model. Let's train a very basic linear regression model to get started.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(housing, housing_labels)

Try it out on a few instances from the training set.  It works, but the predictions are not great!

In [None]:
housing_predictions = lin_reg.predict(housing)
print("predicted:", housing_predictions[:5].round(-2)) # -2 = rounded to the nearest hundred

print("actual:", housing_labels.iloc[:5].values)

print("diff:", housing_predictions[:5] - housing_labels.iloc[:5].values)

Let us measure the RMSE of this regression model RMSE, using scikit-learn `mean_squared_error()`.

In [None]:
from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(housing_labels, housing_predictions, squared=False)
print(lin_rmse)

In [None]:
# housing_labels.describe()

Clearly not a great score: the median_housing_values of most districts range between \\$120,000 and \\$265,000, so a typical prediction error of \\$70,495 is really not very satisfying.

### Evaluation using Cross Validation

Note that so far we are only evaluating the model on our training set. How do we know the performance on the test set (unseen data). One way is to a validation set. We can use the `train_test_split()` function to split the training set into a smaller training set and a validation set, then train your models against the smaller training set and evaluate them against the validation set.

A great alternative is to use Scikit-Learn’s *k-fold* cross-validation feature. The following code randomly splits the training set into 3
nonoverlapping subsets called folds, then it trains and evaluates the decision tree model 5 times, picking a different fold for evaluation every time and using the other 4 folds for training. The result is an array containing the 5 evaluation scores.

*Note*: A better choice of number of folds is 5.

In [None]:
from sklearn.model_selection import cross_validate

lin_reg = make_pipeline(preprocessing, LinearRegression())
linreg_rmses = cross_validate(lin_reg,
                              housing,
                              housing_labels,
                              scoring="neg_root_mean_squared_error",
                              return_train_score=True,
                              cv=5)

print("rmses (train): ", -linreg_rmses['train_score'])
print("average train rmse: ", -linreg_rmses['train_score'].mean())
print("rmses (val):", -linreg_rmses['test_score'])
print("average val rmse:", -linreg_rmses['test_score'].mean())

We can try Polynomial regression by adding powers to each feature and fit a linear model on these extended features.

$$y = \beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\ldots+\beta_nx^n$$

In [None]:
from sklearn.preprocessing import PolynomialFeatures

log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    PolynomialFeatures(degree=2),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler())

default_num_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    PolynomialFeatures(degree=2),
    StandardScaler())

cat_pipeline = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore'))

preprocessing_poly = ColumnTransformer([
    ("bedrooms", ratio_pipeline, ["total_bedrooms", "total_rooms"]),
    ("rooms_per_house", ratio_pipeline, ["total_rooms", "households"]),
    ("people_per_house", ratio_pipeline, ["population", "households"]),
    ("log", log_pipeline, ["total_bedrooms", "population", "households", "median_income"]),
    ("cat", cat_pipeline, make_column_selector(dtype_include=object)),],
    remainder=default_num_pipeline)

preprocessing_poly

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly_reg = make_pipeline(preprocessing_poly, LinearRegression())
poly_rmses = cross_validate(poly_reg,
                            housing,
                            housing_labels,
                            scoring="neg_root_mean_squared_error",
                            return_train_score=True,
                            cv=5)

print("rmses (train): ", -poly_rmses['train_score'])
print("average train rmse: ", -poly_rmses['train_score'].mean())
print("rmses (val):", -poly_rmses['test_score'])
print("average val rmse:", -poly_rmses['test_score'].mean())

print('diff:', abs(poly_rmses['test_score'].mean()-poly_rmses['train_score'].mean()))

We noticed that the mean train rmse has improved but the validation is still much worse than training rmse. We may want to try a regularized Linear regressor `Ridge`.

In [None]:
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

ridge_reg = make_pipeline(preprocessing_poly, Ridge())

ridge_rmses = cross_validate(ridge_reg,
                             housing,
                             housing_labels,
                             scoring="neg_root_mean_squared_error",
                             return_train_score=True,
                             cv=5)

print("rmses (train): ", -ridge_rmses['train_score'])
print("average train rmse: ", -ridge_rmses['train_score'].mean())
print("rmses (val):", -ridge_rmses['test_score'])
print("average val rmse:", -ridge_rmses['test_score'].mean())
print('diff:', abs(ridge_rmses['test_score'].mean()-ridge_rmses['train_score'].mean()))

With regularization, our model's bias has increased but the variance has decreased slightly. Our validation rmse is now closer to our train rmse.

We can also try other more sophisticated algorithms such as RandomForestRegressor.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing, RandomForestRegressor())

forest_rmses = cross_validate(forest_reg,
                             housing,
                             housing_labels,
                             scoring="neg_root_mean_squared_error",
                             return_train_score=True,
                             cv=3)

print("rmses (train): ", -forest_rmses['train_score'])
print("average train rmse: ", -forest_rmses['train_score'].mean())
print("rmses (val):", -forest_rmses['test_score'])
print("average val rmse:", -forest_rmses['test_score'].mean())

The training rmse is a big improvement from Linear Regression model, however, we see that there is quite a fair bit of overfitting here. Some regularization will be helpful here.

## Model Fine Tuning

Now that we have our first model, we can try to improve the model by adjusting some of the hyper-parameters. This process is called fine-tuning. One way to fine tune the model is to use Scikit-Learn's `GridSearchCV` to evaluate all the possible combiniations of hyperparameter values that you want it to experiment with.

In [None]:
from sklearn.model_selection import GridSearchCV

full_pipeline = Pipeline([
    ("preprocessing", preprocessing_poly),
    ("ridge", Ridge()),
])


param_grid = [{'ridge__alpha': [0.0001, 0.001, 0.01, 1, 10]}]

grid_search = GridSearchCV(full_pipeline,
                           param_grid,
                           cv=5,
                           scoring='neg_root_mean_squared_error',
                           return_train_score=True)

grid_search.fit(housing, housing_labels)


In [None]:
print('best params:', grid_search.best_params_)
print('best score:', -(grid_search.best_score_))

In [None]:
cv_res = pd.DataFrame(grid_search.cv_results_)

cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)

# # extra code – these few lines of code just make the DataFrame look nicer
cv_res = cv_res[["param_ridge__alpha",
                 "split0_test_score",
                 "split1_test_score",
                 "split2_test_score",
                 "split3_test_score",
                 "split4_test_score",
                 "mean_test_score"]]
score_cols = ["split0", "split1", "split2", "split3", "split4", "mean_test_rmse"]
cv_res.columns = ["alpha"] + score_cols
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)

cv_res

In [None]:
ridge_reg = make_pipeline(preprocessing_poly, Ridge(alpha=0.0001))
final_model = ridge_reg.fit(housing, housing_labels)

In [None]:
housing_predictions = ridge_reg.predict(housing)
mse = mean_squared_error(housing_labels, housing_predictions, squared=False)
print(mse)

### Evaluate System on Test Set

After tweaking your models for a while, you eventually have a system that performs sufficiently well. You are ready to evaluate the final model on
the test set.

In [None]:
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

final_predictions = final_model.predict(X_test)
final_rmse = mean_squared_error(y_test, final_predictions, squared=False)
print(final_rmse)

## Deploy your model

We now need to get our model ready for deployment to production environment. The most basic way to do this is just to save the best model you trained, transfer the file to your production environment, and load it. To save the model, you can use the joblib library like this:

In [None]:
import joblib

joblib.dump(final_model, "my_california_housing_model.pkl")

Once your model is transferred to production, we can load it and use it. For this we must first import any custom classes and functions the model relies on (which means transferring the code to production), then load the model using joblib and use it to make predictions:

In [None]:
def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"] # feature names out

final_model_reloaded = joblib.load("my_california_housing_model.pkl")


Typicall we will have some kind of flask web service to serve the model. Here for simplicity, we just try out our model using some sample test data in the code cell below.

In [None]:
new_data = X_test.iloc[-5:]  # pretend these are new districts
predictions = np.round(final_model_reloaded.predict(new_data))
actual = y_test.iloc[-5:].values
print(predictions)
print(actual)

## Exercise

In the early [section](#Question-2), we noticed some data quirks at median housing values of \\$500,0001, \\$450,000 and \\$350,000. Try to remove these data quirks to see if you are able to produce a more accurate model.