# Chapter 2: End-to-End Machine Learning Project

This chapter runs through an example project. Specifically, this chapter will use data about California housing prices. The main steps that they walk through are:

1) Look at the big picture.
2) Get the data.
3) Explore and visualize the data to gain insight.
4) Prepare the data for ML algorithms
5) Select a model and train it.
6) Fine tune your model.
7) Present your solution.
8) Launch, monitor, and maintain your system. 

## Look at the Big Picture

In this chapter, we use data from the California census. Our task is to use this data to build a model of housing prices in the state. The data includes metrics such as population, median income, and median housing price for each block group in California ("districts").

The first thing we will want to do, is look at Appendix A - our machine learning project checklist. This can be seen in the `Appendix A - Machine Learning Checklist.ipynb` notebook.

#### Frame the Problem

The first question is to ask the boss *what exactly the business objective is*. Building a model is probably not the end goal. How does the company expect to use and benefit from this model? 

The next question is to ask the boss *what current solutions look like (if any)*. This can give a reference for performance, as well as insights on how to solve the problem. 

Once these questions are asked and answered, we can figure out the following:
* Are we looking for a supervised, unsupervised, or other type of system?
* Is it a classification task or regression? What's the response?

In the example in this chapter, we will be using a supervised regression system, as our target is median housing price - which is a continous variable. We could bin these values and make it a classification task, but we will treat it as regression.

#### Select a Performance Measure

By selecting a performance measure, we can determine how well our system generates predictions. 

For regression, common metrics are:
* Mean Absolute Error (MAE)
* Mean Squared Error (MSE)
* Root Mean Squared Error (RMSE)
* R-Squared (Coefficient of Determination)
* Adjusted R-Squared
* Mean Absolute Percentage Error (MAPE)
* AIC/BIC

For classification, common metrics are:
* Accuracy
* Precision
* Recall
* F1 Score
* Specificity
* Sensitivity
* ROC/AUC

A more indepth explanation for these can be found in the `docs/performance_measures.docx` file. In this problem, we elect to use RMSE.

Metrics like RMSE and MAE are both ways to measure the distance between two vectors (the vector of predictions, and the vector of actual targets). Computing the RMSE corresponds to the Euclidean norm (l2 norm). The MAE is referred to as the Manhattan norm (l1 norm).

The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. When outliers are exponentially rare, the RMSE performs very well and is generally preferred.

#### Check the Assumptions

Before fully diving in and writing code, it's a good practice to verify the assumptions that have been made so far. This helps catch serious issues early on. 

## Get the Data

Usually we will load in the data from a database - which involves a much more complicated process. In this example, we will simply load in the data from a CSV file.

In [None]:
import urllib.request
import pandas as pd
import tarfile

# download data
data_path = "https://github.com/ageron/data/raw/main/housing.tgz"
data_save_path = "../../data/Hands-On Machine Learning with Scikit, Keras, Tensorflow/housing_data.tgz"
urllib.request.urlretrieve(data_path, data_save_path)

# extract data
with tarfile.open(data_save_path) as housing_tarball_file:
    housing_tarball_file.extractall(path = "../../data/Hands-On Machine Learning with Scikit, Keras, Tensorflow")

# load data
housing_data = pd.read_csv("../../data/Hands-On Machine Learning with Scikit, Keras, Tensorflow/housing/housing.csv")

## Explore + Visualize Data to Gain Insights

Each row represents one district. There are 10 total features. They are:
* longitude
* lattitude
* housing_median_age
* total_rooms
* total_bedrooms
* population
* households
* median_income
* median_house_value
* ocean_proximity

There ar 20,640 instances in the dataset. All attributes are numerical except for ocean_proximity. 

In [None]:
housing_data.head(5)

In [None]:
housing_data.info()

In [None]:
housing_data.describe()

In [None]:
import matplotlib.pyplot as plt

housing_data.hist(bins = 50, figsize = (12, 8))
plt.show()

The book recommends creating a test set at this stage. They recommend that we avoid ***snooping bias*** - avoid recognizing patterns in the test set that could lead us down the path of overfitting.

In this example, the book gives the example that the median income feature income could be very important in the prediction process. We want to make sure that our train and test set have a representative sample and include a sufficient amount of each strata of median income. The book makes median income into an ordinal variable, and then performs a stratified split based on this variable.

In [None]:
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
import numpy as np

# random split
train_data, test_data = train_test_split(housing_data, test_size = 0.2, random_state = 1230)

# make median income into ordinal variable
housing_data["income_category"] = pd.cut(
    housing_data["median_income"], 
    bins = [0, 1.5, 3, 4.5, 6, np.inf], 
    labels = [1, 2, 3, 4, 5]
)

# stratified split
train_data, test_data = train_test_split(housing_data, test_size = 0.2, stratify = housing_data["income_category"], random_state = 1230)

The book then:
* Plots and visualizes features
* Analyzes correlations - correlations with panda + scatter_matrix()
* Plays with some of the features and attempts to make new features that are more insightful

## Prepare the Data for Machine Learning Algorithms

The book recommends writing functions for each of the steps so that we can re-use our code (on new projects, a live deployment, and on the test set). 

***Clean the Data***

Most ML algorithms cannot work with missing features. The book shows how to impute with the median:

In [None]:
from sklearn.impute import SimpleImputer

# instantiate imputer
imputer = SimpleImputer(strategy = "median")

# fit (only on numerical)
housing_numerical_features = housing_data.select_dtypes(include = [np.number])
imputer.fit(housing_numerical_features)
X = imputer.transform(housing_numerical_features)

***Scikit-Learn Design***

All objects in Scikit-Learn share a consistent and similar interface.

1) ***Estimators*** - Any object that can estimate some parameters based on a dataset is called an *estimator*. The estimation itself is performed by the *.fit()* method. And it takes the data as an argument.
2) ***Transformers*** - Some estimators can transform a dataset. The transformation is performed by the *.transform()* method. It returns the transformed dataset. All transformers have a convienence function called *.fit_transform()* which calls both the *.fit()* and *.transform()* methods.
3) ***Predictors*** - Some estimators, given a dataset, are capable of making predictions. These have the *.predict()* method.

Note: Scikit-Learn transformers output NumPy arrays, even when they are fed Pandas DataFrames. 

***One-Hot Encoding***

The book transforms a categorical variable, ocean proximity, with One-Hot Encoding. One-Hot Encoding makes a unique column for each category in a feature. It will return a binary value, 1 if that category exists in that observation and 0 if not.

The result of this is a SciPy *sparse matrix*, which is a matrix that contains mostly zeros. Be cautious of performing One-Hot Encoding when there are many categories in a feature. A sparse matrix can be transformed to a regular matrix by calling *.toarray()* (alternatively, you can set *sparse=False*).

Pandas also has a function called *.get_dummies()* which converts each categorical feature into a one-hot representation. It is recommended to use the Scikit-Learn method as it raises excpetions if the same transformation is attempted to be applied on different features. 

***Note***: When you fit any Scikit-Learn estimator using a DataFrame, it stores thge column names in the *feature_names_in_* attribute. It also provides a *get_feature_names_out()* method that can be used to build a DataFrame around the output.

In [None]:
from sklearn.preprocessing import OneHotEncoder

original_encoder = OneHotEncoder()
housing_cat_encoded = original_encoder.fit_transform(housing_data.loc[:, ["ocean_proximity"]])

In [None]:
original_encoder.feature_names_in_

***Feature Scaling and Transformation***

ML algorithms often don't perform very well when the numerical attributes have different scales. There are two common ways to get all attributes to have the same scale:
* Min-Max Scaling
* Standardization

Min-Max Scaling is the simplest: for each attribute the values are shifted and rescaled so that they end up ranging from 0 to 1. This is performed by subtracting the min value and dividing by the difference between the min and the max. *MinMaxScaler* does this. It has a *feature_range* hyperparameter which lets you change the range if you'd like (i.e some NN's prefer values between -1, 1).

Standardization subtracts the mean value, then divides by the standard deviation. So this method does not restrict values to a specific range. Standardization is much less affected by outliers. We can perform this with *StandardScaler*.

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# min-max scaling
min_max_scaler = MinMaxScaler()
housing_numerical_min_max_scaled = min_max_scaler.fit_transform(housing_numerical_features)


# standardization
std_scaler = StandardScaler()
housing_numerical_std_scaled = std_scaler.fit_transform(housing_numerical_features)

When a features distribution has a *heavy tail*, both methods of transformations will squash most values in a small range. Most ML models do not like this. So before scaling the feautre, we will want to get rid of the heavy tail.

When we have a heavy tail to the right, we may want to take the *square root*. For features with a heavy tail to the left, we may want to take the *logarithm*. 

Another approach to handling heavy-tailed features consists in ***bucketizing*** the feature. This means chopping it's distribution into roughly equal-size buckets, and replacing each feature value with the index of the bucket it belongs to. For example, we could replace each value with it's percentile. This results in a uniform distribution, so there's no need for further scaling. 

When a feature has a multimodal distribution, it can be helpful to bucketize it. For this, it may be beneficial to treat the bucket ID's as categories, rather than as numerical values. This means we can one-hot encoded the bucket indixes.

Another approach for handling this is to add a feature for each of the modes, representing the similarity betwen the feature and that particular mode. This similarity measure is typical computed using a ***radial basis function (RBF)*** - which is any function that depends only on the distance bnetween the input value and a fixed point. The most commonly used RBF is the ***Gaussian RBF***, whose output value decays exponentially as the input value moves away from the fixed point. We can use this with the *rbf_kernel* function. There is a hyperparameter that controls how quickly the similarity measure decays as x moves away.

In [None]:
from sklearn.metrics.pairwise import rbf_kernel

# Gaussian RBF
age_simil_35 = rbf_kernel(housing_data[["housing_median_age"]], [[35]], gamma = 0.1)

So far, we've looked at how to transform input features. However, we need to be concious that the target tag may need to be tranformed as well. ***If you make a transformation to the target, you must apply the inverse transformation to get the proper values when predicting***. Thankfully, transformers have an *.inverse_transform()* method. 

In [None]:
from sklearn.linear_model import LinearRegression

# target
housing_labels = housing_data.loc[:, "median_house_value"]

# scale target
target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())

# fit linear reg
model = LinearRegression()
model.fit(housing_data[["median_income"]], scaled_labels)

# predictions + scale back
# NOTE: this code will not run unless you define variable `some_new_data`
scaled_predictions = model.predict(some_new_data)
prediction = target_scaler.inverse_transform(scaled_predictions)

### Custom Scikit-Learn Transformers

Custom transformers can be useful to combine features (see below, which calculates the ratio beween two variables).

Wa can also add a parameter *inverse_func*, which will can be used to calculate the inverse (see below in the second example, which calculates the log).

In [None]:
from sklearn.preprocessing import FunctionTransformer

# combine features with FunctionTransformer
ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
ratio_transformer.transform(np.array([[1., 2.], [3., 4.]]))

# take log
log_transformer = FunctionTransformer(np.log, inverse_func = np.exp)
log_pop = log_transformer.transform(housing_data[["population"]])

### Transformation Pipelines

As we begin to increase the number of transformations, it becomes important to chain and organize them. Scikit-Learn offers the Pipeline class to help with such sequences.

This takes in tuples, which represents the transformation steps. The name can be anything (as long as it does not contain *dunders*). The estimators must all be trasformers (aka they have the fit_transform()) method.

When we call *fit()* on the pipeline, it calls *fit_transform()* sequentially on all the transformers, passing the output of each call as the parameter to the next call.

In [None]:
from sklearn.pipeline import Pipeline

# tranform numerical values
numerical_pipeline = Pipeline(
    [("impute", SimpleImputer(strategy = "mean")),
     ("standardize", StandardScaler()),
    ])
housing_numerical_prepared = numerical_pipeline.fit_transform(housing_numerical_features)

# cast to dataframe
housing_numerical_prepared_data = pd.DataFrame(
    housing_numerical_prepared,
    columns = numerical_pipeline.get_feature_names_out(),
    index = housing_numerical_features.index
)

Up to this point, we've handle the numerical and categorical attributes seperately. It would be much more convienent to have a single transformer capable of handling all columns. We can use *ColumnTransformer* for this.

In [None]:
from sklearn.compose import ColumnTransformer

# define attributes
numerical_attributes = ["longitude", "latitude", "housing_median_age"]
categorical_attributes = ["ocean_proximity"]

# make pipeline
categorical_pipeline = Pipeline(
    [("simple_impute", SimpleImputer(strategy = "most_frequent")),
     ("one_hot", OneHotEncoder(handle_unknown = "ignore")),
    ])

preprocessing = ColumnTransformer([
    ("num", numerical_pipeline, numerical_attributes),
    ("cat", categorical_pipeline, categorical_attributes)
])

Since listing out all the names is not very convenient, Scikit-Learn allows you to specify them easier:

In [None]:
from sklearn.compose import make_column_selector, make_column_transformer

# define transformer
preprocessing = make_column_transformer(
    (numerical_pipeline, make_column_selector(dtype_include = np.number)),
    (categorical_pipeline, make_column_selector(dtype_include = object))
)

# apply transformer
housing_prepared = preprocessing.fit_transform(housing_data)

## Select and Train a Model

***Ensemble Methods***

One way to fine tune your system is to try to combine the models that perform best. The group ("ensemble") will often perform better than the best individual model - just like random forests often perform better than the individual decision trees they rely on. This is especially the case if the individual models make very different types of errors.

For example, we can train and fine-tune a k-nearest neighbors algorithm, then create an ensemble model that just predicts the mean of the random fores prediction and the models prediction.

***Analyzing the Best Models and Their Errors***

You can often gain good insights on the problem by insepcting the best models. For example, the Random Froest can indicate the relative importance of each attribute for making accurate predictions with *.feature_importances_*.

We will also want to look at the specific errors our system makes. Then try to understand why it makes them , and what could fix the problem. Could we add extra features, get rid of uninformative ones, or clean up outliers?

## Launch, Monitor, and Maintain Your System

It is very important to write monitoring code to check our system's live performance at regular intervals, and trigger alerts when it drops. 

If the data keeps evolving, we will need to update our datasets and retrain our model's regularly. 

We will also want to write scripts to check our systems input data quality. 

We will want to make sure we version our models and datasets.