# Ch4 Training Models

To have better understanding on:

- appropriate **model** to use

- right **training algorithm** to use

- good set of _hyperparameters_

linear regression model
- analytically, i.e. direct "closed-form" equation, or normal equation, to compute model parameters that best fit the model to the training set
- iteratively, i.e. use Gradient Descent (GD) to gradually tweaks model parameters to minimize cost function over training set

polynomial regression
- how to detect overfitting
- regularization techniques to reduce overfitting

logistic regression 

softmax regression

need to know _vectors_ and _matrices_

## setup

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "training_linear_models"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## Linear Regression

A **linear model** makes a prediction by simply computing a **_weighted sum_** of the **_input featurs_**, plus a **_constant_** called the **_bias term (intercept)_**.

$\hat{y} = \theta_0 + \theta_1x_1 + ... + \theta_nx_n$

...in which 
- $\hat{y}$ : predicted value
- $n$ : number of features
- $x_i$ : the $i^{th}$ feature value
- $\theta_j$ : the $j^{th}$ model parameter, including the bias term $\theta_0$ and the feature weights $\theta_1, \theta_2, ..., \theta_n$

... which can be vectorized as:

$\hat{y} = h_\theta(x) = \theta.x$

...in which

- **$\theta$** is the model's **_parameter vector_**, including the bias term $\theta_0$ and the feature weightes $\theta_1$ to $\theta_n$

- **x** is the instance's **_feature vector_**, including $x_0$ to $x_n$ with $x_0$=1 

- **$\theta$.x** is the **_dot product_** of the two vectors (NOT the multiplication of the two vectors!), i.e., $\theta.x = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$, which is the same as $\theta^Tx$, i.e. matrix multiplication of the transpose of $\theta$ and $x$

- **$h_\theta$** is the **_hypothesis function_**, using the model parameters **$\theta$**

To train a linear regression model is to find the value of $\theta$ (parameter vector) that minimize the **_Root Mean Squared Error (RMSE)_**.

Mean Squared Error (MSE) cost function : MSE(X, **__$h_\theta$__**), or, MSE($\Theta$)* **_$\Theta$_** is the model's **_parameter vector_**, including the bias term $\theta_0$ and the feature weightes $\theta_1$ to $\theta_n$
    * **_X_** is the instance's **_feature vector_**, including $x_0$ to $x_n$ with $x_0$=1 
    * **_$\Theta$.x_** is the **_dot product_** of the two vectors (NOT the multiplication of the two vectors!), i.e., $\theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \theta_nx_n$
    * **__$h_\theta$__** is the **_hypothesis function_**, using the model parameters **_$\Theta$_**

To train a linear regression model is to find the value of $\Theta$ (parameter vector) that minimize the **_Root Mean Squared Error (RMSE)_**.

Mean Squared Error (MSE) cost function : MSE(X, **__$h_\theta$__**), or, MSE($\Theta$)To train a linear regression model is to find the value of $\Theta$ (parameter vector) that minimize the **_Root Mean Squared Error (RMSE)_**.

Mean Squared Error (MSE) cost function : MSE(X, **__$h_\theta$__**), or, MSE($\Theta$)* **_$\Theta$_** is the model's **_parameter vector_**, including the bias term $\theta_0$ and the feature weightes $\theta_1$ to $\theta_n$
    * **_X_** is the instance's **_feature vector_**, including $x_0$ to $x_n$ with $x_0$=1 
    * **_$\Theta$.x_** is the **_dot product_** of the two vectors (NOT the multiplication of the two vectors!), i.e., $\theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \theta_nx_n$
    * **__$h_\theta$__** is the **_hypothesis function_**, using the model parameters **_$\Theta$_**

To train a linear regression model is to find the value of $\Theta$ (parameter vector) that minimize the **_Root Mean Squared Error (RMSE)_**.

Mean Squared Error (MSE) cost function : MSE(X, **__$h_\theta$__**), or, MSE($\Theta$)