# Google ML Course Notebook

## Framing

#### Terms

**Label**: The true thing that we're predicting, the `y` variable in a linear regression

**Features**: The input variable, the `x` variable in a linear regression. can be many `x₁, x₂, ..., xn`

**Example**: 1 instance of data, such as **x** can be categorized as labeled or unlabeled. The boldface **x** means that this instance is a vector

**Labeled Example**: `{features, label}: (x, y)` used to train the model


| housingMedianAge(feature) | totalRooms(feature) | totalBedrooms(feature) | medianHouseValue(label) |
|:------------------------- |:------------------- |:---------------------- |:----------------------- |
| 15                        | 5612                | 1283                   | 66900                   |
| 19                        | 7650                | 1901                   | 80100                   |
| 17                        | 720                 | 174                    | 85700                   |
| 14                        | 1501                | 337                    | 73400                   |
| 20                        | 1454                | 326                    | 65500                   |

**Unlabeled Example**: `{features, ?}: (x, ?)` used for making predictions on new data


| housingMedianAge(feature) | totalRooms(feature) | totalBedrooms(feature) | 
|:-------------------------:|:-------------------:|:----------------------:|
| 42                        | 1686                | 361                    |
| 34                        | 1226                | 180                    |
| 33                        | 1077                | 271                    |

**Models**: maps examples to predicted label

Models use **training** and **inference**

Model **training** means creating or learning the model. Showing the model labeled examples and enabling the model to gradually learn the relationship

Model **inference** means applying the trained model to unlebeled examples. In the example above the model couls predict `medianHouseValue`

**Regression vs. Classification**

A **regression** model predicts continuous values, think a linear regression or how much / how little

A **classification** model predicts a discrete value, think a boolean value (spam/not spam) or what animal is this





## Descending into ML

### Linear Regression

**Linear Regression** is a best fit line

$y' = w_1 x_1 + b$
 - $w_1$ is weight vector in one dimension
 - b stands for bias or Y Intercept, can also be referred to as $w_0$
 - $x_1$x is the feature in 1 dimension
 - $y'$ is the predicted label (desired output)

**Loss** Distance of any given data point off of the linear regression line. Loss is on an absoulte scale

**Classifying Loss**
Squared Error, $L_2$ Loss: is calculated as the summation of the square of the difference between the model prediction and the true value

$\Sigma$: Summing over all the examples in the training set

D: Average loss over all examples

A more spohisticated model with three dimensions would be written as $y' = b + w_1x_1 + w_2x_2 + w_3x_3$

### Training and Loss

**Training** 

Training refers to the act of learning good values for all the weights and the bias from labeled examples.

**Emperical Risk Minimization** supervised learning, build model to minimize loss

The better a predictave model, the lover the overall average loss will be

**Squared Loss ($L_2$ loss)** is the square of the differences between observation and preciction  
= $(observation -prediction(x))^2$

= $(y - y')^2$


**Mean Square Error (MSE)** is the average squared loss per example over the whole dataset
$MSE = \frac{1}{N} \sum_{(x,y)\in D} (y - prediction(x))^2$

$(x, y)$ is an example which $x$ is the set of features the model uses to make predictions and $y$ is the example's label

$prediction(x)$ is a function of the weights and bias in combination with the set of features $x$

$D$ is ta data set containing many labeled examples, which are made up of $(x, y)$ pairs

$N$ is the number of examples in $D$


## Reducing Loss

### An Iterative Approach

The "model" takes one of more features as input and returns one prediction $(y')$ as output

$b$ and $w_1$ can be arbitrary values, $(0, 0)$ is a simple place to start

$y'$ is the model's prediction for $x$

$y$ is the correct label for x, which loss is computed off of

Iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has **converged**.

### Gradient Descent

Regression problems will always present a convex Loss vs. Weight trend for $w_1$. Because of this all regression problems have only 1 localized minimum where the slope is 0

**Gradient of Loss** is the derivative (slope) of the curve at any given point $(x,y)$. Gradient is a vector of partial derivatives with respect to the weights.

Gradient, because it's a vector, has both a direction and a magnitude.

### Learning Rate

Gradient descrne algorithms multiply the gradient by a scalar known as **learning rate** (step size) i.e.: Gradient maginture is 2.5 and the learning rate is 0.01, the next point will be 0.025 away. A perfect learning rate can be determined for one-dimension (feature) graphs is $\frac{ 1 }{ f(x)'' }$. Higher dimension functions ideal learning rate can be determined with the inverse of the [Hessian Matrix](https://en.wikipedia.org/wiki/Hessian_matrix)

### Stochastic Gradient Descent

**Batch** is the total number of examples you use to calculate the gradient in a single iteration

**Stochastic Gradient Descent (SGD)** Chooses examples at random, 1 example at a time (batch size of 1) per iteration. SGD will give optimal Gradient Descent given enough iterations but is very noisy

**Mini-Batch Stochastic Gradient Descent (Mini-Batch SGD)** is a compromise between full-batch iteration and SGD. Sample sizes range between 10 and 1,000 chosen at random

### Learning Rate and Convergence

