# Linear Regression

Say we have a dataset and we decide to plot it as so:

![title](img/cricketpoints.svg)

In this particular case, the x variable is the amount of cricket chirps per minute, and the y variable is the temperature in celsius. 

We can draw a line "through" the data points, as such:

![title](img/CricketLine.svg)

This shows us that the relationship between chirps per minute and the temperature in celisus is linear. Using mathematics, we know that the equation for a line is
> y = mx + b
- y is the temperature in celsius
- x represents the cricket chirps per minute
- b represents the y intercept
- m is the slope of the line

In ML, the equation is a bit different:
> y' = b + w1x1
- y' is the label that we are predicting
- b is the bias (y intercept)
- w1 is the weight of feature one. Weight is the same concept of slope "m"
- x1 is the feature (known input)

In this particular case, our model has only one feature, but in the real world, models can have up to millions of features, denoted
> y' = b + w<sub>1</sub>x<sub>1</sub> + w<sub>2</sub>x<sub>2</sub> + w<sub>3</sub>x<sub>3</sub>

## Training and Loss

**Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples**. In supervised learning, an ML algorithm builds a model by examining many examples and attempting to find a model that minimizes **loss**. This process is called empirical risk minimization. 

Loss is the penalty for a bad prediction. In other words, loss tells us how bad the model's prediction was for a single example. A perfect prediction means a loss of zero, otherwise the loss is greater. 

**The overall goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.**

For example, see the charts below:

![Loss](img/LossSidebySide.png)

**In the charts above, the one on the left has high loss, while the one on the right has low loss**

There is a way to create a mathematical function that allows us to aggregate the individual losses in a meaningful manner. In other words, there is a way for us to visualize loss such that we know how to counter it, and that is where our **loss function** comes into play.

### Squared Loss: A Popular Loss Function

The linear regression models examined in these notes use a loss function called **squared loss** (L2 loss). The squared loss for any single example is as follows:

= The square of the difference between the label and the prediction  
= (observation - prediction(x))<sup>2</sup>  
= **(y - y')<sup>2</sup>**


### Mean Square Error (MSE)

MSE is the average squared loss per example in a dataset. To calculate MSE, sum up all of the squared losses for individual examples and then divide by the number of examples:

![mse](img/mse.png)

**In this formula, the sigma means 'sum'. Therefore, it could be said that MSE is equal to 1/n times the sum of (y-prediction(x))<sup>2</sup> for every pair (x, y) in the set D.**

- (x, y) is an example in which 
    - x is the set of features that the model uses to make predictions
    - y is the example's label
- prediction(x) is a function of the weights and bias in combination with the set features of x
- D is a data set of many labeled examples, which are (x, y) pairs
- N is the number of examples in D

Although MSE is commonly used in ML, it is not the only nor the best loss function available.

## Reducing Loss

One of the most important parts of training a ML algorithm is reducing loss. There are many different ways to reduce loss, but for this example, we will look at one approach.

### Iterative Approach
This can best be compared to a giant game of "hot and cold". In essence, the model starts with a random weight and bias, and then plugs that prediction into a loss function. With the loss it computes, it changes the values of the weights and bias in order to attempt to reduce loss. It does this over and over again (hence iterative) until the loss is 0 or stops declining or increasing at all. See this diagram: 

![image](img/IterativeDiagram.svg)

The "compute loss" part of the diagram does just that. It uses a loss function to calculate the loss for a given example of b and w<sub>1</sub>. After computing the loss, it feeds it to another function that computes new values for b and w<sub>1</sub>.  

Once the loss reaches zero, stops changing, or changes extremely slowly, the model knows that it has **converged**.