## Linear Regression
Linear regression is a method for finding the straight line or hyperplane that best fits a set of points, like simple example of (x, y).

Problem is in linear model, y=wx+b, and for the data of (x, y), how can we get the answer of (W, B).
If there are only 2 points, we can just caculated by the Linear equations. But there are thousands of (x,y), which one is correct. 

NOTE: try to understand it by house price prediction example, see below:
- House Price Linear Model is to find a line that best fits a set of points(square footage, price)

<img src="./images/LinearRegressionHousePb.png" style="width:400px">

Like Algebra, we define the model like:

- Y1=W1X+B
- B: bias
- W: slope

## Loss Function
How shall we evaluate if it’s a good line? 
We use Loss Function. 
<img src="./images/LossHousePb.png" style="width:400px">

That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.

And we care about minimizing loss across our entire data set. 

Popular loss function: Square loss, mean square error
- Squared Error
L2 Loss for a given example is also called squared error
= Square of the difference between prediction and label
= (observation - prediction)2
= (y - y')2

L_2Loss = sum_{(x,y)\in D} (y - prediction(x))^2
- Mean Square Error
measures the average of the squares of the errors or deviations. 
<img src="./images/SquareErrorHousePb.png" style="width:300px">


For sure, real world should be sophisticated:

y' = b + w_1x_1 + w_2x_2 + w_3x_3


## Training
Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss.

This process is called empirical risk minimization:
- Good weights and bias
- Minimize the loss
Reducing Loss
- Convex Problem
- Non-convex (eggcrate)
It’s a iterative approach, like “Hot and Cold” kid game:
- Initial Value: You'll start with a wild guess ("The value of w1 is 0.") and wait for the system to tell you what the loss is. 
- Compute Parameter Update: Then, you'll try another guess ("The value of w1 is 0.5.") and see what the loss is.

<img src='images/Optimization.gif' style='width:400px'>


    - Gradient descent
        - many algorithms simply set w1 to 0 or pick a random value.
        - The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of loss is equal to the derivative (slope) of the curve, and tells you which way is "warmer" or "colder." When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.
        - gradient is a vector, so it has both of the following characteristics:
            - a direction
            - a magnitude

<img src='images/GradientDescent.png' style='width:400px'>

    - Learning rate
        - As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point.
            - For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.

<img src='images/LearningRateHousePb.png' style='width:400px'>

    - stochastic gradient descent
        - Stochastic gradient descent (SGD) takes this idea to the extreme--it uses only a single example (a batch size of 1) per iteration
    - Mini-batch stochastic gradient descent
        - is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random (batch size)


Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged. 模型收敛 (That minimum is where the loss function converges.)
Many of the coding exercises contain the following hyperparameters:
- steps, which is the total number of training iterations. One step calculates the loss from one batch and uses that value to modify the model's weights once.
- batch size, which is the number of examples (chosen at random) for a single step. For example, the batch size for SGD is 1.


Trick is how to make it as efficiently as possible. 

Again key point is, A Machine Learning model is trained by starting with an initial guess for the weights and bias and iteratively adjusting those guesses until learning the weights and bias with the lowest possible loss.


### Generalization
Asking: will our model do well on a new sample of data?

- Theoretically:
    - Interesting field: generalization theory
    - Based on ideas of measuring model simplicity / complexity
- Intuition: formalization of Occam's Razor principle
    - The less complex a model is, the more likely that a good empirical result is not just due to the peculiarities of our sample

How do we know our model is good?
Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.
- Develop intuition about overfitting.
- Determine whether a model is good or not.
- Divide a data set into a training set and a test set.

Basic assumption:
- We draw examples independently and identically (i.i.d.) at random from the distribution
- The distribution is stationary: It doesn't change over time
- We always pull from the same distribution: Including training, validation, and test sets

In practice, we sometimes violate these assumptions. For example:
- Consider a model that chooses ads to display. The i.i.d. assumption would be violated if the model bases its choice of ads, in part, on what ads the user has previously seen.
- Consider a data set that contains retail sales information for a year. User's purchases change seasonally, which would violate stationarity.


### Overfitting

<img src='images/Overfitting.png' style='width:300px'>

The model shown in Figures 2 and 3 overfits the peculiarities of the data it trained on. 
An overfit model gets a low loss during training but does a poor job predicting new data.

Overfitting is caused by making a model more complex than necessary. 
The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible. （Occam's razor - William of Occam, a 14th century friar and philosopher, loved simplicity. He believed that scientists should prefer simpler formulas or theories over more complex ones. To put Occam's razor in machine learning terms:
The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.）

### Training / Test set 
Well, one way is to divide your data set into two subsets:
- training set—a subset to train a model. 
- test set—a subset to test the model. 

Good performance on the test set is a useful indicator of good performance on the new data in general, assuming that:
- The test set is large enough.
- You don't cheat by using the same test set over and over.
- Is representative of the data set as a whole. In other words, don't pick a test set with different characteristics than the training set.

### Training experiences
Never train on test data.
Wrong approach: Tweak model according to results on Test Set -> Overfitting

Better Approach: Tweak model according to result on validation 
because it creates fewer exposures to the test set.
<img src='images/Validation.png' style='width:400px'>

NOTE:
Test sets and validation sets "wear out" with repeated use. That is, the more you use the same data to make decisions about hyperparameter settings or other model improvements, the less confidence you'll have that these results actually generalize to new, unseen data. Note that validation sets typically wear out more slowly than test sets.

If possible, it's a good idea to collect more data to "refresh" the test set and validation set. Starting anew is a great reset.
