# Lesson 2: Machine Learning Basics

## Generalization

In machine learning, one of the major goals is to uncover patterns hidden inside large datasets. But when doing this, it is important to ensure the pattern is generalizable which means that it applies to data beyond the dataset we've given to the algorithm for training.
When we make AI, we want it to be able to generalize and figure out larger overall patterns. When using machine learning algorithms, it is highly important to make sure that the algorithm is able to discover a generalizable pattern.

The phenomenon of fitting closer to our training data than to the underlying distribution is called _overfitting_, and techniques for combatting overfitting are called _regularization_ methods.

### Training Error and Generalization Error

Normally, for supervised learning settings, we assume the training data and the test data are drawn independently from identical distributions (IID assumption). Without this assumption, our models do not work. If this assumption isn't true, why would we beleive that one training distribution can tell us about another training distribution? 

**Training error** $R_{emp}$ is a statistic calculated on the training dataset. **Generalization error** $R$, which is an expectation taken with respect to the underlying distribution. Generalization error can be thought of as what you would see if you applied your model to an infinite stream of additional data examples drawn from the same underlying data distribution. Formally, we can express training error as a sum:

$$
R_{emp}[X,y,f] = \frac{1}{n} \sum_{i=1}^{n}l(x^{(i)}, y^{(i)}, f(x^{(i)}))
$$

The generalization error can be expressed as a sum:

$$ 
R[p,f] = E_{(x,y)~P}[l(x,y,f(x))] = \int \space \int l(x, y, f(x))p(x,y)dx dy
$$

We can never truly calculate the generalization error exactly since we don't know the percise form of the density function and we cannot sample an infinite stream of datapoints. Thus, we estimate the generalization error by applying our model to an independent test set of randomly selected examples $X'$ and labels $y'$ that were withheld from our training set. This consists of applying the same formula as for calculating the empirical training error but to a test set $X'y'$. 

#### Model Complexity

When we have simple models and abundant data, the training and generalization errors tend to be close. When we work with more complex models and/or fewer examples, we expect the training error to go down but the generalization gap continues to grow. 

Note that low training error does not necessariily imply low generalization error. To get a proper sense of the true error, we use a validation set and recieve a validation error.

When it comes to training and validation erros, we need to be careful of two scenarios:

- First is what happens bwhen our training error and validation error are both substantial but little gap between them. If the model is unable to reduce the training error, it could mean the model isn't expressive enough to capture the pattern (ie it's too simple). Since the generalization gap ($R_{emp} - R$) is small, we could use a more complex model. This is called **underfitting**.

- If our training error is significantly lower than our validation error, this indicates our model is **overfitting**. This is due to the model recognizing too strong of a pattern on the training data and the training data only leading it to be poorly generalizable.

## No Free Lunch Theorem

The **no free lunch theorem** suggests that on averaging over on all possible data generating distributions, every classification has the same error rate when classifying previously unobserved points which means that no machine learning algorithm is universally better than others. This theorem suggests that the performance of all optimization algorithms are identical under some constraints. This theorem suggests we must design our machine learning algorithms to perform well on specific tasks.

## Regularization

Regularization is an important part of tuning a model and helps to reduce overfitting. There are a couple of different regularization techniques which include familiar terms such as l1 and regularization.

- **L1 regularization**: adds L1 penalty that is equal to the absolute value of the magnitude of the coefficient or simply restricting size of coefficients (eg Lasso regression)
- **L2 Regularization**: adds L2 penalty that is equal to the square of the magnitude of coefficients (eg Ridge regression, SVM)
- **Elastic Net**: L1 and L2 regularization combined together adding a hyperparamter

#### L1 Regularization

L1 regularization makes some coefficients zero meaning the model will ignore those features and this helps to emphasize the model's essential features.

$$
L1 = Loss \space function + \lambda \sum_{j=1}^{m} |w_j|
$$

Here $\lambda$ controls the strength of regularization and $w_j$ represents the model's weights (coefficients).

## Weight Decay

**Weight decay** is a regularization technique. It works by restricting the values that the paramters can take.