# **Regularization**

## **Example**

Let’s say we want to build a linear regression model to predict the house prices based on some features, such as the number of rooms, the size of the house, the location, etc. We have a training dataset of 100 houses with their features and prices, and we want to use gradient descent to find the optimal weights for our model.

However, we notice that our model is overfitting the training data, meaning that it has a very low training error but a high test error when we evaluate it on new data. This means that our model is too complex and it learns the noise in the training data instead of the general patterns.

To prevent overfitting, we can use regularization techniques such as L2 or L1 regularization. These techniques add a penalty term to the loss function that depends on the magnitude of the weights. The penalty term makes the model prefer smaller weights and reduces the complexity of the model.

For example, if we use L2 regularization, our loss function becomes:

$$
L = \frac{1}{2n} \sum{i=1}^n(y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^m w_j^2
$$

where:

- L is the loss function with L2 regularization.
- n is the number of training examples.
- yi​ is the true price of the i-th house.
- y^​i​ is the predicted price of the i-th house.
- λ is the regularization parameter that controls the strength of the regularization.
- m is the number of features.
- wj​ is the weight of the j-th feature

The first term in the loss function is the mean squared error (MSE) that measures how well our model fits the data. The second term is the L2 regularization term that penalizes large weights and makes them shrink towards zero. The regularization parameter λ determines how much we want to regularize our model. A larger λ means more regularization and less overfitting, but also more bias and less variance. A smaller λ means less regularization and more overfitting, but also less bias and more variance.

By using regularization, we can improve the performance of our model on new data and avoid overfitting. We can also use other regularization techniques such as L1 or elastic net for different effects and trade-offs

## **Formula for some Regression Models**

**Logistic Regression**

Logistic regression is a type of binary classification model that predicts the probability of an example belonging to a positive class. The output function of logistic regression is:

$$
\hat{y} = \sigma(Wx + b)
$$

where:

- $\hat{y}$ is the predicted probability of the positive class.
- $\sigma$ is the sigmoid function that maps any real number to a value between 0 and 1.
- $W$ is the weight matrix of the model.
- $x$ is the input vector of features.
- $b$ is the bias vector of the model.

The loss function of logistic regression without regularization is:

$$
L = - \frac{1}{n} \sum_{i=1}^n [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]
$$

where:

- $L$ is the loss function without regularization.
- $n$ is the number of training examples.
- $y_i$ is the true label of the i-th example (0 or 1).
- $\hat{y}_i$ is the predicted probability of the i-th example.

The loss function of logistic regression with L2 regularization is:

$$
L = - \frac{1}{n} \sum_{i=1}^n [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)] + \lambda \sum_{j=1}^m w_j^2
$$

where:

- $L$ is the loss function with L2 regularization.
- $\lambda$ is the regularization parameter that controls the strength of the regularization.
- $m$ is the number of features.
- $w_j$ is the weight of the j-th feature.


**Softmax Regression**

Softmax regression is a type of multiclass classification model that predicts the probability of an example belonging to one of K classes. The output function of softmax regression is:

$$
\hat{y}_k = \frac{\exp(W_k x + b_k)}{\sum_{j=1}^K \exp(W_j x + b_j)}
$$

where:

- $\hat{y}_k$ is the predicted probability of the k-th class.
- $W_k$ is the weight vector of the k-th class.
- $x$ is the input vector of features.
- $b_k$ is the bias term of the k-th class.
- $K$ is the number of classes.

The loss function of softmax regression without regularization is:

$$
L = - \frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik})
$$

where:

- $L$ is the loss function without regularization.
- $n$ is the number of training examples.
- $y_{ik}$ is the true label of the i-th example for the k-th class (0 or 1).
- $\hat{y}_{ik}$ is the predicted probability of the i-th example for the k-th class.

The loss function of softmax regression with L2 regularization is:

$$
L = - \frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K y_{ik} \log(\hat{y}_{ik}) + \lambda \sum_{k=1}^K \sum_{j=1}^m w_{kj}^2
$$

where:

- $L$ is the loss function with L2 regularization.
- $\lambda$ is the regularization parameter that controls the strength of the regularization.
- $m$ is the number of features.
- $w_{kj}$ is the weight of the j-th feature for the k-th class.