# Cost Functions

Cost functions are a fundamental concept in machine learning, statistics, and optimization. They measure the difference between the predicted values by a model and the actual values from the data. The goal of many learning algorithms is to minimize the cost function, thus improving the accuracy of the model.

## Key Concepts of Cost Functions:

1. **Definition**:
    - A cost function, also known as a loss function or error function, quantifies the error between predicted outputs and the actual outputs.
    - It provides a measure of how well the model is performing.

2. **Common Types of Cost Functions**:

### Mean Squared Error (MSE)

#### Formula:
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

#### Derivative:
To find the derivative of MSE with respect to the model parameters $\theta$, we use:

$$\hat{y}_i = \theta_0 + \theta_1 x_i$$

The MSE cost function becomes:
$$J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - (\theta_0 + \theta_1 x_i))^2$$

1. Take the partial derivative with respect to $\theta_0$:
$$\frac{\partial J}{\partial \theta_0} = \frac{\partial}{\partial \theta_0} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i - (\theta_0 + \theta_1 x_i))^2 \right)$$

Using the chain rule:
$$\frac{\partial J}{\partial \theta_0} = \frac{2}{n} \sum_{i=1}^{n} (y_i - (\theta_0 + \theta_1 x_i)) (-1)$$
$$\frac{\partial J}{\partial \theta_0} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)$$

2. Take the partial derivative with respect to $\theta_1$:
$$\frac{\partial J}{\partial \theta_1} = \frac{\partial}{\partial \theta_1} \left( \frac{1}{n} \sum_{i=1}^{n} (y_i - (\theta_0 + \theta_1 x_i))^2 \right)$$

Using the chain rule:
$$\frac{\partial J}{\partial \theta_1} = \frac{2}{n} \sum_{i=1}^{n} (y_i - (\theta_0 + \theta_1 x_i)) (-x_i)$$
$$\frac{\partial J}{\partial \theta_1} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i) x_i$$

#### Key Properties:
- **Advantages**:
  - Differentiable and convex, leading to a single global minimum.
  - Penalizes larger errors more significantly, which can be beneficial in some contexts.
- **Disadvantages**:
  - Sensitive to outliers, which can disproportionately influence the model.

### Mean Absolute Error (MAE)

#### Formula:
$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

#### Derivative:
For the MAE cost function, the derivative is not straightforward due to the absolute value function. The gradient depends on the sign of the residual:

$$\frac{\partial J}{\partial \theta_j} = \frac{1}{n} \sum_{i=1}^{n} \text{sign}(y_i - \hat{y}_i) \cdot \frac{\partial (y_i - \hat{y}_i)}{\partial \theta_j}$$

For linear regression:
$$\frac{\partial (y_i - \hat{y}_i)}{\partial \theta_0} = -1$$
$$\frac{\partial (y_i - \hat{y}_i)}{\partial \theta_1} = -x_i$$

So:
$$\frac{\partial J}{\partial \theta_0} = -\frac{1}{n} \sum_{i=1}^{n} \text{sign}(y_i - \hat{y}_i)$$
$$\frac{\partial J}{\partial \theta_1} = -\frac{1}{n} \sum_{i=1}^{n} \text{sign}(y_i - \hat{y}_i) x_i$$

#### Key Properties:
- **Advantages**:
  - Less sensitive to outliers compared to MSE.
  - Provides a more robust measure of central tendency.
- **Disadvantages**:
  - Not differentiable at zero, complicating gradient-based optimization methods.

### Cross-Entropy Loss

#### Formula:
For binary classification:
$$\text{Cross-Entropy Loss} = - \frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

#### Derivative:
The derivative of the Cross-Entropy Loss with respect to the prediction $\hat{y}_i$:

$$\frac{\partial L}{\partial \hat{y}_i} = -\left(\frac{y_i}{\hat{y}_i} - \frac{1 - y_i}{1 - \hat{y}_i}\right)$$

For logistic regression, where $\hat{y}_i = \sigma(z_i)$ and $z_i = \theta^T x_i$:
$$\frac{\partial \sigma(z_i)}{\partial z_i} = \sigma(z_i)(1 - \sigma(z_i))$$

Combining these:
$$\frac{\partial L}{\partial \theta} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) x_i$$

#### Key Properties:
- **Advantages**:
  - Highly suitable for binary and multiclass classification.
  - Provides probabilistic interpretations.
- **Disadvantages**:
  - Can suffer from saturation, making it difficult to train models when predictions are very confident but incorrect.

### Hinge Loss

#### Formula:
$$\text{Hinge Loss} = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i \hat{y}_i)$$

#### Derivative:
For the hinge loss, the derivative depends on whether $1 - y_i \hat{y}_i$ is greater than zero:

1. If $1 - y_i \hat{y}_i > 0$:
   $$\frac{\partial L}{\partial \hat{y}_i} = -y_i$$
2. Otherwise:
   $$\frac{\partial L}{\partial \hat{y}_i} = 0$$

For a linear classifier $\hat{y}_i = \theta^T x_i$:
$$\frac{\partial L}{\partial \theta} = \begin{cases}
-y_i x_i & \text{if } 1 - y_i \theta^T x_i > 0 \\
0 & \text{otherwise}
\end{cases}$$

#### Key Properties:
- **Advantages**:
  - Particularly effective for training support vector machines.
  - Emphasizes correct classification margins.
- **Disadvantages**:
  - Not suitable for probabilistic interpretations.
  - Only useful for classification problems.

## Regularization

To prevent overfitting, regularization terms are often added to the cost function. Examples include L1 (Lasso) and L2 (Ridge) regularization:

$$\text{Regularized Cost} = \text{Cost Function} + \lambda \left\| \theta \right\|_p$$

where $\lambda$ is the regularization parameter and $p$ is 1 for L1 and 2 for L2 regularization.

## Choosing a Cost Function

- Depends on the specific problem and the type of data.
- For regression tasks, MSE or MAE are commonly used.
- For classification tasks, Cross-Entropy Loss is widely used.

## Example: Linear Regression

For linear regression, the model predicts $ \hat{y} = \theta_0 + \theta_1 x $. The cost function typically used is the Mean Squared Error (MSE):

$$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2$$

where $m$ is the number of training examples, $y_i$ is the actual value, and $\hat{y}_i$ is the predicted value. The factor $\frac{1}{2}$ is used for convenience in the differentiation process.

## Optimization Process:

1. **Initialize** parameters $\theta_0, \theta_1$.
2. **Compute** the cost function $J(\theta_0, \theta_1)$.
3. **Update** the parameters to minimize the cost function, typically using Gradient Descent:
    $$\theta_j := \theta_j - \alpha \frac{\partial J(\theta_0, \theta_1)}{\partial \theta_j}$$
    where $\alpha$ is the learning rate.

By iteratively updating the parameters, the algorithm converges to the values that minimize the cost function, leading to an optimal model.

Understanding and correctly choosing cost functions is crucial for building effective machine learning models. They directly impact how well the model learns from the data and generalizes to unseen data.
