# Assignemnt 4
This notebook serves as a comprehensive guide, explaining crucial mathematical, algorithmic, and logical concepts. For those embarking on a study journey through this series of notebooks, it is recommended to start by reviewing this particular notebook. Familiarizing oneself with the presented concepts here will lay a solid foundation before progressing to the subsequent assignments.

## Loss function vs. Cost function
In machine learning, 'loss' refers to the difference between the predicted value and the actual value. The 'loss function' is a mathematical function employed during the training phase to quantify this disparity as a single real number. Loss functions are utilized in supervised learning algorithms that employ optimization techniques. Noteworthy examples of such algorithms include regression, logistic regression, and more. It's worth noting that the terms 'cost function' and 'loss function' are often used interchangeably, as they convey similar concepts. However, most of the time there is a slight difference between these two concepts:

Loss function: Used when we refer to the error for a single training example.

Cost function: Used to refer to an average of the loss functions over an entire training dataset.

The cost function plays a crucial role in achieving the optimal solution. It serves as a metric for evaluating the performance of our algorithm or model. By considering both the predicted outputs and the actual outputs, the cost function quantifies the extent to which the model's predictions deviate from the actual values. A higher value indicates a larger discrepancy between the predictions and the actual values. As we fine-tune our model to enhance its predictions, the cost function serves as a gauge of the model's improvement. In essence, this process involves an optimization problem where the objective is to minimize the cost function through various optimization strategies.

### Types of the cost function
There are many cost functions in machine learning and each has its use cases depending on whether it is a regression problem or classification problem.

1. Regression cost Function
2. Binary Classification cost Functions
3. Multi-class Classification cost Functions

In this particular notebook, our focus will be on regression tasks and specifically on a cost function known as Mean Squared Error (MSE). Therefore, in this explanation, I will concentrate solely on this particular case. However, for those interested in delving deeper into the subject, I recommend exploring the material provided in this <a href = 'https://www.analyticsvidhya.com/blog/2021/02/cost-function-is-no-rocket-science/'> link</a> for further study.

### Regression cost Function 
Regression models are employed to predict continuous values, such as an employee's salary, the price of a car, or loan predictions. In the context of regression problems, the corresponding cost function is referred to as the "Regression Cost Function." These cost functions are calculated based on distance-based errors, and are determined as follows:

Error = y'-y

Where Y is Actual Input, Y' is Predicted output. To continue, MSE (Minimum Squarred Error) will be explained.

### MSE (Minimum Squarred Error)
this method has the following characteristics:

1. To eliminate the possibility of negative errors, the square of the difference between the actual and predicted values is computed. This ensures that all errors are positive and allows for a consistent measure of the discrepancy between the predicted and actual values.

2. It is also known as L2 loss.

3. In the case of Mean Squared Error (MSE), the squaring of each error serves to penalize even minor deviations in predictions more significantly compared to Mean Absolute Error (MAE). However, it is important to note that if our dataset contains outliers that contribute to larger prediction errors, squaring these errors can amplify them multiple times, resulting in a higher MSE value. This is because the squared error term disproportionately affects larger errors, potentially skewing the overall MSE value.

4. Hence we can say that it is less robust to outliers

its formalism is as follow

$$
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} ( h_\theta(x^{(i)}) - y^{(i)} ) ^2 
$$

## Gradient Decent
'Based on Dave Langers numerical analysis notebook'

[Gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. (If we take steps proportional to the positive of the gradient instead, we approach a local maximum of that function; the procedure is then known as *gradient ascent*). Gradient descent was originally proposed by Cauchy in 1847.

The intuition behind gradient descent is simple: if the derivative $f'(x)$ is positive, then the function increases; so, in order to arrive at a minimum, $x$ would need to be decreased. Conversely, if $f'(x) < 0$, then the function decreases, and $x$ needs to be increased. This behaviour can be summarized by an update rule

$$
x \leftarrow x - \gamma \cdot f'(x)
$$

This formula also has the advantage that when the method approaches a solution, the derivative becomes small, and therefore also the updates become smaller. However, it requires the derivative $f'(x)$ to be known, or to be calculable (e.g. using numeric differentiation).

The parameter $\gamma$ is called the *learning rate* and determines the size of the adjustments that are made. The magnitude of this parameter is critical: if its value is too low, then the solution converges only very slowly, which may make the computation intractable; if its value is too high, then the solution may overshoot and not converge at all.

Gradient descent is an important algorithm that is used a lot in deep learning, for instance. There are many extensions to this algorithm. For example, there are methods that adaptively choose a correct learning rate, that avoid getting stuck in local minima, or that can handle complications that arise when functions depend on multiple variables $x_1, x_2, \ldots$.

The formalism used in this specific study is as follows

$$
\theta_j := \theta_j - \alpha \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j
$$

For further study one can use the following  <a href = https://www.ibm.com/topics/gradient-descent> link</a>.