# Week 1: Introduction to Machine Learning

## Supervised learning

**Regression:** Predicting a number out of infinitely many posssible outputs

**Classification:** Predicting categories out of small number of possible outputs

## Unsupervised learning

Find something interesting (*e.g.* patterns) in <u>unlabeled</u> data.

Three common algorithms in unsupervised learning:
- **Clustering:** group similar data together (e.g. Similar news, similar customers based on their shopping habits, similar mutants based on their gene expression profiles (microarray data)
- **Anomaly detection:** find unusual data points: (e.g. outliers)
- **Dimentionality reduction:** compress data using fewer numbers (e.g. data compression)

## Linear regression

**Notation:**

Here is a summary of some of the notation you will encounter.  

|Notation |Description|Python (if applicable)|
|---------|-----------|----------------------|
| $a$  | scalar, non bold ||
| $\mathbf{a}$ | vector, bold||
| **Regression** |  |  |
|  $\mathbf{x}$ | Training Example feature values (in this lab - Size (1000 sqft))  | `x_train` |   
|  $\mathbf{y}$  | Training Example  targets (in this lab Price (1000s of dollars))  | `y_train` 
|  $x^{(i)}$, $y^{(i)}$ | $i_{th}$Training Example | `x_i`, `y_i`|
| m | Number of training examples | `m`|
|  $w$  |  parameter: weight                                 | `w`    |
|  $b$           |  parameter: bias                                           | `b`    |     
| $f_{w,b}(x^{(i)})$ | The result of the model evaluation at $x^{(i)}$ parameterized by $w,b$: $f_{w,b}(x^{(i)}) = wx^{(i)}+b$  | `f_wb` |

## Cost function

Cost functions are defined to assess model's performance. One of the most common cost functions that is used is Mean Squared Error (MSE):

$$
J(w, b) = \frac{1}{2m}\sum_{i=1}^m (\hat y^{(i)} - y^{(i)})^2
$$

where $\hat y{(i)} = f_{w,b}(x^{(i)}) = wx^{(i)} + b$ and $2$ in the denominator is usually added to the denominator to make the subsequent calculations neater and it won't affect our cost analyses. The goal here is find a $w$ and a $b$ to minimize $J(w, b)$.

## Gradient Descent

Gradient descent is a powerful iterative method to calculate an extermum of a function.

When it comes to cost functions, the goal is to find the values for the model parameters ($w$ and $b$) that result in the lowest possible cost. This process is known as minimizing the cost function. Gradient descent is an algorithm for finding values of parameters $w$ and $b$ that minimize the cost function $J(w,b)$.

$$
\begin{aligned}
w &= w - \alpha \frac{\partial}{\partial w}J(w,b) \\
b &= b - \alpha \frac{\partial}{\partial b}J(w,b)
\end{aligned}
\label{eq: gradient_descent}
$$


where $\alpha$ is learning rate in Equation {eq}`eq: gradient_descent`  In this algorithm, both parameters must be updated simultaneously until convergence is acheived.

Picking an appropriate learning rate is key for the success of the algorithm:
- **Too small:** Very low convergence rate
- **Too large:** Never converges (overshoot)

Now, let's formulate gradient descent for minimizing MSE cost function for a linear regression model:

$$
\begin{aligned}
\frac{\partial}{\partial w} J(w,b) &= \frac{1}{m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \\
\frac{\partial}{\partial b} J(w,b) &= \frac{1}{m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})
\end{aligned}
\label{eq: gd_lreg}
$$

In MSE cost function is a convex function, meaning that it has only one minimum. So, we won't get trapped in local minima when applying the algorithm. Therefore, if we picked an appropriate learning rate, we are gauranteed to converge to the minimum.

**"Batch"** gradient descent: Each step of gradient descent uses all the training examples.