# 1.1 Introduction

### Definitions

1. "Field of study that gives computers the ability to learn without being explicitly programmed" (Arthur Samuel, 1959) [_Older, more informal_]
2. "A computer is said to _learn_ from experience E with respect to some task T and some performance measure P, if its performance on T, measured by P, improves with experience E" (Tom Mitchell, 1998)

**Example**: in spam filtering using ML, the task (T) is classifying whether an email is spam or not.

### Supervised Learning

The algorithm learns from "right answers". 

**Example 1**: in **regression** problems, we want to predict a continuous valued output.  
**Example 2**: in **classfication** problems, we want to predict a discrete valued output (_label_).

### Unsupervised Learning

The algorithm works on a dataset without existing labels and tries to understand the underlying structure of the data. 

**Example 1**: in **classification** problems, we want to categorize our data in different clusters.

# 1.2 Linear Regression

Linear regression predicts a real-valued output based on an input value.

### Notation

$m$ = Number of training examples  
$x$ = Input variable / features  
$y$ = Output variable / target

$(x,y)$ = Training example

$(x_i, y_i) = i_{th}$ training example

$h$ = Learning (hypothesis) function 

In this particular case, $h$ is a linear function, and since we are working with only one dependent variable we will call it **univariate linear regression**.  

The hypothesis function will therefore look like this:

$h_{\theta}(x) = h_x = \theta_0 + \theta_1 x$

$\theta_i$ = Parameters

### Linear Regression - Intuition

We choose $\theta_0$ and $\theta_1$ so that $h_0$ is close to $y$ for our training examples $(x,y)$.  

We are in front of a **minimization problem** and here is our **cost function** to minimize:

$\displaystyle \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i)^2$  

From now on, we will call this cost function $J$ and call it the _squared error_ or _mean squared error_ function.

**Note**: The mean is halved $(\frac{1}{2})$ as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the start fraction.

# 1.3 Gradient Descent

Gradient descent is an algorithm to minimize an arbitrary cost function J, but it is useful to know that gradient descent can be _generalized_ to more functions in Machine Learning.

It works pretty much like this:

* Start with some $\theta_0, \theta_1$ (_Tip_: $(0,0)$ could be a good choice)
* Keep changing $\theta_0, \theta_1$ to reduce $J(\theta_0, \theta_1)$ until we (hopefully) end up in a minimum 

**Metaphor warning**: _Gradient descent can be compared to starting on a random point on a surface and wanting to reach the lowest point. At each step, we look around and choose the direction which most likely will lead us to the lowest point in the least possible number of steps._

### Definition

To get slightly more formal, we can define gradient descent as:

_:= denotes **assignment** -> Repeat until convergence:_

$\displaystyle \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$

where:

$\alpha$ = learning weight 

**Note 1**: here we want to **simultaneously** update $\theta_0$ and $\theta_1$.  
**Note 2**: if $\alpha$ is too small, GD can be very slow. If it's too large, it may fail to converge (or even diverge).

Furthermore, it is worth noticing how GD will "slow down" by itself as we approach a local minimum, even with a constant $\alpha$, since the derivative part of the equation (representating the slope of the tangent line) will get smaller and smaller. 

### Gradient Descent for Linear Regression

$\displaystyle \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^m (\theta_0 + \theta_1 x_i - y_i)^2$  

Let's look at $\theta_0$ and $\theta_1$ separately:

for $j = 0 : \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i)$  

for $j = 1 : \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i) \cdot x_i $  

Therefore, here is our final gradient descent linear regression algorithm:

---

_Repeat until convergence:_

for $ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i)$  

for $\theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x_i) - y_i) \cdot x_i $   

---

**Note**: the cost function is always going to be a **convex function** for linear regression. There is no other optimum apart from the global optimum. 