## 1. Linear Regression using Gradiant Decent algorithm

In linear regression there are $n$ input features and $1$ output feature that we want to predict. 
So visually what we're trying to do, is that we want to find a line \_n-dimensional\_ that "fits" data points. or more presicely, is as close as possible to all data points.

The description above gives us an idea about what we're looking for but it can't be considered as a problem statement.



### 1.1 What is the problem?
Making a clear problem statement is as important as solving the problem.
for that matter, let's note that we can model our input features with a vector $X^{(i)}$ in a vector space $V$ and output is a scaler $y^{(i)}$ in the field $F$ in which we define the vector space. here $V = \mathbb{R}^n$ and $F = \mathbb{R}$ and superscript $(i)$ indicates sample index.

Now, let's assume that there is a function $h^*: V \rightarrow F$ that maps these points to a scaler and is the function that exactly "fits" all datapoints. So $h^*(X^{(i)})$. Obviously it's not neccesserily linear or any other form.

We define $h$ as the *hypothesis* \_an estimation of $h^*$\_ given the constraints that $h$ is a linear function.

As it's known that every linear function can be represented with a vector of coefficients, the problem of finding $h$ is equivalent to finding it's vector of coefficients,
which is represented by $\Theta = [\theta_{1},\dots, \theta_{n}]$ . So it's convinient to write $h_{\Theta}$ instead of just $h$.

now we're ready to write the problem statement.

#### statement 1:
> Given value of $h^*$ for m points/vectors $X^{(1)}, \dots,X^{(m)}$,  find a linear function $h_{\Theta}$ which estimates $h^*$.

From the terms "as close as possible" (1.) and "estimates" (1.2) it's not very clear what we should do. In order to define a better metric for that, which means how good some function $h_{\Theta}$ is we define a cost function $J : V \rightarrow F$ as:
$$ J(\Theta) = \frac{1}{2m} \sum_{i=0}^m (h_{\Theta}(X^{(i)})-y^{(i)})^2 $$
So we want to find some $h_{\Theta}$ which minimizes the value of $J$.

now we can update the problem statement as:

#### statement 2:
> Given value of $h^*$ for m points/vectors $X^{(1)}, \dots,X^{(m)}$,  find a linear function $h_{\Theta}$ for which $J(\Theta)$ is minimized.


In [None]:
# functions defined above
m = 100
X = y = theta = []
def h(theta, X):
    return theta.dot(X)

def J(theta):
    res = 0
    for i in range(m):
        res += (h(theta, X[i])-y[i])**2
    res /= 2*m
    return res

### 1.2 Solution
We use gradiant decent algorithm to find some $\Theta$ which is local optimum for $J$.

This is an overview on how this algorithm works:

Let's assume we have $\Theta_{1}$ as first hypothesis. we can initialize this to some random vector.

We claim that $J(\Theta_{2}) \leq J(\Theta_{1})$ for $\Theta_{2} = \Theta_{1} - \eta \nabla J(\Theta_{1})$.

Doing this $p-1$ times we'll end up with a seqence $\Theta_{1}, \dots, \Theta_{p}$ and each one is a better estimation than the previous one.

Although it might not be exactly local minimum but in many applications it's close enough.



$$\nabla J(\Theta) = \derivative{x} J$$

In [None]:
# gradiant decent
p = 100
nabla = 1
def gradiant(theta):
    gradiant_vector = []
    return gradiant_vector
    
def gradiant_decent():
    for i in range(p):
        theta -= nabla*gradiant(theta)
    