# Univariety Linear Regression model

#### 1.0 Problem statement

Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet.
- You would like to expand your business to cities that may give your restaurant higher profits.
- The chain already has restaurants in various cities and you have data for profits and populations from the cities.
- You also have data on cities that are candidates for a new restaurant. 
    - For these cities, you have the city population.
    
Can you use the data to help you identify which cities may potentially give your business higher profits?

> Note:
  - `X` is the population of a city
  - `y` is the profit of a restaurant in that city. A negative value for profit indicates a loss.   
    - Both `X` and `y` are arrays.


| Population of a city (x 10,000) as $x$ | Profit of a restaurent (x $10,000) as $f_{w,b}(x^{(i)})$ or $y$ |
| -------------------| ------------------------ |
| 6.1101             | 17.592                   |
| 5.5277             | 9.1302                   |
| 8.5186             | 13.662                   |
| 7.0032             | 11.854                   |
| 5.8598             | 6.8233                   |

Number of training example (size (1000 sqft) as x)  $m$

In this case $m = 5$ 

#### 2.0 Model Function

The model function for linear regression (which is a function that maps from `x` to `y` is represented as 

$$ f_{w,b}(x^{(i)}) = w*x^{(i)} + b \tag{1}$$

#### 3.0 Compute Cost
Cost is the measure of how well our model will predict the target output well, in this case target output is Profit of a restaurent.

Gradient descent involve repeated steps to adjust the values of `w` and `b` to get smaller and smaller `Cost`, $J(w,b)$.

The equation for Cost with one variable 
$$J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{2}$$ 

where 
  $$f_{w,b}(x^{(i)}) = w*x^{(i)} + b \tag{1}$$
  
- $f_{w,b}(x^{(i)})$ is our prediction for example $i$ using parameters $w,b$.  
- $(f_{w,b}(x^{(i)}) -y^{(i)})^2$ is the squared difference between the target value and the prediction.   
- These differences are summed over all the $m$ examples and divided by `2*m` to produce the cost, $J(w,b)$.  
>Note, 
- Summation ranges are typically from 1 to m, while code will be from 0 to m-1.


_**3.1 Cost versus iterations of gradient descent**_

A plot of cost versus iterations is a useful measure of progress in gradient descent. Cost should always decrease in successful runs. The change in cost is so rapid initially, it is useful to plot the initial decent on a different scale than the final descent. In the plots below, note the scale of cost on the axes and the iteration step.

![Cost vs Iteration](images/cost.png)


_**3.2 Plot of cost J(w,b) vs w,b with path of gradient descent**_

Plot shows the $cost(w,b)$ over a range of $w$ and $b$. Cost levels are represented by the rings. Overlayed, using red arrows, is the path of gradient descent. Here are some things to note:
- The path makes steady (monotonic) progress toward its goal.
- initial steps are much larger than the steps near the goal.

![Cost vs Iteration](images/counter_cost.png)

_**3.3 Convex Cost surface**_

The fact that the cost function squares the loss, $ \sum_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2$  ensures that the 'error surface' is convex like a soup bowl. It will always have a minimum that can be reached by following the gradient in all dimensions.

#### 4.0 Gradient Descent

In linear regression, we utilize input training data to fit the parameters $w$,$b$ by minimizing a measure of the error between our predictions $f_{w,b}(x^{(i)})$ and the actual data $y^{(i)}$. The measure is called the $cost$, $J(w,b)$. In training you measure the cost over all of our training samples $x^{(i)},y^{(i)}$

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline
\;  w &= w -  \alpha \frac{\partial J(w,b)}{\partial w} \tag{3}  \; \newline 
 b &= b -  \alpha \frac{\partial J(w,b)}{\partial b}  \newline \rbrace
\end{align*}$$
where, parameters $w$, $b$ are updated simultaneously.  
The gradient is defined as:
$$
\begin{align}
\frac{\partial J(w,b)}{\partial w}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})*x^{(i)} \tag{4}\\
  \frac{\partial J(w,b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\
\end{align}
$$

Here *simultaniously* means that you calculate the partial derivatives for all the parameters before updating any of the parameters.


_**4.1 Cost vs w, with gradients, b set to 100.**_

The plot shows $\frac{\partial J(w,b)}{\partial w}$ or the slope of the cost curve relative to $w$ at three points. The derivative is negative. Due to the 'bowl shape', the derivatives will always lead gradient descent toward the bottom where the gradient is zero.

![gradient](images/grads.png)

#### 5.0 Evaluating our model

To evaluate the estimation model, we use coefficient of determination which is given by the following formula:

$$
R^2 = 1 - \frac{Residual\ Square\ Sum}{Total\ Square\ Sum}
$$


$$
R^2 = 1 - \frac{\sum\limits_{i=0}^{(m-1)}(f_{w,b}(x^{(i)}) - y^{i})^2}{\sum\limits_{i=0}^{(m-1)}(f_{w,b}(x^{(i)}) - f_{w,b}(x^{(i)})_{mean})^2}
$$

#### 6.0 Learning parameters using batch gradient descent 

You will now find the optimal parameters of a linear regression model by using batch gradient descent. Recall batch refers to running all the examples in one iteration.
- You don't need to implement anything for this part. Simply run the cells below. 

- A good way to verify that gradient descent is working correctly is to look
at the value of $J(w,b)$ and check that it is decreasing with each step. 

- Assuming you have implemented the gradient and computed the cost correctly and you have an appropriate value for the learning rate alpha, $J(w,b)$ should never increase and should converge to a steady value by the end of the algorithm.

_**6.1 Expected Output**_

Optimal w, b found by gradient descent 

| w                  | b                        |
| -------------------| ------------------------ |
|1.492054            |-3.216610                 |

We will now use our final parameters w, b to find our prediction for single example.

recall:

$$
f_{w,b}(x^{(i)}) = w * x^{i} + b
$$

Let's predict what profit will be for the ares of 35,000 and 70,000 people

- The model takes in population of a city in 10,000s as input. 

- Therefore, 35,000 people can be translated into an input to the model as `input[] = {3.5}`

- Similarly, 70,000 people can be translated into an input to the model as `input[] = {7.5}`