# Choosing a best-fit line

### Introduction

Now, so far we have seen that our regression lines make an estimation of our values of y.  They are useful because we can make an estimation of an output given an input.  In our example, we can make an estimation of a movie's expected revenue given a budget.  Our regression lines are described by two different variables, $m$, which represents the slope of the line and $b$ which is the value of $y$ when x is zero.

So far we have been rather fast and loose with choosing our regression line.  Well that ends today.  In this lesson, we'll begin to evaluate the accuracy of a regression line, and how to use a technique to choose a regression line.  

### Determining Quality

To measure the accuracy of a regression line, we see how closely our regression line matches the data we have.  Let's see what we mean.

In [1]:
first_show = {'x': 0, 'y': 100}
second_show = {'x': 100, 'y': 150}
third_show = {'x': 200, 'y': 600}
fourth_show = {'x': 400, 'y': 700}

shows = [first_show, second_show, third_show, fourth_show]

For now, let's set a roughshod regression line simply by drawing a line between our first and last points.  Our regression line will be changing, so it's ok that we don't use a rigorous technique right now.

In [6]:
# We write a method to calculate the slope between two points, and then apply this to our first and last points.  

def slope_between_two_points(first_point, second_point):
    return (second_point['y'] - first_point['y'])/(second_point['x'] - first_point['x'])

slope_between_two_points(shows[0], shows[3]) # 1.5

def y_intercept(points):
    point_at_zero = list(filter(lambda show: show['x'] == 0,shows))[0]
    return point_at_zero['y']

y_intercept(shows) # b = 100 
slope_between_two_points(shows[0], shows[3]) # m = 1.5

# plugging m and b into our formula y = mx + b we get the following sample_regression_formula
def sample_regression_formula(x):
    return 1.5(x) + 100


Ok, so now that we have our sample regression formula we can see it in action.  We draw a chart that displays our sample regression line (by simply drawing a line the data points with the lowest and highest x value) and then show with red color lines where our regression line does not match up to the data.

![](./regression-scatter.png)

Take a look at that first red line.  It's showing that our regression formula does not perfectly predict our line.  More concretely, at that spot, x = 100, and our regression line is predicting that the value of y will be 250. And even if we did not have our plot, we could see this by using our formula for the regression line y = 1.5x + 100, and at x = 100, y = 1.5 * 100 + 100 = 250.

However the point below our regression line shows the actual value of y when x = 100 which is 150.  In other words, our regression line does not perfectly predict our data - and each point where it is misses in it's prediction is called an error.  So that is what the reds lines are indictating, the distance between the prediction of data and the actual data.  Or in other words, the red lines are visually displaying the size of each error.

Now let's measure the size of that error.  We can say that our error, at point x = 100, is the actual_y - expected_y = (150) - (250) = -100. 

### Refining our Terms

One thing that's difficult is that we are really now talking about two things, our predicted y values and our actual y values.  Let's spend some time on notations.  We spend time on notations because (1) other resources you look at may use them and we don't want you to be scared off by them and (2) once you understand the notations, they allow us to speak with more clarity. 

Now so far we have defined our regression function as y = mx + b.  Where for a given value of x, we can calculate the value of y.  However, this value of y is not the actual value of y, but just an estimation.  So let's indicate this, by changing our function to look like the following.  

$\overline{y} = \overline{m}x + \overline{b}$ 

Those lines, over y, m and x are called hats, so this is read as y hat equals m hat times x plus b hat.  These hats indicate that this formula does not give us the actual value of y, but simply our estimated value of y.  And that this predicted value of y is based on our predicted values of m and b.  
> Note that x is not a predicted value.  Why is this?  Well, we are providing a value x, which in this case represents our movie budget, not predicting it.  So we are providing a value of x and asking it to predict a value of y.  

Now how do we represent our actual values of x, and y, that is each of the individual points in our scatter plot.  So we can just represent each actual y value as y.  No special mark needed.     

Ok, so now that we know that to indicate an estimated value, we use the $\overline{hat}$ symbol.  So let's try to apply this to our formula for an error, which we said is actual value - estimated value.

In math terms, this is error = $y$ - $\overline{y}$.

And while were at it, let's use the Greek letter epsilon, $\varepsilon$ to indicate error, so now our formula is:

$\varepsilon$ = $y$ - $\overline{y}$

Now, to indicate that we are representing something at a specific point, we can use a subscript.  For example, $\overline{y}_{x=100}$, means the estimated value of y when x = 100.  And to indicate the error at the point where x = 100, we write:

$\varepsilon _{x=100}$ = $y_{x=100}$ - $\overline{y}_{x=100}$ 

or 

$\varepsilon _{x=100} = 150 - 250 = -100$

Now, we wrote the the general formula for error as, $\varepsilon$ = $y$ - $\overline{y}$, but we can a little more precise by saying, this is the error at any specific point, for where $y$ and $\overline{y}$ are at that same point.  This is written as: 

$\varepsilon _{i}$ = $y_{i}$ - $\overline{y}_{i}$

### Calculating and representing total error

Now so far, we saw that we can calculate the error at a given point of x, by using the formula, $\varepsilon$ = $y$ - $\overline{y}$.  In other words, the the error at a given value of x is the actual value minus the expected value.  Now, we want to see well our regression describes the relation between x and y in general - not just at a given point, so let's move beyond calculating the error at a given point to describing the total error of the regression line from the actual data.  

An initial idea in doing this, is to simply to calculate the total error by summing the errors, $y$ - $\overline{y}$, for every point in our dataset.  

However let's take another look at our data.

![](./regression-scatter.png)

Notice, error at x = 100 = 150 - 250 = -100.  And error at x = 200 = 600 - 400 = 200.  So adding these errors, -150 + 200 = 50, would begin to cancel them out.  To avoid that effect, we can simply square the errors, so that we are always summing positive numbers.

${error^2}$ = $({y - \overline{y}})^2$

Now give a list of points with coordinates (x, y), we can calculate the squared error of each of the points.  We have actual x and y values at x = 0, 100, 200 and 400.  So we have:

sum of squared error = $(0 - 0)^2 + (150 - 250)^2 + (600 - 400)^2 + (700 - 700)^2$

so 

sum of squared error  = $-100^2 + 200^2 = 50,000$

Now there is one thing a little off with our measure of error.  It's that our regression line's error will tend to increase with each added piece of data we have.  For example, assume that we add to our dataset the actual data, x = 300, y = 500.  Our regression line did a relatively good job of estimating this point, as $\varepsilon$ = 500 - 550, or $\varepsilon = (500 - 550)^2$.  But we are now including an extra error, we add that to our sum, and it looks like did worse. 

sum of squared error = $-100^2 + 200^2 + -50^2 = 52,500$ 


To fix this, we can change our metric to use the squared error per point, or in other words, the average squared error.  The average of something is simply the sum divided by the number of things being summed.  Instead of the word average, we use the word mean, as it sounds fancier: 

mean squared error = (sum of $(y - y)^2$ for each data point, (x, y)) $\div$ (number of data points)

So changing our sum of squared error (with the original data) to a mean squared error, we have:

Mean Squared Error = 50,000/4 = 12500

Ok, this is great.  So now we have one number that does a good job of representing how well our regression line fits the data.  We got there by calculating the errors, then squaring the errors so that our errors are always positive, and finally using the average error as it better represents the accuracy of our line.  We could stop here, and be fine.  However, statisticians like to remove the effect of us squaring these errors.  So instead of just calculating the mean squared error, we calculate the Root Mean Squared Error.  This is just taking the square root at the very end.  So 

root mean squared error = the square root of ( (sum of $(y - y)^2$ for each data point, (x, y)) $\div$ (number of data points) ) 

or with our data it is: 

Root Mean Squared Error = 50,000/4 = $\sqrt{12500}$ = 111.8

### Summary 

Previous to this lesson, we have simply assumed that our regression lines make "good" predictions of values of $y$ for given values of $x$.  In this lesson, we aimed to find a metric to tell us how well our regression line fits our actual data.  To do this, we started by saying an error at a given point, is the difference of the actual value of y minus the expected value of y from our regression line.  Then we said we describe how well our regression line describes the entire dataset by squaring the errors at each point (to eliminate negative errors), adding these errors, and then dividing by the number of datapoints so as to describe how off our regression line is on average.  This is called the Mean Squared Error.  Finally, to reduce the effect of squaring each of our errors, we take the square root of the Mean Squared Error to arrive at the Root Mean Squared Error.  This is our metric for describing how well our regression line "fits" our data. 

### Formulas

In [16]:
def sample_regression_formula(x):
    return 100 + 1.5*x

def squared_error(point):
    y_hat = sample_regression_formula(point['x'])
    return (point['y'] - y_hat)**2

squared_error(shows[0]) # 0
squared_error(shows[1]) # 10000
squared_error(shows[2]) # 40000
squared_error(shows[3]) # 0

def squared_errors(points):
    return list(map(lambda point: squared_error(point), points))

squared_errors(shows) # [0.0, 10000.0, 40000.0, 0.0]

def average_squared_error(points):
    return sum(squared_errors(points))/len(points)

average_squared_error(shows) # 12500.0

12500.0

### Comparing between lines

Now that we have a number which we can use to evaluate the goodness of fit of our regression line, to find the "best fit" regression line we do the following:
* Adjust $b$ and $m$, as these are the only things that can vary in our regression line.
* After each adjustment calculate the average squared error 
* The regression line (that is, the values of $b$ and $m$) with our smallest average squared error is our best fit line 

Let's see this technique in action.  For this example, let's imagine that our data does not include the point when x = 0, so now we have the following.

In [9]:
first_show = {'x': 100, 'y': 150}
second_show = {'x': 200, 'y': 600}
third_show = {'x': 400, 'y': 700}

updated_shows = [first_show, second_show, third_show]

We again take an initial guess at slope by calculating drawing a line between the first and last points.  And then let's just start by setting $b$ = 100.

In [10]:
def slope_between_two_points(first_point, second_point):
    return (second_point['y'] - first_point['y'])/(second_point['x'] - first_point['x'])

slope_between_two_points(shows[0], updated_shows[2]) # 1.833

def sample_regression_formula(x):
    return 1.83*x + 110
    # change the number 0 to different numbers, to see what happens

In [11]:
average_squared_error(updated_shows) # 18956.33

18663.0

Now we don't know if $b$ is any good, so let's plug in different numbers.

| b        | average square error           | 
| ------------- |:-------------:| 
| 100      |17689| 
| 110      |18663 | 
| 90      |16916 | 
|80 | 16343.0
|70 | 15969
|60 | 15796
| 50 | 15823

Now notice that simply by setting different numbers as $b$, we value produces smaller mean squared error (MSE), given our value of $m$ at 1.83.  Setting $b$ to 110 produced a higher error, than at 100, so we tried moving in the other direction.  We kept moving in that direction until we set $b$ = 50, at which point our error increased from the value at 60.  So, we know that a value of $b$ between 50 and 60 produces the smallest average squared error, when $m$ = 1.83. 

Let's plot our table, using plotly.  First, because we will be changing $b$, and eventually $m$, let's change our regression formula so that we can change any of those values.

### Using the fitted line

In [12]:
def regression_formula_variable(x, m, b):
    return m*x + b

Now we update our functions that calculate the error to use our new function, and to allow us to pass through the values of $b$ and $m$.

In [116]:
def squared_error_variable(point, m, b):
    y_hat = regression_formula_variable(point['x'], m, b)
    return (point['y'] - y_hat)**2

def squared_errors_variable(points, m, b):
    return list(map(lambda point: squared_error_variable(point, m, b), points))

def average_squared_error_variable(points, m, b):
    return sum(squared_errors_variable(points, m, b))/len(points)

b_values = list(range(10, 120, 10)) # [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110]
errors = list(map(lambda b_value: average_squared_error_variable(updated_shows, 1.83, b_value), b_values))
error_chart = list(zip(b_values, errors))
error_chart

[(10, 17929.666666666668),
 (20, 17103.0),
 (30, 16476.333333333332),
 (40, 16049.666666666666),
 (50, 15823.0),
 (60, 15796.333333333334),
 (70, 15969.666666666666),
 (80, 16343.0),
 (90, 16916.333333333332),
 (100, 17689.666666666668),
 (110, 18663.0)]

Above is our error chart.  Note that it is identical to the data in the table we have above.  If we plot this data of the b-values, and the corresponding squared errors generated from them, we see that the data makes an curve.

In [121]:
cost_function_trace = graph_objs.Scatter(
    x=list(map(lambda error: error[0], error_chart)),
    y=list(map(lambda error: error[1], error_chart)),
)

layout = dict(title = 'Cost Function',
              yaxis = dict(zeroline = False, title= 'Sum Squared Error'),
              xaxis = dict(zeroline = False, title= 'B value')
             )
plotly.offline.iplot(dict(data=[cost_function_trace], layout=layout))

That smily face above, is called the **cost curve**.  It shows the errors of different levels of B.  We want to reduce the error, so to do that we need to find the value of b such that the sum of squared errors is lowest - that appears to be when b is 60.  So that means that our y intercept, when x is 1.83 should be 60.

If we show the regression line side by side of the points cost curve, you can see how the two numbers relate.

> Don't stress about the below code.  It's not important -- it's just used to generate lines in our plots.  

In [106]:
def generate_regression_line(ending_x, m, b):
    y_hat = m*ending_x + b
    return {
    'type':'line',
    'x0': 0,
    'y0': b,
    'x1': ending_x,
    'y1': y_hat,
    'xref': 'x1',
    'yref': 'y1',
    'line': {
        'color': 'rgb(55, 128, 191)',
        'width': 3,
        }
    }
line = generate_regression_line(400, 1.8, 500)

def generate_cost_line(errors, b):
    return {
    'type':'line',
    'x0': b,
    'y0': 0,
    'x1': b,
    'y1': max(errors),
    'xref': 'x2',
    'yref': 'y1',
    'line': {
        'color': 'rgb(55, 128, 191)',
        'width': 3,
        }
    }


> Now the below code, still doesn't need to be understood.  But do change the value of b, and see how the plots below adjust.

In [105]:
import plotly
from plotly import graph_objs, tools
plotly.offline.init_notebook_mode(connected=True)

fig = tools.make_subplots(rows=1, cols=2)



cost_function_trace = graph_objs.Scatter(
    x=list(map(lambda error: error[0], error_chart)),
    y=list(map(lambda error: error[1], error_chart)),
)
fig.append_trace(cost_function_trace, 1, 2)

scatter_trace = graph_objs.Scatter(
    x=list(map(lambda show: show['x'], updated_shows)),
    y=list(map(lambda show: show['y'], updated_shows)),
    mode="markers"
)


##############

### CHANGE THIS VALUE OF B

b = 100
##############

cost_line = generate_cost_line(errors, b)
regression_line = generate_regression_line(400, 1.8, b)

fig.append_trace(scatter_trace, 1, 1)

fig['layout'].update(shapes=[regression_line, cost_line])
fig['layout']['yaxis1'].update(range=[0, 1000])

plotly.offline.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



## Using the fitted line

Then we use the fitted line, so then the model is in terms of unknown parameters.  

yi = wo + w1xi + ei

And the estimated parameters wo hat, w1 hat take on actual parameters.

So these estimated parameters define a specific line.  So we can predict the value of the house, and we do this by plugging into our fitted line our house into the line.  And that gives us our estimated value of the house.  So y^ is our predicted value of the house.

So the prediction is equally likely to be above or below, so we are ensure if that error is above or below, so our best guess is to put it on the line.

So now we have a fitted regression line, and predict the value of a house that has.  So can be able to buy a house of x number of square feet.

### Interpreting the Co-Efficients of the Line

Let's just take a look at the first point, towards the bottom left.  That point represents the movie "21 & Over", with 13 million dollars being spent and 25.6 million earned domestically.

In [42]:
parsed_movies[0]

{'budget': 13000000, 'domgross': 25682380.0, 'title': '21 &amp; Over'}

What plotting this data shows us is that as the movie budget increases, represented by the points plotted further to the right, the movie revenue increases.  So, at least we now know something.

Ok, now imagine your movie executive friend told you that the budget that came across his desk was $30 million.  Based on the data we graphed, how much money do you think the movie would bring in?

### Drawing a line

Ok, so how are we going to do something like this.  Well we could draw a single straight line that approximates the relationship between a movie's budget and revenue.  Below, we draw a line. We'll worry about how well a line like the one below models the relationship between two different variables later.  For now, let's use this.   

![](./plot-intersect.png)

Well one of the benefits of using a line is that we can see how much money will be brought in for any point on this line.  Spend 50 million, and expect to bring in about 63 million.  Spend 10 million, and expect to bring in 17 million.  This approach of modeling a relationship a variable that explains an output by using a line, is called **linear regression**. 

Let's see if we can translate this line into a formula that will tell us the y value that corresponds to any given value of x along that line.

Let's take an initial (wrong) guess as to how to make this a formula.  And then we'll take another one.  This is our first guess.

$y = x$

Here is how we write it as a function.

In [20]:
def y(x):
    return x

y(0)

0

In [21]:
y(10000000)

10000000

What the formula is saying is that for every value of $x$ that I input to the function, I will get back an equal value $y$.  So according to the function, if the movie has a budget of 30 million, it will earn 30 million.  

Of course, this does not match the line in our chart.  The line says that spending 30 million brings predicted earnings of 40 million.  So how do we change our function?  Well look at the line in our chart, we can examine the x and y values at three different points

| X        | Y           | 
| ------------- |:-------------:| 
| 0      |0 | 
| 30 million      |40 million | 
| 60 million      |80 million | 

What equation will allow us to input 0 and get back 0, input 30 million and get back 40 million, and input 60 million and get back 80 million?

Well it's $y = 4/3*x$

* 0 = 4/3 * 0
* 40 million =  4/3 * 30 million 
* 80 million = 4/3 * 60 million 

Let's see it in the code, and then in the next section we'll show how to figure what to multiply $x$ by. 

Ok, this is what this formula looks like in code.

In [16]:
def y(x):
    return 4/3*x

y(30000000)

40000000.0

In [17]:
y(0)

0.0

Progress! By multiplying $x$ by a value, we can describe the line in our chart with a function that given an value of $x$, corresponds the value of $y$ along that line.  

In statistics, you will see this formula described as 

$y = mx$ 

With the variables standing for the following: 

* $y$: the value that is returned, also called the **response variable**, as it responds to values of $x$
* $x$: the input variable, also called the **explanatory variable**, as it explains the value of $y$
* $m$: the **slope variable**, determines how vertical or horizontal the line will be

In our movie example, these terms make sense.  The $y$ value is our money earned from the movie, which we say is in response to how much we spend.  Our explanatory variable of $x$ explains the value of $y$, and the $m$ corresponds to our value of 1.33, which determines the slope of the line.

### Calculating the slope variable 

This is our mechanism for calculating the slope $m$.  Take any two points along the straight line, then $m$ is **the ratio of the vertical distance travelled to the horizontal distance travelled**.  Or, in math, it's:

$m = \Delta y \div \Delta x $
> The $\Delta$ is the Greek letter Delta.  In math, Delta means change.  So you can the read the above formula as $m$ equals change in y divided by change in x.

For example, let's take another look of our graph, and our line.  Let's travel the distance from x being equal to zero to 10 million.  Plugging the numbers into our formula, we see that for that segment:

* $\Delta x$ = 10 million
* $\Delta y$ = 13.3 million

Notice that another way to word change in x is really our ending x value, 10 million, minus our starting x value, 0.  And that change in y also means our ending y value, 13 million, minus our y initial value 0.  

So this means: 

* $\Delta y = y_1-  y_0$
* $\Delta x = x_1 - x_0$

And therefore we can say $m$ is the following: given a beginning point (x0, y0) and an ending point (x1, y1) along any segment of a straight line, the slope of that line $m$ equals the following:  

$m = (y_1 - y_0) \div (x_1 - x_0)$

Ok, let's apply this formula to our line.  We can choose any two points for the formula, so let's have a starting point of (30 million, 40 million) and an ending point of (60 million, 80 million). Then plugging these coordinates into our formula, we have the following:

* $m =(y_1 - y_0)\div(x_1 - x_0) =  (80,000,000 - 40,000,000) \div (60,000,000 - 30,000,000) = 4/3 = 1.33$

![](./m-calc.png)

So that is how we calculate the slope of a line, take any two points along that line and divide distance travelled vertically from the distance travelled horizontally.

### The y intercept

Ok, there is just one more thing that we need to be able to learn before being able to describe every straight line in a two dimensional world.  That is the y-intercept.

The y-intercept is the y value of the line when it intersects the y-axis.  Or to put it another way, the y-intercept is the value of y when x equals zero. 

![](plot-add.png)

So looking at the graph, what is the y intercept of the blue line?  Well it's the value of y when the blue line crosses the y-axis.  The value is zero.  Now you can imagine shifting up the entire line up, so that the y intercept increases to to 20 million, and that for every value of x, the corresponding value of y increases by 20 million.  So our formula is no longer y = 4/3 x.  It is y = 4/3 x + 20 million. 

In statistics, you will see this as $y = mx + b$ where b is the y-intercept.  Taking a look at our chart of points on the line, we can see that 20 million is our y-intercept.

| X        | Y           | 
| ------------- |:-------------:| 
| 0      |20 million | 
| 30 million      |60 million | 
| 60 million      |100 million | 

And translating our formula into a function, we have:

In [19]:
def y(x):
    return 4/3*x + 20000000

In [20]:
y(30000000)

60000000.0

In [21]:
y(60000000)

100000000.0

The formula $y = mx + b$ can describe any line in a two dimensional space.  The $m$ value will change how flat or vertical the line is, and the $b$ value changes the starting point of the line. 

### Summary