# Choosing a best-fit line

### Introduction

Now, so far we have seen that our regression lines make an estimation of our values of y.  They are useful because we can make an estimation of an output given an input.  In our example, we can make an estimation of a movie's expected revenue given a budget.  Our regression lines are described by two different variables, $m$, which represents the slope of the line and $b$ which is the value of $y$ when x is zero.

So far we have been rather fast and loose with choosing our regression line.  Well that ends today.  In this lesson, we'll begin to evaluate the accuracy of a regression line, and how to use a technique to choose a regression line.  

### Determining Quality

To measure the accuracy of a regression line, we see how closely our regression line matches the data we have.  Let's see what we mean.

In [1]:
first_show = {'x': 0, 'y': 100}
second_show = {'x': 100, 'y': 150}
third_show = {'x': 200, 'y': 600}
fourth_show = {'x': 400, 'y': 700}

shows = [first_show, second_show, third_show, fourth_show]

For now, let's set a roughshod regression line simply by drawing a line between our first and last points.  Our regression line will be changing, so it's ok that we don't use a rigorous technique right now.

In [6]:
# We write a method to calculate the slope between two points, and then apply this to our first and last points.  

def slope_between_two_points(first_point, second_point):
    return (second_point['y'] - first_point['y'])/(second_point['x'] - first_point['x'])

slope_between_two_points(shows[0], shows[3]) # 1.5

def y_intercept(points):
    point_at_zero = list(filter(lambda show: show['x'] == 0,shows))[0]
    return point_at_zero['y']

y_intercept(shows) # b = 100 
slope_between_two_points(shows[0], shows[3]) # m = 1.5

# plugging m and b into our formula y = mx + b we get the following sample_regression_formula
def sample_regression_formula(x):
    return 1.5(x) + 100


Ok, so now that we have our sample regression formula we can see it in action.  We draw a chart that displays our sample regression line (by simply drawing a line the data points with the lowest and highest x value) and then show with red color lines where our regression line does not match up to the data.

![](./regression-scatter.png)

Take a look at that first red line.  It's showing that our regression formula does not perfectly predict our line.  More concretely, at that spot, x = 100, and our regression line is predicting that the value of y will be 250. And even if we did not have our plot, we could see this by using our formula for the regression line y = 1.5x + 100, and at x = 100, y = 1.5 * 100 + 100 = 250.

However the point below our regression line shows the actual value of y when x = 100 which is 150.  In other words, our regression line does not perfectly predict our data - and each point where it is misses in it's prediction is called an error.  So that is what the reds lines are indictating, the distance between the prediction of data and the actual data.  Or in other words, the red lines are visually displaying the size of each error.

Now let's measure the size of that error.  We can say that our error, at point x = 100, is the actual_y - expected_y = (150) - (250) = -100. 

### Refining our Terms

One thing that's difficult is that we are really now talking about two things, our predicted y values and our actual y values.  Let's spend some time on notations.  We spend time on notations because (1) other resources you look at may use them and we don't want you to be scared off by them and (2) once you understand the notations, they allow us to speak with more clarity. 

Now so far we have defined our regression function as y = mx + b.  Where for a given value of x, we can calculate the value of y.  However, this value of y is not the actual value of y, but just an estimation.  So let's indicate this, by changing our function to look like the following.  

$\overline{y} = \overline{m}x + \overline{b}$ 

Those lines, over y, m and x are called hats, so this is read as y hat equals m hat times x plus b hat.  These hats indicate that this formula does not give us the actual value of y, but simply our estimated value of y.  And that this predicted value of y is based on our predicted values of m and b.  
> Note that x is not a predicted value.  Why is this?  Well, we are providing a value x, which in this case represents our movie budget, not predicting it.  So we are providing a value of x and asking it to predict a value of y.  

Now how do we represent our actual values of x, and y, that is each of the individual points in our scatter plot.  So we can just represent each actual y value as y.  No special mark needed.     

Ok, so now that we know that to indicate an estimated value, we use the $\overline{hat}$ symbol.  So let's try to apply this to our formula for an error, which we said is actual value - estimated value.

In math terms, this is error = $y$ - $\overline{y}$.

And while were at it, let's use the Greek letter epsilon, $\varepsilon$ to indicate error, so now our formula is:

$\varepsilon$ = $y$ - $\overline{y}$

Now, to indicate that we are representing something at a specific point, we can use a subscript.  For example, $\overline{y}_{x=100}$, means the estimated value of y when x = 100.  And to indicate the error at the point where x = 100, we write:

$\varepsilon _{x=100}$ = $y_{x=100}$ - $\overline{y}_{x=100}$ 

or 

$\varepsilon _{x=100} = 150 - 250 = -100$

Now, we wrote the the general formula for error as, $\varepsilon$ = $y$ - $\overline{y}$, but we can a little more precise by saying, this is the error at any specific point, for where $y$ and $\overline{y}$ are at that same point.  This is written as: 

$\varepsilon _{i}$ = $y_{i}$ - $\overline{y}_{i}$

### Calculating and representing total error

Now so far, we saw that we can calculate the error at a given point of x, by using the formula, $\varepsilon$ = $y$ - $\overline{y}$.  In other words, the the error at a given value of x is the actual value minus the expected value.  Now, we want to see well our regression describes the relation between x and y in general - not just at a given point, so let's move beyond calculating the error at a given point to describing the total error of the regression line from the actual data.  

An initial idea in doing this, is to simply to calculate the total error by summing the errors, $y$ - $\overline{y}$, for every point in our dataset.  

However let's take another look at our data.

![](./regression-scatter.png)

Notice, error at x = 100 = 150 - 250 = -100.  And error at x = 200 = 600 - 400 = 200.  So adding these errors, -150 + 200 = 50, would begin to cancel them out.  To avoid that effect, we can simply square the errors, so that we are always summing positive numbers.

${error^2}$ = $({y - \overline{y}})^2$

Now give a list of points with coordinates (x, y), we can calculate the squared error of each of the points.  We have actual x and y values at x = 0, 100, 200 and 400.  So we have:

sum of squared error = $(0 - 0)^2 + (150 - 250)^2 + (600 - 400)^2 + (700 - 700)^2$

so 

sum of squared error  = $-100^2 + 200^2 = 50,000$

Now there is one thing a little off with our measure of error.  It's that our regression line's error will tend to increase with each added piece of data we have.  For example, assume that we add to our dataset the actual data, x = 300, y = 500.  Our regression line did a relatively good job of estimating this point, as $\varepsilon$ = 500 - 550, or $\varepsilon = (500 - 550)^2$.  But we are now including an extra error, we add that to our sum, and it looks like did worse. 

sum of squared error = $-100^2 + 200^2 + -50^2 = 52,500$ 


To fix this, we can change our metric to use the squared error per point, or in other words, the average squared error.  The average of something is simply the sum divided by the number of things being summed.  Instead of the word average, we use the word mean, as it sounds fancier: 

mean squared error = (sum of $(y - y)^2$ for each data point, (x, y)) $\div$ (number of data points)

So changing our sum of squared error (with the original data) to a mean squared error, we have:

Mean Squared Error = 50,000/4 = 12500

Ok, this is great.  So now we have one number that does a good job of representing how well our regression line fits the data.  We got there by calculating the errors, then squaring the errors so that our errors are always positive, and finally using the average error as it better represents the accuracy of our line.  We could stop here, and be fine.  However, statisticians like to remove the effect of us squaring these errors.  So instead of just calculating the mean squared error, we calculate the Root Mean Squared Error.  This is just taking the square root at the very end.  So 

root mean squared error = the square root of ( (sum of $(y - y)^2$ for each data point, (x, y)) $\div$ (number of data points) ) 

or with our data it is: 

Root Mean Squared Error = 50,000/4 = $\sqrt{12500}$ = 111.8

### Summary 

Previous to this lesson, we have simply assumed that our regression lines make "good" predictions of values of $y$ for given values of $x$.  In this lesson, we aimed to find a metric to tell us how well our regression line fits our actual data.  To do this, we started by saying an error at a given point, is the difference of the actual value of y minus the expected value of y from our regression line.  Then we said we describe how well our regression line describes the entire dataset by squaring the errors at each point (to eliminate negative errors), adding these errors, and then dividing by the number of datapoints so as to describe how off our regression line is on average.  This is called the Mean Squared Error.  Finally, to reduce the effect of squaring each of our errors, we take the square root of the Mean Squared Error to arrive at the Root Mean Squared Error.  This is our metric for describing how well our regression line "fits" our data. 