# Evaluating regression lines

### Introduction

Now, so far we have seen that our regression lines make an estimation of our values of $y$.  They are useful because we can make an estimation of an output given an input.  In our example, we can make an estimation of a movie's expected revenue given a budget.  Our regression lines are described by two different variables, $m$, which represents the slope of the line and $b$ which is the value of $y$ when $x$ is zero.

So far we have been rather fast and loose with choosing our regression line.  We have either used data where our regression line perfectly matched our data, or have simply used the eyeball test to see that our regression line made sense.  Well today we're going further.  In this lesson, we'll learn how to evaluate the accuracy of a regression line, and how to use a technique to choose a better, or even "best fit" regression line.

### Determining Quality

To measure the accuracy of a regression line, we see how closely our regression line matches the data we have.  Let's find out what this means.

In [2]:
first_show = {'x': 0, 'y': 100}
second_show = {'x': 100, 'y': 150}
third_show = {'x': 200, 'y': 600}
fourth_show = {'x': 400, 'y': 700}

shows = [first_show, second_show, third_show, fourth_show]

For now, let's set a roughshod regression line simply by drawing a line between our first and last points.  Eventually, we'll improve this regression line, so it's ok that we don't use a rigorous technique right now.

In [3]:
# 1. We write a method to calculate the slope between two points

def slope_between_two_points(first_point, second_point):
    return (second_point['y'] - first_point['y'])/(second_point['x'] - first_point['x'])

# 2. Then apply this to our first and last points.  
slope_between_two_points(shows[0], shows[3]) # 1.5
    # So we'll use m = 1.5

# 3. we have a point for where x = 0, so that can be our y-intercept 
def y_intercept(points):
    point_at_zero = list(filter(lambda show: show['x'] == 0,shows))[0]
    return point_at_zero['y']

y_intercept(shows) # b = 100 


# plugging m and b into our formula y = mx + b we get the following sample_regression_formula
# m = 1.5
# b = 100
def sample_regression_formula(x):
    return 1.5(x) + 100

Ok, so now that we have our sample regression formula we can see it in action.  We draw a chart that displays our sample regression line (by simply drawing blue line between data points with the lowest and highest $x$ value) and then show with red color lines where our regression line does not match up to the data.

![](./regression-scatter.png)

Take a look at that first red line.  It's showing that our regression formula does not perfectly predict our line.  More concretely, at that spot, $x = 100$, our regression line is predicting that the value of $y$ will be 250.  So our regression line is off. Even if we did not have our graph, we could see this by using our formula for the regression line $y = 1.5x + 100$.  Setting $x$ equal to 100, we see that $y = 1.5 * 100 + 100 = 250$.

However the point below our regression line shows the actual value of y when x = 100 which is 150.  In other words, our regression line does not perfectly predict our data.


Each point where it is misses in it's prediction is called an error.  So that is what the reds lines are indictating, the distance between the prediction of made by our regression line and the actual data.  Or in other words, the red lines are visually displaying the size of each error.

Now let's measure the size of that error.  We can say that our error, at point $x = 100$, is the actual $y$ minus expected $y$, which translates to $150 - 250 = -100$. 

### Refining our Terms

We are using too many words to describe our earlier section.  And hopefully, by now you are beginning to believe that mathematical notation can help us speak about concepts with precision and clarity. 

One thing that's confusing about the above section is that we are really now talking about two things, our predicted $y$ values and our actual $y$ values.  Now so far we have defined our regression function as $y = mx + b$.  Where for a given value of $x$, we can calculate the value of $y$.  However, this value of $y$ is not the actual value of $y$, but just an estimation.  So let's indicate this, by changing our function to look like the following:

$\overline{y} = \overline{m}x + \overline{b}$ 

Those little dashes over the $y$, $m$ and $b$ are called hats, so this is read as y-hat equals m-hat times x plus b-hat.  These hats indicate that this formula does not give us the actual value of $y$, but simply our estimated value of $y$.  And that this predicted value of $y$ is based on our predicted values of $m$ and $b$. 
> Note that $x$ is not a predicted value.  Why is this?  Well, we are *providing* a value $x$, which in this case represents our movie budget, not predicting it.  So we are *providing* a value of $x$ and asking it to *predict* a value of $y$.  

Now remember that we were given some real data as well.  This means that we have actual points for $x$ and $y$, which looks like the following.

In [5]:
first_show = {'x': 0, 'y': 100}
second_show = {'x': 100, 'y': 150}
third_show = {'x': 200, 'y': 600}
fourth_show = {'x': 400, 'y': 700}

shows = [first_show, second_show, third_show, fourth_show]

So how do we represent our actual values of $x$, and $y$? Here's how: $y$.  No extra ink is needed.

Ok, so now we know the following:  
 * **$y$**: actual y  
 * **$\overline{y}$**: estimated y
 
Now, using the Greek letter $\varepsilon$, epsilon, to indicate error, we can say $\varepsilon = y - \overline{y}$.  Ok, so that is the general formula for error: $\varepsilon$ = $y$ - $\overline{y}$.  However, we can a little more precise by saying we are talking about error at any specific point, where $y$ and $\overline{y}$ are at that same point.  This is written as: 

$\varepsilon _{i}$ = $y_{i}$ - $\overline{y}_{i}$

So given our dataset and our regression line, we can represent our error when $ x = 0 $ as  $\varepsilon _{x=0} = y_{x=0}$ - $\overline{y}_{x=0} = 100 - 100 = 0$

### Calculating and representing total error

Now so far, we saw that we can calculate the error at a given value of $x$, $x_i$, by using the formula, $\varepsilon_i$ = $y_i - \overline{y_i}$.  And this helpful at describing how well our regression line predicts the value of $y$ at a specific point.  

However, we want to see well our regression describes the relation between $x$ and $y$ in general - not just at a given point.  So let's move beyond calculating the error at a given point to describing the total error of the regression line from the actual data.  As an initial approach, we simply calculate the total error by summing the errors, $y - \overline{y}$, for every point in our dataset.  

Total Error = $\sum_{i=1}^{n} y_i - \overline{y_i}$

However let's take another look at our data.

![](./regression-scatter.png)

Take a look at what happens if we add the errors at $x = 100$ and $x = 200$. 

* $\varepsilon_{x=100}= 150 - 250 = -100$
* $\varepsilon_{x=200} = 600 - 400 = 200$  
* $\varepsilon_{x=100} + \varepsilon_{x=200} =  -150 + 200 = 50 $

So because $\varepsilon_{x=100}$ is positive while $ \varepsilon_{x=200} $ is negative, adding the two errors  begins to cancel them out.  That's not what we want.  To represent our total error better, we can simply square the errors, so that we are always summing positive numbers.

${\varepsilon_i^2}$ = $({y_i - \overline{y_i}})^2$

Now given a list of points with coordinates (x, y), we can calculate the squared error of each of the points, and sum them up.  This is called our ** residual sum of squares ** (RSS).  Using our sigma notation, our formula RSS looks like: 

RSS $ = \sum_{i = 1}^n ({y_i - \overline{y_i}})^2$



So let's apply this to our example.  In our example, we have actual $x$ and $y$ values at the following points: $ (0, 100), (100, 150), (200, 600), (400, 700) $.  And we can calculate the values of $\overline{y} $ as $\overline{y} = 1.5 *x + 100 $, for each of those four points.  So this gives us:

RSS = $(0 - 0)^2 + (150 - 250)^2 + (600 - 400)^2 + (700 - 700)^2$ 

which reduces to  

$-100^2 + 200^2 = 50,000$

Ok, this is great.  So now we have one number, RSS, that does a good job of representing how well our regression line fits the data.  We got there by calculating the errors at each of our provided points, and then squaring the errors so that our errors are always positive.

### Summary 

Previous to this lesson, we have simply assumed that our regression lines make "good" predictions of $y$ for given values of $x$.  In this lesson, we aimed to find a metric to tell us how well our regression line fits our actual data.  To do this, we started looking at an error at a given point, and describing that as the difference of the actual value of $y$ minus the value of $y$ expected from our regression line.  Then we saw how well our regression line describes the entire dataset by squaring the errors at each point (to eliminate negative errors), and adding these errors.  This is called the Residual Sum of Squares (RSS).  This is our metric for describing how well our regression line "fits" our data. 