# Evaluating Regression Lines Lab

### Introduction

In the previous lesson, we learned to evaluate how well a regression line estimated our actual data.  In this lab, we will turn these formulas into code.

### Determining Quality

In the file, `movie_data.py` you will find movie data written as a python list of dictionaries, with each dictionary representing a movie.  The movies are derived from the first 30 entries from the 538 dataset [provided here](https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv).

In [2]:
from movie_data import movies 
len(movies)

30

In [3]:
movies[0]

{'budget': 13000000, 'domgross': 25682380.0, 'title': '21 &amp; Over'}

Note that like in previous lessons, we are still referencing a budget as our `x_values` or explanatory value and our revenue as our ouput.  Here revenue is represented as the key `domgross`.  First, convert the explanatory budget values to `x_values`, and convert the domgross values to `y_values`.

In [5]:
x_values = list(map(lambda movie: movie['budget'], movies))
y_values = list(map(lambda movie: movie['domgross'], movies))

In [6]:
x_values[0] # 13000000

13000000

In [7]:
y_values[0]

25682380.0

### Calculating errors of a regression Line

For the dataset, let's use the following regression formula: $\overline{y} = \overline{m} x + \overline{b}$, with $\overline{m} = 1.7$, and $\overline{y} = 100,000$.  Write a function called `regression_formula` that calculates our $\overline{y}$ for any provided value of $x$. 

In [4]:
def regression_formula(x):
    return 100000 + 1.7*x

In [5]:
regression_formula(100000) # 270000.0
regression_formula(250000) # 525000.0

525000.0

Now that we have our regression formula, we can move towards calculating the error.  Let's start by writing a function named `y_actual`, that given a value, $x$, a list of `x_values` and a list of `y_values` to represent our dataset, returns the value of $y$ at $x$.

In [14]:
def y_actual(x, x_values, y_values):
    combined_values = list(zip(x_values, y_values))
    point_at_x = list(filter(lambda point: point[0] == x,combined_values))[0]
    return point_at_x[1]

In [16]:
y_actual(13000000, x_values, y_values) # 25682380.0

25682380.0

Now write a function called `error`, that given a list of `x_values`, and a list of `y_values`, the values `m` and `b` of a regression line, and a value of `x`, returns the error at that x value.  Remember ${\varepsilon_i} =  y_i - \overline{y}_i$.  

In [18]:
def error(x_values, y_values, m, b, x):
    expected = (m*x + b)
    return (y_actual(x, x_values, y_values) - expected)

In [19]:
error(x_values, y_values, 1.7, 100000, 13000000) # 3482380.0

3482380.0

** Can add in plotting errors here ** 

Now write a function called `squared_error`, that given a value of x, returns the squared error at that x value.

${\varepsilon_i}^2 =  (y_i - \overline{y}_i)^2$

In [30]:
def squared_error(x_values, y_values, m, b, x):
    return error(x_values, y_values, m, b, x)**2

In [33]:
squared_error(x_values, y_values, 1.7, 100000, x_values[0]) # 12126970464400.0

12126970464400.0

Next, write a function called `residual_sum_squares` that provided a list of points returns the total squared error for the movies in our dataset.

In [34]:
def squared_errors(x_values, y_values, m, b):
    return list(map(lambda x: squared_error(x_values, y_values, m, b, x), x_values))

In [37]:
squared_errors(x_values, y_values, 1.7, 100000)

[12126970464400.0,
 4135150451673360.5,
 361267379491225.0,
 794537411251600.0,
 724697867965369.0,
 1.1849947361812563e+17,
 7947865497243204.0,
 26791793814241.0,
 12126970464400.0,
 2.578526293187741e+16,
 724697867965369.0,
 28038359355876.0,
 4309641785171044.0,
 7000432263889.0,
 183234585197889.0,
 250695953891161.0,
 170556704389696.0,
 5.700890907419822e+16,
 225831887511616.0,
 1.2332076514313688e+16,
 1.5715833880370482e+16,
 3916421362617124.0,
 724697867965369.0,
 8814749428288609.0,
 637050330900736.0,
 1116906426022500.0,
 1.9030233952612996e+16,
 1.33580290597636e+16,
 3147108684215409.0,
 250695953891161.0]

In [36]:
def residual_sum_squares(x_values, y_values, m, b):
    return sum(squared_errors(x_values, y_values, m, b))

residual_sum_squares(x_values, y_values, 1.7, 100000)

3.0025171100327725e+17