# Evaluating Regression Lines Lab

### Introduction

In the previous lab, we learned to evaluate how well a regression line estimated our actual data.  In this lab, we will turn these formulas into code.

### Determining Quality

In the file, `movie_data.py` you will find movie data written as a python list of dictionaries, with each dictionary representing a movie.  The movies are derived from the first 30 entries from the 538 dataset [provided here](https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv).

In [2]:
from movie_data import movies 
len(movies)

30

In [3]:
movies[0]

{'budget': 13000000, 'domgross': 25682380.0, 'title': '21 &amp; Over'}

### Calculating errors of a regression Line

For the data set, let's use the following regression formula: $\overline{y} = \overline{m} x + \overline{b}$, with \overline{m} = 1.7, and $\overline{y} = 100,000$.  Write a function called `regression_formula` that calculates our $\overline{y}$ for any provided value of $x$. 

In [4]:
def regression_formula(x):
    return 100000 + 1.7*x

In [5]:
regression_formula(100000) # 270000.0
regression_formula(250000) # 525000.0

525000.0

Now that we have our regression formula, let's begin to write a method that, provided a list of points and a specific point at which to evaluate the error, calculates the error at that given point.  We can begin by writing a function y, that given x and an array of points, returns the value of `domgross` at `x`.

In [6]:
def y(x, points):
    point_at_x = list(filter(lambda point: point['budget'] == x,movies))[0]
    return point_at_x['domgross']

In [7]:
y(13000000, movies) # 25682380.0

25682380.0

Now write a function called `error`, that given a value of x, returns the error at that x value.  Remember ${\varepsilon_i} =  y_i - \overline{y}_i$

In [8]:
def error(x, movies):
    return (y(x, movies) - regression_formula(x))

In [9]:
error(13000000, movies) # 3482380.0

3482380.0

Now write a function called `squared_error`, that given a value of x, returns the squared error at that x value.

${\varepsilon_i}^2 =  (y_i - \overline{y}_i)^2$

In [10]:
def squared_error(x, movies):
    return (y(x, movies) - regression_formula(x))**2

In [11]:
squared_error(movies[0]['budget'], movies) # 12126970464400.0

12126970464400.0

Next, write a function called `average_squared_error` that provided a list of points returns the average error per movie in our dataset.

In [12]:
def squared_errors(points):
    return list(map(lambda point: squared_error(point['budget'], points), points))

squared_errors(movies) 

def average_squared_error(points):
    return sum(squared_errors(points))/len(points)

In [15]:
average_squared_error(movies) # 1.6033962845114356e+16

1.0008390366775908e+16

Let's finish this off with a function that minimizes the effect of squaring all of our errors, the root mean squared error.  Which taken an array of movies, returns the root mean squared error.

In [16]:
import math
def root_mean_squared_error(movies):
    return math.sqrt(average_squared_error(movies))

In [17]:
root_mean_squared_error(movies) # 126625285.17288463

100041943.03778745