# Evaluating Regression Lines Lab

### Introduction

In the previous lesson, we learned to evaluate how well a regression line estimated our actual data.  In this lab, we will turn these formulas into code.  In doing so, we'll build lots of useful functions for both calculating and displaying our errors for a given regression line and dataset.

### Determining Quality

In the file, `movie_data.py` you will find movie data written as a python list of dictionaries, with each dictionary representing a movie.  The movies are derived from the first 30 entries from the 538 dataset [provided here](https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv).

In [1]:
from movie_data import movies 
len(movies)

30

> Press shift + enter

In [2]:
movies[0]

{'budget': 13000000, 'domgross': 25682380.0, 'title': '21 &amp; Over'}

Note that, like in previous lessons, we are still our budget is our explanatory value and our revenue is our dependent variable.  Here revenue is represented as the key `domgross`.  

#### Plotting our data

Let's write the code to plot this data set.

As a first task, convert the budget values to `x_values`, and convert the domgross values to `y_values`.

In [5]:
x_values = list(map(lambda movie: movie['budget'], movies))
y_values = list(map(lambda movie: movie['domgross'], movies))

In [6]:
x_values[0] # 13000000

13000000

In [7]:
y_values[0]

25682380.0

Assign a variable called `titles` equal to the titles of the movies.

In [8]:
titles = list(map(lambda movie: movie['title'], movies))

In [9]:
titles[0]

'21 &amp; Over'

Great! Now we have the data necessary to make a trace of our data.

In [12]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
from graph import trace_values, plot

movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')

plot([movies_trace])

#### Plotting a regression line

Now let's add a regression line to make a prediction of output (revenue) based on an input (the budget).  We'll use the following regression formula:

* $\overline{y} = \overline{m} x + \overline{b}$, with $\overline{m} = 1.7$, and $\overline{b} = 100,000$. 


* $\overline{y} = 1.7x + 100,000$

Write a function called `regression_formula` that calculates our $\overline{y}$ for any provided value of $x$. 

In [13]:
def regression_formula(x):
    return 100000 + 1.7*x

Check to see that the regression formula generates the correct outputs.

In [15]:
regression_formula(100000) # 270000.0
regression_formula(250000) # 525000.0

525000.0

Let's plot the data as well as the regression line to get a sense of what we are looking at.

In [15]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot
movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')

regression_trace = m_b_trace(1.7, 100000, x_values, name='estimated revenue')
plot([movies_trace, regression_trace])

### Calculating errors of a regression Line

Now that we have our regression formula, we can move towards calculating the error.

In [21]:
def y_actual(x, x_values, y_values):
    combined_values = list(zip(x_values, y_values))
    point_at_x = list(filter(lambda point: point[0] == x,combined_values))[0]
    return point_at_x[1]

In [30]:
y_actual(13000000, x_values, y_values) # 25682380.0

25682380.0

Write a function called `error`, that given a list of `x_values`, and a list of `y_values`, the values `m` and `b` of a regression line, and a value of `x`, returns the error at that x value.  Remember ${\varepsilon_i} =  y_i - \overline{y}_i$.  

In [31]:
def error(x_values, y_values, m, b, x):
    expected = (m*x + b)
    return y_actual(x, x_values, y_values) - expected

In [32]:
error(x_values, y_values, 1.7, 100000, 13000000) # 3482380.0

3482380.0

Now that we have a formula to calculate our errors, write a function that returns a trace of an error at a given point.  The function is called `error_line` and it takes our dataset of `x_values` as the first argument, and `y_values` as the second argument.  It also takes in values of $m$ and $b$ as the next two arguments, to represent the regression line it calculates errors from.  Finally, it's last argument is the value $x$ it is drawing an error for.

The return value is a dictionary that represents a trace, and looks like the following:  

```python
{'marker': {'color': 'red'},
 'mode': 'line',
 'name': 'error at 120000000',
 'x': [120000000, 120000000],
 'y': [93050117.0, 204010000.0]}

```

The trace draws a vertical line at the x value of the error.   The x values of is the value of x where the error is evaluated at. The y values of the trace are the actual value and the expected value.  The mode of the trace should equal `'line'`.

In [33]:
def error_line_trace(x_values, y_values, m, b, x):
    y_hat = m*x + b
    y = y_actual(x, x_values, y_values)
    name = 'error at ' + str(x)
    return {'x': [x, x], 'y': [y, y_hat], 'mode': 'line', 'marker': {'color': 'red'}, 'name': name}

In [34]:
error_at_120m = error_line_trace(x_values, y_values, 1.7, 10000, 120000000)

# {'marker': {'color': 'red'},
#  'mode': 'line',
#  'name': 'error at 120000000',
#  'x': [120000000, 120000000],
#  'y': [93050117.0, 204010000.0]}
error_at_120m

{'marker': {'color': 'red'},
 'mode': 'line',
 'name': 'error at 120000000',
 'x': [120000000, 120000000],
 'y': [93050117.0, 204010000.0]}

We just ran the our function to draw a trace of the error for the movie Elysium.  Let's see how it looks.

In [22]:
movies[19]

{'budget': 120000000, 'domgross': 93050117.0, 'title': 'Elysium'}

In [35]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot
movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')
regression_trace = m_b_trace(1.7, 100000, x_values, name='estimated revenue')
plot([movies_trace, regression_trace, error_at_120m])

From there, we can write a function called `error_lines`, that takes in a list of `x_values` as an argument, `y_values` as an argument, and returns a list of traces for every x_value provided.

In [36]:
def error_line_traces(x_values, y_values, m, b):
    return list(map(lambda x_value: error_line_trace(x_values, y_values, m, b, x_value), x_values))

In [37]:
errors_for_regression = error_line_traces(x_values, y_values, 1.7, 100000)

In [40]:
len(errors_for_regression) # 30

30

In [41]:
errors_for_regression[-1]

# {'marker': {'color': 'red'},
#  'mode': 'line',
#  'name': 'error at 40000000',
#  'x': [40000000, 40000000],
#  'y': [95020213.0, 68100000.0]}

{'marker': {'color': 'red'},
 'mode': 'line',
 'name': 'error at 30000000',
 'x': [30000000, 30000000],
 'y': [35266619.0, 51100000.0]}

In [32]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot
movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')
regression_trace = m_b_trace(1.7, 100000, x_values, name='estimated revenue')
plot([movies_trace, regression_trace, *errors_for_regression])

### Calculating RSS

Now write a function called `squared_error`, that given a value of x, returns the squared error at that x value.

${\varepsilon_i}^2 =  (y_i - \overline{y}_i)^2$

In [45]:
def squared_error(x_values, y_values, m, b, x):
    return error(x_values, y_values, m, b, x)**2

In [46]:
squared_error(x_values, y_values, 1.7, 100000, x_values[0]) # 12126970464400.0

12126970464400.0

Now write a function that will iterate through the x and y values to create a list of squared errors at each point, $(x_i, y_i)$ of the dataset.

In [47]:
def squared_errors(x_values, y_values, m, b):
    return list(map(lambda x: squared_error(x_values, y_values, m, b, x), x_values))

In [48]:
squared_errors(x_values, y_values, 1.7, 100000)

[12126970464400.0,
 4135150451673360.5,
 361267379491225.0,
 794537411251600.0,
 724697867965369.0,
 1.1849947361812563e+17,
 7947865497243204.0,
 26791793814241.0,
 12126970464400.0,
 2.578526293187741e+16,
 724697867965369.0,
 28038359355876.0,
 4309641785171044.0,
 7000432263889.0,
 183234585197889.0,
 250695953891161.0,
 170556704389696.0,
 5.700890907419822e+16,
 225831887511616.0,
 1.2332076514313688e+16,
 1.5715833880370482e+16,
 3916421362617124.0,
 724697867965369.0,
 8814749428288609.0,
 637050330900736.0,
 1116906426022500.0,
 1.9030233952612996e+16,
 1.33580290597636e+16,
 3147108684215409.0,
 250695953891161.0]

Next, write a function called `residual_sum_squares` that, provided a list of x_values, y_values, and the values of a regression line m and b, returns the sum of the squared error for the movies in our dataset.

In [51]:
def residual_sum_squares(x_values, y_values, m, b):
    return sum(squared_errors(x_values, y_values, m, b))

In [52]:
residual_sum_squares(x_values, y_values, 1.7, 100000) # 3.0025171100327725e+17

3.0025171100327725e+17

Finally, write a function called `root_mean_squared_error` that calculates the RMSE for RMSE for the movies in the dataset, provided the same parameters as RSS.

In [66]:
import math
def root_mean_squared_error(x_values, y_values, m, b):
    return math.sqrt(sum(squared_errors(x_values, y_values, m, b)))/len(x_values)

In [67]:
root_mean_squared_error(x_values, y_values, 1.7, 100000)

18265076.29948103

#### Some functions for your understanding

Now we'll provide a couple functions for you.  Note that we can represent a multiple regression lines by a list of m and b values:

In [68]:
regression_lines = [(1.7, 100000), (1.9, 200000)]

And then can return a list of the regression lines along with the associated RMSE.

In [74]:
def root_mean_squared_errors(x_values, y_values, regression_lines):
    errors = []
    for regression_line in regression_lines:
        error = root_mean_squared_error(x_values, y_values, regression_line[0], regression_line[1])
        errors.append([regression_line[0], regression_line[1], round(error, 0)])
    return errors

Now let's generate the rss values for each of these lines.

In [75]:
root_mean_squared_errors(x_values, y_values, regression_lines)

[[1.7, 100000, 18265076.0], [1.9, 200000, 20121368.0]]

Now we'll provide a couple functions for you:
* a function called `trace_rss`, that builds a bar chart displaying the value of the RSS.  The return value is a dictionary with keys of `x` and `y`, both which point to lists.  The $x$ key points to a list with one element, the string 'RSS' and related m and b value.  The $y$ list should be a list which points to the value of the RSS for related regression lines.

In [81]:
import plotly.graph_objs as go

def trace_rmse(x_values, y_values, regression_lines):
    errors = root_mean_squared_errors(x_values, y_values, regression_lines)
    x_values_bar = list(map(lambda error: 'm: ' + str(error[0]) + ' b: ' + str(error[1]),errors))
    y_values_bar = list(map(lambda error: error[-1], errors))
    return dict(
        x=x_values_bar,
        y=y_values_bar,
        type='bar'
    )

trace_rmse(x_values, y_values, regression_lines)

{'type': 'bar',
 'x': ['m: 1.7 b: 100000', 'm: 1.9 b: 200000'],
 'y': [18265076.0, 20121368.0]}

Once this is built, we can build a subplot showing the regression line, as well as the related RSS for the regression line.

In [86]:
import plotly
from plotly import tools
import plotly.graph_objs as go

def plot_regression_and_rss(scatter_trace, regression_traces, rss_calc_trace):
    fig = tools.make_subplots(rows=1, cols=2)
    for reg_trace in regression_traces:
        fig.append_trace(reg_trace, 1, 1)
    fig.append_trace(scatter_trace, 1, 1)
    fig.append_trace(rss_calc_trace, 1, 2)
    plotly.offline.iplot(fig)

In [94]:

### add more regression lines here, by adding new elements to the list
regression_lines = [(1.7, 100000), (1.9, 200000)]

regression_traces = list(map(lambda line: m_b_trace(line[0], line[1], x_values, name='m:' + str(line[0]) + 'b: ' + str(line[1])), regression_lines))

scatter_trace = trace_values(x_values, y_values, text=titles, name='movie data')
rmse_calc_trace = trace_rmse(x_values, y_values, regression_lines)

plot_regression_and_rss(scatter_trace, regression_traces, rmse_calc_trace)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]

