# Evaluating Regression Lines Lab

### Introduction

In this lesson, let's put our knowledge about data science so far to the test.  And we'll do so with our real live movie data.

### Determining Quality

First, let's get some movies from the 538 dataset [provided here](https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv).

In [1]:
import pandas

def parse_file(fileName):
    movies_df = pandas.read_csv(fileName)
    return movies_df.to_dict('records')

movies = parse_file('https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv')


In [2]:
len(movies)

1794

And let's start looking at our data by examining the first entry in our dataset.

In [3]:
movies[0]

{'binary': 'FAIL',
 'budget': 13000000,
 'budget_2013$': 13000000,
 'clean_test': 'notalk',
 'code': '2013FAIL',
 'decade code': 1.0,
 'domgross': 25682380.0,
 'domgross_2013$': 25682380.0,
 'imdb': 'tt1711425',
 'intgross': 42195766.0,
 'intgross_2013$': 42195766.0,
 'period code': 1.0,
 'test': 'notalk',
 'title': '21 &amp; Over',
 'year': 2013}

So there, we can see what data is available about the movies in the dataset.  The `budget_2013$` seems to be the budget adjusted for inflation to be in 2013 dollars, and `domgross_2013$` seems to be the equivalent for domestic revenue, as `intgross_2013$` is the equivalent for 2013 international revenue.

Let's get started by simply plotting our dataset using Plotly to see how much money a movie makes domestically in 2013 dollars given a budget in 2013 dollars.  Set the text of the trace equal to a list of the movie titles, so that we can see which movie is associated with each point.

In [4]:
budgets = list(map(lambda movie: movie['budget_2013$'], movies))
domestic_revenues = list(map(lambda movie: movie['domgross_2013$'], movies))
titles = list(map(lambda movie: movie['title'], movies))

In [5]:
from graph import trace_values, plot
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
revenues_per_budgets = trace_values(budgets, domestic_revenues, text = titles)
plot([revenues_per_budgets])

Look at that datapoint was at the top well over the 1.5 billion mark.  What movie is that?

Write a function called highest_domestic_gross, that finds the highest grossing movie given a list of movies.

In [6]:
def highest_domestic_gross(movies):
    return max(movies, key=lambda movie: movie['domgross_2013$'])

In [7]:
highest_domestic_gross(movies)

{'binary': 'FAIL',
 'budget': 11000000,
 'budget_2013$': 42274609,
 'clean_test': 'notalk',
 'code': '1977FAIL',
 'decade code': nan,
 'domgross': 460998007.0,
 'domgross_2013$': 1771682790.0,
 'imdb': 'tt0076759',
 'intgross': 797900000.0,
 'intgross_2013$': 3066446442.0,
 'period code': nan,
 'test': 'notalk',
 'title': 'Star Wars',
 'year': 1977}

Huh, well we should've known.  Let's now zoom in on our dataset so that our plot no longer expands for just a few of the outliers.  Set the x-axis of our plot to go from zero to 300 million dollars, and the y-axis of our plot to go from zero to one billion dollars.

In [167]:
from graph import build_layout
revenues_per_budgets = trace_values(budgets, domestic_revenues, text = titles)
revenues_layout = build_layout(x_range = [0, 300000000], y_range = [0, 1000000000])
plot([revenues_per_budgets], revenues_layout)

Ok, well at least we now have a closer look at our data.  And we're still seeing Titanic up in the top right there. 

Ok, now imagine that we hired an outside consultant to predict revenue for us.  And the consultant told us the following: $R(x) = 3*budget - 10000$ where $x$ is a movie's budget in 2013 dollars, and R(x) is the expected revenue in 2013 dollars.  Write a function called `outside_consultant_predicted_revenue` that provided a budget returns the expected revenue according to the outside consultant's formula. 

In [75]:
def outside_consultant_predicted_revenue(budget):
    return 3*budget - 1000000

Let's plot the consultant estimated revenue to see visually if his estimates line up.

In [168]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot


consultant_estimated_revenues = list(map(lambda budget: outside_consultant_predicted_revenue(budget),budgets))
movies_trace = trace_values(budgets, consultant_estimated_revenues, mode='line', name = 'consultant estimate')
plot([revenues_per_budgets, movies_trace], revenues_layout)

Overall, they don't look so bad.  Of course, we can calculate the RSS to place a number to how accurate his model really is.  Let's write a method called `error_for_consultant_model` which takes in a bugdet of a movie in our dataset, and returns the difference between the movie's gross domestic revenue in 2013 dollars, and the prediction by the consultant's model.

In [77]:
def error_for_consultant_model(movie):
    expected = outside_consultant_predicted_revenue(movie['budget_2013$'])
    return movie['budget_2013$'] - expected

In [78]:
american_hustle = movies[10]
error_for_consultant_model(american_hustle)

-79000000

Once haven written a formula that calculates the error for the consultant's model provided a budget, now write a method that calculates the RSS for the consulant's model.  When we move onto compare our consultant's model with others, we'll then have a metric to compare.

In [79]:
def rss_consultant(movies):
    return sum(list(map(lambda movie: error_for_consultant_model(movie)**2, movies)))

In [80]:
rss_consultant(movies)

43310629791499954384

Ok, we'll find out if this number is any good later, but for right now let's just say that our RSS is good enough.  Use the derivative to write a function that provided a budget, returns the expected revenue increase of spending a little more than that budget.  Remember that our consultant's model is $R(x) = 3x - 1000000$ where $x$ is a budget, and $R(x)$ is an expected revenue.

In [81]:
def consultant_spend_a_little_more(budget):
    return 3

In [82]:
consultant_spend_a_little_more(100000000)

3

In [84]:
movies[0]

{'binary': 'FAIL',
 'budget': 13000000,
 'budget_2013$': 13000000,
 'clean_test': 'notalk',
 'code': '2013FAIL',
 'decade code': 1.0,
 'domgross': 25682380.0,
 'domgross_2013$': 25682380.0,
 'imdb': 'tt1711425',
 'intgross': 42195766.0,
 'intgross_2013$': 42195766.0,
 'period code': 1.0,
 'test': 'notalk',
 'title': '21 &amp; Over',
 'year': 2013}

### A new model

A burgeoning data scientist in your company wants to take a crack at his own model for predicting a movie's revenue.  He notices, that in general, movies tend to make more money per year.

In [92]:
from graph import build_layout
years = list(map(lambda movie: movie['year'],movies))
years_and_revenues = trace_values(years, domestic_revenues, text = titles)
years_layout = build_layout(y_range = [0, 1000000000])
plot([years_and_revenues], years_layout)

So he comes up with a new model, to indicate that on average movies make 2 million more per year.  

In [195]:
def revenue_with_year(budget, year):
    return 1.4*budget + 1000000*(year - 1965)

Let's plot these estimated budgets side by side of the actual revenues and budgets.  

In [196]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot

budgets = list(map(lambda movie: movie['budget_2013$'], movies))
domestic_revenues = list(map(lambda movie: movie['domgross_2013$'], movies))
titles = list(map(lambda movie: movie['title'], movies))

consultant_estimated_revenues = list(map(lambda movie: revenue_with_year(movie['domgross_2013$'], movie['year']),movies))
movies_trace = trace_values(budgets, consultant_estimated_revenues, mode='markers', name = 'consultant estimate')
plot([revenues_per_budgets, movies_trace], revenues_layout)

As you can see, there is significant overlap with our actual data and the movies.

Once again, let's try putting some numbers to this data.  We can start with writing a function called `error_revenue_with_year` that accepts as an argument a movie and returns the difference in revenue predicted by the internal consultant's model, and the actual revenue.

In [220]:
def error_revenue_with_year(movie):
    return movie['domgross_2013$'] - revenue_with_year(movie['budget'], movie['year'])

In [221]:
error_revenue_with_year(movie) # -17863583.0

-17863583.0

This is down from the error of $-79000000$ previously for American Hustle.  So at least we're predicting that movie better.  Let's see how our model matches up in general. 

In [231]:
def rss_revenue_with_year(movies):
    n = len(movies)
    rss_list =  list(map(lambda movie: (error_revenue_with_year(movie)**2), movies))
    return sum(rss_list)

In [232]:
rss_revenue_with_year(movies)

nan

Hmm, that's not great.  Seems like our RSS is too large here to even return a number.  Still, let's keep going with seeing what these numbers indicate.

* Write predicted output
* Write error at
* Write RSS
* Write expected per increase

Now write a function called `error`, that given a list of `x_values`, and a list of `y_values`, the values `m` and `b` of a regression line, and a value of `x`, returns the error at that x value.  Remember ${\varepsilon_i} =  y_i - \overline{y}_i$.  

In [None]:
def error(x_values, y_values, m, b, x):
    expected = (m*x + b)
    return (y_actual(x, x_values, y_values) - expected)

In [None]:
error(x_values, y_values, 1.7, 100000, 13000000) # 3482380.0

Now that we have a formula to calculate our errors, write a function that returns a trace of an error at a given point.  The function is called `error_line` and it takes our dataset of `x_values` as the first argument, and `y_values` as the second argument.  It also takes in values of $m$ and $b$ as the next two arguments, to represent the regression line it is calculating errors from.  Finally, it's last argument is the value $x$ it is drawing an error for.

The return value is a dictionary that represents a trace.  The x values of the trace are two x values, and the y values of the trace are the actual value and the expected value.  The mode of the trace should equal `line`.

In [None]:
def error_line_trace(x_values, y_values, m, b, x):
    y_hat = m*x + b
    y = y_actual(x, x_values, y_values)
    name = 'error at ' + str(x)
    return {'x': [x, x], 'y': [y, y_hat], 'mode': 'line', 'marker': {'color': 'red'}, 'name': name}

In [None]:
error_at_120m = error_line_trace(x_values, y_values, 1.7, 10000, 120000000)

# {'marker': {'color': 'red'},
#  'mode': 'line',
#  'name': 'error at 120000000',
#  'x': [120000000, 120000000],
#  'y': [93050117.0, 204010000.0]}
error_at_120m

We just ran the our function to draw a trace of the error for the movie Elysium.  Let's see how it looks.

In [None]:
movies[19]

In [None]:
from linear_equations import expected_value_for_line
y_actual(13000000, x_values, y_values) # 25682380.0
expected_value_for_line(1.7, 10000, 13000000) # 22110000.0

In [None]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot
movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')
regression_trace = m_b_trace(1.7, 100000, x_values, name='estimated revenue')
plot([movies_trace, regression_trace, error_at_120m])

From there, we can write a function called `error_lines`, that takes in a list of `x_values` as an argument, `y_values` as an argument, and returns a list of traces for every x_value provided.

In [None]:
def error_line_traces(x_values, y_values, m, b):
    return list(map(lambda x_value: error_line_trace(x_values, y_values, m, b, x_value), x_values))

In [None]:
errors_for_regression = error_line_traces(x_values[0:5], y_values[0:5], 1.7, 100000)

In [None]:
len(errors_for_regression) # 5
errors_for_regression[-1]

# {'marker': {'color': 'red'},
#  'mode': 'line',
#  'name': 'error at 40000000',
#  'x': [40000000, 40000000],
#  'y': [95020213.0, 68100000.0]}

In [None]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot
movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')
regression_trace = m_b_trace(1.7, 100000, x_values, name='estimated revenue')
plot([movies_trace, regression_trace, *errors_for_regression])

### Calculating RSS

Now write a function called `squared_error`, that given a value of x, returns the squared error at that x value.

${\varepsilon_i}^2 =  (y_i - \overline{y}_i)^2$

In [None]:
def squared_error(x_values, y_values, m, b, x):
    return error(x_values, y_values, m, b, x)**2

In [None]:
squared_error(x_values, y_values, 1.7, 100000, x_values[0]) # 12126970464400.0

Next, write a function called `residual_sum_squares` that provided a list of points returns the total squared error for the movies in our dataset.

In [None]:
def squared_errors(x_values, y_values, m, b):
    return list(map(lambda x: squared_error(x_values, y_values, m, b, x), x_values))

In [None]:
squared_errors(x_values, y_values, 1.7, 100000)

In [None]:
def residual_sum_squares(x_values, y_values, m, b):
    return sum(squared_errors(x_values, y_values, m, b))

residual_sum_squares(x_values, y_values, 1.7, 100000)

Now we'll provide a couple functions for you. a function called `trace_rss`, that build a bar chart displaying the value of the RSS.  The return value is a dictionary with keys of `x` and `y`, both which point to lists.  The $x$ key should point to a list with one element, the string 'RSS'.  The $y$ list should be a list which points to the value of the RSS for the line.

In [None]:
import plotly.graph_objs as go

def trace_rss(x_values, y_values, m, b):
    rss_calc = residual_sum_squares(x_values, y_values, m, b)
    return dict(
        x=['RSS'],
        y=[rss_calc],
        type='bar'
    )

trace_rss(x_values, y_values, 1.7, 100000)

Once this is built, we can turn build a subplot showing the regression line, as well as the related RSS for the regression line.

In [None]:
import plotly
from plotly import tools
import plotly.graph_objs as go

def plot_regression_and_rss(scatter_trace, regression_trace, rss_calc_trace):
    fig = tools.make_subplots(rows=1, cols=2)
    fig.append_trace(scatter_trace, 1, 1)
    fig.append_trace(regression_trace, 1, 1)
    fig.append_trace(rss_calc_trace, 1, 2)
    plotly.offline.iplot(fig)

In [None]:
m = 1.7
b = 100000


scatter_trace = trace_values(x_values, y_values, text=titles, name='movie data')
regression_trace = m_b_trace(m, b, x_values, name='estimated revenue')
rss_calc_trace = trace_rss(x_values, y_values, m, b)

plot_regression_and_rss(scatter_trace, regression_trace, rss_calc_trace)