# Evaluating Regression Lines Lab

### Introduction

In this lesson, let's put our knowledge about data science so far to the test.  And we'll do so with our real live movie data.

### Determining Quality

First, let's get some movies from the 538 dataset [provided here](https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv).

In [10]:
import pandas

def parse_file(fileName):
    movies_df = pandas.read_csv(fileName)
    return movies_df.to_dict('records')

movies = parse_file('https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv')


In [11]:
len(movies)

1794

And let's start looking at our data by examining the first entry in our dataset.

In [12]:
movies[0]

{'binary': 'FAIL',
 'budget': 13000000,
 'budget_2013$': 13000000,
 'clean_test': 'notalk',
 'code': '2013FAIL',
 'decade code': 1.0,
 'domgross': 25682380.0,
 'domgross_2013$': 25682380.0,
 'imdb': 'tt1711425',
 'intgross': 42195766.0,
 'intgross_2013$': 42195766.0,
 'period code': 1.0,
 'test': 'notalk',
 'title': '21 &amp; Over',
 'year': 2013}

So there, we can see what data is available about the movies in the dataset.  The `budget_2013$` seems to be the budget adjusted for inflation to be in 2013 dollars, and `domgross_2013$` seems to be the equivalent for domestic revenue, as `intgross_2013$` is the equivalent for 2013 international revenue.

Now, that first movie looks good, but there are some others that will not be so fun to play with.  Let's remove the values that have `nan`, which stands for not a number for their `domgross_2013`.  This is missing data, and there are only a few pieces of missing data here, so these are safe to remove without causing too much damage.  

In [13]:
import math
list(filter(lambda movie: math.isnan(movie['domgross_2013$']),movies))

[{'binary': 'FAIL',
  'budget': 19200000,
  'budget_2013$': 19200000,
  'clean_test': 'nowomen',
  'code': '2013FAIL',
  'decade code': 1.0,
  'domgross': nan,
  'domgross_2013$': nan,
  'imdb': 'tt2005374',
  'intgross': nan,
  'intgross_2013$': nan,
  'period code': 1.0,
  'test': 'nowomen-disagree',
  'title': 'The Frozen Ground',
  'year': 2013},
 {'binary': 'PASS',
  'budget': 4000000,
  'budget_2013$': 4142763,
  'clean_test': 'ok',
  'code': '2011PASS',
  'decade code': 1.0,
  'domgross': nan,
  'domgross_2013$': nan,
  'imdb': 'tt1422136',
  'intgross': 442550.0,
  'intgross_2013$': 458345.0,
  'period code': 1.0,
  'test': 'ok',
  'title': 'A Lonely Place to Die',
  'year': 2011},
 {'binary': 'PASS',
  'budget': 10000000,
  'budget_2013$': 10356908,
  'clean_test': 'ok',
  'code': '2011PASS',
  'decade code': 1.0,
  'domgross': nan,
  'domgross_2013$': nan,
  'imdb': 'tt1701990',
  'intgross': nan,
  'intgross_2013$': nan,
  'period code': 1.0,
  'test': 'ok',
  'title': 'Dete

Write a function called `remove_movies_missing_data` that returns the subset of movies that do not have `nan`.  To do so, you can import the math library and make use of the `math.isnan` method.

In [14]:
import math

def remove_movies_missing_data(movies):
    return [movie for movie in movies if not math.isnan(movie['domgross_2013$'])]

In [15]:
parsed_movies = remove_movies_missing_data(movies)
len(parsed_movies) # 1776
list(filter(lambda movie: math.isnan(movie['domgross_2013$']),parsed_movies)) # []

[]

To avoid working with such large numbers, let's divide both our budget and revenue numbers for each movie by 1 million.  It will make some of our future calculations easier.  It seems like the attributes that we can scale down are `budget`, `budget_2013$`, `domgross`, `domgross_2013$`, `intgross`, and `intgross_2013$`.

Write a function called `scale_down_movie` that can take an element from our movies list and returns a movie with all of the elements the same, except the budget `budget`, `budget_2013$`, `domgross`, `domgross_2013$`, `intgross`, and `intgross_2013$` numbers divided by 1 million and rounded to two decimal places.

In [71]:
def scale_down_movie(movie):
    movie_copy = dict(movie)
    movie_copy.update({'budget': round(movie['budget']/1000000, 2), 'budget_2013$': round(movie['budget_2013$']/1000000, 2), 
                       'intgross': round(movie['intgross']/1000000, 2), 'intgross_2013$': round(movie['intgross_2013$']/1000000, 2),
                       'domgross': round(movie['domgross']/1000000, 2), 'domgross_2013$': round(movie['domgross_2013$']/1000000, 2)
                      })
    return movie_copy

In [74]:
movies[0]

{'binary': 'FAIL',
 'budget': 13000000,
 'budget_2013$': 13000000,
 'clean_test': 'notalk',
 'code': '2013FAIL',
 'decade code': 1.0,
 'domgross': 25682380.0,
 'domgross_2013$': 25682380.0,
 'imdb': 'tt1711425',
 'intgross': 42195766.0,
 'intgross_2013$': 42195766.0,
 'period code': 1.0,
 'test': 'notalk',
 'title': '21 &amp; Over',
 'year': 2013}

In [73]:
scale_down_movie(movies[0])

{'binary': 'FAIL',
 'budget': 13.0,
 'budget_2013$': 13.0,
 'clean_test': 'notalk',
 'code': '2013FAIL',
 'decade code': 1.0,
 'domgross': 25.68,
 'domgross_2013$': 25.68,
 'imdb': 'tt1711425',
 'intgross': 42.2,
 'intgross_2013$': 42.2,
 'period code': 1.0,
 'test': 'notalk',
 'title': '21 &amp; Over',
 'year': 2013}

Ok, now that we have a function to scale down our movies, lets `map` through all of our `parsed_movies` to return a list of `scaled_movies`. 

In [86]:
def scale_down_movies(movies):
    return list(map(lambda movie: scale_down_movie(movie), movies))

In [89]:
first_ten_movies = parsed_movies[0:10]
first_ten_scaled = scale_down_movies(first_ten_movies)
first_ten_scaled[-2:]
# [{'binary': 'PASS', 'budget': 13.0,
#   'budget_2013$': 13.0,
#   'clean_test': 'ok',
#   'code': '2013PASS',
#   'decade code': 1.0,
#   'domgross': 18.01,
#   'domgross_2013$': 18.01,
#   'imdb': 'tt1814621',
#   'intgross': 18.01,
#   'intgross_2013$': 18.01,
#   'period code': 1.0,
#   'test': 'ok',
#   'title': 'Admission',
#   'year': 2013},
#  {'binary': 'FAIL',
#   'budget': 130.0,
#   'budget_2013$': 130.0,
#   'clean_test': 'notalk',
#   'code': '2013FAIL',
#   'decade code': 1.0,
#   'domgross': 60.52,
#   'domgross_2013$': 60.52,
#   'imdb': 'tt1815862',
#   'intgross': 244.37,
#   'intgross_2013$': 244.37,
#   'period code': 1.0,
#   'test': 'notalk',
#   'title': 'After Earth',
#   'year': 2013}]

[{'binary': 'PASS',
  'budget': 13.0,
  'budget_2013$': 13.0,
  'clean_test': 'ok',
  'code': '2013PASS',
  'decade code': 1.0,
  'domgross': 18.01,
  'domgross_2013$': 18.01,
  'imdb': 'tt1814621',
  'intgross': 18.01,
  'intgross_2013$': 18.01,
  'period code': 1.0,
  'test': 'ok',
  'title': 'Admission',
  'year': 2013},
 {'binary': 'FAIL',
  'budget': 130.0,
  'budget_2013$': 130.0,
  'clean_test': 'notalk',
  'code': '2013FAIL',
  'decade code': 1.0,
  'domgross': 60.52,
  'domgross_2013$': 60.52,
  'imdb': 'tt1815862',
  'intgross': 244.37,
  'intgross_2013$': 244.37,
  'period code': 1.0,
  'test': 'notalk',
  'title': 'After Earth',
  'year': 2013}]

In [79]:
scaled_movies = scale_down_movies(parsed_movies)

Ok, now let's simply plot our dataset using Plotly to see how much money a movie makes domestically in 2013 dollars given a budget in 2013 dollars.  Create a trace called `revenues_per_budgets_trace`, that plots this data. Set `budget_2013$` as the `x` values, and the `domgross_2013$` as the $y$ values.  Set the text of the trace equal to a list of the movie titles, so that we can see which movie is associated with each point.

In [110]:
budgets = list(map(lambda movie: movie['budget_2013$'], scaled_movies))
domestic_revenues = list(map(lambda movie: movie['domgross_2013$'], scaled_movies))
titles = list(map(lambda movie: movie['title'], scaled_movies))

In [111]:
from graph import trace_values
revenues_per_budgets_trace = trace_values(budgets, domestic_revenues, text = titles)

In [112]:
revenues_per_budgets_trace['x'][0:10] # [13.0, 45.66, 20.0, 61.0, 40.0, 225.0, 92.0, 12.0, 13.0, 130.0]

[13.0, 45.66, 20.0, 61.0, 40.0, 225.0, 92.0, 12.0, 13.0, 130.0]

In [113]:
revenues_per_budgets_trace['y'][0:10] # [25.68, 13.61, 53.11, 75.61, 95.02, 38.36, 67.35, 15.32, 18.01, 60.52]

[25.68, 13.61, 53.11, 75.61, 95.02, 38.36, 67.35, 15.32, 18.01, 60.52]

In [115]:
revenues_per_budgets_trace['text'][0:10] 
# ['21 &amp; Over',  'Dredd 3D', '12 Years a Slave', '2 Guns', '42', '47 Ronin',  'A Good Day to Die Hard',
# 'About Time',  'Admission',  'After Earth']

['21 &amp; Over',
 'Dredd 3D',
 '12 Years a Slave',
 '2 Guns',
 '42',
 '47 Ronin',
 'A Good Day to Die Hard',
 'About Time',
 'Admission',
 'After Earth']

Now it's time to plot this data.

In [106]:
from graph import trace_values, plot
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

plot([revenues_per_budgets_trace])

Look at that datapoint was at the top well over the 1.5 billion mark.  What movie is that?

Write a function called highest_domestic_gross, that finds the highest grossing movie given a list of movies.

In [19]:
def highest_domestic_gross(movies):
    return max(movies, key=lambda movie: movie['domgross_2013$'])

In [117]:
highest_domestic_gross(scaled_movies)['title'] # 'Star Wars'

'Star Wars'

Huh, well we should've known.  Let's now zoom in on our dataset so that our plot no longer expands for just a few of the outliers.  Set the x-axis of our plot to go from zero to 300 million dollars, and the y-axis of our plot to go from zero to one billion dollars.

In [107]:
from graph import build_layout
revenues_per_budgets_trace = trace_values(budgets, domestic_revenues, text = titles)
revenues_layout = build_layout(x_range = [0, 300], y_range = [0, 1000])
plot([revenues_per_budgets_trace], revenues_layout)

Ok, well at least we now have a closer look at our data.  And we're still seeing Titanic up in the top right there. For a next step,

Ok, now imagine that we hired an outside consultant to predict revenue for us.  And the consultant told us the following: $R(x) = 1.5*budget + 10$ where $x$ is a movie's budget in 2013 dollars, and R(x) is the expected revenue in 2013 dollars.  Write a function called `outside_consultant_predicted_revenue` that provided a budget returns the expected revenue according to the outside consultant's formula. 

In [137]:
def outside_consultant_predicted_revenue(budget):
    return 1.5*budget + 10

Let's plot the consultant estimated revenue to see visually if his estimates line up.

In [138]:
budgets = list(map(lambda movie: movie['budget_2013$'], scaled_movies))
domestic_revenues = list(map(lambda movie: movie['domgross_2013$'], scaled_movies))
titles = list(map(lambda movie: movie['title'], scaled_movies))

consultant_estimated_revenues = list(map(lambda budget: outside_consultant_predicted_revenue(budget),budgets))
consultant_estimated_revenues_trace = trace_values(budgets, consultant_estimated_revenues, mode='line', name = 'consultant estimate')

In [134]:
consultant_estimated_revenues_trace['x'][0:10] # [13.0, 45.66, 20.0, 61.0, 40.0, 225.0, 92.0, 12.0, 13.0, 130.0]
consultant_estimated_revenues_trace['y'][0:10] # [76.248, 61.971000000000004, 106.421, 131.171, 152.522, 90.196, 122.085, 64.852,  67.811,  114.572]
consultant_estimated_revenues_trace['mode'] # 'line'
consultant_estimated_revenues_trace['name'] # 'consultant estimate'

'consultant estimate'

In [136]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot

plot([revenues_per_budgets_trace, consultant_estimated_revenues_trace], revenues_layout)

Overall, they don't look so bad.  Of course, we can calculate the RSS to place a number to how accurate his model really is.  Let's write a method called `error_for_consultant_model` which takes in a bugdet of a movie in our dataset, and returns the difference between the movie's gross domestic revenue in 2013 dollars, and the prediction by the consultant's model.

In [24]:
def error_for_consultant_model(movie):
    expected = outside_consultant_predicted_revenue(movie['budget_2013$'])
    return movie['budget_2013$'] - expected

In [25]:
american_hustle = scaled_movies[10]
error_for_consultant_model(american_hustle) # -30.0

-30.0

Once haven written a formula that calculates the error for the consultant's model provided a budget, now write a method that calculates the RSS for the consulant's model.  When we move onto compare our consultant's model with others, we'll then have a metric to compare.

In [26]:
def rss_consultant(movies):
    return round(sum(list(map(lambda movie: error_for_consultant_model(movie)**2, movies))), 2)

In [27]:
rss_consultant(scaled_movies)

3900419.65

Ok, we'll find out if this number is any good later, but for right now let's just say that our RSS is good enough.  Use the derivative to write a function that provided a budget, returns the $\frac{\Delta R}{\Delta x}$ according to the consultant's model.  Remember that our consultant's model is $R(x) = 1.5x - 1000$ where $x$ is a budget, and $R(x)$ is an expected revenue.

In [28]:
def consultant_spend_a_little_more(budget):
    return 3

In [29]:
consultant_spend_a_little_more(100000000)

3

### A new model

Now imagine a data scientist in your company wants to take a crack at his own model for predicting a movie's revenue.  The data scientist notices, that in general, movies tend to make more money per year.

In [30]:
from graph import build_layout
years = list(map(lambda movie: movie['year'],movies))
years_and_revenues = trace_values(years, domestic_revenues, text = titles)
years_layout = build_layout(y_range = [0, 550])
plot([years_and_revenues], years_layout)

So the data scientist comes up with a new model, to indicate a movie's expected revenue is 1.5 million for every year after 1965 plus $1.1$ times the movie's budget.  Write a function called `revenue_with_year` that takes as arguments `budget` and `year` and returns expected revenue.  

In [139]:
def revenue_with_year(budget, year):
    return 1.1*budget + 1*(year - 1965)

In [141]:
revenue_with_year(25, 1997) # 59.5
revenue_with_year(40, 1983) # 62.0

62.0

Let's plot these estimated budgets side by side of the actual revenues and budgets.  

In [148]:
budgets = list(map(lambda movie: movie['budget_2013$'], scaled_movies))
domestic_revenues = list(map(lambda movie: movie['domgross_2013$'], scaled_movies))
titles = list(map(lambda movie: movie['title'], scaled_movies))

internal_consultant_estimated_revenues = list(map(lambda movie: revenue_with_year(movie['domgross_2013$'], movie['year']),scaled_movies))
internal_consultant_estimated_trace = trace_values(budgets, consultant_estimated_revenues, mode='line', name = 'consultant estimate')

In [149]:
internal_consultant_estimated_trace['x'][0:10] # [13.0, 45.66, 20.0, 61.0, 40.0, 225.0, 92.0, 12.0, 13.0, 130.0]
internal_consultant_estimated_trace['y'][0:10] # [29.5, 78.49, 40.0, 101.5, 70.0, 347.5, 148.0, 28.0, 29.5, 205.0]
internal_consultant_estimated_trace['mode'] # 'line'

'line'

In [150]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot
plot([revenues_per_budgets_trace, internal_consultant_estimated_trace], revenues_layout)

As you can see, visually the model does fairly well.  Once again, let's find out how well.  Write a function called `rss_revenue_with_year` that returns the Residual Sum of Squares associated with the `revenue_with_year` model for the `scaled_movies` dataset.

In [193]:
def squared_error_revenue_with_year(movie):
    actual = movie['domgross_2013$']
    expected = revenue_with_year(movie['domgross_2013$'], movie['year'])
    return (actual - expected)**2

def rss_revenue_with_year(movies):
    squared_errors = list(map(lambda movie: squared_error_revenue_with_year(movie), movies))
    return round(sum(squared_errors), 2)

In [190]:
rss_revenue_with_year(scaled_movies) # 4241833.98

4241833.98

So the RSS here is $4,241,833.98$ as opposed to $3,900,419.65$ with our previous model.  So according to RSS it isn't better than the previous model.  Still, it's significantly worse to ignore completely.  

So using our knowledge of partial derivatives, let's write a function that returns the model's predicted increase in revenue from adding a movie debuting a year later. 

In [35]:
def years_and_revenue_increase_year():
    return 1

And using our knowledge of partial derivatives, let's write a function that returns the model's predicted increase of increasing the budget.

In [36]:
def years_and_revenue_increase_budget():
    return 1.1

### Our initial regression line, and improving upon it

Ok, so now that you have evaluated the model's of an outside consultant and an internal consultant, it's time to see if you can do any better.  Here we go.

We have our dataset, and should begin with an initial regression line by using our `build_regression_line` from earlier. 

In [185]:
from linear_equations import build_regression_line
budgets = list(map(lambda movie: movie['budget_2013$'], scaled_movies))
domestic_revenues = list(map(lambda movie: movie['domgross_2013$'], scaled_movies))
unrounded_initial_regression_line = build_regression_line(budgets, domestic_revenues)

initial_regression_line = {'b': round(unrounded_initial_regression_line['b'], 2), 'm':  round(unrounded_initial_regression_line['m'], 2)}
initial_regression_line

{'b': 0.5, 'm': 1.79}

Using our `build_regression_line` method from earlier, we can get write `expected_revenue_per_budget`.  

In [186]:
def expected_revenue_per_budget(budget):
    return round(budget*initial_regression_line['m'] + initial_regression_line['b'], 3)

In [187]:
budget = american_hustle['budget_2013$'] # 40
expected_revenue_per_budget(budget) # 72.035

72.1

Now remember that our `build_regression_line` formula is not very sophisticated.  It essentially draws a line between the point with the lowest $x$ value and the highest $x$.  Let's plot our line along our dataset to get a sense of how good a job this does.

In [159]:
budgets = list(map(lambda movie: movie['budget_2013$'], scaled_movies))
estimated_revenues = list(map(lambda budget: expected_revenue_per_budget(budget), budgets))
initial_regression_trace = trace_values(budgets, estimated_revenues, mode = 'line', name = 'initial regression trace')

In [208]:
initial_regression_trace['x'][0:10] # [13.0, 45.66, 20.0, 61.0, 40.0, 225.0, 92.0, 12.0, 13.0, 130.0]
initial_regression_trace['y'][0:10] # [23.75, 82.16, 36.27, 109.59, 72.04, 402.88, 165.03, 21.96, 23.75, 232.99]

[23.75, 82.16, 36.27, 109.59, 72.04, 402.88, 165.03, 21.96, 23.75, 232.99]

In [158]:
plot([revenues_per_budgets_trace, initial_regression_trace], revenues_layout)

By now you should be able to guess our next step: put a number to how well this line matches our data.  We'll write a function called `regression_revenue_error` that provided a movie and an `m` and `b` value of a regression line, returns the difference between our `initial_regression_lines`'s expected revenue and the actual revenue error.

In [41]:
def regression_revenue_error(m, b, movie):
    expected = (m*movie['budget_2013$'] + b)
    actual = movie['domgross_2013$']
    return actual - expected

In [42]:
regression_revenue_error(initial_regression_line['m'], initial_regression_line['b'], movies[10]) # 76897630.51112476

76897630.51112476

Ok, now plot the cost curve from changing values of $m$ from $1.0$ to $2.0$.  We don't ask you to write a function for calculating the RSS, as you already wrote one in the `error` library which is available to you.

In [43]:
from error import residual_sum_squares
residual_sum_squares(budgets, domestic_revenues, initial_regression_line['m'], initial_regression_line['b'])

24163798.64890295

In [44]:
from graph import trace_values
large_m_range = list(range(10, 20))
m_range = list(map(lambda m_value: m_value/10,large_m_range))
cost_values = list(map(lambda m_value: residual_sum_squares(budgets, domestic_revenues, m_value, initial_regression_line['b']),m_range))
rss_trace = trace_values(x_values=m_range, y_values=cost_values, mode = 'line')

In [45]:
plot([rss_trace])

Ok, so based on this, it appears that with our $b = 0.5021166807532609$, the slope of our regression line that produces the lowest error is between $1.3$ and $1.4$.  In fact if we replace our initial line value of $m$ to $1.3$ we see that our RSS does in fact from a previous value of $24,163,798$.

In [46]:
residual_sum_squares(budgets, domestic_revenues, 1.3, initial_regression_line['b'])

22065937.75716766

### Changing multiple variables

Ok, now it's time for us to not just alter a variable like the slope of our regression, but to find the 'best fit regression line'.  As you know, the technique for that is to use gradient descent.

Remember that we derived our gradient formulas equal from our cost curve: 

$$J(m,b) = \sum_{i = 1}^n(y_i - (mx_i + b))^2 $$

We derived the gradient of our cost function as the following: 

$$ \frac{dJ}{db}J(m,b) = -2\sum_{i = 1}^n(y_i - (mx_i + b)) = -2\sum_{i = 1}^n \epsilon_i $$
$$ \frac{dJ}{dm}J(m,b) = -2\sum_{i = 1}^n x(y_i - (mx_i + b)) = -2\sum_{i = 1}^n x_i*\epsilon_i$$


#### remove the 2, to simplify

Now looking at our top function $\frac{dJ}{dm}$, we see that it equals to negative 2, multiplied by the sum of the errors for a a provided $m$ and $b$ values and a dataset.  And lucky, for us we already have a function called `regression_revenue_error` that returns the error at a given point, provided our $m$ and $b$ values.

Your task is two write a function called `b_gradient` that takes in values of $m$, $b$ and our (scaled) movies, and returns the $b gradient$, -2 time the sum of the errors for the dataset.

In [47]:
def b_gradient(m, b, movies):
    n = len(movies)
    errors = list(map(lambda movie: regression_revenue_error(m, b, movie), movies))
    return -1 * sum(errors)/n

In [48]:
b_gradient(1.79, 0.50, scaled_movies)

5.367987612612615

In [49]:
def m_gradient(m, b, movies):
    n = len(movies)
    errors_times_x = list(map(lambda movie: regression_revenue_error(m, b, movie)*movie['budget_2013$'], movies))
    return -1 * sum(errors_times_x)/n

In [50]:
m_gradient(1.79, 0.50, scaled_movies)

2520.578267701572

Ok, now we just wrote two functions that tell us how to update the corresponding values of $m$ and $b$.  Our next step is to write a function called `step_gradient` that will use these functions to take the step down along our cost curve.

Remember that with each step we want to move our `current_b` value in the negative direction of calculated `b_gradient`, and want to move our `current_m` value in the negative direction of calculated `m_gradient`.  

`current_m` = `old_m` $ -  \alpha(-2*\sum_{i=1}^n x_i*\epsilon_i )$

`current_b` =  `old_b` $ - \alpha( -2*\sum_{i=1}^n \epsilon_i )$

The `step_gradient` function would take as arguments the `b_current`, `m_current`, and list of scaled movies and a learning rate, and return a new calculated `b_current` and `m_current` with a dictionary of keys `b` and `m` that point to the current values.   

In [51]:
def step_gradient(b_current, m_current, movies, learning_rate):
    b_change = b_gradient(m_current, b_current, movies)
    m_change = m_gradient(m_current, b_current, movies) 
    new_b = b_current - (learning_rate * b_change)
    new_m = m_current - (learning_rate * m_change)
    return {'b': new_b, 'm': new_m}

Now let's plot the first 10 steps of gradient descent.

In [52]:
initial_regression_line # {'b': 0.5021166807532609, 'm': 1.788331924668964}

{'b': 0.5021166807532609, 'm': 1.788331924668964}

Then let's see how our formula changes over time using gradient descent.

In [53]:
step_gradient(initial_regression_line['b'], initial_regression_line['m'], scaled_movies, .0001)

{'b': 0.5015889931744079, 'm': 1.537287989729067}

Now write a function that returns a set of 10 iterations.

In [68]:
# set our initial step with m and b values, and the corresponding error.
def generate_steps(m, b, number_of_steps, movies, learning_rate):
    iterations = []
    for i in range(number_of_steps):
        iteration = step_gradient(b, m, movies, learning_rate)
        # {'b': value, 'm': value}
        b = iteration['b']
        m = iteration['m']
        # update values of b and m
        iterations.append(iteration)
    return iterations

In [69]:
iterations = generate_steps(0, 0, 100, scaled_movies, .0001)

And we can see how this changes over time.

In [70]:
def to_line(m, b):
    initial_x = 0
    ending_x = 500
    initial_y = m*initial_x + b
    ending_y = m*ending_x + b
    return {'data': [{'x': [initial_x, ending_x], 'y': [initial_y, ending_y]}]}

frames = list(map(lambda iteration: to_line(iteration['m'], iteration['b']),iterations))
frames[0:10]

[{'data': [{'x': [0, 500], 'y': [0.009517487049549562, 425.7253743733105]}]},
 {'data': [{'x': [0, 500], 'y': [0.014275382007953506, 589.640993030583]}]},
 {'data': [{'x': [0, 500], 'y': [0.017200610005283723, 652.7504928140912]}]},
 {'data': [{'x': [0, 500], 'y': [0.019420141895015803, 677.0460318882198]}]},
 {'data': [{'x': [0, 500], 'y': [0.021367901762291967, 686.3968081745156]}]},
 {'data': [{'x': [0, 500], 'y': [0.023210965900984027, 689.9933165079358]}]},
 {'data': [{'x': [0, 500], 'y': [0.025013664662035303, 691.3742260799991]}]},
 {'data': [{'x': [0, 500], 'y': [0.02680076753385386, 691.9020510830119]}]},
 {'data': [{'x': [0, 500], 'y': [0.0285818116579547, 692.1014082086579]}]},
 {'data': [{'x': [0, 500], 'y': [0.030360469177543772, 692.1742936451152]}]}]

In [160]:
from plotly.offline import init_notebook_mode, iplot
from IPython.display import display, HTML

init_notebook_mode(connected=True)

budgets = list(map(lambda movie: movie['budget_2013$'], scaled_movies))
domestic_revenues = list(map(lambda movie: movie['domgross_2013$'], scaled_movies))

figure = {'data': [{'x': [0], 'y': [0]}, {'x': budgets, 'y': domestic_revenues, 'mode': 'markers'}],
          'layout': {'title': 'Regression Line',
                     'updatemenus': [{'type': 'buttons',
                                      'buttons': [{'label': 'Play',
                                                   'method': 'animate',
                                                   'args': [None]}]}]
                    },
          'frames': frames}
iplot(figure)

Finally, let's calculate the RSS associated with our formula as opposed to the other.

In [58]:
iterations[-1] # {'b': 0.5509918331403825, 'm': 1.3796546010160367}

{'b': 0.6733901508952449, 'm': 1.378542026072047}

In [59]:
residual_sum_squares(budgets, domestic_revenues, iterations[-1]['m'], iterations[-1]['b'])

22043348.930944957

Using this last iteration, we have an RSS $22043348$, better than all previous models - and we have the data, and knowledge to prove it.  Nice work!