### A not so simple problem

In trying to predict whether or not a movie will make money, screen writer William Goldman famously said, "nobody knows anything."  Let's try to see if we gather any insight, and start at the beginning.  

### The benefit of a buck

Imagine a movie executive receives a budget proposal, and wants to see how much money the movie he might make.  We can help him by trying to see the relationship between money spent and on a movie, and money made. 

Let's get some movie data.

The website FiveThirtyEight has provided a list of just that information for movies made (mainly) between 1990 and 2013.  Click here to [take a look](https://github.com/fivethirtyeight/data/blob/master/bechdel/movies.csv).  The data may look like its in a table, but really it's separated by commas in a CSV file.  A CSV file (comma separated value) file is just a file with data separated by commas.  [See it here](https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv). 

Pandas, a popular data science library, makes it relatively easy to gather information from a comma separated value sheet.  So let's import the CSV data using Pandas, and then convert the data to a list of dictionaries in pure Python, which we all know and love. 

In [2]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv')

When we use pandas, the data gets turned into a dataframe.  A dataframe is essentially a table.

In [4]:
df

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,4.219577e+07,2013FAIL,13000000,2.568238e+07,4.219577e+07,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,4.086899e+07,2012PASS,45658735,1.361109e+07,4.146726e+07,1.0,1.0
2,2013,tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000,53107035.0,1.586070e+08,2013FAIL,20000000,5.310704e+07,1.586070e+08,1.0,1.0
3,2013,tt1272878,2 Guns,notalk,notalk,FAIL,61000000,75612460.0,1.324930e+08,2013FAIL,61000000,7.561246e+07,1.324930e+08,1.0,1.0
4,2013,tt0453562,42,men,men,FAIL,40000000,95020213.0,9.502021e+07,2013FAIL,40000000,9.502021e+07,9.502021e+07,1.0,1.0
5,2013,tt1335975,47 Ronin,men,men,FAIL,225000000,38362475.0,1.458038e+08,2013FAIL,225000000,3.836248e+07,1.458038e+08,1.0,1.0
6,2013,tt1606378,A Good Day to Die Hard,notalk,notalk,FAIL,92000000,67349198.0,3.042492e+08,2013FAIL,92000000,6.734920e+07,3.042492e+08,1.0,1.0
7,2013,tt2194499,About Time,ok-disagree,ok,PASS,12000000,15323921.0,8.732475e+07,2013PASS,12000000,1.532392e+07,8.732475e+07,1.0,1.0
8,2013,tt1814621,Admission,ok,ok,PASS,13000000,18007317.0,1.800732e+07,2013PASS,13000000,1.800732e+07,1.800732e+07,1.0,1.0
9,2013,tt1815862,After Earth,notalk,notalk,FAIL,130000000,60522097.0,2.443732e+08,2013FAIL,130000000,6.052210e+07,2.443732e+08,1.0,1.0


It's easy from there to convert our data back into a list of movies.

In [7]:
movies = df.to_dict('records')
type(movies)

list

In [8]:
first_movie = movies[0]
first_movie

{'binary': 'FAIL',
 'budget': 13000000,
 'budget_2013$': 13000000,
 'clean_test': 'notalk',
 'code': '2013FAIL',
 'decade code': 1.0,
 'domgross': 25682380.0,
 'domgross_2013$': 25682380.0,
 'imdb': 'tt1711425',
 'intgross': 42195766.0,
 'intgross_2013$': 42195766.0,
 'period code': 1.0,
 'test': 'notalk',
 'title': '21 &amp; Over',
 'year': 2013}

Now, there are a lot of keys and values, which we don't really need.  Let's scope down this data to only include a movie's title, budget and how much money it earned in the United States. 

In [12]:
parsed_movies = list(map(lambda movie: {'title': movie['title'], 'budget': movie['budget_2013$'], 'domgross': movie['domgross']}, movies))

In [13]:
first_five_movies = parsed_movies[0:5]
first_five_movies

[{'budget': 13000000, 'domgross': 25682380.0, 'title': '21 &amp; Over'},
 {'budget': 45658735, 'domgross': 13414714.0, 'title': 'Dredd 3D'},
 {'budget': 20000000, 'domgross': 53107035.0, 'title': '12 Years a Slave'},
 {'budget': 61000000, 'domgross': 75612460.0, 'title': '2 Guns'},
 {'budget': 40000000, 'domgross': 95020213.0, 'title': '42'}]

Ok, much better.

### Plotting the Data

Remember that when we want to plot data, we translate the values to X and Y values.  We'll have `budget` as the x value and the y value as `domgross`.  Let's just plot a few movies to get started.

> We'll show the code for plotting the data below, but you don't need to understand it at this point.

In [14]:
import plotly
from plotly import graph_objs

plotly.offline.init_notebook_mode(connected=True)

trace0 = graph_objs.Scatter(
    text=list(map(lambda movie: movie['title'],first_five_movies)),
    x=list(map(lambda movie: movie['budget'],first_five_movies)),
    y=list(map(lambda movie: movie['domgross'],first_five_movies)),
    mode="markers",
)

layout = dict(title = 'Styled Scatter',
              yaxis = dict(zeroline = False),
              xaxis = dict(zeroline = False)
             )

layout= graph_objs.Layout(
    title= 'Movie Spending and Revenue',
    xaxis= dict(
        title= 'Movie Spend',
        zeroline = True,
        range=[0, 65000000]
    ),
    yaxis=dict(
        title= 'Movie Revenue',
        zeroline = True,
        range=[0, 100000000]
    ),
    showlegend= False
)
plotly.offline.iplot(dict(data=[trace0], layout=layout))

In [15]:


import plotly
from plotly import graph_objs

plotly.offline.init_notebook_mode(connected=True)

trace0 = graph_objs.Scatter(
    text=list(map(lambda movie: movie['title'],first_five_movies)),
    x=list(map(lambda movie: movie['budget'],first_five_movies)),
    y=list(map(lambda movie: movie['domgross'],first_five_movies)),
    mode="markers",
)

layout = dict(title = 'Styled Scatter',
              yaxis = dict(zeroline = False),
              xaxis = dict(zeroline = False)
             )

layout= graph_objs.Layout(
    title= 'Movie Spending and Revenue',
    xaxis= dict(
        title= 'Movie Spend',
        zeroline = True,
        range=[0, 65000000]
    ),
    yaxis=dict(
        title= 'Movie Revenue',
        zeroline = True,
        range=[0, 100000000]
    ),
    showlegend= False
)
plotly.offline.iplot(dict(data=[trace0], layout=layout))

Let's just take a look at the first point, towards the bottom left.  That point represents the movie "21 & Over", with 13 million dollars being spent and 25.6 million earned domestically.

In [42]:
parsed_movies[0]

{'budget': 13000000, 'domgross': 25682380.0, 'title': '21 &amp; Over'}

What plotting this data shows us is that as the movie budget increases, represented by the points plotted further to the right, the movie revenue increases.  So, at least we now know something.

Ok, now imagine your movie executive friend told you that the budget that came across his desk was $30 million.  Based on the data we graphed, how much money do you think the movie would bring in?

### Drawing a line

Ok, so how are we going to do something like this.  Well we could draw a single straight line that approximates the relationship between a movie's budget and revenue.  Below, we draw a line. We'll worry about how well a line like the one below models the relationship between two different variables later.  For now, let's use this.   

![](./plot-intersect.png)

Well one of the benefits of using a line is that we can see how much money will be brought in for any point on this line.  Spend 50 million, and expect to bring in about 63 million.  Spend 10 million, and expect to bring in 17 million.  This approach of modeling a relationship a variable that explains an output by using a line, is called **linear regression**. 

Let's see if we can translate this line into a formula that will tell us the y value that corresponds to any given value of x along that line.

Let's take an initial (wrong) guess as to how to make this a formula.  And then we'll take another one.  This is our first guess.

$y = x$

Here is how we write it as a function.

In [20]:
def y(x):
    return x

y(0)

0

In [21]:
y(10000000)

10000000

What the formula is saying is that for every value of $x$ that I input to the function, I will get back an equal value $y$.  So according to the function, if the movie has a budget of 30 million, it will earn 30 million.  

Of course, this does not match the line in our chart.  The line says that spending 30 million brings predicted earnings of 40 million.  So how do we change our function?  Well look at the line in our chart, we can examine the x and y values at three different points

| X        | Y           | 
| ------------- |:-------------:| 
| 0      |0 | 
| 30 million      |40 million | 
| 60 million      |80 million | 

What equation will allow us to input 0 and get back 0, input 30 million and get back 40 million, and input 60 million and get back 80 million?

Well it's $y = 4/3*x$

* 0 = 4/3 * 0
* 40 million =  4/3 * 30 million 
* 80 million = 4/3 * 60 million 

Let's see it in the code, and then in the next section we'll show how to figure what to multiply $x$ by. 

Ok, this is what this formula looks like in code.

In [16]:
def y(x):
    return 4/3*x

y(30000000)

40000000.0

In [17]:
y(0)

0.0

Progress! By multiplying $x$ by a value, we can describe the line in our chart with a function that given an value of $x$, corresponds the value of $y$ along that line.  

In statistics, you will see this formula described as 

$y = mx$ 

With the variables standing for the following: 

* $y$: the value that is returned, also called the **response variable**, as it responds to values of $x$
* $x$: the input variable, also called the **explanatory variable**, as it explains the value of $y$
* $m$: the **slope variable**, determines how vertical or horizontal the line will be

In our movie example, these terms make sense.  The $y$ value is our money earned from the movie, which we say is in response to how much we spend.  Our explanatory variable of $x$ explains the value of $y$, and the $m$ corresponds to our value of 1.33, which determines the slope of the line.

### Calculating the slope variable 

This is our mechanism for calculating the slope $m$.  Take any two points along the straight line, then $m$ is **the ratio of the vertical distance travelled to the horizontal distance travelled**.  Or, in math, it's:

$m = \Delta y \div \Delta x $
> The $\Delta$ is the Greek letter Delta.  In math, Delta means change.  So you can the read the above formula as $m$ equals change in y divided by change in x.

For example, let's take another look of our graph, and our line.  Let's travel the distance from x being equal to zero to 10 million.  Plugging the numbers into our formula, we see that for that segment:

* $\Delta x$ = 10 million
* $\Delta y$ = 13.3 million

Notice that another way to word change in x is really our ending x value, 10 million, minus our starting x value, 0.  And that change in y also means our ending y value, 13 million, minus our y initial value 0.  

So this means: 

* $\Delta y = y_1-  y_0$
* $\Delta x = x_1 - x_0$

And therefore we can say $m$ is the following: given a beginning point (x0, y0) and an ending point (x1, y1) along any segment of a straight line, the slope of that line $m$ equals the following:  

$m = (y_1 - y_0) \div (x_1 - x_0)$

Ok, let's apply this formula to our line.  We can choose any two points for the formula, so let's have a starting point of (30 million, 40 million) and an ending point of (60 million, 80 million). Then plugging these coordinates into our formula, we have the following:

* $m =(y_1 - y_0)\div(x_1 - x_0) =  (80,000,000 - 40,000,000) \div (60,000,000 - 30,000,000) = 4/3 = 1.33$

![](./m-calc.png)

So that is how we calculate the slope of a line, take any two points along that line and divide distance travelled vertically from the distance travelled horizontally.

### The y intercept

Ok, there is just one more thing that we need to be able to learn before being able to describe every straight line in a two dimensional world.  That is the y-intercept.

The y-intercept is the y value of the line when it intersects the y-axis.  Or to put it another way, the y-intercept is the value of y when x equals zero. 

![](plot-add.png)

So looking at the graph, what is the y intercept of the blue line?  Well it's the value of y when the blue line crosses the y-axis.  The value is zero.  Now you can imagine shifting up the entire line up, so that the y intercept increases to to 20 million, and that for every value of x, the corresponding value of y increases by 20 million.  So our formula is no longer y = 4/3 x.  It is y = 4/3 x + 20 million. 

In statistics, you will see this as $y = mx + b$ where b is the y-intercept.  Taking a look at our chart of points on the line, we can see that 20 million is our y-intercept.

| X        | Y           | 
| ------------- |:-------------:| 
| 0      |20 million | 
| 30 million      |60 million | 
| 60 million      |100 million | 

And translating our formula into a function, we have:

In [19]:
def y(x):
    return 4/3*x + 20000000

In [20]:
y(30000000)

60000000.0

In [21]:
y(60000000)

100000000.0

The formula $y = mx + b$ can describe any line in a two dimensional space.  The $m$ value will change how flat or vertical the line is, and the $b$ value changes the starting point of the line. 

### Summary