### Choosing a best-fit line

So far we have been rather fast and loose with choosing our regression line.  Well that ends today.  In this lesson, we'll begin to evaluate the accuracy of a regression line, and how to use a technique to choose a regression line.  

### Determining Quality

To get a sense of the accuracy of a regression line, we can do the following.  See how closely our regression line matches the data we have.  Let's see what we mean.

In [10]:
first_show = {'x': 0, 'y': 100}
second_show = {'x': 100, 'attendance': 150}
third_show = {'x': 200, 'attendance': 600}
fourth_show = {'budget': 400, 'attendance': 700}

shows = [first_show, second_show, third_show, fourth_show]

In [17]:
def sample_regression_formula(x):
    return 100 + 1.5(x)

We draw a chart that displays our sample regression line (by simply drawing a line the data points with the lowest and highest x value) and then show with red color lines where our regression line does not match up to the data.

In [16]:
import plotly
from plotly import graph_objs
plotly.offline.init_notebook_mode(connected=True)

trace = graph_objs.Scatter(
    x=list(map(lambda show: show['budget'], shows)),
    y=list(map(lambda show: show['attendance'], shows)),
    mode="markers"
)

layout= graph_objs.Layout(
    xaxis= dict(
        title= 'Movie Spend',
        zeroline = True
    ),
    yaxis=dict(
        title= 'Movie Revenue',
        zeroline = True
    ),
    shapes=[
        {
            'type': 'line',
            'x0': 0,
            'y0': 100,
            'x1': 400,
            'y1': 700,
            'line': {
                'color': 'rgb(55, 128, 191)',
                'width': 3,
            },
        },
                {
            'type': 'line',
            'x0': 200,
            'y0': 600,
            'x1': 200,
            'y1': 400,
            'line': {
                'color': 'rgb(178,34,34)',
                'width': 3,
            },
        },
        {
            'type': 'line',
            'x0': 100,
            'y0': 150,
            'x1': 100,
            'y1': 250,
            'line': {
                'color': 'rgb(178,34,34)',
                'width': 3,
            },
        }
    ],
    showlegend= False
)

plotly.offline.iplot(dict(data=[trace], layout=layout))

In the above chart, you can see that at the point where x = 100, our regression line predicts a value of y -- our predicted y is called $\overline{y}$, pronounced y hat.  However the actual value of y when x = 100 is 250.  So our formula has an error.

We can say that our error, at point x = 100, is the actual_y - expected_y or in math terms,  $\varepsilon$ = $y$ - $\overline{y}$.

Now an initial idea, is to simply evaluate lines by comparing their total errors, $y$ - $\overline{y}$, for every point in our dataset.  However if we tried that on our datasets, notice at point x = 100, $\varepsilon$ = 150 - 250 = -150.  And at x = 200, $\varepsilon$ = 600 - 400 = 200.  So adding these errors would begin to cancel them out.  To avoid that effect, we can simply square the errors, so that we are always summing positive numbers, and add these numbers together.

${error^2}$ = $({y - \overline{y}})^2$

Now give a list of points with coordinates (x, y), we can calculate the squared error of each of the points.

In [None]:
def sample_regression_formula(x):
    return 100 + 1.5(x)

def squared_error(point):
    point.y

### Comparing between lines

And give me a different line, gives me a different cost 

Estimating a function, search over all different lines, and try to find the one that results in the smallest residual sum of squares.

### Using the fitted line

Then we use the fitted line, so then the model is in terms of unknown parameters.  

yi = wo + w1xi + ei

And the estimated parameters wo hat, w1 hat take on actual parameters.

So these estimated parameters define a specific line.  So we can predict the value of the house, and we do this by plugging into our fitted line our house into the line.  And that gives us our estimated value of the house.  So y^ is our predicted value of the house.

So the prediction is equally likely to be above or below, so we are ensure if that error is above or below, so our best guess is to put it on the line.

So now we have a fitted regression line, and predict the value of a house that has.  So can be able to buy a house of x number of square feet.

### Interpreting the Co-Efficients of the Line

Let's just take a look at the first point, towards the bottom left.  That point represents the movie "21 & Over", with 13 million dollars being spent and 25.6 million earned domestically.

In [42]:
parsed_movies[0]

{'budget': 13000000, 'domgross': 25682380.0, 'title': '21 &amp; Over'}

What plotting this data shows us is that as the movie budget increases, represented by the points plotted further to the right, the movie revenue increases.  So, at least we now know something.

Ok, now imagine your movie executive friend told you that the budget that came across his desk was $30 million.  Based on the data we graphed, how much money do you think the movie would bring in?

### Drawing a line

Ok, so how are we going to do something like this.  Well we could draw a single straight line that approximates the relationship between a movie's budget and revenue.  Below, we draw a line. We'll worry about how well a line like the one below models the relationship between two different variables later.  For now, let's use this.   

![](./plot-intersect.png)

Well one of the benefits of using a line is that we can see how much money will be brought in for any point on this line.  Spend 50 million, and expect to bring in about 63 million.  Spend 10 million, and expect to bring in 17 million.  This approach of modeling a relationship a variable that explains an output by using a line, is called **linear regression**. 

Let's see if we can translate this line into a formula that will tell us the y value that corresponds to any given value of x along that line.

Let's take an initial (wrong) guess as to how to make this a formula.  And then we'll take another one.  This is our first guess.

$y = x$

Here is how we write it as a function.

In [20]:
def y(x):
    return x

y(0)

0

In [21]:
y(10000000)

10000000

What the formula is saying is that for every value of $x$ that I input to the function, I will get back an equal value $y$.  So according to the function, if the movie has a budget of 30 million, it will earn 30 million.  

Of course, this does not match the line in our chart.  The line says that spending 30 million brings predicted earnings of 40 million.  So how do we change our function?  Well look at the line in our chart, we can examine the x and y values at three different points

| X        | Y           | 
| ------------- |:-------------:| 
| 0      |0 | 
| 30 million      |40 million | 
| 60 million      |80 million | 

What equation will allow us to input 0 and get back 0, input 30 million and get back 40 million, and input 60 million and get back 80 million?

Well it's $y = 4/3*x$

* 0 = 4/3 * 0
* 40 million =  4/3 * 30 million 
* 80 million = 4/3 * 60 million 

Let's see it in the code, and then in the next section we'll show how to figure what to multiply $x$ by. 

Ok, this is what this formula looks like in code.

In [16]:
def y(x):
    return 4/3*x

y(30000000)

40000000.0

In [17]:
y(0)

0.0

Progress! By multiplying $x$ by a value, we can describe the line in our chart with a function that given an value of $x$, corresponds the value of $y$ along that line.  

In statistics, you will see this formula described as 

$y = mx$ 

With the variables standing for the following: 

* $y$: the value that is returned, also called the **response variable**, as it responds to values of $x$
* $x$: the input variable, also called the **explanatory variable**, as it explains the value of $y$
* $m$: the **slope variable**, determines how vertical or horizontal the line will be

In our movie example, these terms make sense.  The $y$ value is our money earned from the movie, which we say is in response to how much we spend.  Our explanatory variable of $x$ explains the value of $y$, and the $m$ corresponds to our value of 1.33, which determines the slope of the line.

### Calculating the slope variable 

This is our mechanism for calculating the slope $m$.  Take any two points along the straight line, then $m$ is **the ratio of the vertical distance travelled to the horizontal distance travelled**.  Or, in math, it's:

$m = \Delta y \div \Delta x $
> The $\Delta$ is the Greek letter Delta.  In math, Delta means change.  So you can the read the above formula as $m$ equals change in y divided by change in x.

For example, let's take another look of our graph, and our line.  Let's travel the distance from x being equal to zero to 10 million.  Plugging the numbers into our formula, we see that for that segment:

* $\Delta x$ = 10 million
* $\Delta y$ = 13.3 million

Notice that another way to word change in x is really our ending x value, 10 million, minus our starting x value, 0.  And that change in y also means our ending y value, 13 million, minus our y initial value 0.  

So this means: 

* $\Delta y = y_1-  y_0$
* $\Delta x = x_1 - x_0$

And therefore we can say $m$ is the following: given a beginning point (x0, y0) and an ending point (x1, y1) along any segment of a straight line, the slope of that line $m$ equals the following:  

$m = (y_1 - y_0) \div (x_1 - x_0)$

Ok, let's apply this formula to our line.  We can choose any two points for the formula, so let's have a starting point of (30 million, 40 million) and an ending point of (60 million, 80 million). Then plugging these coordinates into our formula, we have the following:

* $m =(y_1 - y_0)\div(x_1 - x_0) =  (80,000,000 - 40,000,000) \div (60,000,000 - 30,000,000) = 4/3 = 1.33$

![](./m-calc.png)

So that is how we calculate the slope of a line, take any two points along that line and divide distance travelled vertically from the distance travelled horizontally.

### The y intercept

Ok, there is just one more thing that we need to be able to learn before being able to describe every straight line in a two dimensional world.  That is the y-intercept.

The y-intercept is the y value of the line when it intersects the y-axis.  Or to put it another way, the y-intercept is the value of y when x equals zero. 

![](plot-add.png)

So looking at the graph, what is the y intercept of the blue line?  Well it's the value of y when the blue line crosses the y-axis.  The value is zero.  Now you can imagine shifting up the entire line up, so that the y intercept increases to to 20 million, and that for every value of x, the corresponding value of y increases by 20 million.  So our formula is no longer y = 4/3 x.  It is y = 4/3 x + 20 million. 

In statistics, you will see this as $y = mx + b$ where b is the y-intercept.  Taking a look at our chart of points on the line, we can see that 20 million is our y-intercept.

| X        | Y           | 
| ------------- |:-------------:| 
| 0      |20 million | 
| 30 million      |60 million | 
| 60 million      |100 million | 

And translating our formula into a function, we have:

In [19]:
def y(x):
    return 4/3*x + 20000000

In [20]:
y(30000000)

60000000.0

In [21]:
y(60000000)

100000000.0

The formula $y = mx + b$ can describe any line in a two dimensional space.  The $m$ value will change how flat or vertical the line is, and the $b$ value changes the starting point of the line. 

### Summary