# Introduction to Simple Linear Regression

### Introduction

In the last lesson, we saw a whirlwind tour of linear regression.  We saw that by passing through data to our linear regression model, we were able to make predictions.   Moreover we were able to use the model to detect a general pattern between our input and our output in general.  Let's dig deeper into understanding the predictions that our linear regression model makes.

### Back to SciKit Learn

Let's return to our problem of predicting T-shirt sales for based on different advertising budgets.

|ad spending        | t-shirts           
| ------------- |:-------------:| 
|    800        | 330  | 
|    1500        |780 | 
|    2000      | 1130 | 
|    3500      | 1310 | 
|    4000      | 1780 | 

And let's once again use this data to make predictions.

In [6]:
inputs = [800, 1500, 2000, 3500, 4000]
outcomes = [330, 780, 1130, 1310, 1780]

In [3]:
# inputs = [800, 1500, 2000, 3500, 4000]
sklearn_inputs = [
    [800], 
    [1500],
    [2000],
    [3500],
    [4000]
]

In [8]:
from sklearn.linear_model import LinearRegression

regression = LinearRegression()
# create the initial model
regression.fit(sklearn_inputs, outcomes)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now after we pass our data into this model, the key information that we learn is our coefficient and our intercept.

In [9]:
regression.coef_

array([0.38675261])

In [10]:
regression.intercept_

153.26385079539216

Let's learn why these numbers are so important.

### The simple linear regression model

To understand why these numbers are so important, we first have to understand the general form of linear regression, or specifically simple linear regression.  We have an example of simple linear regression, whenever we have a model in the following form.

$$y = mx + b$$

These `coef_` and `intercept_` numbers show that our model is of that form.  Specifically, we can plug these numbers in like so.

$$tshirt\_sales = .38*ad\_spend + 153.26$$.  

Lining these numbers up with our simple linear regression formula to our example we see that: 

* $y$ corresponds to `tshirt_sales`
* $x$ corresponds ad_spend, and 
* $m$ is .38
* $b$ is 153.26

Now let's better understand the $y$, $m$ and $x$ components of simple linear regression, as these are the most important parts.  We'll discuss the $b$ component later.

### Dependent and Independent Variables

The way to understand $y = mx$ is to think of $x$ as the input and $y$ as the output.  So in our T-shirt example, we input advertising dollars spent, and the output is the T-shirt sales.

In linear regression, $y$ is called our **dependent variable** as the output changes *depending* on a the input.  Here our dependent variable is T-shirt sales, as it *depends* on different spending of advertising.  

In linear regression $x$ is called the **independent variable**, as it does not depend on anything.  We can plug in any value for $x$ to get an output of $y$.  In our example, advertising spending is our independent variable.  

### Understanding the coefficient

Now let's take a look at that `m` in $y = mx + b$.  This is that number .38 in our T-shirt sales formula.  The $m$ is called a coefficient.  In math, a coefficient just means a number that is multiplied by a variable -- here our independent variable x.  So whatever the value of $x$, we multiply it by our coefficient $m$.

When we plot the predictions of our linear regression model as a line, our coefficient determines the slope of the line.  The larger the coefficient, the steeper our line.  Let's see this below where we plot the two linear models:

$$tshirt\_sales = .38*ad\_spend$$  

and

$$tshirt\_sales = .78*ad\_spend$$  




In [23]:
from graph import plot, trace_values
import plotly.plotly as py
from library.hypothesis import Hypothesis
data_trace = trace_values(inputs, outcomes)
first_hyp = Hypothesis(.38, 0, inputs)
second_hyp = Hypothesis(.78, 0, inputs)
py.plot([data_trace, first_hyp.trace(mode = 'lines'), second_hyp.trace(mode = 'lines')], auto_open=True)

'https://plot.ly/~JeffKatzy/144'

If we think about it, it makes sense that the coefficient determines the steepness of our line.  After all, if we look at our formulas of $tshirt\_sales = .38*ad\_spend$, and $tshirt\_sales = .78*ad\_spend$, and we plug in ad spends of of 1000, and 1100, we predict the following:


| sample inputs        | model 1 prediction | model 2 prediction           
| ------------- |:-------------:| :-------------:| 
|    1000        | $1000*.38 =  380$ |  $ 1000*.78 = 780 $|
|    1100       | $1100*.38 =  418$ | $1100*.78 = 858$ |

* So in the first model, for a one hundred unit increase in x, the predicted y increases by 38.  
* And in the second model, for a one hundred increase in x, the predicted y increases by 78.  

So the larger our coefficient, the larger y our model predicts, for a given value of x.  And accordingly the steeper the slope.

Also, notice that this coefficient is also the model saying about the real world.  The larger our coefficient in a model, the more impact our independent variable has.  So here model 2, as opposed to model 1, is saying that advertising has more of an impact as model two predicts that each extra dollar spent on advertising will increase expected sales by .78, while the first predicts an impact of .38.

### Understanding the Y intercept

There's one more component to our line.  This is that value $b$ in our formula 

$$y = mx + b$$

A y intercept is simply the y-value when our independent variable, x is zero.  So in our example, it represents expected sales when our advertising budget is 0.  Let's examine this with respect to our first model.

In our first model above $tshirt\_sales =.38∗ad\_spend$.  If we let our independent variable $ad\_spend = 0$ we calculate $tshirt\_sales = .38*0 = 0$.  But in the real world, this probably isn't true.  No matter how much we advertise, the company can still get a few purchases -- even if the CEO has to beg some friends and family.  So, let's say that in reality when there is no ad budget T-shirt sales will equal 100.  Here's how we can represent this.

We can change our model to be the following: $ tshirt\_sales =.38∗ad\_spend + 100 $.  So now when the advertising spend is zero our updated model predicts $ tshirt\_sales = .38∗0 + 100 = 100$.

Notice that this also affects every other prediction.  Every previous prediction our model made also increase by 100.  So when we spend 1000 on advertising, we now predict sales of 480 instead of 380.  And when we spend 1100, we predict sales of 518 instead of 418.  

Now let's plot the models $tshirt\_sales =.38∗ad\_spend$ and $tshirt\_sales =.38∗ad\_spend + 100 $ side by side.

In [24]:
third_hyp = Hypothesis(.38, 100, inputs)
import plotly.plotly as py
py.plot([data_trace, first_hyp.trace(mode = 'lines'), third_hyp.trace(mode = 'lines')],  layout = layout, auto_open = True)

'https://plot.ly/~JeffKatzy/146'

So notice that this matches what we said above.  We said that including the value of $b$ increases our predicted output of T-shirts by the same amount, that value of $b$, 100.  And that's what we see.

This is different from changing our slope $m$, which changes the steepness of the line.  

Try changing the slopes and intercepts in the code above, to see how the line changes, or feel free to add in another line with different slopes and intercepts.

### Summary

In this lesson, we learned about a simple linear regression model.  A simple linear model has one input and one output.  The input is called the independent variable, $x$.  In our example above, customers is the dependent variable and temperature is the independent variable.  This makes sense as the number of customers predicted by our model *depends* on the temperature.  The temperature does not depend on anything in our model, so it's independent.

We also discussed our coefficient, which is the number we multiply our independent variable by.  It is represented by $m$ in the fromula $y = mx $.  We can interpret our coefficient as the impact that our independent variable $x$ has on our dependent variable $y$.  So in the model $ tshirt\_sales = .38*ad\_spending$, this means our model predicts that a one dollar increase in spending increases the number of sales by .38.  The value of $m$ also determines the slope of our line -- the further $m$ is from zero, the steeper the slope.

Finally we saw another component of our line, the intercept.  The intercept is the predicted output when our independent variable is zero.  So in our model of $ tshirt\_sales = .38*ad\_spending + 153.26 $ we say that even when the ad spending is zero, we expect sales of 153.  We also noticed that the intercept increases our predicted output by that value (here 153), for every input value.  So where we previously to predicted an output of 380 with spending of 1000 dollars, the updated formula predicts 380 + 153 = 533.  And where we previously to predicted sales of 418 with spending of 1100, the updated formula predicts 418 + 153 = 571.
