### Comedy Show Lab

Imagine that you are the producer for a comedy show at your school.  We need you to use knowledge of linear regression to make predictions as to the success of the show.

### Working through a linear regression 

The comedy show is trying to figure out how much money to spend on advertising in the student newspaper.  The newspaper tells the show that 
 * For every two dollars spent on advertising, three students attend the show.  
 * If no money is spent on advertising, no one will attend the show.  

Write a linear regression function called `attendance` that shows the relationship between advertising and attendance expressed by the newspaper.  

In [24]:
def attendance(advertising):
    return (3/2)*advertising

In [25]:
attendance(100) # 150

150.0

In [26]:
attendance(50) # 75

75.0

Despite what the student newspaper says, the comedy show knows from experience that they'll still have a crowd even without an advertising budget.  Some of the comedians in the show have friends (believe it or not), and twenty of those friends will show up.  Write a function called `attendance_with_friends` that models the following: 

 *  When the advertising budget is zero, 20 friends still attend
 * For every two dollars spent on advertising, three additional people attend the show. 

In [27]:
def attendance_with_friends(advertising):
    return (3/2)*advertising + 20

In [28]:
attendance_with_friends(100) # 170

170.0

In [29]:
attendance_with_friends(50) # 95

95.0

Let's help plot this line so you can get a sense of what your $m$ and $b$ values look like in graph form.

Our x-values can be a list of `initial_sample_budgets`,  equal to a list of our budgets.  And we can use the outputs of our `attendance_with_friends` function to determine the list of `attendance-values`, the attendance at each of those x-values.

In [30]:
initial_sample_budgets = [0, 50, 100]
attendance_values = [20, 95, 170]

First we import the necessary plotly library, and `graph_obs` function, and setup `plotly` to be used without uploading our plots to its website.

Finally, we plot out our regression line, using our `attendance_with_friends` function.  The `budgets` will be our x values.  For our y values, we need to use our `attendance_with_friends` function to create a list of y-value attendances for every input of x. 

In [31]:
import plotly
from plotly import graph_objs
plotly.offline.init_notebook_mode(connected=True)

trace_of_attendance_with_friends = graph_objs.Scatter(
    x=initial_sample_budgets,
    y=attendance_values,
)

plotly.offline.iplot([trace_of_attendance_with_friends])

In [32]:
trace_of_attendance_with_friends

{'type': 'scatter', 'x': [0, 50, 100], 'y': [20, 95, 170]}

Now let's write a couple functions that we can use going forward.  We'll write a function called `m_b_data` that given a slope of a line, $m$, a y-intercept, $b$, will return a dictionary that has a key of `x` pointing to a list of `x_values`, and a key of `y` that points to a list of `y_values`.  Each $y$ value should be the output of a regression line for the provided $m$ and $b$ values, for each of the provided `x_values`.

In [33]:
def m_b_data(m, b, x_values):
    y_values = list(map(lambda x: m*x + b, x_values))
    return {'x': x_values, 'y': y_values}

In [34]:
m_b_data(1.5, 20, [0, 50, 100]) # {'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}

{'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}

Now let's write a function called `m_b_trace` that uses our `m_b_data` function to return a dictionary that includes keys of `name` and `mode` in addition to `x` and `y`.  The values of `mode` and `name` are provided as arguments.  When the `mode` argument is not provided, it has a default value of `line` and when `name` is not provided, it has a default value of `line function`.

In [35]:
def m_b_trace(m, b, x_values, mode = 'line', name = 'line function'):
    values = m_b_data(m, b, x_values)
    values.update({'mode': mode, 'name': name})
    return values

In [36]:
m_b_trace(1.5, 20, [0, 50, 100]) 
# {'mode': 'line', 'name': 'line function', 'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}

{'mode': 'line',
 'name': 'line function',
 'x': [0, 50, 100],
 'y': [20.0, 95.0, 170.0]}

### Calculating lines

The comedy show decides to use advertising with three different shows.  The attendance looks like the following.

| Budgets (dollars)        | Attendance           | 
| ------------- |:-------------:| 
| 200       |400 | 
| 400       |700 | 

In code, we represent the shows as the following:

In [91]:
first_show = {'budget': 200, 'attendance': 400}
second_show = {'budget': 400, 'attendance': 700}

Write a function called `marginal_return_on_budget` that provided these two shows, returns the expected amount of increase per every dollar spent on budget.  The function should use the formula for calculating the slope of a line.

In [92]:
def marginal_return_on_budget(first_show, second_show):
    return (second_show['attendance'] - first_show['attendance'])/(second_show['budget'] - first_show['budget'])

In [96]:
marginal_return_on_budget(first_show, second_show) # 1.5

1.5

In [95]:
first_show

{'attendance': 400, 'budget': 200}

Let's make sure that our function properly calculates the slope of the line with different data.

In [97]:
imaginary_third_show = {'budget': 300, 'attendance': 500}
imaginary_fourth_show = {'budget': 600, 'attendance': 900}
marginal_return_on_budget(imaginary_third_show, imaginary_fourth_show) # 1.3333333333333333

1.3333333333333333

Now we'll begin to write functions that we can use going forward.  The functions will calculate attributes of lines in general, as well be applied to predicting the attendance of the comedy show.

Take the following data.  The comedy show spends zero dollars on advertising for the next show.  Now the attendance chart looks like the following:

| Budgets (dollars)        | Attendance           | 
| ------------- |:-------------:| 
| 0       |100 | 
| 200       |400 | 
| 400       |700 | 

In [88]:
budgets = [0, 200, 400]
attendance_numbers = [100, 400, 700]

Write a function called `y_intercept_provided`.  Given a list of `x_values` and a list of `y_values` should find the point with an x_value of zero, and then return the corresponding y-value.  If there is no x-value equal to zero, it returns `False`.

To get you started, we'll provide a function called `sort_points` that returns a list of points sorted by their x_values.  The return value is a list of sorted tuples.

In [84]:
def sorted_points(x_values, y_values):
    values = list(zip(x_values, y_values))
    sorted_values = sorted(values, key=lambda value: value[0])
    return sorted_values

In [85]:
sorted_points([4, 1, 6], [4, 6, 7])

[(1, 6), (4, 4), (6, 7)]

In [81]:
def y_intercept_provided(x_values, y_values):
    values = sorted_points(x_values, y_values)
    intercept_point = list(filter(lambda value: value[0] == 0, values))
    if bool(intercept_point):
        return intercept_point[0][1]
    else:
        return bool(intercept_point)

In [82]:
y_intercept_provided([200, 400], [400, 700]) # False

False

In [87]:
y_intercept_provided(budgets, attendance_numbers) # 100

100

Now write a function called `slope`, that given a list of points, will use the points with the lowest and highest x-values to calculate the slope of a line.  

In [98]:
def slope(x_values, y_values):
    sorted_values = sorted_points(x_values, y_values)
    x1 = sorted_values[0][0]
    y1 = sorted_values[0][1]
    x2 = sorted_values[-1][0]
    y2 = sorted_values[-1][1]
    m = (y2 - y1)/(x2 - x1)
    return m

In [99]:
slope([200, 400], [400, 700]) # 1.5

1.5

Now write a function called `y_intercept`.  It then calculates the y intercept, by drawing a line from the point with the highest x-value and the calculated slope. It's arguments are a list of x-values and a list of y-values, and it returns a dictionary with the keys of `m` and `b` to return the values of `m` and `b`.

If the values provided includes the y_intercept, it returns the provided y_intercept.

In [65]:
def y_intercept(x_values, y_values, m = None):
    sorted_values = sorted_points(x_values, y_values)
    highest = sorted_values[-1]
    if m == None:
        m = slope(x_values, y_values)
    offset = highest[1] - m*highest[0]
    return offset

In [101]:
y_intercept([200, 400], [400, 700]) # 100

100.0

In [103]:
y_intercept([0, 200, 400], [10, 400, 700]) # 10

10.0

Now write a function called `build_regression_line` that given a list of `x_values` and a list of `y_values` returns a dictionary with a key of `m` and a key of `b` to return the `m` and `b` values of the calculated regression line.  Use the  `slope` and `y_intercept` functions to calculate the line.

In [68]:
def build_regression_line(x_values, y_values):
    sorted_values = sorted_points(x_values, y_values)
    highest = sorted_values[-1]
    lowest = sorted_values[0]
    m = slope(x_values, y_values)
    b = y_intercept(x_values, y_values, m)
    return {'m': m, 'b': b}

In [105]:
build_regression_line([0, 200, 400], [10, 400, 700]) # {'b': 10.0, 'm': 1.725}

{'b': 10.0, 'm': 1.725}

Now, let's write a function that will will return the expected attendance given previous shows, and a budget, and do so even when the previous shows do not include a value for when x is zero.  The function will be called `regression_line_two_points` and take inputs of an array of two shows and a budget.

In [46]:
first_show = {'budget': 300, 'attendance': 700}
second_show = {'budget': 400, 'attendance': 900}

shows = [first_show, second_show]

In [110]:
def expected_value_for_line(m, b, x_value):
    return m*x_value + b

In [111]:
expected_value_for_line(1.5, 100, 100)

250.0