Reggie is a mad scientist who has been hired by the local fast food joint to build their newest ball pit in the play area. As such, he is working on researching the bounciness of different balls so as to optimize the pit. He is running an experiment to bounce different sizes of bouncy balls, and then fitting lines to the data points he records. He has heard of linear regression, but needs your help to implement a version of linear regression in Python.

Linear Regression is when you have a group of points on a graph, and you find a line that approximately resembles that group of points. A good Linear Regression algorithm minimizes the error, or the distance from each point to the line. A line with the least error is the line that fits the data the best. We call this a line of best fit.

In this project, you’ll combine your knowledge of loops, lists, and arithmetic to create a function that will find a line of best fit when given a set of data.


The line we will end up with will have a formula that looks like:

y = m*x + b
where m is the slope of the line and b is the intercept, where the line crosses the y-axis.

Create a function called get_y() that takes in m, b, and x. It should return what the y value would be for that x on that line!

Once you have defined get_y(), test it out by uncommenting the print() statements and checking if the expected values display in the terminal.

In [2]:
def get_y(m,b,x):
  y = (m*x) + b
  return y
#this formula calculates any y value for any point on a line so long as we have the x coordinate, the slope of the line and the y intercept
# Uncomment each print() statement to check your work. Each of the following should print True
#print(get_y(1, 0, 7) == 7)
#print(get_y(5, 10, 3) == 25)

Reggie wants to try a bunch of different m values and b values and see which line produces the least error. To calculate the error between a point and a line, he wants a function called calculate_error(), which will take in m, b, and an [x, y] point called point and return the distance between the line and the point.

In [3]:
def calculate_error(m,b,point):
  x_point = point[0]
  y_point = point[1]
  y = get_y(m,b,x_point)
  diff = y - y_point
  return abs(diff)
# Task 4
# Uncomment each print() statement and check the output against the expected result

# this is a line that looks like y = x, so (3, 3) should lie on it. thus, error should be 0:
print(calculate_error(1, 0, (3, 3)))

# the point (3, 4) should be 1 unit away from the line y = x:
print(calculate_error(1, 0, (3, 4)))

# the point (3, 3) should be 1 unit away from the line y = x - 1:
print(calculate_error(1, -1, (3, 3)))

# the point (3, 3) should be 5 units away from the line y = -x + 1:
print(calculate_error(-1, 1, (3, 3)))

0
1
1
5


Great! Reggie’s datasets will be sets of points. For example, he ran an experiment comparing the width of bouncy balls to how high they bounce:

datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
The first datapoint, (1, 2), means that his 1cm bouncy ball bounced 2 meters. The 4cm bouncy ball bounced 4 meters.

As we try to fit a line to this data, we will need a function called calculate_all_error(), which takes m and b that describe a line, and points, a set of data like the example above.

calculate_all_error() should iterate through each point in points and calculate the error from that point to the line (using calculate_error()). It should keep a running total of the error, and then return that total after the loop.

In [4]:
def calculate_all_error(m,b,points):
  total_error = 0
  for item in points:
    result = calculate_error(m,b,item)
    total_error += result
    
  return total_error


# Uncomment each print() statement and check the output against the expected result
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]

# every point in this dataset lies upon y=x, so the total error should be zero:
print(calculate_all_error(1, 0, datapoints))

# every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:
print(calculate_all_error(1, 1, datapoints))

# every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:
print(calculate_all_error(1, -1, datapoints))

# the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be:
# 1 + 5 + 9 + 3 = 18
print(calculate_all_error(-1, 1, datapoints))

0
4
4
18


Our next step is to find the m and b that minimizes this error, and thus fits the data best!

The way Reggie wants to find a line of best fit is by trial and error. He wants to try a bunch of different slopes (m values) and a bunch of different intercepts (b values) and see which one produces the smallest error value for his dataset.

Using a list comprehension, let’s create a list of possible m values to try. Make the list possible_ms that goes from -10 to 10 inclusive, in increments of 0.1.

In [None]:
possible_ms = [i/10 for i in range(-100,101, 1)]
#print(possible_ms)
#to generate our possible values, we go through each value in the given range, then divide by 10 as the range value can not handle floats, we divide before the range function to make sure we are only diving the number in the range and not trying to divide the whole range itself

Now, let’s make a list of possible_bs to check that would be the values from -20 to 20 inclusive, in steps of 0.1.

In [None]:
possible_bs = [i/10 for i in range(-200, 201, 1)]

We are going to find the smallest error. First, we will make every possible y = m*x + b line by pairing all of the possible ms with all of the possible bs. Then, we will see which y = m*x + b line produces the smallest total error with the set of data stored in datapoints.

First, create the variables that we will be optimizing:

 - smallest_error — this should start at infinity (float("inf")) so that any error we get at first will be smaller than our value of smallest_error
 - best_m — we can start this at 0
 - best_b — we can start this at 0
 We want to:
 - Iterate through each element m in possible_ms
 - For every m value, take every b value in possible_bs
 - If the value returned from calculate_all_error() on this m value, this b value, and datapoints is less than our current smallest_error,
Set best_m and best_b to be these values, and set smallest_error to this error.

In [None]:
for m in possible_ms:
  for b in possible_bs:
    error = calculate_all_error(m,b,datapoints)
    if error < smallest_error:
      best_m = m
      best_b = b
      smallest_error = error
    else:
      pass
      
print(best_m)
print(best_b)
print(smallest_error)
#we first set 0 values for best m and b, then the float of infinity for the smallest possible value, as we want to find the smallest combination of values to give our error, so we set the beginning value as infinity.
#we then iterate though all the m values, then all the b values, and we run this through the function to calculate error created earlier, after which we find the smallest error, print the associated m and b values



Now we have seen that for this set of observations on the bouncy balls, the line that fits the data best has an m of 0.4 and a b of 1.6:

y = 0.4x + 1.6
This line produced a total error of 5.

Using this m and this b, what does your line predict the bounce height of a ball with a width of 6 to be? In other words, what is the output of get_y() when we call it with:

 - m = 0.4
 - b = 1.6
 - x = 6

In [None]:
new_ball = get_y(0.4,1.6,6)
print(new_ball)