# Linear Least Squares and the expansion of the universe!

*************
<img src="https://cdn.theatlantic.com/assets/media/img/2018/03/lead_large-1/lead_720_405.jpg?mod=1533692228" width="100" />Einstein's self-labelled "biggest blunder" came when he added a fudge factor that he called the *cosmological constant* to his calculations. He added it because based on his own theory of General Relativity, the universe was expanding... but that couldn't be! So he made it go away with this magical constant (OOPS!).

Enter astronomer Edwin Hubble : <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Studio_portrait_photograph_of_Edwin_Powell_Hubble_%28cropped%29.JPG/220px-Studio_portrait_photograph_of_Edwin_Powell_Hubble_%28cropped%29.JPG" width="70" />. He observed that on average, everything was moving away from us! In fact, the farther away from us the object (such as a star or galaxy) was, the faster it was moving from us! Einstein realized his original work was right all along, and felt silly for faking math.

Hubble saw a *linear* relationship, thus Hubble's law is written as: $$v = H_0 d$$
In this activity we will measure Hubble's constant, $H_0$ using his **actual data** (so cool, right?).
************

**tl;dr**
We are doing a linear least squares regression. The first columm in the file `hubble.csv` contains our *inputs* and the second our *outputs* to the linear model.

*Note: We will use `pandas` for easy data management, `numpy` for vectorized mathematics, and `matplotlib` for plotting.*

In [None]:
# Package imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [None]:
# Read in the data from the file hubble.csv

# Check out the documentation for the read_csv() function:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

# Don't get overwhelmed by the number of possible parameters; every parameter that has an "="
# sign next to it is optional. If you leave it unspecified, it will default to some sensible
# value. At least for this activity, you can leave them all as is, and just specify your
# file name (i.e., call read_csv with just one argument)

# YOUR CODE HERE



In [None]:
# To verify that you successfully read in the data, let's print out a small
# section of it: the pandas head() and tail() functions are handy for this

# YOUR CODE HERE

In [None]:
# The column names are a little unwieldy: modify them to something that's a little
# easier to work with (but still descriptive!). We'll let you Google this one

# YOUR CODE HERE

In [None]:
# Let's plot the raw data -- d along the x-axis and v along the y-axis.
# Here's a quick start guide: https://matplotlib.org/tutorials/introductory/pyplot.html
# Don't forget to label your axes and title the plot!

# YOUR CODE HERE

In [None]:
# Let's define our hypothesis function -- no need to modify
def y_pred(theta, D):
    """
    We assume theta is an array of two parameters like [22.0, 3.5]
        - theta[0] is the intercept term
        - theta[1] is the slope
    D is a vector of all the distance values from our dataset

    This function will return a vector of all our predictions on the data,
    i.e., a vector of (predicted) velocities.
    """
    # the magic of vectorized code: no loops!
    # if how this works doesn't make sense, talk to us!
    return theta[0] + theta[1] * D

As a reminder, gradient descent is an optimization algorithm that finds the parameters $w$ that minimize a function via:

$$w_j ← w_j - \eta \frac{∂ J}{\partial w_j} $$

Where the function we want to minimize, $J(w)$ can be defined in this problem as the mean squared error:

$$J(\vec w) = \frac{1}{N}∑_i (y_{pred} - y_{true})^2$$

In [None]:
# Now for the main act: implement gradient descent to fit a line through the
# data. This should be a 2-parameter fit, with a slope and intercept term.
# Since the data is noisy, we'll declare convergence when the change in your
# parameters drops below 1e-3.

# One implementation tip: use vectorized code to compute the update equation.
# You can wrap that in a loop to control how many gradient descent steps you
# take. In other words: your solution must ultimately only contain one loop,
# not a nested loop!

# YOUR CODE HERE

In [None]:
# Print out your final fitted parameters theta



In [None]:
# Plot your line of best fit and your data.



### In the cell below, comment on your results.

Does your line appear to fit the data well? Hubble calculated an expansion at a rate of $H_0 \approx 500$ km/s/Mparsec. Do your results agree?


<_Your response goes here._>

### A note on interpreting regression results

Consider the following regression problem: predicting the price of a house ($\hat{y}$) based on the number of bedrooms ($x_1$, an integer) and the total area of a house ($x_2$, measured in square feet). In other words, we are interested in fitting the following function:
$$ \hat{y} = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2 $$

Once gradient descent converges, we discover that $\theta_1 = 780.0$ and $\theta_2 =  2.5$. On this basis, can we conclude that the variable $x_1$ (number of bedrooms) has a _far_ greater impact on a house's price than $x_2$ (area), since it has a much larger coefficient in our model? Why or why not?

<i><p style='text-align: right;'> <b>Authors:</b> Michelle Kuchera, Raghuram Ramanujan </p></i>