# Expansion velocity of the universe

In 1929, Edwin Hubble published a paper in which he compared the radial velocity of objects with their distance. The former can be done pretty precisely with spectroscopy, the latter is much more uncertain. His original data are here.

He saw that the velocity increases with distance and speculated that this could be the sign of a cosmological expansion. Let's find out what he did.

Load the data into an array with numpy.genfromtxt. You will find 6 columns
   * `CAT`, `NUMBER`:  These two combined give you the name of the galaxy.
   * `R`: distance in Mpc
   * `V`: radial velocity in km/s
   * `RA`, `DEC`: equatorial coordinates of the galaxy 

In [None]:
import numpy as np
data = np.genfromtxt("table1.txt", names=True, dtype=None)
N = len(data)

Make a scatter plot of `V` vs `R`. Don't forget labels and units...

In [None]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(data['R'], data['V'], c='#ff9900', marker='o')
ax.set_xlim(xmin=0, xmax=2.5)
ax.set_xlabel('Distance [Mpc]')
ax.set_ylabel('Velocity [km/s]')

Use `np.linalg.lstsq` to fit a linear regression function and determine the slope $H_0$ of the line $V=H_0 R$. For that, reshape $R$ as a Nx1 matrix (the design matrix) and solve for 1 unknown parameter. Add the best-fit line to the plot.

In [None]:
A = data['R'].reshape(N,1)
params, _, _, _ = np.linalg.lstsq(A, data['V'])
print(params)
H0 = params[0]

R = np.linspace(0,2.5,100)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(data['R'], data['V'], c='#ff9900', marker='o')
ax.plot(R, H0*R, 'k--')
ax.set_xlim(xmin=0, xmax=2.5)
ax.set_xlabel('Distance [Mpc]')
ax.set_ylabel('Velocity [km/s]')

Why is there scatter with respect to the best-fit curve? Is it fair to only fit for the slope and not also for the intercept? How would $H_0$ change if you include an intercept in the fit?

In [None]:
Ac = np.empty((N,2))
Ac[:,0] = data['R']
Ac[:,1] = 1
params_c, _, _, _ = np.linalg.lstsq(Ac, data['V'])
print(params_c)
H0_c = params_c[0]
intercept = params_c[1]

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(data['R'], data['V'], c='#ff9900', marker='o')
ax.plot(R, H0*R, 'k--')
ax.plot(R, intercept + H0_c*R, 'k-.')
ax.set_xlim(xmin=0, xmax=2.5)
ax.set_xlabel('Distance [Mpc]')
ax.set_ylabel('Velocity [km/s]')

$V$ as given in the table is a combination of any assumed cosmic expansion and the motion of the sun with respect to that cosmic frame. Generalize the model to $V=H_0 R + V_s$, where the solar velocity is given by $V_s = X \cos(RA)\cos(DEC) + Y\sin(RA)\cos(DEC)+Z\sin(DEC)$. Construct a new $N\times4$ design matrix for the four unknown parameters $H_0$, $X$, $Y$, $Z$ to account for the solar motion. Use `astropy` to convert the coordinate strings `RA` and `DEC` to floating points coordinates in degrees. The resulting $H_0$ is Hubble's own version of the "Hubble constant". What do you get?

In [None]:
from coordinates import *
ra = Ra2Deg(data['RA'])
dec = Dec2Deg(data['DEC'])

Ah = np.empty((N,4))
Ah[:,0] = data['R']
Ah[:,1] = np.cos(ra*np.pi/180)*np.cos(dec*np.pi/180)
Ah[:,2] = np.sin(ra*np.pi/180)*np.cos(dec*np.pi/180)
Ah[:,3] = np.sin(dec*np.pi/180)
params_h, _, _, _ = np.linalg.lstsq(Ah, data['V'])
print(params_h)
H0 = params_h[0]

Make a scatter plot of $V-V_S$ vs $R$. Add the best-fit linear regression line.

In [None]:
VS = params_h[1]*Ah[:,1] + params_h[2]*Ah[:,2] + params_h[3]*Ah[:,3]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(data['R'], data['V'] - VS, c='#ff9900', marker='o')
ax.plot(R, H0*R, 'k-')
ax.set_xlim(xmin=0, xmax=2.5)
ax.set_xlabel('Distance [Mpc]')
ax.set_ylabel('Velocity [km/s]')

Using `astropy.units`, can you estimate the age of the universe from $H_0$? Does it make sense?

In [None]:
Mpc = 3.0857e19 # in km
Year = 60.*60.*24.*365 # in seconds
age = (H0 / Mpc * Year)**-1
age / 1e9 # in billion years

## Deconstructing lstsq

So far we have not incorporated any measurement uncertainties. Can you guess or estimate them from the scatter with respect to the best-fit line? You may want to look at the residuals returned by `np.linalg.lstsq`...

In [None]:
scatter = data['V'] - VS - H0*data['R']
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(scatter, 10)
ax.set_xlabel('$\Delta$V [km/s]')

Let see how adopting a suitable value $\sigma$ for those uncertainties would affect the estimate of $H_0$?

The problem you solved so far is $Ax=b$, and errors don't occur. With errors the respective equation is changed to $A^\top \Sigma^{-1} Ax=A^\top \Sigma^{-1}b$, where in this case the covariance matrix $\Sigma=\sigma^2\mathbf{1}$. This problem can still be solved by `np.linalg.lstsq`.

Construct the modified design matrix and data vector and get a new estimate of $H_0$. Has it changed? Use `np.dot`, `np.transpose`, and `np.linalg.inv` (or their shorthands).

In [None]:
error = scatter.std()
Sigma = error**2*np.eye(N)
Ae = np.dot(Ah.T, np.dot(np.linalg.inv(Sigma), Ah))
be = np.dot(Ah.T, np.dot(np.linalg.inv(Sigma), data['V']))
params_e, _, _, _ = np.linalg.lstsq(Ae, be)
print(params_e)

Compute the parameter covariance matrix $S=(A^\top \Sigma^{-1} A)^{-1}$ and read off the variance of $H_0$. Update your plot to illustrate that uncertainty.

In [None]:
S = np.linalg.inv(Ae)
dH0 = np.sqrt(S[0,0])
print(dH0)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(data['R'], data['V'] - VS, c='#ff9900', marker='o')
ax.plot(R, H0*R, 'k-')
ax.plot(R, (H0-dH0)*R, 'k--')
ax.plot(R, (H0+dH0)*R, 'k--')
ax.set_xlim(xmin=0, xmax=2.5)
ax.set_xlabel('Distance [Mpc]')
ax.set_ylabel('Velocity [km/s]')

How large is the relative error?  Would that help with the problematic age estimate above?

In [None]:
age = ((H0 - dH0)/ Mpc * Year)**-1
age / 1e9 # in billion years

Compare the noise-free result from above (Hubble's result) with $SA^\top \Sigma^{-1}b$. Did adopting errors change the result?

In [None]:
params_h, _, _, _ = np.linalg.lstsq(Ah, data['V'])
print(params_h)
print (np.dot(S, be))