# Problem 1

### Getting familiar with loading, visualizing and operating on multivariate data

A Hess diagram is a 2D plot coded by a third quantity.

Here's an example from the [astroML documentation](https://www.astroml.org/examples/datasets/plot_SDSS_SSPP.html) showing the temperature and surface gravity (how strongly the star would pull on you if you could stand on it's surface, including the effect of rotation), color coded on the left by density, and on the right by metallicity (measured here as the ratio of iron to hydrogen) with density contours.


<img src="https://www.astroml.org/_images/plot_SDSS_SSPP_1.png">

From http://das.sdss.org/va/stripe_82_variability/SDSS_82_public/, download one of the HLC\*fits.gz files (and use `astropy.io.fits` to load one of them)

The HLC files are each different bands on the sky. `MEAN_PSFMAG` has columns u, g, r, i, z.

You should separate stars and galaxies by using the `MEAN_OBJECT_TYPE` column and making a cut with value > 5 to find stars (and similarly <=5 to find galaxies).

For this problem, we want to make stars and galaxies make a 3-panel plot.

In the left plot show `r` vs `g-i` color coded by density (I would suggest numpy.histogram2d and matplotlib.pyplot.pcolormesh, but you can use whatever). Plot the contours on top.

In the middle and right panel, again plot `r` vs `g-i`, but now color-coded by proper motion in RA, and Dec.

Comment on the diffeence in structure in stars vs galaxies.

In [36]:
from astropy.io import fits
import numpy as np

hdul = fits.open('data/HLC.RA_03_to_04.fits.gz')
#hdul.info()
#hdul[1].header

ugrizData = hdul[1].data['MEAN_PSFMAG']
objectType = hdul[1].data['MEAN_OBJECT_TYPE']
print(ugrizData.shape)
#print(objectType.shape)
#print(objectType)

starIndeces = np.nonzero(objectType > 5)
galaxyIndeces = np.nonzero(objectType <= 5)
starData = ugrizData[starIndeces]
galaxyData = ugrizData[galaxyIndeces]
print(starData.shape)
print(galaxyData.shape)




(128785, 5)
(51133, 5)
(77652, 5)


# Problem 2

### Comparing distributions to a standard normal distribution

Load the `IntroStat_demo.csv` file in the data directory (use `pandas` or `astropy` or whatever you like).

Estimate the sample mean and variance of the suspiciously named `mag.outlier` column.

Make a Q-Q plot (I suggest statsmodels.graphics.gofplots.qqplot) of the `mag.outlier` column and over plot a line with `Y = Mean + Sigma*X` on it. 

Calculate the values of the first and third quartiles, and use some linear algebra to figure out the equation of a line passing through them (google line 2 point form if you need a refresher on the linear algebra)

Overplot that line passing through the data.

Now try the same thing with `mag.het`. Describe what you found?

What happens if you rescale the data? Subtract of the sample mean of `mag.het` and divide by `mag.het.error`. Now repeat the Q-Q plot with this quantity. 

Describe what's going on.

# Problem 3

The demo data set for this part is the Wesenheit index of the OGLE-III fundamental-mode and first overtone classical Cepheids. 

These stars are awesome because you can use them to measure distances. Here's a nice [youtube video](https://www.youtube.com/watch?v=iyisAjHdhas) on these stars.

You'll try to estimate their period-luminosity relationship. 

The Wesenheit index is defined as `W = I - 1.55(V - I)`, and its main advantage over using simply the I or V photometry is that it is insensitive to extinction. It is denoted by 'W' among the data columns. 

Other columns are 'name', the identifier of the star; 'RA0' (in decimal hours) and 'Decl0' (in decimal degrees), celestial coordinates; 'Mode', the mode of the Cepheid ('F' indicates fundamental-mode, '1' indicates first overtone star); 'Cloud', indicating which Magellanic Cloud the star belongs to; 'logP1', the base-10 logarithm of the period in days; 'VI', the colour V-I.

Split the data into LMC and SMC, and then again by mode F and 1, and plot the `W` on the y-axis vs `log(P1)` on x.
Fit or estimate straight lines to each of the four samples (you can use `statsmodels` `astropy` `scipy` `numpy`....)
(Yes, we've not covered fitting straight lines. That's OK.)

Compute the residuals of each sample to it's respective line. Do these residuals look like a normal distribution? If not, speculate on why (WATCH THE YOUTUBE VIDEO!)

Plot the residuals color coded by if they are positive or negative vs RA and Dec (just like a Hess diagram in Problem 1). What do you see?