# ASTR 596: FDS Homework 6+7: Gaussian Processes (200 pts)

### This is a double HW set so you get extra time - until reading day (May 4th, 2023) at noon to do it. 
### After that, it's finals time. 


# P1. Gaussian Processes

### Last HW, you worked on finding periodic planet signals in the light curve of Kepler-90, a star that is photometrically stable. The periodogram worked nicely because 

### a) we cleaned the light curve to squelch red noise
### b) the signals really were periodic and we could implictly make a strong assumption about the covariance between points.

### Life gets harder when the star itself has quasi-periodic variations because it has a magnetic field and is rotating (ruh oh...) 

In [1]:
%matplotlib notebook
%pylab

from astropy.table import Table
import scipy.stats as st
import sklearn
import sklearn.ensemble

Using matplotlib backend: nbAgg
Populating the interactive namespace from numpy and matplotlib


In [2]:
tab = Table.read('KIC2157356.txt',format='ascii')
tab['quarter'] = tab['quarter'].astype('int')
tab

time,flux,error,quarter
float64,float64,float64,int64
539.4710179205795,5221.16455078125,4.891251087188721,6
539.4914521464307,5216.07958984375,4.892819404602051,6
539.5118864718097,5225.43359375,4.891888618469238,6
539.5323205971945,5233.111328125,4.892074108123779,6
539.5527548221144,5221.17333984375,4.891964435577393,6
539.573189147035,5220.09521484375,4.891523361206055,6
539.5936232714812,5222.14208984375,4.893854141235352,6
539.6140574957026,5224.57958984375,4.893648147583008,6
539.6344918194518,5223.78564453125,4.894421100616455,6
539.6549259432068,5231.61669921875,4.894259929656982,6


In [3]:
qs = sorted(np.unique(tab['quarter']))
plt.figure()
means = []
cycle_map = {}
for i, q in enumerate(qs):
    ind = tab['quarter']==q
    t = tab[ind]
    plt.errorbar(t['time'],t['flux'], yerr=t['error'], marker='.', linestyle='None', alpha=0.01)
    meanflux = np.mean(t['flux'])
    cycle_map[q] = ind
    means.append(meanflux)
    if i == 0:
        plt.axhline(meanflux, label='m', color='grey', ls=":")
    else:

        vmin = means[0]
        vmax = meanflux

        plt.plot((t['time'][0], t['time'][0]), (vmin, vmax), label=rf'$c_{i}$', color=f'C{i}', ls='--') 
    
plt.xlabel('Time')
plt.ylabel('Flux')
plt.legend(frameon=False);


<IPython.core.display.Javascript object>

### As you can see there is some kind of periodic singal, but it's not perfectly regular. There is also the usual offsets between Kepler photometry in different cycles.

### You'll need four parameters to describe constants ($m, c_1, c_2, c_3$) to renormalize the flux to the first cycle, illustrated in the figure above. 
### $m$ specifies the mean of the Gaussian process, while $c_1, c_2, c_3$ are nuisance parameters. 

### You know how to implement a model with one common zeropoint and multiple offsets - this was what you did on your midterm.


### You'll also need some model to describe the quasi-periodic oscillations. There's no good way to write down a model in real for these in real space because stellar magnetic fields are incredibly complicated. 

### Instead we'll write down a model for the covariance between the observations and use a Gaussian process to model the star. You can model quasi-periodic correlation structure as something periodic + something that varies the periodicity smoothly:

## $$k(t_i, t_j) = A\cdot \exp\left(-\Gamma_1\cdot \sin^2\left(\frac{\pi}{P}|t_i - t_j|\right) -  \frac{|t_i-t_j|^2}{\lambda}) \right) $$

### This is another 4 parameters, ($A, \Gamma_1, P, \lambda$) parameters for a total of 8: ($m, c_1, c_2, c_3, A, \Gamma_1, P, \lambda$)


<hr>

### Q1: To implement the GP correlations, use the `george` package to construct this quasi-periodic kernel
https://george.readthedocs.io/en/latest/user/kernels/


### In particular, you should be able to combine `ExpSine2Kernel` and `ExpSquaredKernel` to get a model for the quasi-periodic oscillations. (20 pts)

In [272]:
import numpy as np
import matplotlib.pyplot as plt
from george import kernels
from george.modeling import Model
import george

In [292]:
def kernel_func(log_const, gamma, logP, lam, bounds_arr):
    kernel_const = kernels.ConstantKernel(log_const, bounds=[bounds_arr[0]])
    kernel_sine2 = kernels.ExpSine2Kernel(gamma=gamma, log_period=logP, bounds=[bounds_arr[1], bounds_arr[2]])
    kernel_sq = kernels.ExpSquaredKernel(lam, bounds=[bounds_arr[3]])
    
    kernel = kernels.Product(kernel_const, kernels.Product(kernel_sine2, kernel_sq))
    return kernel

### Q2: To implement the full model, read how to use `george`'s modeling protocol: (20 pts)
https://george.readthedocs.io/en/latest/tutorials/model/

In [354]:
class QuasiPeriodic_mean(Model):
    parameter_names = ("m", "c1", "c2", "c3")
    
    def get_value(self, x):
        c_arr = np.array([0, self.c1, self.c2, self.c3])
        mean_arr = np.zeros(len(x))
        q_tmp = np.array(tab['quarter'])[::10]
        for i, q in enumerate(qs):
            ind = q_tmp==q
            mean_arr[ind] = self.m + c_arr[i]
        return mean_arr


def QuasiPeriodic_GP(m, c1, c2, c3, log_const, gamma, logP, lam, bounds_arr):
    kernel = kernel_func(log_const, gamma, logP, lam, bounds_arr)
    gp = george.GP(kernel, mean=QuasiPeriodic_mean(m, c1, c2, c3), fit_mean=True)
    return gp

### Q3: With your model and likelihood constructured, write down priors on the parameters (you should be able to estimate from the plots) (20 pts)

In [389]:
bounds_log_const = (5, 10) # log(A)
bounds_logP = (0, 3)
bounds_gamma = (1, 3)
bounds_lam = (0, 4)
bounds_m = (5100, 5300)
bounds_c1 = (380, 480)
bounds_c2 = (650, 750)
bounds_c3 = (600, 700)

### Q4: Use `emcee` to optimize the model parameters and hyper-parameters, **using only every 10th sample in time**
### (Don't go overboard with the number of walkers or steps) (20 pts)
https://george.readthedocs.io/en/latest/tutorials/hyper/ may help 

In [390]:
import emcee
import corner
from multiprocessing import Pool

In [391]:
bounds_arr = np.array([bounds_log_const, bounds_gamma, bounds_logP, bounds_lam, \
                       bounds_m, bounds_c1, bounds_c2, bounds_c3])

gp = QuasiPeriodic_GP(5200., 430., 700., 650., 7., 2., 2., 10., bounds_arr)
gp.compute(np.array(tab['time'])[::10],  yerr=np.array(tab['error'])[::10])

def lnprob(p):
    gp.set_parameter_vector(p)
    return gp.log_likelihood(np.array(tab['flux'])[::10], quiet=True) + gp.log_prior()

In [392]:
initial = gp.get_parameter_vector()
ndim, nwalkers = len(initial), 16
p0 = initial + (1e-8)*np.random.randn(nwalkers, ndim)

with Pool() as pool:
    sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, pool=pool)
    start = time.time()
    nburn = 100
    state = sampler.run_mcmc(p0, nburn, progress=True)
    sampler.reset()
    nstep = 1000
    _ = sampler.run_mcmc(p0, nstep, progress=True)
    end = time.time()
    print("Multiprocessing took %.1fseconds"%(end - start))

100%|██████████| 100/100 [04:09<00:00,  2.50s/it]
100%|██████████| 1000/1000 [28:59<00:00,  1.74s/it]

Multiprocessing took 1993.1seconds





In [393]:
samples = sampler.flatchain

In [394]:
labels = [r'$m$', r'$c_1$', r'$c_2$', r'$c_3$', r'$\log\,A$', r'$\Gamma$', r'$\log\,P$', r'$\lambda$']

fig = corner.corner(samples, labels=labels, show_titles=True, smooth=1, label_kwargs={"fontsize": 12})

<IPython.core.display.Javascript object>

### Q5: Plot your posterior model over the data after correcting for the offsets, showing the points you used to condition the GP in red, and the remaining data in black.  (20 pts)

In [396]:
fig, axes = plt.subplots(1, 1, figsize=(15, 5))


for s in samples[np.random.randint(len(samples), size=24)]:
    gp.set_parameter_vector(s)
    gp_line = gp.sample_conditional(np.array(tab['flux'][::10]), np.array(tab['time'][::10]))
    quarter_tmp = np.array(tab['quarter'])[::10]
    gp_line -= s[0]
    for i, q in enumerate(qs):
        ind = quarter_tmp==q
        if i !=0 :
            gp_line[ind] -= s[i]
    axes.plot(tab['time'][::10], gp_line, color='g', alpha=.1, zorder=1)

params_median = np.median(samples, axis=0)
normalized_flux = np.array(tab['flux'])-params_median[0]
for i, q in enumerate(qs):
    if i != 0:  
        ind = tab['quarter']==q
        normalized_flux[ind] -= params_median[i]
axes.errorbar(tab['time'], normalized_flux, yerr=tab['error'], \
              fmt='.', ls='', c='k', alpha=1, zorder=2)
axes.errorbar(tab['time'][::10], normalized_flux[::10], yerr=tab['error'][::10], \
              fmt='.', ls='', c='r', alpha=1, zorder=3)
    
axes.set_xlabel(r'Time')
axes.set_ylabel(r'Flux')

plt.show()

<IPython.core.display.Javascript object>

# P2. Random Forests

For this work, we'll use the datasets produced by [Dey et al. (2022)](https://ui.adsabs.harvard.edu/abs/2022MNRAS.515.5285D/abstract), who trained a deep capsule network on postage stamps of SDSS galaxies to predict photometric redshifts. 

We're not going to use a deep capsule network on postage stamps, but we can use tabular data. This won't be as performant, but it's still instructive to see how well we can do with a simple random forest. Dey et al. have done an excellent job making their data available - http://d-scholarship.pitt.edu/42023/ (all of it)

You will need the [training set](http://d-scholarship.pitt.edu/42023/9/cat_train.csv) and the [test set](http://d-scholarship.pitt.edu/42023/8/cat_test.csv).

I suggest reading through Sec. 2 of the paper to get some sense of what the data is. Importantly, the data includes columns for photometric redshift already. You can't use these to train your random forest (duh.). I've limited the number of columns you can use to a set defined below. If you use more than these (e.g. the GalaxyZoo parameters) you might get better performance at the cost of a smaller training sample because you've also got to filter missing data. 

In [15]:
import pandas as pd

In [212]:
train_data = pd.read_csv('./cat_train.csv')
test_data = pd.read_csv('./cat_test.csv')

In [213]:
train_cols = ['dered_petro_u', 'dered_petro_g', 'dered_petro_r', 'dered_petro_i', 'dered_petro_z',\
       'petroMagErr_u', 'petroMagErr_g', 'petroMagErr_r', 'petroMagErr_i', 'petroMagErr_z',\
       'v_disp', 'sersicN_r', 'petroR90_r']
pred_cols  = ['bestObjID', 'z', 'zErr', 'zphot', 'dzphot']

### Q6. Pre-process the data

All ML work involves some amount of cleaning and pre-processing the data.
Filter data that have `zphot_reliable` == `True`, and have redshifts and photo-zs >= 0. 
Next filter any entries in the `train_cols` that have any value that is > 5 $\times$ the nomrally-scaled Median Absolute Deviation (as described in Sec 2.3) (`scipy.stats.median_abs_deviation` is your friend). 
Your pre-processed training data should have 357397 entries.
Make a hexbin plot of `zphot` vs `z` for the training data (to avoid plotting that many points) but replicate Fig. 3 in Dey et al. 
(35 pts)

In [214]:
from scipy.stats import median_abs_deviation

ind_train_prep = np.where((train_data.zphot_reliable==True)&(train_data.z>=0)&(train_data.zphot>=0))[0]

for i, col_name in enumerate(train_cols):
    col_data = train_data[col_name][ind_train_prep]
    ind_isnan = np.isnan(col_data)
    col_data = col_data[~ind_isnan]
    ind_train_mad = np.where(abs(col_data-np.mean(col_data)<=5*median_abs_deviation(col_data)))[0]
    ind_train_prep = ind_train_prep[~ind_isnan][ind_train_mad]

In [228]:
fig, axes = plt.subplots(1, 1, figsize=(7, 5))

sigma_nmad = st.median_abs_deviation(train_data.zphot[ind_train_prep]-train_data.z[ind_train_prep])
delta_z = np.abs((train_data.zphot[ind_train_prep]-train_data.z[ind_train_prep]) / (1 + train_data.z[ind_train_prep]))
f_out = len(np.where(delta_z>0.05)[0])/len(delta_z)

pl = axes.hexbin(train_data.z[ind_train_prep], train_data.zphot[ind_train_prep], gridsize=200, mincnt=1, cmap='viridis')
fig.colorbar(pl, ax=axes, label='Number of galaxies per pixel')
axes.text(0.02, 0.95, r'$\sigma_{NMAD}$=%.5f'%sigma_nmad, transform=axes.transAxes, fontsize=12)
axes.text(0.02, 0.89, r'$f_{\rm outliers}$=%.3f'%(f_out*100)+r'%', transform=axes.transAxes, fontsize=12)
axes.text(0.02, 0.82, r'$\frac{\Delta z}{1+z_{\rm spec}}$=%.5f'%np.mean(delta_z), transform=axes.transAxes, fontsize=12)
axes.plot([0, .5], [0, .5], c='silver', ls='-')
axes.plot([0., .5], [0.05, .55], c='silver', ls='--')
axes.plot([0.05, .55], [0., .5], c='silver', ls='--')
axes.set_xlabel(r'$z$')
axes.set_ylabel(r'$z_{\rm phot}$')
axes.set_xlim(0, 0.35)
axes.set_ylim(0, 0.35)
plt.show()

<IPython.core.display.Javascript object>

### Q7. Train the forest(s)

Using `n_estimators` (i.e. number of trees) in (5, 20, 50, 200, 500), train a random forest. You can use all the cores your CPU has with `n_jobs=-1`. Limit the maximum number of features at each branch with `sqrt`. Use the inverse variance of the redshifts as your sample weights. Plot the `oob_score` vs the number of trees. For each of the forests you trained, make a plot of the feature importances. (35 pts)

In [229]:
from sklearn.ensemble import RandomForestRegressor

Xtrain = np.array([train_data[colname][ind_train_prep] for colname in train_cols]).T
ytrain = train_data.z[ind_train_prep]

In [230]:
ntree_arr = np.array([5, 20, 50, 200, 500])
oob_score_arr = np.zeros(len(ntree_arr))
forest_arr = []

for i,ntree in enumerate(ntree_arr):
    rf_reg = RandomForestRegressor(n_estimators=ntree, random_state=0, n_jobs=-1, max_features='sqrt', oob_score=True)
    rf_reg.fit(Xtrain, ytrain)
    oob_score_arr[i] = rf_reg.oob_score_
    forest_arr.append(rf_reg)



In [234]:
fig, axes = plt.subplots(1, 1, figsize=(5,4))
axes.plot(ntree_arr, oob_score_arr, marker='o', ls='-', c='k')
axes.set_xlabel(r'$n_{\rm tree}$')
axes.set_ylabel(r'$OOb score$')
plt.show()

<IPython.core.display.Javascript object>

In [243]:
fig, axes = plt.subplots(2, 3, figsize=(15,10))

for i in range(len(ntree_arr)):
    importances = forest_arr[i].feature_importances_
    axes[i//3][i%3].bar(train_cols, importances)
    axes[i//3][i%3].set_title(r'$n_{\rm tree}$=%d'%ntree_arr[i])
    axes[i//3][i%3].set_xticklabels(train_cols, rotation=45, ha='right', fontsize=6)
    axes[i//3][i%3].set_ylabel('feature importance')

<IPython.core.display.Javascript object>

  import sys
  import sys
  import sys
  import sys
  import sys


### Q8. Test.

Pick your best performing forest from Q7. Load the test data (remember to apply any cuts you did to the training data). Use your random forest to predict the photo-z. Replicate Fig. 3 and Fig. 4 with your photo-z prediction *and* the photo-z prediction from SDSS included in the file. (30 pts, 10 pts for prediction, 10 for the two figures)

In [244]:
ind_test_prep = np.where((test_data.zphot_reliable==True)&(test_data.z>=0)&(test_data.zphot>=0))[0]

for i, col_name in enumerate(train_cols):
    col_data = test_data[col_name][ind_test_prep]
    ind_isnan = np.isnan(col_data)
    col_data = col_data[~ind_isnan]
    ind_test_mad = np.where(abs(col_data-np.mean(col_data)<=5*median_abs_deviation(col_data)))[0]
    ind_test_prep = ind_test_prep[~ind_isnan][ind_test_mad]

In [245]:
Xtest = np.array([test_data[colname][ind_test_prep] for colname in train_cols]).T
ytest = test_data.z[ind_test_prep]

ypred = forest_arr[-1].predict(Xtest)

In [270]:
from scipy.stats import norm
import seaborn as sns

sigma_nmad = st.median_abs_deviation(ypred-ytest)
delta_z = ((ypred-ytest) / (1 + ytest))
f_out = len(np.where(abs(delta_z)>0.05)[0])/len(delta_z)

fig, axes = plt.subplots(1, 2, figsize=(14,5))
pl = axes[0].hexbin(ytest, ypred, gridsize=200, mincnt=1, cmap='viridis')
fig.colorbar(pl, ax=axes[0], label='Number of galaxies per pixel')
axes[0].text(0.02, 0.95, r'$\sigma_{NMAD}$=%.5f'%sigma_nmad, transform=axes[0].transAxes, fontsize=12)
axes[0].text(0.02, 0.89, r'$f_{\rm outliers}$=%.3f'%(f_out*100)+r'%', transform=axes[0].transAxes, fontsize=12)
axes[0].text(0.02, 0.82, r'$\frac{\Delta z}{1+z_{\rm spec}}$=%.5f'%np.mean(delta_z), transform=axes[0].transAxes, fontsize=12)

axes[0].plot([0, .5], [0, .5], c='silver', ls='-')
axes[0].plot([0., .5], [0.05, .55], c='silver', ls='--')
axes[0].plot([0.05, .55], [0., .5], c='silver', ls='--')
axes[0].set_xlabel(r'$z_{\rm spec}$ ')
axes[0].set_ylabel(r'$z_{\rm phot}$ from my prediction')
axes[0].set_xlim(ytest.min(), ytest.max())
axes[0].set_ylim(ypred.min(), ypred.max())


axes[1].hist(delta_z, bins=50, range=(-.1, .1), edgecolor='k', facecolor='w', density=True)
mu, std = norm.fit(delta_z)
x_arr = np.linspace(-.1, .1, 100)
axes[1].plot(x_arr, norm.pdf(x_arr, mu, std))
axes[1].set_xlabel(r'$\frac{\Delta z}{1+z}$')
axes[1].set_ylabel(r'relative frequency')
axes[1].set_xlim(-.1, .1)
plt.show()

<IPython.core.display.Javascript object>

In [271]:
from scipy.stats import norm
import seaborn as sns

sigma_nmad = st.median_abs_deviation(test_data.zphot[ind_test_prep]-ytest)
delta_z = ((test_data.zphot[ind_test_prep]-ytest) / (1 + ytest))
f_out = len(np.where(abs(delta_z)>0.05)[0])/len(delta_z)

fig, axes = plt.subplots(1, 2, figsize=(14,5))
pl = axes[0].hexbin(ytest, test_data.zphot[ind_test_prep], gridsize=200, mincnt=1, cmap='viridis')
fig.colorbar(pl, ax=axes[0], label='Number of galaxies per pixel')
axes[0].text(0.02, 0.95, r'$\sigma_{NMAD}$=%.5f'%sigma_nmad, transform=axes[0].transAxes, fontsize=12)
axes[0].text(0.02, 0.89, r'$f_{\rm outliers}$=%.3f'%(f_out*100)+r'%', transform=axes[0].transAxes, fontsize=12)
axes[0].text(0.02, 0.82, r'$\frac{\Delta z}{1+z_{\rm spec}}$=%.5f'%np.mean(delta_z), transform=axes[0].transAxes, fontsize=12)

axes[0].plot([0, .5], [0, .5], c='silver', ls='-')
axes[0].plot([0., .5], [0.05, .55], c='silver', ls='--')
axes[0].plot([0.05, .55], [0., .5], c='silver', ls='--')
axes[0].set_xlabel(r'$z_{\rm spec}$ ')
axes[0].set_ylabel(r'$z_{\rm phot}$ from Dey et al.2022')
axes[0].set_xlim(ytest.min(), ytest.max())
axes[0].set_ylim(test_data.zphot[ind_test_prep].min(), test_data.zphot[ind_test_prep].max())


axes[1].hist(delta_z, bins=50, range=(-.1, .1), edgecolor='k', facecolor='w', density=True)
mu, std = norm.fit(delta_z)
x_arr = np.linspace(-.1, .1, 100)
axes[1].plot(x_arr, norm.pdf(x_arr, mu, std))
axes[1].set_xlabel(r'$\frac{\Delta z}{1+z}$')
axes[1].set_ylabel(r'relative frequency')
axes[1].set_xlim(-.1, .1)
plt.show()

<IPython.core.display.Javascript object>