### *Name: (Your name here)*

## Lab 10 - Application of Bayesian Statistics to TESS and K2

*Written by ASTR 200 alum William Balmer for their Independent Study on Bayesian Analysis Techniques.*

Please also watch the accompanying video introduction (posted to Slack) 

### Import Statements

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# begin by importing necessary modules
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity 
from scipy.stats import gaussian_kde
from astropy import units as u

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

# import a real dataset, one containing the properties of exoplanets discovered by the TESS satelite

TESS = pd.read_csv('/content/drive/Shareddrives/ASTR 200 S21/Labs/Lab10/TESS_planets.csv', skiprows=80)
KEPL = pd.read_csv('/content/drive/Shareddrives/ASTR 200 S21/Labs/Lab10/Kepler_planets.csv', skiprows=54)

### Background

According to wikipedia, "the Kepler telescope is a retired space telescope launched by NASA to discover Earth-size planets orbiting other stars... After nine years of operation, the telescope's reaction control system fuel was depleted, and NASA announced its retirement on October 30, 2018." The telescope helped astronomers and a community of scientists confirm over 2000 planet detections. You can read about the Kepler mission [here](https://www.nasa.gov/mission_pages/kepler/overview/index.html)

Kepler's spiritual successor, TESS, is the "Transiting Exoplanet Survey Satellite... a space telescope... designed to search for exoplanets using the transit method in an area 400 times larger than that covered by the Kepler mission." Data from TESS is still being beamed down to Earth, but a large number of exoplanets have already been confirmed using TESS. You can read more about the TESS mission [here](https://www.nasa.gov/content/about-tess)
    
Both Kepler and TESS discover planets using the transit method: they measure how the light from a star dims as a planet orbiting that star passes in front of it (transits). Bigger planets closer to their host stars will therefore be easier to detect; the survey is biased towards these detections. We might expect, then, that the data we collect from Kepler or TESS might not be representative of the total exoplanet population.

### Background

    Let's use bayesian statistics to see how our understanding of the properties of the exoplanet population has changed, using Kepler data as an informed prior and TESS data as our new observations. First, pick one of the following four qualities of an exoplanet: radius, orbital semi-major axis (distance between it and the star), orbital period (how long it takes to orbit the star), or temperature. Check the units for each of your choice of parameter in both datasets. Comment which parameter you chose, and what units it has in both datasets.</div>


In [None]:
TESS.columns

In [None]:
KEPL.columns

In [None]:
#let's compare semi-major axis
kep_a = KEPL["koi_sma"].values
kep_a

In [None]:
tess_a = TESS["pl_orbsmax"].values

In [None]:
np.nanmean(kep_a)
np.nanmean(tess_a)

### The prior

<div class="sidebar">
A prior is a continuous distribution, but we have a set of discrete data points. What we'll need to do next is construct a prior for our chosen parameter using our Kepler data. Luckily, you learned to do this in Lab 9. Set a variable we'll call "prior_data" equal to an array containing our data, and then construct a continuous distribution modeled on your data using a KDE, just as you did in Lab 9.

In [None]:
# assign your exoplanet property Kepler data to a variable here
# you will want to use the pandas function "to_numpy()" so that you can
# easily feed your variable to your kernel density estimator and do your
# further manipulations
kep_a_nonan = kep_a[np.isnan(kep_a)!=True]

#print max and min
print('min', np.min(kep_a_nonan))
print('max', np.max(kep_a_nonan))

In [None]:
#values suggest should plot in log space
kep_a_nonan_log = np.log10(kep_a_nonan)

#print max and min
print('min', np.min(kep_a_nonan_log))
print('max', np.max(kep_a_nonan_log))

In [None]:
# create a PDF using a KDE here, and plot it
x_d=np.arange(-3,3,0.01)
kde = KernelDensity(bandwidth=0.2, kernel='gaussian')
kde.fit(kep_a_nonan_log[:, None])
logkepprob = kde.score_samples(x_d[:, None])
kep_prob=np.exp(logkepprob)
plt.fill_between(x_d, kep_prob, alpha=0.5)
plt.xlabel('log(semi-major axis)')
plt.ylabel('probability')

### The likelihood

Now we have a PDF (probability density function), for our prior. We know that our posterior will be proportional to our prior multiplied by our data. Now, let's construct a likelihood function for our data.

<div class="hw">
(a) Like you did for your prior data, assign the variable "data" to the TESS data for your given parameter. Make sure you transform the units of this data into the same units as the prior. Then, create a PDF for this data using a KDE, and check that the resulting distribution looks correct.

In [None]:
# assign your exoplanet property TESS data a variable here
tess_a_nonan = tess_a[np.isnan(tess_a)!=True]
tess_a_nonan_log = np.log10(tess_a_nonan)

#print max and min
print('min', np.min(tess_a_nonan_log))
print('max', np.max(tess_a_nonan_log))

In [None]:
# create a PDF using a KDE here, and plot it
x_d=np.arange(-3,3,0.01)
kde = KernelDensity(bandwidth=0.15, kernel='gaussian')
kde.fit(tess_a_nonan_log[:, None])
logtessprob = kde.score_samples(x_d[:, None])
tess_prob=np.exp(logtessprob)
plt.fill_between(x_d, tess_prob, alpha=0.5, color='m')
plt.xlabel('log(semi-major axis)')
plt.ylabel('probability')

<div class="hw">
(b) Describe what is qualitatively different about your TESS data and the prior Kepler data. Why might the distribution have changed between the two series of observations? Is this distribution unimodal like the Kepler data? How might your definition of uni vs bi vs multimodal rely on your KDE parameters, like bandwidth?</div>
    
**Answer:**

### The posterior

Now, your task is to construct a posterior distribution that is the product of your prior and your observation.

<div class="hw">
(a) Using the cell below, multiply the likelihood by the prior to produce a posterior. Then, normalize all three pdfs so that they can be plotted on the same axis.

In [None]:
# construct your posterior and normalize your pdfs here
prior = kep_prob/np.sum(kep_prob)
likelihood = tess_prob/np.sum(tess_prob)
post = kep_prob*tess_prob
post_norm = post / np.sum(post)


<div class="hw">
(b) Plot your posterior individually and then on the same axis as your prior and likelihood.

In [None]:
# individual plot here
plt.fill_between(x_d, post_norm, alpha=0.5, color='y')
plt.xlabel('log(semi-major axis)')
plt.ylabel('probability')



In [None]:
# all three plotted here
plt.fill_between(x_d, prior, alpha=0.5, label='prior')
plt.fill_between(x_d, likelihood, alpha=0.5, color='m', label='likelihood')
plt.fill_between(x_d, post_norm, alpha=0.5, color='y', label='posterior')
plt.xlabel('log(semi-major axis)')
plt.ylabel('probability')
plt.legend()



### Exercise 1

(a) Describe how TESS observations of have affected the posterior. Does the posterior more dramatically resemble the prior or the data? Why? How is it different from either one?

**Answer:**

(b) Look back the exoplanet semi-major axis vs. mass plot that we have examined many times in this class. Does it give you any insight into what is going on? Why or why not?

### Exercise 2: Hypothesis testing

Remember that a hypothesis test in the Bayesian framework is as simple as comparing two likelihoods via a Bayes Factor. See  [this table](https://en.wikipedia.org/wiki/Bayes_factor#Interpretation) if you need a reminder how to interpret a Bayes factor

<div class="hw">
(a) Formulate a hypothesis to test on your posterior distribution. Write down this hypothesis here.
    
**Answer:**

<div class="hw">
(b) In the cell below, calculate the associated Bayes Factor.

In [None]:
# hypothesis test here





(c) Evaluate the result of your hypothesis test. If it was inconclusive, reflect why. If it was decisive, explain what it says about your data.

**Answer:**

(d) Reflect more broadly on the pros and cons of this style of hypothesis testing. What are the pros and cons? Where can it go wrong?

**Answer:**