### *** Names: [Insert Your Names Here]***

# Lab 6 - Testing Differences Between Data Subsets

In [None]:
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
import pandas as pd

In [None]:
# these set the pandas defaults so that it will print ALL values, even for very long lists and large dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
#read in the data, skipping the first 76 rows of ancillary information
data=pd.read_csv('planets_030220.csv', skiprows=76)
print(data.shape)

In [None]:
#this truncates to only planet detection methods with >30 successful detections (skip if you want all of them)
methods,methods_inds,methods_counts = np.unique(data['pl_discmethod'],return_index=True,return_counts=True)
methods = methods[methods_counts> 30]
print("I am keeping only the following discovery methods: ", methods)

#find the indices of all entries where pl_discmethod is one of these four
inds = [j for j in range(len(data)) if data['pl_discmethod'][j] in methods and data['pl_bmassj'][j] < 13.]

#write a new dataframe with just these entries
data2 = data.loc[inds]

#note the table is much smaller than it once was
print("My shape is now: ", data2.shape)

<div class=hw>
    
### Exercise 1
---------------

(a) Sit down with your lab partner and compare your answers to exercise 2 on the prelab. Discuss any differences in your approaches, and pros and cons of each. Decide on a single "filter" function, add a docstring and comments, and paste it in the cell below. 

(b) Use your filter function to create a dataframe consisting only of planets detected with the Radial Velocity method, one with only the Transit method, one with only the Microlensing Method and one with only the Direct Imaging method. 

(c) Create two different types of graphics to compare the distributions of a single planetary property (e.g. mass) across these 4 subsets of the database. Your first graphic should be a box plot. Your second should be one or more overlapping histograms.  

*Hint 1 - As you probably noted in the prelab, the default pandas plotting functions do not make particularly nice graphics. I suggest using either matplotlib or seaborn.*  
*Hint 2 - For overlapping histograms, you will want to set the fill to be transparent, which is called the "alpha" value and is specified with the keyword alpha for many plotting functions*

In [None]:
# your commented filter function goes here

In [None]:
# define your four different filtered dataframes using your filter function here

In [None]:
# boxplot here

In [None]:
# overlapping histograms here

## Testing Differences Between Datasets 

### Computing Confidence Intervals

Now that we have a mechanism for filtering the dataset, we can test differences between groups with confidence intervals. The syntax for computing the confidence interval on a mean for a given variable is as follows. 

variable1 = st.t.interval(conf_level,n,loc=np.nanmean(variable2), scale=st.sem(variable2))

where conf_level is the confidence level you with to calculate (e.g. 0.95 is 95% confidence, 0.98 is 98%, etc.)
n is the number of samples and should generally be set to the number of valid entries in variable2 -1. 

An example can be found below (if your filter function is working as specified).

In [None]:
## apply filter to select only men from data, and pull the scores from this group into a variable
df2=yourfilterfunc(data2,'pl_discmethod','Transit')
transit_radii=df2['pl_radj']
#print mean
print(np.nanmean(transit_radii))

In [None]:
#compute 95% confidence intervals on the mean (low and high)
transitradii_conf=st.t.interval(0.95, len(transit_radii)-1, loc=np.nanmean(transit_radii), 
                                scale=st.sem(transit_radii, nan_policy='omit'))
transitradii_conf

<div class=hw>
    
### Exercise 2
---------------

Choose a planet property that you find interesting and compare the mean (or median) for that property across the four discovery methods. Then write a paragraph describing the results. Are the differences between the groups significant according to your data? Would they still be significant if you were to compute the 98% (3-sigma) confidence intervals?

In [None]:
# calculations for entire population

In [None]:
# calculations for RV planets

In [None]:
# calculations for transiting planets

In [None]:
# calculations for microlensing planets

In [None]:
# calculations for directly imaged planets

***Explanations of which groups deviate at 95/98% confidence from one another here***

<div class=hw>
    
### Exercise 3
---------------

Now let's do a hypothesis test about the meaningfulness of these differences using a statistic called the Student's t-test. This test is commonly used to test differences between the means of two samples. Read a little about the test [here](https://en.wikipedia.org/wiki/Student%27s_t-test).
(a) Choose one of the differences that you noted in Exercise 2 to test and formulate it as a null vs. alternative hypothesis apopropriate for a t test
(b) The relevant python function is called ttest_indiv in the scipy stats library (imported as st above). Figure out how to use it and compute the test statistic. 
(c) Interpret the output of the function in a single sentence. What does it mean?

***hypothesis description here***

In [None]:
#t-test here

***test result explanation here***

<div class=hw>
    
### Exercise 4
---------------

Bring your results from Exercises 1-4 together into a single integrated explanation of what you learned in ~2-3 paragraphs (+ figures). 

***explanation here***

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../../custom.css", "r").read()
    return HTML(styles)
css_styling()