### *** Names: [Insert Your Name Here]***

# Prelab 6

##  Prelab 6 Contents

1. Creating Statistical Graphics from Pandas DataFrames
2. Filtering/Selecting a Subset of Data

In [None]:
#various things that we will need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as st
import seaborn as sb # a new plotting library

In [None]:
# these set the pandas defaults so that it will print ALL values, even for very long lists and large dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Read in the exoplanet database data as a pandas dataframe called "data".

In [None]:
#read in the data, skipping the first 76 rows of ancillary information
data=pd.read_csv('planets_030220.csv', skiprows=76)
print(data.shape)

In [None]:
data.columns

To make the dataset a bit more manageable for plotting, we'll truncate it to include only planet discovery methods that have found more than 30 planets and also only things that are legitimately classified as planets (masses < 13 Jupiter masses). You don't have to understand everything that's going on in the cell below, however some of the techniques employed may be useful to you later, so I recommend you spend a few minutes trying to undertsand what's going on. 

In [None]:
#this truncates to only planet detection methods with >30 successful detections (skip if you want all of them)
methods,methods_inds,methods_counts = np.unique(data['pl_discmethod'],return_index=True,return_counts=True)
methods = methods[methods_counts> 30]
print("I am keeping only the following discovery methods: ", methods)

#find the indices of all entries where pl_discmethod is one of these four
inds = [j for j in range(len(data)) if data['pl_discmethod'][j] in methods and data['pl_bmassj'][j] < 13.]

#write a new dataframe with just these entries
data2 = data.loc[inds]

#note the table is much smaller than it once was
print("My shape is now: ", data2.shape)

## 1. Creating Statistical Graphics from Pandas DataFrames

<div class=hw>
    
# Exercise 1 - Statistical Plots for Pandas dataframes

### 1a - Histogram
The built-in syntax for creating a histogram for a pandas dataframe column is: 

dataframe["Column Name"].hist(bins=nbins)
    
*HOWEVER*, the matplotlib built-in functionality is far more versatile and so I would like you to use it. To read in a pandas column as an array, follow this convention. 

myarray = dataframe["Column Name"].values()

(i) Play around with inputs (e.g. column name) until you find a case (dataframe column) where you think the histogram tells you something important or interesting (you should play around with the number of bins as well). Display your histogram and then write a 2-3 sentence explanation of what it shows.    
    
(ii) Play around some more until you find an example where the histogram is not a particularly good qay to represent the data. Explain why in 2-3 sentences.   
    
(iii) Now think more broadly. When are histograms useful and when aren't they? How many and what types of variables are they good at representing? What do they show that other types of plots don't and what don't they show that other types of plots do? 

In [None]:
#informative histogram here

*informative histogram description here*

In [None]:
#uninformative histogram here

*uninformative histogram description here*

*Part iii explanation here*

<div class=hw>
    
### 1b - Box plots

The syntax for creating a box plot for a pair of pandas dataframe columns is: 

dataframe.boxplot(column="column name 1", by="column name 2")
    
Use of this built-in pandas functionality may be nice for quick exploration, ***HOWEVER*** here again there are other libraries with much more versatile boxplot functionalities. I recommend using the plotting library seaborn and what is called a "notched" boxplot. Here is a simple schematic describing what it shows. 
    
<img src="boxplot.png" width="50%">
    
Here is the basic syntax to make a boxplot with seaborn (imported with the shorthand sb in the first cell of this notebook). Note that you will need to turn the pandas dataframe columns that you want into numpy arrays ("var1" and "var2" below), as you did for exercise 1a. 
sb.boxplot(x=var1,y=var2,notch=True)

(i) Play around with the x and y variables and refer to the docstring as needed until you understand thoroughly what is being shown. Play around with inputs (e.g. column name) until you find a case (dataframe column) where you think the boxlot tells you something important or interesting. Display your boxplot and then write a 2-3 sentence explanation of what it shows.  
    
(ii) Play around some more until you find an example where the boxplot is not a particularly good way to represent the data. Explain why in 2-3 sentences. 
    
(iii) Now think more broadly. When are boxplots useful and when aren't they? How many and what types of variables are they good at representing? What do they show that other types of plots (e.g. histogram) don't and what don't they show that other types of plots do? 

In [None]:
#informative boxplot code here

*Informative boxplot explanation here*

In [None]:
#uninformative boxplot code here

*Uninformative boxplot explanation here*

*Part iii explanation here*

<div class=hw>
    
### 1c - Scatter Plots
The syntax for creating a scatter plot in pandas is: 

dataframe.plot.scatter(x='column name',y='column name')

But here again, the matplotlib.pyplot version is much more versatile. This time, YOU should figure out how to make a matplotlib scatterplot that is interesting or informative from two pandas dataframe columns. 
    
(i) Play around with scatterplot syntax until you understand thoroughly what the options are. Play around with inputs (e.g. column name) until you find a case (dataframe column) where you think the scatterplot tells you something important or interesting. Display your boxplot and then write a 2-3 sentence explanation of what it shows.  
    
(ii) Play around some more until you find an example where the scatterplot is not a particularly good way to represent the data. Explain why in 2-3 sentences. 
    
(iii) Now think more broadly. When are scatterplots useful and when aren't they? How many and what types of variables are they good at representing? What do they show that other types of plots (e.g. histogram, boxplot) don't and what don't they show that other types of plots do? 

In [None]:
#informative scatter plot here

*Informative scatter plot explanation here*

In [None]:
#uninformative scatter plot here

*Uninformative scatter plot explanation here*

*Part iii explanation here*

## 2.  Filtering/ Selecting a Subset of Data

You will find it quite useful for the rest of this class to be able to select subsets from larger tabular datasets/pandas dataframes. One basic form of filtering employs conditionals inside of square brackets. For example:

In [None]:
x = np.array(np.arange(10))
print(x)
y=x[x > 3]
print(y)

<div class=hw>
    
### Exercise 2
--------------

Write a function called "filter" that takes a dataframe, column name, and value for that column as input and returns a new dataframe containing only those rows where column name = value. For example filter(data, "PRE_GENDER", 1) should return a dataframe about half the size of the original dataframe where all values in the PRE_GENDER column are 1. 

In [None]:
#your function here

In [None]:
#your tests here

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../../custom.css", "r").read()
    return HTML(styles)
css_styling()