### *** Names: [Insert Your Name Here]***

# Prelab 6 - Visualizing and Filtering Tabular Data

##  Prelab 6 Contents

1. Introduction to the NASA Exoplanet Archive Dataset
2. Creating Statistical Graphics from Pandas DataFrames
3. Filtering/Selecting a Subset of Data

In [None]:
#various things that we will need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as st
import seaborn as sb # a new plotting library

In [None]:
# these set the pandas defaults so that it will print ALL values, even for very long lists and large dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 1. Introduction to the NASA Exoplanet Archive Dataset

Most of the rest of this unit (and your second project) will revolve around a single dataset - the [NASA Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu/index.html).  

We will explore this dataset in great detail and apply many of the statistical principles that you have learned and will be learning to it. For this prelab, you will begin just by exploring it. At a minimum, you should complete each of the following, but it may behoove you to do a little more exploring as well. For each item in the list below, you should include one or more well-commented code cells and/or a markdown cell with explanations.

---
###Exercise 1
---
a) Figure out how to read in the data from the file planets_032821.csv into a dataframe called ```data```. (Hint: you will need to tell Pandas how many rows to skip before the real table begins) and print its shape. You will probably want to avoid displaying the dataframe itself in this notebook for now, as it is very large and will slow down your notebook. 

b) Use the exploratory pandas functions that you already know to find out basic information about the table and the types of entries in it, and write a 1 paragraph description of what the exoplanet archive is/contains based on these results. It may behoove you to do some googling of any new terms, or to refer to [this Astronomy jargon dictionary](https://docs.google.com/document/d/1sLHNH8eOdbiF976ITBlYeP2iv0-voBHO4pVBTXUDlHc/edit?usp=sharing).  

c) Choose a single column that interests you and compute at least three descriptive statistics for it. Write a paragraph describing that variable in words. For example, a good explanation might look like the paragraph below (from a different dataset):  

>QuaRCS score can take discrete integer values between 0 and 25. The minimum score for this dataset is 1 and the maximum is 25. There are 2,777 valid entries for score in this QuaRCS dataset, for which the mean is 13.9 and the median is 14 (both 56% of the maximum score). These are very close together, suggesting a reasonably centrally-concentrated score distrubution, and the low skewness value of 0.1 supports this. The kurtosis of the distribution is negative (platykurtic), which tells us that the distribution of scores is flat rather than peaky. The most common score ("mode") is 10, with 197 (~7%) of participants getting this score, however all score values from 7-21 have counts of greater than 100, supporting the flat nature of the distribution suggested by the negative kurtosis. The interquartile range (25-75 percentiles) is 8 points, and the standard deviation is 5.3. These represent a large fraction (20 and 32%) of the entire available score range, respectively, making the distribution quite wide.

In [None]:
##insert code and markdown cells here as needed to answer Exercise 1

---

Now, to make the dataset a bit more manageable for plotting, we'll truncate it to include only planet discovery methods that have found more than 30 planets and also only things that are legitimately classified as planets (masses < 13 Jupiter masses). You don't have to understand everything that's going on in the cell below, however some of the techniques employed may be useful to you later, so I recommend you spend a few minutes trying to undertsand what's going on. (***Note: as instructed above, your dataframe needs to be named*** ```data``` ***for this cell to work***)

In [None]:
#this truncates to only planet detection methods with >30 successful detections (skip if you want all of them)
methods,methods_inds,methods_counts = np.unique(data['discoverymethod'],return_index=True,return_counts=True)
methods = methods[methods_counts> 30]
print("I am keeping only the following discovery methods: ", methods)

#find the indices of all entries where discoverymethod is one of these five and the planet is really a planet (mass < 13*mass of jupiter)
inds = [j for j in range(len(data)) if data['discoverymethod'][j] in methods and data['pl_bmassj'][j] < 13.]

#write a new dataframe with just these entries
data2 = data.loc[inds]

#note the table is much smaller than it once was
print("My shape is now: ", data2.shape)

## 2. Creating Statistical Graphics from Pandas DataFrames

The exercises in this lab have many parallels with the exercises you did for the American Institute of Physics data that you looked at last week in Prelab/Lab 5. In this case, you will be introduced a bit more systematically to how to make a range of visualizations with Pandas dataframes and will use them to explore the dataset that you'll use for Project 2 in this course. ***As you work through this prelab and the remaining labs and prelabs in this unit, you should remain on the lookout for an interesting pattern or phenomenon that you want to explore more deeply in your project.***

 ## Statistical Plots for Pandas dataframes
---
### Exercise 2a - Histogram
---
The built-in syntax for creating a histogram for a pandas dataframe column is: 

```dataframe["Column Name"].hist(bins=nbins)```
    
*HOWEVER*, the matplotlib built-in functionality is far more versatile and so I would like you to use it. To read in a pandas column as an array, follow this convention. 

```myarray = dataframe["Column Name"].values```

(i) Play around with inputs (e.g. column name) until you find a case (dataframe column) where you think the histogram tells you something important or interesting (you should play around with the number of bins as well). Display your histogram and then write a 2-3 sentence explanation of what it shows.    
    
(ii) Play around some more until you find an example where the histogram is not a particularly good way to represent the data. Explain why in 2-3 sentences.   
    
(iii) Now think more broadly. When are histograms useful and when aren't they? How many and what types of variables are they good at representing? What do they show that other types of plots don't and what don't they show that other types of plots do? 

In [None]:
#informative histogram here

*informative histogram description here*

In [None]:
#uninformative histogram here

*uninformative histogram description here*

*Part iii explanation here*

--- 
### Exercise 2b - Box plots
---
You made notched boxplots with the seaborn plotting library in Lab 5, and I suggest you refer to your work there to remind yourself of all of the subtelties. As you may recall, there are various ways to make a seaborn boxplot, and in this case, you will probably want:

```sb.boxplot(x=var1,y=var2,notch=True)```

where column (x) containins a categorical variable and column (y) containins a numerical variable and you want to compare the distributions of variable y according to the categories stored in column x. Remember that you will need to turn these columns into numpy arrays, as you did in the histogram exercise above. 

(i) Play around with inputs (e.g. column name) until you find a case (dataframe column) where you think the boxlot tells you something important or interesting. Display your boxplot and then write a 2-3 sentence explanation of what it shows.  
    
(ii) Play around some more until you find an example where the boxplot is not a particularly good way to represent the data. Explain why in 2-3 sentences. 
    
(iii) Now think more broadly. When are boxplots useful and when aren't they? How many and what types of variables are they good at representing? What do they show that other types of plots (e.g. histogram) don't and what don't they show that other types of plots do? 

In [None]:
#informative boxplot code here

*Informative boxplot explanation here*

In [None]:
#uninformative boxplot code here

*Uninformative boxplot explanation here*

*Part iii explanation here*

--- 
### Exercise 2c - Scatter Plots
---
The syntax for creating a scatter plot in pandas is: 

```dataframe.plot.scatter(x='column name',y='column name')```

But here again, the matplotlib.pyplot version is much more versatile. This time, YOU should figure out how to make a matplotlib scatterplot that is interesting or informative from two pandas dataframe columns. 
    
(i) Play around with scatterplot syntax until you understand thoroughly what the options are. Play around with inputs (e.g. column name) until you find a case (dataframe column) where you think the scatterplot tells you something important or interesting. Display your boxplot and then write a 2-3 sentence explanation of what it shows.  
    
(ii) Play around some more until you find an example where the scatterplot is not a particularly good way to represent the data. Explain why in 2-3 sentences. 
    
(iii) Now think more broadly. When are scatterplots useful and when aren't they? How many and what types of variables are they good at representing? What do they show that other types of plots (e.g. histogram, boxplot) don't and what don't they show that other types of plots do? 

In [None]:
#informative scatter plot here

*Informative scatter plot explanation here*

In [None]:
#uninformative scatter plot here

*Uninformative scatter plot explanation here*

*Part iii explanation here*

---
## 3.  Filtering/ Selecting a Subset of Data

You will find it quite useful for the rest of this class to be able to select subsets from larger tabular datasets/pandas dataframes. One basic form of filtering employs conditionals inside of square brackets. For example:

In [None]:
x = np.array(np.arange(10))
print(x)
y=x[x > 3]
print(y)

for filtering a dataframe, the basic pattern is:

```df_filtered = df[df[colname]==value]```

Let's break this down a little. The basic syntax is 

```df_filtered = df[conditional]```

Which basically just means that the new dataframe ```df_filtered``` is a subset of dataframe ```df``` that meets the conditional inside the square brackets. 

The conditional inside the square brackets can be very versatile. In this case it's saying find only places where a single column 
```df[column]``` 
has some value 
```value```
. A simple modification would be that the conditional 
```==```
 could be any conditional (e.g 
 ```>=```
 , 
 ```!=```
 , 
 ```in```
 , 
 ```not```
 , etc.). The whole conditional statement could also be compound or more complicated than a simple single value, for example

```df_filtered = df[(df[colname]==value) and (df[colname2]==value2)]```

or 

```df_filtered = df[df[colname] in [value1, value2, value3]]```

In my opinion, this form of "pythonic" data filtering is one of the most powerful things about python as a computing language, and it will benefit you greatly in this class to develop comfort with this style of syntax, so challenge yourself to always filter data this way rather than using loops through the data. 

Although most filtering can be accomplished in one line of code, in my experience it takes a while to master, so let's begin practicing with Exercise 2 below. 

--- 
### Exercise 3
--------------

Write a function called "filter" that takes a dataframe, column name, and value for that column as input and returns a new dataframe containing only those rows where column name = value. For example 
```filter(df, "COLNAME", 1)``` 
should return a dataframe where all values in the COLNAME column are 1. I recommend printing the dataframe to a file or to the terminal to verify that your function is working. 

In [None]:
#your function here

In [None]:
#your tests here

---
# Sumbitting Prelabs and Labs for Grading

Before submitting any Google Colab notebook for grading, please follow the following steps

**1) Try running everything in one go (Runtime menu -> Restart and run all)**

Make sure the entire notebook runs from start to finish. If necessary, comment out any un-executable cells from the instructions portion of the lab so the whole notebook will execute in one go. 

**2) Restart the kernel (Runtime menu --> Restart Runtime).**

**3) Clear all output (Edit --> clear all outputs).**

**4) Make sure the names of all group members are in a markdown cell at the top of the file and submit the notebook through the Moodle link for this Lab**