# Data with Pandas

Since Pandas is a third-party Python library (not part of the standard Python libraries), we need to **import** it.


### <font color='red'>***You must run this next cell in order for any of the pandas steps to work!***</font>

In [0]:
import pandas as pd

### The data we're using today

For this lesson, we will be using the Portal Teaching data (https://figshare.com/articles/Portal_Project_Teaching_Database/1314459), a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA (http://onlinelibrary.wiley.com/doi/10.1890/15-2115.1/abstract)

This section will use the **surveys.csv** file that you downloaded along with this notebook (in the **python-intro-workshop** folder), so it should be ready to go.

(If needed, it can also be downloaded from here:  https://ndownloader.figshare.com/files/2292172)

Each row records the species and weight of each animal caught in plots in the study area.

The columns represent:

| Column 	| Description |
| --- | --- |
| record_id |	Unique id for the observation|
| month |	month of observation |
|day 	|day of observation|
|year |	year of observation|
|plot_id 	|ID of a particular plot|
|species_id |	2-letter code|
|sex |	sex of animal ("M", "F")|
|hindfoot_length |	length of the hindfoot in mm|
|weight |	weight of the animal in grams|

Each time we call a function that's in a library, we use the syntax *LibraryName.FunctionName*. Adding the library name with a `.` before the function name tells Python where to find the function. In the example above, we have imported Pandas as `pd`. This means we don't have to type out `pandas` each time we call a Pandas function.

Let's use panda's built-in function that reads in a CSV file:

In [0]:
# To download the file from the internet:
pd.read_csv("https://raw.githubusercontent.com/gwu-libraries/gwlibraries-workshops/master/python-programming/surveys.csv")

# Or if you're able to download the file locally:
# pd.read_csv("surveys.csv")

(Often, though, you'll have your data file either on your computer or perhaps on your Google Drive.  You can borrow some code from https://colab.research.google.com/notebooks/io.ipynb to upload files from (or download files to) your computer or your Google Drive.)

So, ***pd.read_csv*** read our CSV file, but we'd like to store it as an **object**.  So we'll create a variable for it, called `surveys_df`.  This is just like how we used a variable above to store an integer, or a string, or a list, or a dictionary.  We're just storing a Pandas DataFrame object instead.

Make sure to run the cell below:

In [0]:
# To download the file from the internet:
surveys_df = pd.read_csv("https://raw.githubusercontent.com/gwu-libraries/gwlibraries-workshops/master/python-programming/surveys.csv")

# Or if you're able to download the file locally:
# surveys_df = pd.read_csv("surveys.csv")

Try evaluating `surveys_df`:

In [0]:
surveys_df 

How would you now check what **class** (type) of object `surveys_df` is?

`surveys_df` is a Pandas **DataFrame**.   A DataFrame is a 2-dimensional structure that can store data in rows and columns - similar to a spreadsheet or a table, but with some other nice features.  (Yes, it's very similar to a `data.frame` in R.)

Just like the Pandas *library* has functions, *objects* can have **functions** (which may take arguments) and **attributes** (which don't).

A Pandas DataFrame object has an attribute called `dtypes` which lists out the type of each column:

In [0]:
surveys_df.dtypes

Try these to see what they do:

    surveys_df.columns
    surveys_df.head()
Also, what does `surveys_df.head(15)` do, versus `surveys_df.head(4)`?

    surveys_df.tail()
    surveys_df.shape
    
Take note of the output of the shape method. What format does it return the shape of the DataFrame in?


In [0]:
surveys_df.shape

What if we want to isolate just one column?  There are two ways we can do this:

In [0]:
surveys_df['species_id']

or

In [0]:
surveys_df.species_id

Let's see what type `surveys_df['species_id']` is.  Try it.

In [0]:
type(surveys_df['species_id'])

You can think of a Pandas **Series** as a series of observations of one variable.  It behaves like a Python list.

We can also slice and dice -- similar to how we selected parts of list objects above.  What does the next line do?

In [0]:
surveys_df[3:10]

Another way to select is with `.loc`, which selects based on *labels* (as opposed to `.iloc` which selects using *numerical indices*).  Try this:

In [0]:
surveys_df.loc[[3, 10, 12], ['day', 'year', 'species_id']]

We can also use `.query` to select only rows matching certain conditions.  Note that the query expression is in single quotes.

In [0]:
surveys_df.query('hindfoot_length < 10')

There are other ways to do this.  Try this expression:

In [0]:
surveys_df['hindfoot_length'] < 10

How can you use this to get back a data frame with only the rows in `surveys_df` where `hindfoot_length < 10`?

**Challenge**:  How might you query to get back only rows with `hindfoot_length < 10` **and** `weight > 10` in ONE expression?  (There is more than one way to accomplish this!)

Pandas has a handy function (well, it has many handy functions!) to get all the unique elements in the column:

In [0]:
unique_species = pd.unique(surveys_df['species_id'])

Try evaluting the **`.size`** attribute on the above result to see how many unique species there are in the data set.

In [0]:
unique_species.size

We see from above that we can also isolate just the data in one column.  Let's try isolating the `weight` column, and calling the **`describe()`** function to get some statistics on it.

In [0]:
surveys_df['weight'].describe()

We can also get a quick pairwise correlation between every pair of numerical variables:

In [0]:
surveys_df.corr()

Pandas can also sort and group data based on the values in a column:

In [0]:
grouped_by_species = surveys_df.groupby('species_id')

Try running **`describe()`** on `grouped_by_species`:

In [0]:
grouped_by_species.describe()

Now we're going to create some series with:

* The number of animals observed per species
* The mean weight of all animals observed in each species

In [0]:
# a series with the number of samples by species
species_counts = grouped_by_species.size()
# a series with the mean weight by species
species_mean_weights = grouped_by_species['weight'].mean()

Let's look at each - notice that each is a Series:

Let's try creating some quick bar charts.  First we need to make sure figures appear inline in the notebook:

In [0]:
import matplotlib.pyplot as plt

%matplotlib inline

And now we'll create some quick charts:

In [0]:
plt.scatter(x = surveys_df.weight, y=surveys_df.hindfoot_length)

plt.show()

In [0]:
plt.hist(surveys_df.hindfoot_length[surveys_df.hindfoot_length.notnull()])
plt.title('Distribution of hindfoot length values')
plt.show()

We can also call `.plot()` on a Series:

In [0]:
species_counts.plot(kind='bar')

In [0]:
species_mean_weights.plot(kind='bar')

And now for a scatter plot, where we specify which variables are the x and y:

In [0]:
surveys_df.plot(kind='scatter', y='hindfoot_length', x='weight')

See if you can look up another plot type and get it to work!

## Nicer plotting, with `ggplot`

Let's try a different plotting library, called `ggplot` (from the `plotnine` package), that thinks about plotting data in a different way, in terms of "adding" *data* plus *aesthetics* (colors, shapes, etc.) plus *layers* (which add to or modify the plot)

First we need to import the ggplot library, from the `plotline` package:

In [0]:
# This step seemes to be needed to run in Google Colaboratory -- it might not be needed in Anaconda
# Install the plotnine Python library
!pip install plotnine

In [0]:
from plotnine import *

The first thing we'll do is create a ggplot object and give it:
- Our data frame (`surveys_df`)
- Aesthetics information, such as which variables in our data frame will be used as the independent (x) and dependent (y) variables

In [0]:
ggplot(surveys_df) + aes(x = 'weight', y = 'hindfoot_length')

Hmmm, we get a canvas but it's mysteriously empty!   We need to add a ***layer***, using `+`.  We'll add a layer with points:

In [0]:
ggplot(surveys_df) + aes(x = 'weight', y = 'hindfoot_length') + geom_point()

Not bad!  Let's see if we can perhaps color-code the points based on the species_id.

In [0]:
ggplot(surveys_df) + aes(x = 'weight', y = 'hindfoot_length', color='species_id') + geom_point()

Notice that we get an error.  I'll help you out.  The problem is that we have too many species_id values for a meaningful color layer.   Let's create a subset using just the top 10 species.

Remember earlier we created `species_counts`?  We'll use it and do a bit of sorting to get just the top ten species as as a list:

In [0]:
species_counts.sort_values(ascending=False)

Now let's chop it down to just the first 10:

In [0]:
species_counts.sort_values(ascending=False)[:10]

But we simply want the species ids, not any of the other stuff.  Similar to a Python dictionary, we can use `.keys()` to get that:

In [0]:
species_counts.sort_values(ascending=False)[:10].keys()

And let's use a variable to store our index of top species so we can reuse it:

In [0]:
top_species = species_counts.sort_values(ascending=False)[:10].keys()

Now that we have a simple list of the top 10 species IDs, we can use `.isin()` to create a subset data frame containing just the rows with those species_id values:

Now that our new data frame has a smaller variety of species, let's try it again with species_id as the key for the color:

In [0]:
subset_df = surveys_df[surveys_df.species_id.isin(top_species)]

In [0]:
surveys_plot = ggplot(subset_df) + aes(x = 'weight', y = 'hindfoot_length', color='species_id') + geom_point()

And now to render the plot:

In [0]:
surveys_plot

Here are some other layers we can try adding.  See what they do:

In [0]:
surveys_plot + stat_smooth() + theme_xkcd()

In [0]:
surveys_plot + facet_wrap('~species_id')

# Challenge

* Download the species.csv data from https://ndownloader.figshare.com/files/3299483 as a Pandas DataFrame.
* ***Join*** the species data onto the surveys data, and recreate the visualizations so that the key shows each species by its scientific name (genus + species), rather than by its two-letter code.  Hint:  Use [DataFrame.join](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html)

