# Data with Pandas

Since Pandas is a third-party Python library (not part of the standard Python libraries), we need to **import** it.


### <font color='red'>***You must run this next cell in order for any of the pandas steps to work!***</font>

In [None]:
import pandas as pd

### The data we're using today

For this lesson, we will be using the Portal Teaching data (https://figshare.com/articles/Portal_Project_Teaching_Database/1314459), a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA (https://doi.org/10.1890/15-2115.1)

This section will use a data file called **surveys.csv** which we will download from the Internet shortly.

(If needed, it can also be downloaded from here:  https://ndownloader.figshare.com/files/2292172)

Each row in the data file records the species and weight of each animal caught in plots in the study area.

The columns represent:

| Column 	| Description |
| --- | --- |
| record_id |	Unique id for the observation|
| month |	month of observation |
|day 	|day of observation|
|year |	year of observation|
|plot_id 	|ID of a particular plot|
|species_id |	2-letter code|
|sex |	sex of animal ("M", "F")|
|hindfoot_length |	length of the hindfoot in mm|
|weight |	weight of the animal in grams|

Each time we call a function that's in a library, we use the syntax *`LibraryName.FunctionName`*. Adding the library name with a `.` before the function name tells Python where to find the function. In the example above, we have imported Pandas as `pd`. This means we don't have to type out `pandas` each time we call a Pandas function.

## Loading data

Let's use Pandas' built-in function, `read_csv`, that reads in a CSV file:

In [None]:
# To download the file from the internet:
# pd.read_csv("https://raw.githubusercontent.com/gwu-libraries/gwlibraries-workshops/master/python-programming/surveys.csv")

pd.read_csv('https://go.gwu.edu/pywdata1')
# Or if you're able to download the file locally:
# pd.read_csv("surveys.csv")

# If you're inside Google Colaboratory, you can borrow code from
# https://colab.research.google.com/notebooks/io.ipynb?authuser=1 to load files from your computer
# (Just paste in the code under "Uploading files from your local file system")

That read our CSV file, but we'd like to store it as an **object**.  So we'll create a variable for it, called `surveys_df`.  This is just like how we used a variable above to store an integer, or a string, or a list, or a dictionary.  We're just storing a Pandas DataFrame object instead.

Make sure to run the cell below:

In [None]:
# To download the file from the internet:
surveys_df = pd.read_csv("https://go.gwu.edu/pywdata1")

# Or if you're able to download the file locally:
#surveys_df = pd.read_csv("surveys.csv")

Try evaluating `surveys_df`:

## Properties of a Pandas `DataFrame`

How would you now check what **class** (type) of object `surveys_df` is?

`surveys_df` is a Pandas **DataFrame**.   A DataFrame is a 2-dimensional structure that can store data in rows and columns - similar to a spreadsheet or a table, but with some other nice features.  (Yes, it's very similar to a `data.frame` in R.)

Just like the Pandas *library* has functions, *objects* can have **functions** (which may take arguments) and **attributes** (which don't).

A Pandas DataFrame object has an attribute called `dtypes` which lists out the type of each column.  Try it.

Try these to see what they do:

    surveys_df.columns
    surveys_df.head()
Also, what does `surveys_df.head(15)` do, versus `surveys_df.head(4)`?

    surveys_df.tail()
    surveys_df.shape


Take note of the output of the shape method. What format does it return the shape of the DataFrame in?

## Slicing and Dicing (Subsetting)

What if we want to isolate just one column?  There are (at least) two ways we can do this:

- We can use bracket notation, like this:

    `mydataframe['myvariable']`

- We can treat the variable like an attribute, like this:

    `mydataframe.myvariable`


Try it for `surveys_df` and the `species_id` variable:

Let's see what type `surveys_df['species_id']` is, using the `type()` function that we used this morning.  Try it.

You can think of a Pandas **Series** as a series of observations of one variable.  It behaves like a Python list.

We can also slice and dice -- similar to how we selected parts of list objects above.  Try getting back the 3rd through 10th rows of `surveys_df` using this structure:

`df[2:5]` # returns rows 2 through 4 of `df`

Or, we can select just certain columns with a structure like this:

df[['variableA', 'variableB']] # returns a DataFrame with `variableA` and `variableB` from `df`

Use this structure to get back a `DataFrame` containg just the `species_id` and `hindfood_length` columns from `surveys_df`:

What if we want to get a subset on both columns _and_ rows?

Another way to select is with `.loc`, which selects based on *labels* (as opposed to `.iloc` which selects using *numerical indices*).  Try this:

We can also use `.query` to select only rows matching certain conditions.  Note that the query expression is in single quotes.

There are other ways to do this.  Try this expression:

How can I tally up the number of False and True values in the series above?

How can you use this to get back a data frame with only the rows in `surveys_df` where `hindfoot_length < 10`?

**Challenge**:  How might you query to get back only rows with `hindfoot_length < 10` **and** `weight > 10` in ONE expression?  (There is more than one way to accomplish this!)

Another handy operator is `~` (tilde), which gives us the "opposite" of a result.  Let's say we want to get all of the `AB` species observations, but only where the `hindfoot_length` variable has a value:

## Basic descriptive statistics

### Numerical data

We see from above that we can also isolate just the data in one column. Let's try isolating the weight column, and calling the describe() function to get some statistics on it.

We can also ask for descriptive statistics across all numerical variables:

Using corr(), we can also get a pairwise correlation between every pair of numerical variables:

### Categorical Data

We can use `.describe()` on a `Series` containing text to get some basic descriptive information.

For categorical data, we can also find out how many there are of each unique value, using `.value_counts()`

Pandas has a handy `unique()` function (well, it has many handy functions!) to get all the unique elements in a `Series`:

Try evaluting the **`.size`** attribute on the above result to see how many unique species there are in the data set.

## Grouping

Pandas can also sort and group data based on the values in a column:

In [None]:
grouped_by_species = surveys_df.groupby('species_id')

`.groupby` doesn't appear to do anything until we perform subsequent steps.

But now try running **`describe()`** on `grouped_by_species`:

Try grouping by multiple variables.  You will need to pass a **list** of variables to group on.

Now we're going to create some series with:

* The number of animals observed per species
* The mean weight of all animals observed in each species

In [None]:
# a series with the mean weight by species
species_mean_weights = grouped_by_species['weight'].mean()
species_mean_weights

We can sort to find the species with the highest mean weights.  Remember that `species_mean_weights` is a Series, so we can use its `.sort_values()` method.  Additionally, we can use the optional `ascending` parameter to control whether it sorts in ascending or descending order.

We can also get the tally of species observations by using `Series.value_counts()`.  Notice that by default, `Series.value_counts` sorts in descending order.

(For future reference:  `.agg()` or `.aggregate()` can be a useful way to work with "groupby" objects)

## Merging (Joining) data frames

Let's read in a second data frame from `https://go.gwu.edu/pywdata2`, and call it `species_df`

In [None]:
species_df = pd.read_csv('https://go.gwu.edu/pywdata2')
species_df

We can "calculate" a new text variable (let's call it `genus_species`) by concatenating the values of `genus` and `species`:

Let's now "join" the common name onto the existing, surveys_df data frame.

A brief summary of ways we can join table A and table B:
- **left join** keeps all rows in table A, adds on columns from table B but only for rows where the "key" matches.  Where there's no matching row in table B, the result is "NA" (missing data) in the variables from table B.
- **right join** keeps all rows in table B, adds on columns from table A but only for rows where the "key" matches.  Where there's no matching row in table A, the result is "NA" (missing data) in the variables from table A.
- **inner join** keeps ONLY the rows where there is a match between table A and table B (based on the "key" variable in common)
- **outer join** keeps ALL rows in BOTH table A and table B.  For rows the key's value is only found in table A, the variables from table B will have missing data; for rows the key's value is only found in table B, the variables from table A will have missing data.

In our case, we will do a left join, because we want to keep all the original data in `surveys_df`, but we wish to enhance it with the name for the species, where it is available in `species_df`.

We will use the `.merge()` method available on the `surveys_df` DataFrame.  We need to give it a few things:
* The second data frame that we wish to merge onto this one.
* The variable(s) that we want to merge on.  In other words, the variable(s) that will need to match between the two data frames.  This parameter is called `on`.
* The method of merging (see above).  This parameter is called `how`.

In [None]:
surveys_df = surveys_df.merge(species_df[['species_id', 'genus_species', 'taxa']],
                          on='species_id', how='left')
surveys_df

## Plotting with Matplotlib

Let's try creating some quick bar charts.  First we need to make sure figures appear inline in the notebook:

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

And now we'll create some quick charts using `plt.scatter()` followed by `plt.show()`:

In [None]:
plt.scatter(x = surveys_df.weight, y=surveys_df.hindfoot_length)

plt.show()

We can also use `plt.hist()` to create a histogram.  Create a histogram of the values of `surveys_df.hindfoot_length`.

In [None]:
plt.hist(surveys_df.hindfoot_length)
plt.title('Distribution of hindfoot length values')
plt.show()

We can also call `.plot()` on a Series.  We can specify which type of plot we want by populating the `kind` parameter.  For a bar plot, we can specify `kind = 'bar'`.

And we can control the size of the figure by adding a `figsize` parameter to `plot()`.  For example, we can add `figsize=(15,5)` where 15 is the width and 5 is the height.

And now for a scatter plot (`kind = 'scatter'`), where we specify the names of the `x` and `y` variables:

But let's look up what more we can do with this plot.  Try looking up help on `matplotlib.pyplot.scatter`

Let's try using a few of these parameters, like `marker`, `alpha` and `c` (color)

See if you can look up another plot type and get it to work!

## Nicer plotting, with `ggplot`

Let's try a different plotting library, called `ggplot` (from the `plotnine` package), that thinks about plotting data in a different way, in terms of "adding" *data* plus *aesthetics* (colors, shapes, etc.) plus *layers* (which add to or modify the plot)

In Python, the `ggplot` library is available as part of the `plotline` package.

In [None]:
from plotnine import *

The first thing we'll do is create a ggplot object using `ggplot()` and give it:
- Our data frame (`surveys_df`)
- Aesthetics information, such as which variables in our data frame will be used as the independent (x) and dependent (y) variables.  We'll add (using `+`) `aes(x, y)`

In [None]:
ggplot(surveys_df) + aes(x = 'weight', y = 'hindfoot_length')

Hmmm, we get a canvas but it's mysteriously empty!   We need to add a ***layer***, using `+`.  We'll add a layer with points, using `geom_point()`.

In [None]:
ggplot(surveys_df) + aes(x = 'weight', y = 'hindfoot_length') + geom_point()

Not bad!  Let's see if we can perhaps color-code the points based on the species_id, by adding a `color` parameter to the `aes()` component.

In [None]:
ggplot(surveys_df) + aes(x = 'weight', y = 'hindfoot_length', color='species_id') + geom_point()

This is great, but, perhaps with so many species_id values, it becomes hard to distinguish between the colors to have a meaningful color layer.   Let's create a subset using just the top 10 species.

Remember earlier we created `species_counts`?  We'll use it here, to get just the top ten species as as a list:

In [None]:
species_counts

How would you get just the first 10 values?

But we simply want the species ids, not any of the other stuff.  Similar to a Python dictionary, we can use `.keys()` to get that:

And let's use a variable to store our index of top species so we can reuse it:

Now that we have a simple list of the top 10 species IDs, we can use `.isin()` to create a subset data frame containing just the rows with those species_id values.  Let's call this new data frame `subset_df`:

In [None]:
subset_df = surveys_df[surveys_df.species_id.isin(top_species)]
subset_df

Now that our new data frame has a smaller variety of species, let's try it again with `genus_species` as the key for the color:

In [None]:
surveys_plot = ggplot(subset_df) + \
  aes(x = 'weight', y = 'hindfoot_length', color='genus_species') + \
  geom_point(alpha=0.2)

And now to render the plot:

In [None]:
surveys_plot

Here are some other layers we can try adding.  See what the `stat_smooth()` and `theme_xkcd()` layers do:

In [None]:
surveys_plot + stat_smooth() + theme_xkcd()

We can also try `facet_wrap('~variablename')` and pass it the name of the variable we'd like to use as our facet.  Let's facet on `species_id` so we can look at the hindfoot_length vs. weight for each species separately:

In [None]:
surveys_plot + facet_wrap('~species_id')

## Linear Regression

Let's try a simple regression regression on one of the species' length/weight data.

In [None]:
ds_data = surveys_df[surveys_df.species_id=='DS'][['hindfoot_length', 'weight', 'sex']]

In [None]:
ds_data.head()

In [None]:
ds_data = ds_data[~ds_data.hindfoot_length.isnull() & ~ds_data.weight.isnull() & ~ds_data.sex.isnull()]

In [None]:
ds_data.head()

In [None]:
ds_data[ds_data.sex.isnull()]

In [None]:
import math
import numpy as np
from sklearn.linear_model import LinearRegression

In [None]:
ds_data.hist(bins=20, figsize=(20, 5)) # Creates one histogram per continuous data

In [None]:
100*ds_data['weight']

In [None]:
plt.scatter(ds_data.hindfoot_length, ds_data.weight)

In [None]:
#ds_data[['hindfoot_length']]

regr = LinearRegression()
regr.fit(X = ds_data[['hindfoot_length']], y = ds_data[['weight']])

In [None]:
regr.coef_

In [None]:
regr.intercept_

In [None]:
# R-squared

regr.score(X = ds_data[['hindfoot_length']], y = ds_data[['weight']])