# QBio Python Workshop Week 5

## Pandas and Seaborn, Pt 1

Prepared and presented by Gloria Ha (gloriaha@g.harvard.edu) with reference to notes by Mary Richardson for MCB112.

### Outline
- Basic Pandas
    - Importing data
    - Cleaning dataframe
    - Referencing columns
    - Creating new columns
    - Sorting dataframe
- Basic Seaborn
    - Plotting data from a dataframe
- More Pandas
    - Grouping data
    - Creating a new dataframe

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline

### Importing data

**Pandas**, or Python Data Analysis Library, is a package that makes it easy to import, view, clean up, and analyze data.  We will focus on a Pandas data structure called a **DataFrame**, which is a 2D table with labeled rows and columns.

For today's lesson, we'll be using toy data from my own research.  I'm interested in quantifying chromosome segregation errors in cell division through live-cell imaging.  In normal cell division, one cell aligns the chromosomes at the center of the cell and divides an equal number of chromatids (1 from each chromosome) into two daughter cells.  In my experiments, I force cells to undergo anaphase (chromatid separation) at different times using a drug.  I then observe (1) when the cells go into anaphase and (2) how many chromatids are in each daughter cell after division.  


Let's go ahead and take a look at the data, which I've provided in a comma-separated values (.csv) format.  We can preview a file using the command line argument `!head`

In [None]:
!head kt_counts.csv

Let's load the data.  Since it's in a CSV (comma-separated values) format, we can use `pd.read_csv`

Since the CSV file already has the headers we need, we don't need to specify the column names.  I'll provide the code to import the data with specified column names in the comments.

In [None]:
# import data
imported_df = pd.read_csv('kt_counts.csv')

# import data with specified column names (have to get rid of the first row since it's redundant now)
# imported_df = pd.read_csv('kt_counts.csv',names=['sample','forced_time','position','N1','N2','anaphase_time','congression'],skiprows=1)

# take a look at the dataframe (default is the first five rows)
imported_df.head()

This has the same information as the preview from the command line, but looks a lot nicer and spaced out.  

There are 7 columns here.
- **sample**: ID number of sample
- **forced_time**: time of drug addition (minutes)
- **position**: ID number of cell within the sample
- **N1**: number of chromatids in cell 1
- **N2**: number of chromatids in cell 2
- **anaphase_time**: time of chromatid separation (minutes)
- **congression**: whether or not the chromosomes aligned at the center of the cell before anaphase (Y/N)

This data is **tidy**

What that means is (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html):
- Every column is a variable.

- Every row is an observation.

- Every cell is a single value.

Today we're starting with a tidy dataframe.  Mary will go more into the process of tidying data next week!

### Cleaning dataframe

Next we need to clean the data.  There are some cells that I wasn't able to analyze, so there is either no data or text in the `N1` and `N2` columns.  There are also some where I couldn't get the anaphase times.  Let's use the Pandas function `isna()` to check for empty cells.

In [None]:
# store information on empty status of each entry in the dataframe
isna = imported_df.isna()
isna

It seems like most of our data is fine.  If we want to isolate and display the rows that have empty cells, we can first check which rows have any empty cells.

In [None]:
# check for rows with any empty cells
empty_rows = imported_df.isna().any(axis=1)
empty_rows

Now let's display those rows.

In [None]:
# display empty rows
imported_df[empty_rows]

We only have two cells with missing data, so now we can get rid of them using the `dropna()` function.  We can either specify which columns to search for empty cells in, or get rid of all rows with any empty cells at all.  I'll show you both ways below.  We'll store the new, clean dataframe in a new dataframe called `cleaned_df`.

In [None]:
# commented out: how to specify which columns to search for empty cells in
# cleaned_df = imported_df.dropna(subset=['N1','N2','anaphase_time'])

# drop rows with any empty cells!
cleaned_df = imported_df.dropna()
cleaned_df.head()

To make sure that we got rid of the rows with empty cells, we can check how many rows the dataframe has now.  There should be 87 rows now since there were 89 before.

In [None]:
cleaned_df

Great!  You'll notice that even though there are 87 rows, the indices on the left of the dataframe still go to 88.  We can reset the index of the dataframe.

In [None]:
cleaned_df = cleaned_df.reset_index(drop=True) 
cleaned_df

You'll notice that I keep re-defining `cleaned_df`.  Another way to accomplish the same thing is to make the operations `inplace=True`.  For example, another way of doing the above command is:

In [None]:
cleaned_df.reset_index(drop=True, inplace=True) 
cleaned_df

### Referencing columns

You can reference columns of dataframes using brackets.  For example, if I wanted to print out the drug addition times for each cell, I could write `cleaned_df['forced_time']`.  In Pandas, column names are strings.  Let's check out the anaphase times.

In [None]:
cleaned_df['anaphase_time']

Now we can calculate things from our dataframe.  **Try calculating the mean `anaphase_time` for all of the cells.**  There are multiple ways to do this, including with numpy (`np.mean()`).  You can also take the series and directly call the function `mean()`: `df['column name'].mean()`.  Similarly, you can use `sum()`, `max()`, `mean()`, and many other functions!

In [None]:
# calculate the mean anaphase time


What if we wanted to calculate the mean anaphase time just for cells that were forced into anaphase at 20 minutes?  We can call a subset of a dataframe that satisfies some condition like this:

`df[df['column_name']==VAL]`

**Try showing the subset of the dataframe for cells that were forced into anaphase at 20 minutes**

In [None]:
# show the part of the dataframe with cells forced into anaphase at 20 minutes


Now try **calculating the mean anaphase time for cells that were forced into anaphase at 20 minutes.**

In [None]:
# calculate the mean anaphase time for cells forced into anaphase at 20 minutes


### Creating new columns

You can also create new columns in a dataframe.  For example, if I wanted to have a new column called 'force_to_anaphase' specifying how long after the drug addition time the cell went into anaphase, I would do:

`cleaned_df['force_to_anaphase'] = cleaned_df['anaphase_time']-cleaned_df['forced_time']`.

What we're actually interested in looking at today is the difference in the number of chromatids between each pair of daughter cells ($\Delta N = |N_1-N_2|$).  We can define a new column `dN` with this metric.  Try defining a new column in our dataframe and then taking a look at the resulting dataframe.

In [None]:
# make a new column dN =|N1-N2|

# take a look at the resulting dataframe


All of our important numerical data is already in float format, but there are times when the data is not, so we need to convert everything to floats (or whatever data type you desire). Here's how to do it.

In [None]:
# convert dN and anaphase time data to floats
cleaned_df.astype({'dN': 'float','anaphase_time':'float'})
cleaned_df.head()

### Sorting dataframe

There are various reasons why you might want to sort your dataframe.  It might be for visualization purposes or to make downstream calculations easier.  Let's try sorting the dataframe by drug addition time.

The general command for sorting a dataframe is
`df = df.sort_values(by=['column_name'])`

You can also sort in descending order by specifying `ascending=False`. You can sort by column A and then sort any ties by column B by passing `['column_A','column_B']`.

Try sorting the dataframe by `'forced_time'` in ascending order and taking a look at the resulting dataframe!

In [None]:
# sort dataframe by forced_time column

# take a look at the resulting dataframe


You'll notice that the indices are all shuffled again.  Though it doesn't matter for our purposes today, you can reindex the dataframe again using the same command as before.

In [None]:
cleaned_df.reset_index(drop=True, inplace=True) 
cleaned_df

### Plotting data from a dataframe

Now that we have our nice and clean dataframe we can start looking at the data!  You can always use standard Matplotlib functionality to plot data from Pandas dataframes.  Let's plot a histogram of the $\Delta N$ values for cells that were forced into anaphase at 10 minutes using the standard `plt.hist()` command.

In [None]:
# plot a histogram of dN values for cells forced into anaphase at 10 minutes
plt.hist(cleaned_df[cleaned_df['forced_time']==10]['dN']);
plt.xlabel('dN');
plt.ylabel('number of cells');

The nice thing about **Seaborn** is that we can plot directly from the dataframe and display things according to our desired attributes.  For most Seaborn plotting functions, the arguments include:
- **data**: dataframe or subset of dataframe to plot
- **x**: column to plot data from (string)
- **hue**: column to color data based on (string)
- **y**: column to plot data from (string, for 2D plots)

**Try using `sns.histplot()` with the `data`, `x`, and `hue` arguments for plotting `dN` for the cells forced into anaphase at 10 minutes, colored by chromosome congression.**"

In [None]:
# plot a histogram of dN values for cells forced into anaphase at 10 minutes using sns.histplot()


As you can see, Seaborn automatically adds in x and y axis labels for your plot and provides a nice legend for your coloring.

What's a good way to look at all of the data at once?  We could try plotting all of the dNk data as a histogram.

In [None]:
sns.histplot(data=cleaned_df, x='dN',hue='forced_time');

That's not very nice.  It would be better to separate out the data.  One option is `sns.swarmplot()`.  **Try plotting `forced_time` on the x-axis and `dN` on the y-axis, and coloring by chromosome congression.**

In [None]:
# plot dN vs forced_time using sns.swarmplot()
plt.figure(figsize=(15,8))


**Try doing the same thing but for anaphase onset times (you should only have to change one argument).**

In [None]:
plt.figure(figsize=(15,8))


### Grouping data

So far we've tried displaying all of the data, and there seems to be a time dependent trend!  To quantify that trend, I want to condense the data into some statistical measurements.  I'm interested in the mean and standard error of the mean for each timepoint for $\Delta N$.  There's a handy Pandas function `groupby()` where you can group the dataframe by some parameter.  In my case, I want to group rows by `forced_time`.  Let's see what happens if I display the means of each group.

In [None]:
cleaned_df.groupby('forced_time').mean()

Some of these columns don't make much sense (average position or sample number mean nothing to me!), but the average anaphase time and $\Delta N$ are things I am interested in!

I can take a look at just the mean `dN` values by group:

In [None]:
cleaned_df.groupby('forced_time').mean()['dN']

I can use `.sem()` similarly to calculate the standard error of the mean of `dN` for each group, and `.count()` to calculate how many cells are in each group.

**Try making three new variables: `counts`,`means`,and `sems` that contain the information**

In [None]:
# store statistical measurements in new variables
counts = 
means = cleaned_df.groupby('forced_time').mean()['dN']
sems = 

### Creating a new dataframe

Let's get some practice making a new dataframe.  This is the dataframe I want:

| forced_time | count | mean | sem |
| --- | --- | --- | --- |
| 10 |  |  |  |
| 20 |  |  |  |
| 30 |  |  |  |

Where count is the number of cells in each group, and mean and sem are measurements of `dN`.

First, let's make a dictionary containing this information.  You can plug in the variables you just made (`counts`,`means`,and `sems`).

In [None]:
stat_dict = {'count': counts,
             'mean': means,
             'sem': sems}

Making a dataframe from this is super easy now!  The dictionary keys become the columns and the values become the cells.

In [None]:
# set up dataframe for statistics
stat_df = pd.DataFrame(stat_dict)

# reindex dataframe
stat_df = stat_df.reset_index()
stat_df

Now we can plot how the mean `dN` changes over time and plot error bars as well!

In [None]:
plt.errorbar(stat_df['forced_time'],stat_df['mean'],yerr=stat_df['sem'],fmt='o-')
plt.xlabel('forced_time (min)')
plt.ylabel('<dN>');