# Introduction to Pandas 2 | Narrow versus Wide and more

-------


> This is an aggregated tutorial relying on material for the following fantastic sources:
-  [Justin Bois](http://justinbois.github.io/bootcamp/2020/index.html). It contains modified training datasets and adopts content to Colab environment.
- [BIOS821 course at Duke](https://people.duke.edu/~ccc14/bios-821-2017/index.html)
- [Pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html/) 

In [None]:
import pandas as pd

<hr>

In the last lesson, we learned about Pandas and dipped our toe in to see its power. In this lesson, we will continue to harness the power of Pandas to pull out subsets of data we are interested in.

## Tidy data

[Hadley Wickham](https://en.wikipedia.org/wiki/Hadley_Wickham) wrote a [great article](http://dx.doi.org/10.18637/jss.v059.i10) in favor of "tidy data." Tidy data frames follow the rules:

1. Each variable is a column.
2. Each observation is a row.
3. Each type of observation has its own separate data frame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always **much** easier to work with than non-tidy formats.

Also, let's take a look at this [article](https://dtkaplan.github.io/DataComputingEbook/chap-tidy-data.html#chap:tidy-data). 

## The data set

The dataset we will be using is a list of all SARS-CoV-2 datasets in [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) as of January 20, 2021.  

It is obtained by going to https://www.ncbi.nlm.nih.gov/sra and performing a query with the following search terms: `txid2697049[Organism:noexp]`.

Results are downloaded using `Send to:` menu selecting `File` and then `RunInfo`. Let's get these results into this notebook:

In [None]:
df = pd.read_csv('https://github.com/nekrut/BMMB554/raw/master/2021/data/sra_ncov_bmmb554.csv.gz')
df = df[df['size_MB']> 0].reset_index(drop=True)

# Take a look
df

This data set is in tidy format. Each row represents a single SRA dataset. The properties of each run are given in each column. We already saw the power of having the data in this format when we did Boolean indexing in the last lesson. 

## Finding unique values and counts

How many unique sequencing platforms do we have?



In [None]:
df['Platform'].unique()

In [None]:
df['Platform'].value_counts()

## Sorting

(and axes!)

Let's start by sorting on index:

In [None]:
df_subset = df.sample(n=10)

In [None]:
df_subset

In [None]:
df_subset.sort_index()

In [None]:
df_subset.sort_index(axis = 1)

Now let's try sorting by values:

In [None]:
df_subset.sort_values(by=['LibraryLayout'])

In [None]:
df_subset.sort_values(by=['LibraryLayout','size_MB'])

In [None]:
df_subset.sort_values(by=['LibraryLayout','size_MB'],ascending=[True,False])

## Split-apply-combine

The general idea of "Split-Apply-Combine" is shown  in this figure:

<img src="https://camo.githubusercontent.com/60a1e7e95eaef8f9a99f43335368915eafedda3e/687474703a2f2f7777772e686f66726f652e6e65742f737461743537392f736c696465732f73706c69742d6170706c792d636f6d62696e652e706e67" alt="Drawing" style="width: 400px;"/>

> Image from [BIOS703](https://people.duke.edu/~ccc14/bios-821-2017/index.html#)

Let's say we want to compute the total size of SRA runs for each `BioProject`. Ignoring for the second the mechanics of how we would do this with Pandas, let's think about it in English. What do we need to do?

1. **Split** the data set up according to the `'BioProject'` field, i.e., split it up so we have a separate data set for each BioProject ID. .
2. **Apply** a median function to the activity in these split data sets.
3. **Combine** the results of these averages on the split data set into a new, summary data set that contains classes for each platform and medians for each.

We see that the strategy we want is a **split-apply-combine** strategy. This idea was put forward by Hadley Wickham in [this paper](http://dx.doi.org/10.18637/jss.v040.i01). It turns out that this is a strategy we want to use *very* often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.

Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.

Pandas's split-apply-combine operations are achieved using the `groupby()` method. You can think of `groupby()` as the splitting part. You can then apply functions to the resulting `DataFrameGroupBy` object. The [Pandas documentation on split-apply-combine](http://pandas.pydata.org/pandas-docs/stable/groupby.html) is excellent and worth reading through. It is extensive though, so don't let yourself get intimidated by it.

### Aggregation

Let's go ahead and do our first split-apply-combine on this tidy data set. First, we will split the data set up by `BioProject`.

In [None]:
grouped = df.groupby(['BioProject'])

# Take a look
grouped

There is not much to see in the `DataFrameGroupBy` object that resulted. But there is a lot we can do with this object. Typing `grouped.` and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The `grouped.sum()` method is exactly what we want.

In [None]:
df_sum = grouped.sum()

# Take a look
df_sum

In [None]:
df_sum = pd.DataFrame(grouped['size_MB'].sum())
df_sum

The outputted data frame has the sums of numerical columns only, which we have only one: `size_MS`. Note that this data frame has `Platform` as the name of the row index. If we want to instead keep `Platform` (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the `reset_index()` method.

In [None]:
df_sum.reset_index()

Note, though, that this was not done in-place. If you want to update your data frame, you have to explicitly do so.

In [None]:
df_sum = df_sum.reset_index()

We can also use multiple columns in our `groupby()` operation. To do this, we simply pass in a list of columns into `df.groupby()`. We will **chain the methods**, performing a groupby, applying a median, and then resetting the index of the result, all in one line.

In [None]:
df.groupby(['BioProject', 'Platform']).sum().reset_index()

This type of operation is called an **aggregation**. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median. 

You now have tremendous power in your hands. When your data are tidy, you can rapidly accelerate the ubiquitous split-apply-combine methods. Importantly, you now have a logical framework to think about how you slice and dice your data. As a final, simple example, I will show you how to go start to finish after loading the data set into a data frame, splitting by `BioProject` and `Platform`, and then getting the quartiles and extrema, in addition to the mean and standard deviation.

In [None]:
df.groupby(['BioProject', 'Platform'])['size_MB'].describe()

In [None]:
df.groupby(['BioProject', 'Platform'])['size_MB'].describe().reset_index()

In [None]:
import numpy as np
df.groupby(['BioProject', 'Platform']).agg({'size_MB':np.mean, 'Run':'nunique'})

Yes, that's right. One single, clean, easy to read line of code. In coming tutorials, we will see how to use tidy data to quickly generate plots.

Why `np.mean` is without quotes and `nunique` is with quotes? See [here](https://stackoverflow.com/questions/66443260/why-are-some-pandas-aggregation-functions-in-quotes-and-others-not)

## Tidying a data set

You should always organize your data sets in a tidy format. However, this is sometimes just not possible, since you data sets can come from instruments that do not output the data in tidy format (though most do, at least in my experience), and you often have collaborators that send data in untidy formats.

The most useful function for tidying data is `pd.melt()`. To demonstrate this we will use a dataset describing read coverage across SARS-CoV-2 genomes for a number of samples.

In [None]:
df = pd.read_csv('https://github.com/nekrut/BMMB554/raw/master/2021/data/coverage.tsv',sep='\t')

df.head()

Clearly these data are not tidy. When we melt the data frame, the data within it, called **values**, become a single column. The headers, called **variables**, also become new columns. So, to melt it, we need to specify what we want to call the values and what we want to call the variable. [`pd.melt()`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html#pandas.melt) does the rest!

![](https://pandas.pydata.org/docs/_images/07_melt.svg)

> Image from [Pandas Docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html#wide-to-long-format).



In [None]:
melted = pd.melt(df, value_name='coverage', var_name=['sample'],value_vars=df.columns[3:],id_vars=['start','end'])

melted.head()

...and we are good to do with a tidy DataFrame! Let's take a look at the summary. This wouild allow us to easily plot coverage:

In [None]:
import seaborn as sns
sns.relplot(data=melted, x='start',y='coverage',kind='line')

In [None]:
sns.relplot(data=melted, x='start',y='coverage',kind='line',hue='sample')

In [None]:
melted.groupby(['sample']).describe()

In [None]:
melted.groupby(['sample'])['coverage'].describe()

To get back from melted (narrow) format to wide format we can use [`pivot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot) function. 

![](https://pandas.pydata.org/docs/_images/07_pivot.svg)

> Image from [Pandas Docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html#long-to-wide-table-format).





In [None]:
melted.pivot(index=['start','end'],columns='sample',values='coverage')

## Working with multiple tables

Working with multiple tables often involves joining them on a common key:

![](https://pandas.pydata.org/docs/_images/08_merge_left.svg)

In fact, this can be done in several different ways described below. But first let's define two simple dataframes:



In [None]:
!pip install --upgrade pandasql

In [None]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

In [None]:
df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

In [None]:
df1

In [None]:
df2

### Inner join

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/SQL_Join_-_07_A_Inner_Join_B.svg/234px-SQL_Join_-_07_A_Inner_Join_B.svg.png?20170204165143)

> Figure from Wikimedia Commons

Using pandas `merge`:

In [None]:
pd.merge(df1, df2, on="key")

Using `pysqldf`:

In [None]:
pysqldf('select * from df1 join df2 on df1.key=df2.key')

### Left join

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/SQL_Join_-_01b_A_Left_Join_B.svg/234px-SQL_Join_-_01b_A_Left_Join_B.svg.png?20170204144906)

> Figure from Wikimedia Commons

Using pandas `merge`:

In [None]:
pd.merge(df1, df2, on="key", how="left").fillna('.')

Using `pysqldf`:

In [None]:
pysqldf('select * from df1 left join df2 on df1.key=df2.key').fillna('.')

In [None]:
pysqldf('select df1.key, df1.value as value_x, df2.value as value_y from df1 left join df2 on df1.key=df2.key').fillna('.')

### Right join

![](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5f/SQL_Join_-_03_A_Right_Join_B.svg/234px-SQL_Join_-_03_A_Right_Join_B.svg.png?20170130230641)

> Figure from Wikimedia Commons

Using pandas `merge`:

In [None]:
pd.merge(df1, df2, on="key", how="right").fillna('.')

### Full join

![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/61/SQL_Join_-_05_A_Full_Join_B.svg/234px-SQL_Join_-_05_A_Full_Join_B.svg.png?20170130230643)

> Figure from Wikimedia Commons

Using pandas `merge`:

In [None]:
pd.merge(df1, df2, on="key", how="outer").fillna('.')

## Computing environment

In [None]:
!pip install watermark

In [None]:
%load_ext watermark
%watermark -v -p numpy,pandas