# Introduction to Pandas : Part 2
------
This tutorial is heavily based on [Pandas in 10 min](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#). The original material waas modified by adding TnSeq data as examples.

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline  

## Get datasets to play with

In [None]:
%%bash
wget https://nekrut.github.io/BMMB554/tnseq_untreated.txt.gz
wget https://nekrut.github.io/BMMB554/ta_gc.txt

In [None]:
data_file = 'tnseq_untreated.txt.gz'

In [None]:
# Just two choices for beginning of of gene field
!gunzip -c {data_file} | cut -f 8 | cut -f 1 -d '=' | sort | uniq -c

In [None]:
# Process tnseq_untreated.txt.gz to correctly parse gene names

import os
f = open('data.txt','w')

with os.popen('gunzip -c {}'.format(data_file)) as stream:
  for line in stream:
    if line.split( '\t' )[7].startswith( '.' ):
      f.write( '{}\t{}\n'.format( '\t'.join( line.split( '\t' )[:7] ) , 'intergenic'  ) )
    elif line.split( '\t' )[7].startswith( 'ID' ):
      f.write( '{}\t{}\n'.format( '\t'.join( line.split( '\t' )[:7] ) , line.split( '\t' )[7].split(';')[0][3:] ) )
f.close()

In [None]:
# Read from the file

tnseq = pd.read_table('data.txt', header=None, names=['pos','blunt','cap','dual','erm','pen','tuf','gene'])

In [None]:
tnseq.head()

In [None]:
# Set position as index

tnseq = tnseq.set_index('pos')

In [None]:
tnseq.head()

In [None]:
# Reading GC content data

gc = pd.read_table('ta_gc.txt', header=None, names=['pos','gc'])

In [None]:
gc.head()

In [None]:
# Set position as index as well

gc = gc.set_index('pos')

In [None]:
gc.head()

## Joins of all sorts

![](http://kirillpavlov.com/images/join-types.png)

Image from Kirill Pavlov [blog](http://kirillpavlov.com/blog/2016/04/23/beyond-traditional-join-with-apache-spark/)

### Prepare sample data

To make things more digestable we will create twio dataframes, `df1` and `df2`, that are small subsets of `tnseq` and `gc` tables. In making them we will make sure that thay mostly overlap but also contain a few rows with indexes not present in the other dataframe.

In [None]:
# Let's create a small subset of tnseq data:

df1 = tnseq[( tnseq['gene'] != 'intergenic' ) & ( tnseq['blunt']>100 ) ].head(10)

In [None]:
# Create a numpy array contain index values from fd1

i = np.array(df1.index[1:])

In [None]:
i

In [None]:
# Append a few gc index values to i, that are not present in df1

i = np.append(i,[2410079,2405277,2405301])

In [None]:
i

In [None]:
# ... and a subset of gc data

df2 = gc.loc[i]

In [None]:
# This is what we have in df1

df1

In [None]:
# ... and this is content of df2
df2


### Inner join

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/SQL_Join_-_07_A_Inner_Join_B.svg/220px-SQL_Join_-_07_A_Inner_Join_B.svg.png)

Here **A** is `df1` and **B** is `df2`.

Image from [Wikipedia](https://en.wikipedia.org/wiki/Join_(SQL).

In [None]:
df1.join(df2, how = 'inner')

In [None]:
pd.merge(df1, df2, left_index=True, right_index=True, how = 'inner')

### Left join

![](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/SQL_Join_-_01_A_Left_Join_B.svg/220px-SQL_Join_-_01_A_Left_Join_B.svg.png)

Here **A** is `df1` and **B** is `df2`.

Image from [Wikipedia](https://en.wikipedia.org/wiki/Join_(SQL).

In [None]:
df1.join(df2, how = 'left')

In [None]:
pd.merge(df1, df2, left_index=True, right_index=True, how = 'left')

### Right join

![](https://upload.wikimedia.org/wikipedia/commons/thumb/5/5f/SQL_Join_-_03_A_Right_Join_B.svg/220px-SQL_Join_-_03_A_Right_Join_B.svg.png)

Here **A** is `df1` and **B** is `df2`.

Image from [Wikipedia](https://en.wikipedia.org/wiki/Join_(SQL).

In [None]:
df1.join(df2, how = 'right')

In [None]:
pd.merge(df1, df2, left_index=True, right_index=True, how = 'right')

### Full join

![](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/SQL_Join_-_05b_A_Full_Join_B.svg/220px-SQL_Join_-_05b_A_Full_Join_B.svg.png)

Here **A** is `df1` and **B** is `df2`.

Image from [Wikipedia](https://en.wikipedia.org/wiki/Join_(SQL).

In [None]:
df1.join(df2, how = 'outer')

In [None]:
pd.merge(df1, df2, left_index=True, right_index=True, how = 'outer')

## Grouping

-------

By “group by” we are referring to a process involving one or more of the following steps:

 - Splitting the data into groups based on some criteria
 - Applying a function to each group independently
 - Combining the results into a data structure
See the [Grouping section](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby).

In [None]:
df1.groupby(['gene']).sum()

In [None]:
df1.groupby(['gene']).max()

## Actually using SQL

There is a great SQL-like interface for Pandas called [`pandasql`](https://github.com/yhat/pandasql):

In [None]:
!pip install -U pandasql

In [None]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

In [None]:
## Aggregating

pysqldf("select gene, sum(blunt) as bl from df1 group by gene")

In [None]:
## Joining (left join)

pysqldf("select * from df1 left join df2 on df1.pos = df2.pos")

In [None]:
## Joining (inner join)

pysqldf("select * from df1 join df2 on df1.pos = df2.pos")