<a href="https://colab.research.google.com/github/kdidi99/Python_for_Biochemists/blob/main/notebooks/day_3_data_analysis_4_merging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 3: Data analysis with Pandas, part 4

#4. Sorting & combining

So far we looked at how to manipulate data frames read in from a single csv or txt file. In reality, though, we will often need to collect data from multiple sources into one DataFrame. For this to work, we need specific methods to combine/merge these datasets and sort/reshape them in order to gain a useful representation of our data. 

So, first we will deal with combining data, and then we will look at how to sort and reshape the combined DataFrame.

##Combining Data

For combining our data, we have three main methods: 

- concat(): Combine data-frames both column- or row-wise, not looking at the values/not matching against a key
- merge(): combine data based on indices or columns that are common (database-like joining, most flexible of the three)
- .join(): combine data based on indices

In [2]:
import pandas as pd

#mutations of proteins
df_mutation = pd.DataFrame({
    'Mutation': ['No', 'H86R', 'C142G'],
    'Name': ['WT', 'Mut3', 'Mut1']
})

#activity of proteins from screening experiment
df_activity = pd.DataFrame({
    'Name': ['Mut1', 'Mut2', 'Mut3'],
    'Activity': [10, 1333, 1104]
})

print(df_mutation)
print("---")
print(df_activity)

  Mutation  Name
0       No    WT
1     H86R  Mut3
2    C142G  Mut1
---
   Name  Activity
0  Mut1        10
1  Mut2      1333
2  Mut3      1104


The most versatile of those functions is merge(), so we will focus on this first. With merge(), you can implement all the joining operations you would normally implement in a "classic" relational database working with e.g. SQL. There are four main types of joins: Inner join, left join, right join and outer/full join.

<div>
<img src="https://github.com/kdidi99/Python_for_Biochemists/blob/main/images/inner_join.PNG?raw=1", width=500>
</div>

In [8]:
df_inner = pd.merge(df_mutation, df_activity, on="Name", how='inner') #on keyword optional, by default combines columns with same names, but better to specify it!
df_inner

Unnamed: 0,Mutation,Name,Activity
0,H86R,Mut3,1104
1,C142G,Mut1,10


<div>
<img src="https://github.com/kdidi99/Python_for_Biochemists/blob/main/images/left_join.PNG?raw=1", width=500>
</div>

In [9]:
df_left = pd.merge(df_mutation, df_activity, on="Name", how='left')
df_left

Unnamed: 0,Mutation,Name,Activity
0,No,WT,
1,H86R,Mut3,1104.0
2,C142G,Mut1,10.0


<div>
<img src="https://github.com/kdidi99/Python_for_Biochemists/blob/main/images/right_join.PNG?raw=1", width=500>
</div>

In [10]:
df_right = pd.merge(df_mutation, df_activity, on="Name", how='right')
df_right

Unnamed: 0,Mutation,Name,Activity
0,C142G,Mut1,10
1,,Mut2,1333
2,H86R,Mut3,1104


<div>
<img src="https://github.com/kdidi99/Python_for_Biochemists/blob/main/images/outer_join.PNG?raw=1", width=500>
</div>

In [12]:
df_outer = pd.merge(df_mutation, df_activity, on="Name", how='outer')
df_outer

Unnamed: 0,Mutation,Name,Activity
0,No,WT,
1,H86R,Mut3,1104.0
2,C142G,Mut1,10.0
3,,Mut2,1333.0


The other two methods (.join() and concat()) are just simplified versions of merge() for special cases.

.join() for example (an *instance method*, not a *module function* as the other two) does by default a left join on indices.

In [15]:
#df_join = df_mutation.join(df_activity); does not work, we need to specify name for new index!
df_join = df_mutation.join(df_activity, lsuffix="_l", rsuffix="_r")
df_join

ValueError: ignored

In this case, the .join() does not make sense since our indices do not correspond to observations in the DataFrames; therefore, calling .join() gives us a mess. 