# Merging datasets together

Turbopanda has even more to offer when it provides robust and flexible column-wise merging of $k$ datasets together seemlessly on the best index.

In [1]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

import matplotlib.pyplot as plt
%matplotlib inline

print("turbopanda: %s" % turb.__version__)


0.2.4


## Use cases

These are the following use cases between `pd.merge` and `turb.merge`:

| Use case input | `pandas` response | `turbopanda` response |
| --------------------- | ----------------- | --------------- |
| DataFrame $X$ with itself | Joins on label with suffixes (Must specify `on` label) | Concatenates the DataFrames together along the columns (using `pd.concat`) |

## Example: Merging together RNA and protein

Given that I work with biological datasets, you will see the majority of my examples use such.

In [10]:
hgnc = turb.read("../data/hgnc.csv")
rna = turb.read("../data/rna.csv")
prot = turb.read("../data/prot.csv")
print(hgnc, rna, prot)

MetaPanda(hgnc(n=200, p=15, mem=0.046MB, options=[])) MetaPanda(rna(n=100, p=117, mem=0.060MB, options=[])) MetaPanda(prot(n=100, p=78, mem=0.035MB, options=[]))


### Joining together loaded datasets..

Here datasets are joined in a chain-like fashion, beginning with the first one. By default, only the intersection of terms is what is kept, and non-overlaps are dropped.

In [8]:
hgnc

MetaPanda(hgnc(n=200, p=15, mem=0.046MB, options=[]))

In [6]:
m1 = turb.merge([hgnc, rna, prot])
m1

MetaPanda(hgnc__rna__prot(n=100, p=207, mem=0.115MB, options=[]))

Note in the above example that the indices on which the merge occurs are not required as arguments; this is automatically determined by the algorithm as to which two columns are the best in terms of overlapping. An error is raised if None of the columns overlap with each other.

As we can see the three datasets are merged together, and the names have also been glued together, separated by `"__"`. The name can be overrided with a custom one if so desired:

In [12]:
m2 = turb.merge([hgnc, rna, prot], name='combined_DNA')
m2

MetaPanda(combined_DNA(n=100, p=207, mem=0.115MB, options=[]))

Further to this, we can modify *how* the datasets are joined, for example using `outer`:

In [14]:
m3 = turb.merge([hgnc, rna, prot], name='combined_DNA', how='outer')
m3

MetaPanda(combined_DNA(n=200, p=207, mem=0.660MB, options=[]))

In this case, where there aren't overlaps, those rows are kept.

## Load and merge

Often we have situations where we'd like to import a file and merge it straight away into a dataset we like. In this case, we merely give the direction to the file by passing a `str` or list of strings as an argument to `merge` rather than the loaded object.

In [17]:
m4 = turb.merge(["../data/hgnc.csv", "../data/rna.csv", "../data/prot.csv"])
m4

MetaPanda(hgnc__rna__prot(n=100, p=207, mem=0.115MB, options=[]))

In addition, this method draws on `turb.read` and as such has glob-like compliance, such that similarly-named files can be imported in alphabetical order and merged automatically with one string input.