# Notebook 2: Going Deeper

**TurboPanda** has a lot of interesting functionality to do with manipulating columns and groups of columns that `pandas` so often lacks.

In [1]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

## Reading in our dataset

In [11]:
g = turb.read("../../Research Data/datasets/aviner_translation.csv", name="Translation")
g.drop("Majority")
g

MetaPanda(Translation(n=5216, p=15, mem=0.626MB), mode='instant')

## Accessing with `head`

Normally copying over features from `pandas.DataFrame` was a forbidden fruit, but we decided in this case that `head` so useful that we would break tradition:

In [13]:
g.head(3)

colnames,Protein_IDs,Protein_names,Gene_names,Intensity_G1_1,Intensity_G1_2,Intensity_G1_3,Intensity_G2M_1,Intensity_G2M_2,Intensity_G2M_3,Intensity_MG1_1,Intensity_MG1_2,Intensity_MG1_3,Intensity_S_1,Intensity_S_2,Intensity_S_3
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703,23.25724
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1,,,,,,25.11629,,,,,,


## Generic `apply` to the underlying DataFrame

For the thousands of instances where we simply want to apply a `pandas.DataFrame.*` function to the underlying dataset, but retain the consistency between `df_` and `meta_` attributes in particular, our solution is to provide an `apply` function which is performed on the entire
dataset.

This is similar to the `transform` function we have, however `apply` only looks for functions within `pandas.DataFrame.*` API, and it does not allow for pre-subset selection (using the `selector` parameter) beforehand.

For instance, let's say we want to fill all the `NaN` values in dataset with a value. We could do this easily with `pandas.DataFrame.fillna` but to use `MetaPanda` we'd have to call the `transform` function with a custom lambda etc.

In [14]:
g.apply("fillna", 0)

MetaPanda(Translation(n=5216, p=15, mem=0.627MB), mode='instant')

It's that easy.

In [15]:
g.df_.head(3)

colnames,Protein_IDs,Protein_names,Gene_names,Intensity_G1_1,Intensity_G1_2,Intensity_G1_3,Intensity_G2M_1,Intensity_G2M_2,Intensity_G2M_3,Intensity_MG1_1,Intensity_MG1_2,Intensity_MG1_3,Intensity_S_1,Intensity_S_2,Intensity_S_3
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703,23.25724
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1,0.0,0.0,0.0,0.0,0.0,25.11629,0.0,0.0,0.0,0.0,0.0,0.0


## String manipulation of ID columns

Often datasets can have numerical/quantitative data coupled with IDs that help to identify rows based on your problem domain. In this case, we have some identifiers from **Uniprot**, one of the main databases managing protein sequences.

There is a slight problem though, each of these expression values has a *stack of protein IDs* associated with it. So we're going to have to untangle this somehow by **expanding** the ID columns to something we can then perform **set theory** operations on.

`MetaPanda` comes with an `expand` function which does exactly this:

In [16]:
g.expand("Protein_IDs", sep=";")

MetaPanda(Translation(n=26318, p=15, mem=3.370MB), mode='instant')

We can see in this instance that $n$ has increased significantly, as has the memory usage.

In [17]:
g.df_.head()

colnames,Protein_IDs,Protein_names,Gene_names,Intensity_G1_1,Intensity_G1_2,Intensity_G1_3,Intensity_G2M_1,Intensity_G2M_2,Intensity_G2M_3,Intensity_MG1_1,Intensity_MG1_2,Intensity_MG1_3,Intensity_S_1,Intensity_S_2,Intensity_S_3
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,Q96IC2,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
0,Q96IC2-2,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
0,H3BM72,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
0,H3BV93,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
0,H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
