# Notebook 2: Going Deeper

**TurboPanda** has a lot of interesting functionality to do with manipulating columns and groups of columns that `pandas` so often lacks.

In [1]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

## Reading in our dataset

In [2]:
g = turb.read("translation.csv", name="Translation")
g

MetaPanda(Translation(n=5216, p=14, mem=0.585MB), mode='instant')

## Accessing with `head`

Normally copying over features from `pandas.DataFrame` was a forbidden fruit, but we decided in this case that `head` so useful that we would break tradition:

In [3]:
g.head(3)

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1,,,,,,25.11629,,,,,


## Generic `apply` to the underlying DataFrame

For the thousands of instances where we simply want to apply a `pandas.DataFrame.*` function to the underlying dataset, but retain the consistency between `df_` and `meta_` attributes in particular, our solution is to provide an `apply` function which is performed on the entire
dataset.

This is similar to the `transform` function we have, however `apply` only looks for functions within `pandas.DataFrame.*` API, and it does not allow for pre-subset selection (using the `selector` parameter) beforehand.

For instance, let's say we want to fill all the `NaN` values in dataset with a value. We could do this easily with `pandas.DataFrame.fillna` but to use `MetaPanda` we'd have to call the `transform` function with a custom lambda etc.

In [4]:
g.apply("fillna", 0)

MetaPanda(Translation(n=5216, p=14, mem=0.585MB), mode='instant')

It's that easy.

In [5]:
g.df_.head(3)

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1,0.0,0.0,0.0,0.0,0.0,25.11629,0.0,0.0,0.0,0.0,0.0


## String manipulation of ID columns

Often datasets can have numerical/quantitative data coupled with IDs that help to identify rows based on your problem domain. In this case, we have some identifiers from **Uniprot**, one of the main databases managing protein sequences.

There is a slight problem though, each of these expression values has a *stack of protein IDs* associated with it. So we're going to have to untangle this somehow by **expanding** the ID columns to something we can then perform **set theory** operations on.

`MetaPanda` comes with an `expand` function which does exactly this:

In [6]:
g.expand("prot_IDs", sep=";")

MetaPanda(Translation(n=26318, p=14, mem=3.159MB), mode='instant')

We can see in this instance that $n$ has increased significantly, as has the memory usage.

In [7]:
g.df_.head()

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,Q96IC2-2,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,H3BM72,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,H3BV93,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662


## Applying a transformation to eliminate duplicates

Here we see that there are labels that are *sort of* duplicated in the sense that they are appended with `-[0-9]`. We will split these off and keep the left-hand part, then apply a function which drops duplicates.

In [8]:
g.transform(lambda x: x.str.split("-",expand=True)[0], "^prot_ID")

MetaPanda(Translation(n=26318, p=14, mem=3.159MB), mode='instant')

Now use `pandas.DataFrame.drop_duplicates`, with the subset just on Protein IDs

In [9]:
g.apply("drop_duplicates", subset=["prot_IDs"])

MetaPanda(Translation(n=21050, p=14, mem=2.527MB), mode='instant')

In [10]:
g.df_.head(3)

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,H3BM72,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,H3BV93,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662


## Using `mode=delay`: Delaying actions to make a task graph

You may have noticed that the representation of the object shows `mode='instant'`.

Using the current approach, all of our operations apply instantly. This can be nice but if you make a mistake, it can be difficult to backtrack.

Fortunately, `MetaPanda` has a `mode` parameter during initialization and `mode_` attribute that can be set at any point. When set to `delay`, many functions that apply changes immediately to the data are instead *cached* in a `pipe_` attribute.

This attribute is then emptied when a call to `compute()` is made, and all of the operations within the `pipe_` are executed with their parameters, in order.

### Creating a series of operations to perform...

In [11]:
g.mode_="delay"

In [12]:
g.drop("prot_name")
g.apply("groupby", ["counter","Gene_names"])
g.apply("mean")
g.rename([("translation","trans")])

We can see that no changes have been made to the dataset.

In [13]:
g

MetaPanda(Translation(n=21050, p=14, mem=2.527MB), mode='delay')

But if we check the `pipe_` attribute, all of the operations have been saved:

In [14]:
g.pipe_

[('drop', ('prot_name',), {}),
 ('apply', ('groupby', ['counter', 'Gene_names']), {}),
 ('apply', ('mean',), {}),
 ('rename', ([('translation', 'trans')],), {})]

### Now a call to `compute()`:

Note that `compute()` will empty the `pipe_` attribute by default to prevent repeat calls.

`compute()` also takes optional parameters `pipe` and `inplace`, if `pipe` is passed it uses this external pipe rather than an internal object. If `inplace` is set to False, a copy of the MetaPanda is returned instead of acting inplace.

In [15]:
g.compute()

MetaPanda(Translation(n=5190, p=11, mem=0.561MB), mode='instant')

In [16]:
g.df_.head()

Unnamed: 0_level_0,colnames,trans_G1_1,trans_G1_2,trans_G1_3,trans_G2M_1,trans_G2M_2,trans_G2M_3,trans_MG1_1,trans_MG1_2,trans_MG1_3,trans_S_1,trans_S_2
counter,Gene_names,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
1,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703
2,A2ML1,0.0,0.0,0.0,0.0,0.0,25.11629,0.0,0.0,0.0,0.0,0.0
3,AAAS,25.48382,24.42746,25.22645,24.44556,23.93706,25.30966,25.61462,25.45923,24.48253,24.31645,23.92143
4,AACS,24.18177,24.51533,24.32766,24.15993,24.05001,24.95797,24.11656,24.22523,23.96446,23.8944,23.78107


In [17]:
g.pipe_

[]

### Functions that are affected by `delay` (accessed in `turb.metapanda.__delay_functions__`) include:

- `add_prefix`
- `add_suffix`
- `apply`
- `drop`
- `expand`
- `melt`
- `meta_map`
- `multi_transform`
- `rename`
- `shrink`
- `sort_columns`
- `split_categories`
- `transform`

## Computing external pipelines

Often we want to have standardized pipelines that we can apply to many different pandas.DataFrames, for example we may have a similar system for **standardizing a dataset** in preparation of Machine Learning algorithms.

`turbopanda` provides one of these presets, called `ml_pipe`:

In [18]:
turb.ml_pipe

<function turbopanda.pipes.ml_pipe(mp, X_s, y_s, preprocessor='scale')>

This function takes a `MetaPanda`, with the input columns as a selector and output column(s) as a selector, with optional arguments, and returns a pipeline list which can be passed directly into `compute()`:

In [29]:
npipe = turb.ml_pipe(g, "trans_G1_[1-3]", "trans_G2M_[1-3]")
npipe

[('drop', (object, '_id$', '_ID$'), {}),
 ('apply',
  ('dropna',),
  {'subset': Index(['trans_G2M_1', 'trans_G2M_2', 'trans_G2M_3'], dtype='object', name='colnames')}),
 ('transform',
  (<function turbopanda.pipes.ml_pipe.<locals>.<lambda>(x)>, 'trans_G1_[1-3]'),
  {}),
 ('transform',
  (<function sklearn.preprocessing.data.scale(X, axis=0, with_mean=True, with_std=True, copy=True)>,),
  {'selector': 'trans_G1_[1-3]', 'whole': True}),
 ('transform',
  (<function turbopanda.pipes.ml_pipe.<locals>.<lambda>(y)>,
   'trans_G2M_[1-3]'),
  {})]

In [30]:
ng = g.compute(npipe, inplace=False)
ng



MetaPanda(Translation(n=5190, p=11, mem=0.725MB), mode='instant')