# Notebook 2: Going Deeper

**TurboPanda** has a lot of interesting functionality to do with manipulating columns and groups of columns that `pandas` so often lacks.

In [1]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

## Reading in our dataset

In [2]:
g = turb.read("../data/translation.csv", name="Translation")
g

MetaPanda(Translation(n=5216, p=14, mem=1.169MB, options=[]))

## Accessing with `head`

Normally copying over features from `pandas.DataFrame` was a forbidden fruit, but we decided in this case that `head` so useful that we would break tradition:

In [3]:
g.head(3)

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1,,,,,,25.11629,,,,,


## Generic `apply` to the underlying DataFrame

For the thousands of instances where we simply want to apply a `pandas.DataFrame.*` function to the underlying dataset, but retain the consistency between `df_` and `meta_` attributes in particular, our solution is to provide an `apply` function which is performed on the entire
dataset.

This is similar to the `transform` function we have, however `apply` only looks for functions within `pandas.DataFrame.*` API, and it does not allow for pre-subset selection (using the `selector` parameter) beforehand.

For instance, let's say we want to fill all the `NaN` values in dataset with a value. We could do this easily with `pandas.DataFrame.fillna` but to use `MetaPanda` we'd have to call the `transform` function with a custom lambda etc.

In [4]:
g.apply("fillna", 0)

MetaPanda(Translation(n=5216, p=14, mem=1.169MB, options=[]))

It's that easy.

`apply` also works on any basic `.str` accessor methods in addition to ones found in `pandas.DataFrame.*` API.

In [5]:
g.head(3)

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1,0.0,0.0,0.0,0.0,0.0,25.11629,0.0,0.0,0.0,0.0,0.0


## String manipulation of ID columns

Often datasets can have numerical/quantitative data coupled with IDs
 that help to identify rows based on your problem domain. In this case,
  we have some identifiers from **Uniprot**, one of the main databases 
  managing protein sequences.

There is a slight problem though, each of these expression values has 
a *stack of protein IDs* associated with it. So we're going to have
 to untangle this somehow by **expanding** the ID columns to something
  we can then perform **set theory** operations on.

`MetaPanda` comes with an `expand` function which does exactly this:

In [6]:
g.expand("prot_IDs", sep=";")

MetaPanda(Translation(n=26318, p=14, mem=6.316MB, options=[]))

We can see in this instance that $n$ has increased significantly, as has the memory usage.

In [7]:
g.head(2)

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,Q96IC2-2,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662


## Applying a transformation to eliminate duplicates

Here we see that there are labels that are *sort of* duplicated in the sense
 that they are appended with `-[0-9]`. We will split these off and keep the 
 left-hand part, then apply a function which drops duplicates.

In [8]:
g.transform(lambda x: x.str.split("-",expand=True)[0], "prot_IDs")

MetaPanda(Translation(n=26318, p=14, mem=6.316MB, options=[]))

Now use `pandas.DataFrame.drop_duplicates`, with the subset just on Protein IDs

In [9]:
g.apply("drop_duplicates", subset=["prot_IDs"])

MetaPanda(Translation(n=21050, p=14, mem=5.052MB, options=[]))

In [10]:
g.head(3)

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,H3BM72,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
0,H3BV93,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662


## Pipelines

Pipes in TurboPanda allow you to chain together operations 
which are performed on the raw dataset and potentially the
 metadata information also. 

Pipe elements can come in the following two formats:

    1. (<function name in MetaPanda>, <function arguments>, <function keyword arguments>) 
    2. (<function name in MetaPanda>, *<arguments>)

For example if we take the example before of dropping 
duplicates found in the `prot_IDs` column:

```python
('apply', ('drop_duplicates',), {'subset':'prot_IDs'})
```

Alternatively the *clean* version allows us just to pass the arguments 
as a list, with keywords as strings including an `=`:

```python
('apply', 'drop_duplicates', 'subset=prot_IDs')
```

### Defining a pipe

We have a `Pipe` class for defining pipes, which by default
 accepts the *clean* version set of arguments.

In [12]:
# tells us the number of pipe elements inside.
turb.Pipe()

Pipe(n_elements=0)

### Adding pipe steps

Pipe operations can be chained together one after another, assuming each
 step returns a `pandas.DataFrame` of some kind, the main exception to this
  is `pandas.DataFrame.groupby`.

Checking the `p` attribute within the `Pipe` reveals the *raw* or ugly version of the pipeline 
that `MetaPanda` uses when it is processing the commands:

In [13]:
turb.Pipe(['apply', 'drop_duplicates', 'subset=prot_IDs']).p

(('apply', ('drop_duplicates',), {'subset': 'prot_IDs'}),)

To add multiple steps, each pipe element must be encapsulated in a tuple or list:

In [15]:
cleaner = turb.Pipe(['apply', 'drop_duplicates', 'subset=prot_IDs'],
                    ['apply', 'fillna', 0])

### Computing changes using a Pipe

To use these constructed pipes on the `MetaPanda` object, we use the `compute` function.

By default, `compute()` accepts `pipe` and `inplace` as optional 
arguments, if `pipe is None` then it will automatically execute whatever 
is in the `pipe_` attribute and empty it. If `inplace` is set to False, a copy 
of the MetaPanda is returned instead of acting inplace.

Note that by default `inplace=False`, so it needs to be set to `True` to
 update the dataframe in-place.

There is also a `compute_k()` function which allows users to join together
multiple pipelines in order.

In [17]:
# firstly saving the pipe to MetaPanda
g.cache_pipe('clean', cleaner)
g.pipe_

{'current': [],
 'clean': (('apply', ('drop_duplicates',), {'subset': 'prot_IDs'}),
  ('apply', ('fillna', 0), {}))}

We could have just passed the pipe to `compute`, but alternatively now we have it cached, we can
simply refer to the name we allocated to it:

In [18]:
g.compute('clean')


MetaPanda(Translation(n=21050, p=14, mem=5.052MB, options=[P]))

## Computing external pipelines

Often we want to have standardized pipelines that we can apply to many different `pandas.DataFrames`, for example we may have a similar system for **standardizing a dataset** in preparation of Machine Learning algorithms.

`turbopanda` provides one of these presets, called `ml_regression`:

In [21]:
turb.Pipe.ml_regression

<bound method Pipe.ml_regression of <class 'turbopanda._pipe.Pipe'>>

Metapanda also uses the `clean` Pipe function when set to True to sanitise `pandas.DataFrame` objects
that are passed to it.

This function takes a `MetaPanda`, with the input columns as a selector 
and output column(s) as a selector, with optional arguments, and returns a 
pipeline list which can be passed directly into `compute()`:

In [1]:
npipe = turb.Pipe.ml_regression(g, "trans_G1_[1-3]", "trans_G2M_[1-3]")
npipe

NameError: name 'turb' is not defined

By default `compute(inplace=False)` so a copy is returned.

In [19]:
ng = g.compute(npipe)
ng

MetaPanda(Translation(n=5201, p=11, mem=0.726MB, key='None'), mode='instant')

## Creating your own pipeline

Creating your own pipelines is incredibly useful for performing coherent 
stage-wise actions on a DataFrame, such as cleaning all of the column 
names, or preparing a dataset for Machine-Learning applications, or 
doing extensive preprocessing on a subset of columns.

Remember the three key parts of a pipeline argument:

    (<function name>, <function args>, <function kwargs>)

In [20]:
(
    # operation one, lower strings in df.columns
    ("apply_columns", ("lower",), {}),
    # operation two, lower strings in df.index, axis is redundant
    ("apply_index", ("lower",), {"axis": 0})
)

(('apply_columns', ('lower',), {}), ('apply_index', ('lower',), {'axis': 0}))

Instead of writing out all of that ugly code, there is a
 relative shorthand using `turb.pipe`.

This allows you to write out all of the *arguments* in a single list and 
then the hardwork is done by the algorithm of preparing it into the normal
 *pipeline* format.

In [21]:
turb.pipe([["apply_columns","lower"],["apply_index","lower"]])

[('apply_columns', ('lower',), {}), ('apply_index', ('lower',), {})]

Keywords can be written as string and then automatically converted into 
objects/types downstream. For example the keyword `"axis=0"` in string 
converts the `"0"` into `0` integer.

Note that this shorthand does NOT accept non-basic arguments as 
keywords. For example, `lambda` expressions, functions and `dtypes` 
cannot be passed as keyword arguments. However strings, ints, booleans 
and floats are accepted.

In [22]:
turb.pipe([["apply_columns", "lower", "axis=0"]])

[('apply_columns', ('lower',), {'axis': 0})]

### Example: Using the operations above

Rather than setting `mode=delay`, we can simply create a pipe of the operations and then call `compute()` to either create a copy or operate inplace.

Here we will create several pipelines:

1. Pipeline one: expand prot IDs and drop duplicates, this is to *clean the labels*
2. Pipeline two: clean, groupby counters and gene names, and rename the columns ready for display.

In [85]:
g2 = turb.read("translation.csv", name="Translation")
g2

MetaPanda(Translation(n=5216, p=14, mem=0.585MB, key='None'), mode='instant')

#### Pipe 1

- uses MetaPanda `expand` function to split on `;` characters
- transforms that column to drop `"-"` end
- filter rows to keep all non-duplicates

#### Pipe 2

- Drops protein name
- Groupby the counter, Gene_names and calculate the mean
- Rename column information

Note that normally `groupby` and `mean` are separate functions in `pandas`, but in TurboPanda you can specify a built-in aggregator (such as `sum`, `mean`, `std`, `min`, `max`) by separating `groupby` with a double-underscore `__`:

Examples:

* `"groupby__mean"`, `"groupby__std"`, `"groupby__sum"`, `"groupby__count"`

In [90]:
pipe1 = turb.pipe([
    # expand on protein IDs
    ["expand", "prot_IDs", "sep=;"],
    # split of the -2,-3 etc to get duplicates
    ["transform", lambda x: x.str.split("-",expand=True)[0], "^prot_ID"],
    # filter out duplicates
    ["filter_rows", lambda x: ~x.duplicated(), "^prot_ID"],
])

pipe2 = turb.pipe([
    # drop protein name
    ["drop", "prot_name"],
    # groupby counter,gene names and then aggregate on mean
    ["apply", "groupby__mean", ["counter","Gene_names"]],
    # rename
    ["rename", [("translation","trans")]]
])

### Using `compute_k()`

This provides yet another shorthand for executing multiple pipelines in a row.

In [94]:
g2_refreshed = g2.compute_k([pipe1, pipe2])
g2_refreshed

MetaPanda(Translation(n=5190, p=11, mem=0.561MB, key='None'), mode='instant')

In [95]:
g2_refreshed.head()


Unnamed: 0_level_0,colnames,trans_G1_1,trans_G1_2,trans_G1_3,trans_G2M_1,trans_G2M_2,trans_G2M_3,trans_MG1_1,trans_MG1_2,trans_MG1_3,trans_S_1,trans_S_2
counter,Gene_names,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
1,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703
2,A2ML1,,,,,,25.11629,,,,,
3,AAAS,25.48382,24.42746,25.22645,24.44556,23.93706,25.30966,25.61462,25.45923,24.48253,24.31645,23.92143
4,AACS,24.18177,24.51533,24.32766,24.15993,24.05001,24.95797,24.11656,24.22523,23.96446,23.8944,23.78107
