# Notebook 1: Basic Example

Here we will introduce a basic example of how **TurboPanda** works and can be of benefit to you.

### Requirements:

- `numpy`
- `pandas`
- `matplotlib`
- `seaborn`
- `scipy.stats`
- `sklearn`

In [1]:
import sys
import pandas as pd
# map path back to the directory above battlesim/
sys.path.insert(0,"../")
# our main import
import turbopanda as trb

## Create a `MetaPanda` object

Additionally, the `__repr__` object represents the dataset in terms of dimensions and the memory usage. Future versions will aim to encapsulate multiple `pandas.DataFrames`.

In [2]:
# download a dataset
dd = pd.read_csv("../../Research Data/datasets/aviner_translation.csv")
g = trb.MetaPanda(dd, name="translation")

## Viewing the Dataset

We can access the pandas object using the `df_` attribute:

In [3]:
g.df_.head()

Unnamed: 0,Protein_IDs,Majority_protein_IDs,Protein_names,Gene_names,Intensity_G1_1,Intensity_G1_2,Intensity_G1_3,Intensity_G2M_1,Intensity_G2M_2,Intensity_G2M_3,Intensity_MG1_1,Intensity_MG1_2,Intensity_MG1_3,Intensity_S_1,Intensity_S_2,Intensity_S_3
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
1,H0YGH4;P01023;H0YGH6;F8W7L3,H0YGH4;P01023,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703,23.25724
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,A8K2U0;F5H2W3;H0YGG5;F5H2Z2,Alpha-2-macroglobulin-like protein 1,A2ML1,,,,,,25.11629,,,,,,
3,Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82;F8VUB6,Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82,Aladin,AAAS,25.48382,24.42746,25.22645,24.44556,23.93706,25.30966,25.61462,25.45923,24.48253,24.31645,23.92143,25.21838
4,Q86V21;Q86V21-2;E7EW25;F5H790;F8W8B5;Q86V21-3;...,Q86V21;Q86V21-2;E7EW25;F5H790,Acetoacetyl-CoA synthetase,AACS,24.18177,24.51533,24.32766,24.15993,24.05001,24.95797,24.11656,24.22523,23.96446,23.8944,23.78107,24.32629


### Some important modifications...

`MetaPanda` does **not** accept MultiIndex for columns, these will be concatenated together. It will also do some nicety cleaning of your column names to remove spaces, tabs etc for your coding.

Categorization is when the data columns are assigned to their correct type. We spend some time trying to find whether a column should be a `pd.Category` or `bool`, `int` or `float` for maximum efficiency.

## Meta-information on the columns

This can be accessed with the `meta_` attribute:

In [4]:
g.meta_.head()

Unnamed: 0,is_unique,potential_id,potential_stacker,is_norm
Protein_IDs,True,True,True,False
Majority_protein_IDs,True,True,True,False
Protein_names,False,True,True,False
Gene_names,False,True,True,False
Intensity_G1_1,False,False,False,True


## Accessing column subsets using a vast variety of methods

Unlike traditional `pandas` which is incredibly difficult to access subsets of a DataFrame with ease, we allow the use of `regex` **and** typing (such as `float`) to specify subgroups that contain that capture pattern or data type, for example:

In [5]:
g["Intensity_[MG12S]*_1"].head()

Unnamed: 0,Intensity_G1_1,Intensity_G2M_1,Intensity_MG1_1,Intensity_S_1
0,21.26058,21.01794,21.11775,20.58628
1,22.62015,24.94606,23.56139,22.87688
2,,,,
3,25.48382,24.44556,25.61462,24.31645
4,24.18177,24.15993,24.11656,23.8944


Or using type:

In [6]:
g[object].head()

Unnamed: 0,Protein_IDs,Majority_protein_IDs,Protein_names,Gene_names
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3
1,H0YGH4;P01023;H0YGH6;F8W7L3,H0YGH4;P01023,Alpha-2-macroglobulin,A2M
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,A8K2U0;F5H2W3;H0YGG5;F5H2Z2,Alpha-2-macroglobulin-like protein 1,A2ML1
3,Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82;F8VUB6,Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82,Aladin,AAAS
4,Q86V21;Q86V21-2;E7EW25;F5H790;F8W8B5;Q86V21-3;...,Q86V21;Q86V21-2;E7EW25;F5H790,Acetoacetyl-CoA synthetase,AACS


Or using the `meta_` attribute columns as a selector:

In [7]:
g["is_norm"].head()

Unnamed: 0,Intensity_G1_1,Intensity_G1_2,Intensity_G1_3,Intensity_G2M_1,Intensity_G2M_2,Intensity_G2M_3,Intensity_MG1_1,Intensity_MG1_2,Intensity_MG1_3,Intensity_S_1,Intensity_S_2,Intensity_S_3
0,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662,21.05682
1,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703,23.25724
2,,,,,,25.11629,,,,,,
3,25.48382,24.42746,25.22645,24.44556,23.93706,25.30966,25.61462,25.45923,24.48253,24.31645,23.92143,25.21838
4,24.18177,24.51533,24.32766,24.15993,24.05001,24.95797,24.11656,24.22523,23.96446,23.8944,23.78107,24.32629


Priority is selected as follows:

1. If key belongs to a type, use the type
2. Elif key belongs to a column name in `meta_`, use this
3. Elif key not found in `df_` columns, use regex
4. Else, map to pandas to use a direct column name

## Viewing selections by `view` function

Whereas above we use a *selector* to get a subgroup of columns, what if we want to view those column names for ourselves before we do anything?

Here we *view* by a meta-data column:

In [8]:
g.view("is_norm")

Index(['Intensity_G1_1', 'Intensity_G1_2', 'Intensity_G1_3', 'Intensity_G2M_1',
       'Intensity_G2M_2', 'Intensity_G2M_3', 'Intensity_MG1_1',
       'Intensity_MG1_2', 'Intensity_MG1_3', 'Intensity_S_1', 'Intensity_S_2',
       'Intensity_S_3'],
      dtype='object')

Or viewing by common regular expression, regex:

In [9]:
g.view("Intensity_[G1SM2]+_1")

Index(['Intensity_G1_1', 'Intensity_G2M_1', 'Intensity_MG1_1',
       'Intensity_S_1'],
      dtype='object')

Or by a data type selection:

In [10]:
g.view(object)

Index(['Protein_IDs', 'Majority_protein_IDs', 'Protein_names', 'Gene_names'], dtype='object')

Or by a custom function, which takes the whole dataframe and creating a boolean selection based on some threshold with respect to the variance, for instance, or sample size:

In [11]:
g.view(lambda df: df.var().gt(5.3))

Index(['Protein_IDs', 'Majority_protein_IDs', 'Protein_names', 'Gene_names',
       'Intensity_G1_3', 'Intensity_MG1_3'],
      dtype='object')

## Viewing *not* selected columns

We can find which columns *remain* using `view_not` with our selection. This can be very useful if we wish to isolate some specific group or trait.

In [12]:
g.view_not(object)

Index(['Intensity_G1_1', 'Intensity_G1_2', 'Intensity_G1_3', 'Intensity_G2M_1',
       'Intensity_G2M_2', 'Intensity_G2M_3', 'Intensity_MG1_1',
       'Intensity_MG1_2', 'Intensity_MG1_3', 'Intensity_S_1', 'Intensity_S_2',
       'Intensity_S_3'],
      dtype='object')

## Creating multi-views

By using multiple selection criteria, by default `turbopanda` only keeps the **union** of the terms provided:

$$
S=\bigcup_i t_i
$$

This means that if you select for `object` and for "Intensity", you will get all of the column names of type `object` **OR** containing the string "Intensity" within it.

This is contrary to a **intersection** of terms, where you would get the column names of type `object` **AND** they contain the string "Intensity".

In [13]:
g.view(float, "_1", "G1")

Index(['Intensity_G1_1', 'Intensity_G1_2', 'Intensity_G1_3', 'Intensity_G2M_1',
       'Intensity_G2M_2', 'Intensity_G2M_3', 'Intensity_MG1_1',
       'Intensity_MG1_2', 'Intensity_MG1_3', 'Intensity_S_1', 'Intensity_S_2',
       'Intensity_S_3'],
      dtype='object')

In [14]:
g.view(float), g.view("_1"), g.view("G1")

(Index(['Intensity_G1_1', 'Intensity_G1_2', 'Intensity_G1_3', 'Intensity_G2M_1',
        'Intensity_G2M_2', 'Intensity_G2M_3', 'Intensity_MG1_1',
        'Intensity_MG1_2', 'Intensity_MG1_3', 'Intensity_S_1', 'Intensity_S_2',
        'Intensity_S_3'],
       dtype='object'),
 Index(['Intensity_G1_1', 'Intensity_G2M_1', 'Intensity_MG1_1',
        'Intensity_S_1'],
       dtype='object'),
 Index(['Intensity_G1_1', 'Intensity_G1_2', 'Intensity_G1_3', 'Intensity_MG1_1',
        'Intensity_MG1_2', 'Intensity_MG1_3'],
       dtype='object'))

## Renaming columns using rules

Often we want to chain together a bunch of changes to our naming of columns that either increase brevity, or make the dataframe *pretty* in preparation for graphs.

A `MetaPanda` object can chain together a series of *string replacements* to proactively apply to the column names to aid this process.

In [15]:
g.rename([("Protein|protein","prot"),("Intensity","translation"),("Gene","gene"),
          ("Majority","maj"), ("IDs","ids")], apply=False)

Index(['prot_ids', 'maj_prot_ids', 'prot_names', 'gene_names',
       'translation_G1_1', 'translation_G1_2', 'translation_G1_3',
       'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
       'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
       'translation_S_1', 'translation_S_2', 'translation_S_3'],
      dtype='object')

Traditionally, we set `apply=True` to make the changes immediately, but you can prototype how the columns will look using `False`, then set to `True` when you are happy:

In [16]:
g.rename([("Protein|protein","prot"),("Intensity","translation"),("Gene","gene"),
          ("Majority","maj"), ("IDs","ids")], apply=True)

MetaPanda(translation(n=5216, p=16, mem=0.668MB))

In [17]:
g.meta_.index

Index(['prot_ids', 'maj_prot_ids', 'prot_names', 'gene_names',
       'translation_G1_1', 'translation_G1_2', 'translation_G1_3',
       'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
       'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
       'translation_S_1', 'translation_S_2', 'translation_S_3'],
      dtype='object')

## Caching selections using `cache`

We may wish to save our 'selected columns' using the `cache` function, particularly if it is a complicated or long selection criterion.

This also allows us to reference this cached selection using a *meaningful name* further down the line.

**NOTE**: Selections are *not* pre-computed, the selection itself is cached and **executed at runtime**. This means that if you have different columns present further down the line, a *different result* will emerge.

In [18]:
g.cache("ids", object)

MetaPanda(translation(n=5216, p=16, mem=0.668MB))

Our cached columns now sit in a hidden object called `self._select`:

In [19]:
g._select

{'ids': (object,)}

They can now be summoned by using the name we passed to the dictionary:

In [20]:
g.view("ids")

Index(['prot_ids', 'maj_prot_ids', 'prot_names', 'gene_names'], dtype='object')

### Multi-cache

This is an extension to `cache`, where multiple things can be cached at once:

In [21]:
g.multi_cache(hello="_s$", hello2=lambda x:x)

MetaPanda(translation(n=5216, p=16, mem=0.668MB))

In [22]:
g._select

{'ids': (object,),
 'hello': ('_s$',),
 'hello2': (<function __main__.<lambda>(x)>,)}

## Performing basic meta-analysis

We have condensed this down into a single `analyze` function, where subfunctions can be selected. For now, we just have the `agglomerate` function which performs edit-distance analysis on the column names to determine how similar they might be.

In [23]:
g.analyze(functions=["agglomerate"])

MetaPanda(translation(n=5216, p=16, mem=0.668MB))

In [24]:
g.meta_.head()

Unnamed: 0,is_unique,potential_id,potential_stacker,is_norm,agglomerate
prot_ids,True,True,True,False,0
maj_prot_ids,True,True,True,False,0
prot_names,False,True,True,False,0
gene_names,False,True,True,False,0
translation_G1_1,False,False,False,True,1


## Dropping columns through `del` or using the `drop` function

Using the powerful selection methods for columns above, we can also remove or drop columns we aren't interested in:

In [28]:
del g["translation"]

In [29]:
g.drop(object)

MetaPanda(translation(n=5216, p=0, mem=0.000MB))

In [30]:
g

MetaPanda(translation(n=5216, p=0, mem=0.000MB))

### But this leaves us with an interesting question...

Can I 'rollback' changes I made to a dataframe, or follow step-by-step what's actually happening to it?

This means we need to create something like a **task graph** as we go along and perform **meta-changes** to the DataFrame.