# Notebook 1: Basic Example

Here we will introduce a basic example of how **TurboPanda** works and can be of benefit to you.

### Requirements:

- `numpy`
- `pandas`
- `scipy.stats`
- `matplotlib.pyplot`

In [8]:
import sys
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

## Reading a `MetaPanda` object

Additionally, the `__repr__` object represents the dataset in terms of dimensions and the memory usage. Future versions will aim to encapsulate multiple `pandas.DataFrames`.

By default, if there isa **metadata** file also present, this will be read in.

`MetaPanda` can be given a name to have, or alternatively it will just adopt the name of the file.

In [12]:
g = turb.read("translation.csv")

By default data types are automatically tuned down to the smallest integer, if possible. Errors are ignored.

Here are the arguments shown in the `__repr__` attribute:

1. **MetaPanda**: this tells you it's a MetaPanda object
2. *Translation*: The name of the dataset
3. $n$, $p$ and *mem*: the number of samples, dimensions and memory usage in megabtypes, respectively
4. *mode*: either 'instant' or 'delay', we'll cover this later

In [14]:
g

MetaPanda(translation(n=5216, p=14, mem=0.585MB), mode='instant')

## Viewing the Dataset

**NOTE**: The column names `colnames` and `counter` are reserved for the column/index reference and this is maintained in `MetaPanda`.

We can access the pandas object using the `df_` attribute:

In [18]:
g.df_.head()

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1,,,,,,25.11629,,,,,
3,Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82;F8VUB6,Aladin,AAAS,25.48382,24.42746,25.22645,24.44556,23.93706,25.30966,25.61462,25.45923,24.48253,24.31645,23.92143
4,Q86V21;Q86V21-2;E7EW25;F5H790;F8W8B5;Q86V21-3;...,Acetoacetyl-CoA synthetase,AACS,24.18177,24.51533,24.32766,24.15993,24.05001,24.95797,24.11656,24.22523,23.96446,23.8944,23.78107


### Some important modifications...

`MetaPanda` does **not** accept MultiIndex for columns, these will be concatenated together. It will also do some nicety cleaning of your column names to remove spaces, tabs etc for your coding.

Categorization is when the data columns are assigned to their correct type. We spend some time trying to find whether a column should be a `pd.Category` or `bool`, `int` or `float` for maximum efficiency.

## Meta-information on the columns

This can be accessed with the `meta_` attribute:

In [17]:
g.meta_.head()

Unnamed: 0_level_0,mtypes,is_unique,potential_id,potential_stacker
colnames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
prot_IDs,object,True,True,True
prot_names,object,False,True,True
Gene_names,object,False,True,True
translation_G1_1,float64,False,False,False
translation_G1_2,float64,False,False,False


## Accessing column subsets using a vast variety of methods

Unlike traditional `pandas` which is incredibly difficult to access subsets of a DataFrame with ease, we allow the use of `regex` **and** typing (such as `float`) to specify subgroups that contain that capture pattern or data type.

**NOTE**: Using the `__getitem__` attribute of `MetaPanda` **does not alter the underlying `DataFrame`**! The same super-object remains, allowing you to very quickly view dataframe subsets using a selection method of your choice.

for example:

In [22]:
g["translation_[MG12S]*_1"].head()

colnames,translation_G1_1,translation_G2M_1,translation_MG1_1,translation_S_1
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,21.26058,21.01794,21.11775,20.58628
1,22.62015,24.94606,23.56139,22.87688
2,,,,
3,25.48382,24.44556,25.61462,24.31645
4,24.18177,24.15993,24.11656,23.8944


Or using type:

In [23]:
g[object].head()

colnames,prot_IDs,prot_names,Gene_names
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1
3,Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82;F8VUB6,Aladin,AAAS
4,Q86V21;Q86V21-2;E7EW25;F5H790;F8W8B5;Q86V21-3;...,Acetoacetyl-CoA synthetase,AACS


Or using the `meta_` attribute columns as a selector:

In [24]:
g["is_unique"]

counter
0                    Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5
1                             H0YGH4;P01023;H0YGH6;F8W7L3
2                      A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1
3                    Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82;F8VUB6
4       Q86V21;Q86V21-2;E7EW25;F5H790;F8W8B5;Q86V21-3;...
                              ...                        
5211                                             Q15149-3
5212    Q15149-4;Q15149-2;Q15149-6;Q15149-5;Q15149-9;Q...
5213                                    Q9BXS6-5;Q9BXS6-3
5214                                             Q9BXS6-4
5215                                    Q9UHX1-6;Q9UHX1-5
Name: prot_IDs, Length: 5216, dtype: object

Priority is selected as follows:

1. If key belongs to a type, use the type
2. Elif key belongs to a column name in `meta_`, use this
3. Elif key not found in `df_` columns, use regex
4. Else, map to pandas to use a direct column name

## Viewing selections by `view` function

Whereas above we use a *selector* to get a subgroup of columns, what if we want to view those column names for ourselves before we do anything?

Here we *view* by a meta-data column:

In [25]:
g.view("is_unique")

Index(['prot_IDs'], dtype='object', name='colnames')

Or viewing by common regular expression, regex:

In [29]:
g.view("translation_[G1SM2]+_1")

Index(['translation_G1_1', 'translation_G2M_1', 'translation_MG1_1',
       'translation_S_1'],
      dtype='object', name='colnames')

Or by a data type selection:

In [30]:
g.view(object)

Index(['prot_IDs', 'prot_names', 'Gene_names'], dtype='object', name='colnames')

Or by a custom function, which takes the whole dataframe and creating a boolean selection based on some threshold with respect to the variance, for instance, or sample size.

This should allow us to use `pandas.DataFrame.aggregate` for better performance.

In [31]:
g.view(lambda x: x.count()==x.shape[0])

Index(['prot_IDs'], dtype='object', name='colnames')

## Viewing *not* selected columns

We can find which columns *remain* using `view_not` with our selection. This can be very useful if we wish to isolate some specific group or trait.

In [32]:
g.view_not(object)

Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
       'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
       'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
       'translation_S_1', 'translation_S_2'],
      dtype='object', name='colnames')

## Creating multi-views

By using multiple selection criteria, by default `turbopanda` only keeps the **union** of the terms provided:

$$
S=\bigcup_i t_i
$$

This means that if you select for `object` and for "Intensity", you will get all of the column names of type `object` **OR** containing the string "Intensity" within it.

This is contrary to a **intersection** of terms, where you would get the column names of type `object` **AND** they contain the string "Intensity".

In [33]:
g.view(float, "_1", "G1")

Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
       'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
       'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
       'translation_S_1', 'translation_S_2'],
      dtype='object', name='colnames')

In [34]:
g.view(float), g.view("_1"), g.view("G1")

(Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
        'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
        'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
        'translation_S_1', 'translation_S_2'],
       dtype='object', name='colnames'),
 Index(['translation_G1_1', 'translation_G2M_1', 'translation_MG1_1',
        'translation_S_1'],
       dtype='object', name='colnames'),
 Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
        'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3'],
       dtype='object', name='colnames'))

## Renaming columns using rules

Often we want to chain together a bunch of changes to our naming of columns that either increase brevity, or make the dataframe *pretty* in preparation for graphs.

A `MetaPanda` object can chain together a series of *string replacements* to proactively apply to the column names to aid this process.

In [35]:
g.rename([("Protein|protein","prot"),("Intensity","translation"),("Gene","gene"),
          ("IDs","ids")])

MetaPanda(translation(n=5216, p=14, mem=0.585MB), mode='instant')

In [36]:
g.df_.columns

Index(['prot_ids', 'prot_names', 'gene_names', 'translation_G1_1',
       'translation_G1_2', 'translation_G1_3', 'translation_G2M_1',
       'translation_G2M_2', 'translation_G2M_3', 'translation_MG1_1',
       'translation_MG1_2', 'translation_MG1_3', 'translation_S_1',
       'translation_S_2'],
      dtype='object', name='colnames')

Further to this, the renaming process can be further specified by using a selector to reduce the search space.

## Caching selections using `cache`

We may wish to save our 'selected columns' using the `cache` function, particularly if it is a complicated or long selection criterion.

This also allows us to reference this cached selection using a *meaningful name* further down the line.

**NOTE**: Selections are *not* pre-computed, the selection itself is cached and **executed at runtime**. This means that if you have different columns present further down the line, a *different result* will emerge.

In [37]:
g.cache("ids", object)

MetaPanda(translation(n=5216, p=14, mem=0.585MB), mode='instant')

Our cached columns now sit in a hidden object called `self._select`:

In [38]:
g._select

{'ids': (object,)}

They can now be summoned by using the name we passed to the dictionary:

In [39]:
g.view("ids")

Index(['prot_ids', 'prot_names', 'gene_names'], dtype='object', name='colnames')

### Multi-cache

This is an extension to `cache`, where multiple things can be cached at once:

In [40]:
g.multi_cache(hello="_s$", hello2=lambda x:x)

MetaPanda(translation(n=5216, p=14, mem=0.585MB), mode='instant')

In [41]:
g._select

{'ids': (object,),
 'hello': ('_s$',),
 'hello2': (<function __main__.<lambda>(x)>,)}

## Performing basic meta-analysis

We have condensed this down into a single `analyze` function, where subfunctions can be selected. For now, we just have the `agglomerate` function which performs edit-distance analysis on the column names to determine how similar they might be.

In [42]:
g.analyze(functions=["agglomerate"])

MetaPanda(translation(n=5216, p=14, mem=0.585MB), mode='instant')

In [43]:
g.meta_.head()

Unnamed: 0_level_0,mtypes,is_unique,potential_id,potential_stacker,agglomerate
colnames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
prot_ids,object,True,True,True,0
prot_names,object,False,True,True,0
gene_names,object,False,True,True,0
translation_G1_1,float64,False,False,False,1
translation_G1_2,float64,False,False,False,1


## Mapping meta-information to column groups

One of the easiest ways is to **cache** the groups and then create a `meta_map` from the cached elements.

In [44]:
g.multi_cache(numerical_f="translation", identifs=("ids?$","_names$"))

MetaPanda(translation(n=5216, p=14, mem=0.585MB), mode='instant')

With `meta_map` we specify the name of the meta column, and then give selectors as to identify each subgroup. In this case we reference the name of the cached elements we are interested in, and use the dictionary name we specified to name it.

In [45]:
g.meta_map("feature_types", ["numerical_f","identifs"])

MetaPanda(translation(n=5216, p=14, mem=0.585MB), mode='instant')

Note that duplicate column names **cannot** occur in different subgroups as we are trying to *uniquely* label each feature type.

In [46]:
g.meta_map("identifiers", ["identifs","identifs"])

ValueError: shared terms: Index(['gene_names', 'prot_ids', 'prot_names'], dtype='object', name='colnames') discovered for meta_map.

These columns now appear in `g.meta_`:

In [49]:
g.meta_.head()

Unnamed: 0_level_0,mtypes,is_unique,potential_id,potential_stacker,agglomerate,feature_types
colnames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
prot_ids,object,True,True,True,0,identifs
prot_names,object,False,True,True,0,identifs
gene_names,object,False,True,True,0,identifs
translation_G1_1,float64,False,False,False,1,numerical_f
translation_G1_2,float64,False,False,False,1,numerical_f


## Applying transformations to selector data

With these selector groups, we can apply a function to the columns of this data using `g.transform`:

In [50]:
g.transform(lambda x:x**2, "numerical_f")

MetaPanda(translation(n=5216, p=14, mem=0.585MB), mode='instant')

Note that if the `selector` parameter is empty, it will attempt to transform *every column* in the dataset. `pandas.DataFrame.transform` is used, so aggregations are not permitted. 

In [51]:
g["numerical_f"].head()

colnames,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,452.012262,419.212112,419.80117,441.753802,405.848826,496.849004,445.959365,429.273646,410.381702,423.794924,411.141319
1,511.671186,495.874958,534.435451,622.30591,586.436451,638.269191,555.139099,550.395529,493.706625,523.351639,545.55085
2,,,,,,630.828023,,,,,
3,649.425082,596.700802,636.37378,597.585404,572.982841,640.578889,656.108758,648.172392,599.394275,591.289741,572.234813
4,584.758,601.001405,591.835041,583.702218,578.402981,622.900267,581.608466,586.861769,574.295343,570.942351,565.53929


## Dropping columns through `del` or using the `drop` function

Using the powerful selection methods for columns above, we can also remove or drop columns we aren't interested in:

In [52]:
g.drop(object)

MetaPanda(translation(n=5216, p=11, mem=0.460MB), mode='instant')

## Writing files

We can write our `MetaPanda` object to file with or without the associated metadata:

In [53]:
# g.write("translation2.csv", with_meta=False)

### But this leaves us with an interesting question...

Can I 'rollback' changes I made to a dataframe, or follow step-by-step what's actually happening to it?

This means we need to create something like a **task graph** as we go along and perform **meta-changes** to the DataFrame.