# Notebook 1: Basic Example

Here we will introduce a basic example of how **TurboPanda** works and can be of benefit to you.

### Requirements:

- `numpy`
- `pandas`
- `scipy.stats`
- `matplotlib.pyplot`
- `jupyter`

See `environment.yml` file for Python requirements.

In [9]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

### Version last run:

In [10]:
turb.__version__

'0.2.2'

# The bedrock of `turbopanda`: The `MetaPanda` object.

You can think of a `MetaPanda` as an object that sits on top of the raw dataset which is itself a `pandas.DataFrame` object, in addition to certain meta information associated to the columns.

<img src="../extras/readme.svg" width ="500" height=500> </img>

where `df_` is the raw dataset and `meta_` is a meta information accessor.

## Creating a `MetaPanda` object

A `pandas.DataFrame` must be passed to the MetaPanda constructor. 

In [11]:
f1 = turb.MetaPanda(
    pd.DataFrame({
        "a": [1, 2, 3],
        "b": ['Ha', 'Ho', 'He'],
        "c": [True, False, True],
        "d": np.random.rand(3),
    })
)

### Printed output

We see the `name` of the MetaPanda, along with `n`: the number of rows, and `p`: the number of columns, memory usage, and some additional boolean flags denoted as `options`.

In [12]:
f1

MetaPanda(DataSet(n=3, p=4, mem=0.000MB, options=[]))

## Reading a `MetaPanda` object

Additionally, the `__repr__` object represents the dataset in terms of dimensions and the memory usage. Future versions will aim to encapsulate multiple `pandas.DataFrames`.

By default, if there isa **metadata** file also present, this will be read in.

`MetaPanda` can be given a name to have, or alternatively it will just adopt the name of the file.

In [13]:
g = turb.read("../data/translation.csv", name="trl")
g

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[]))

By default data types are automatically tuned down to the smallest integer, if possible. Errors are ignored.

Here are the arguments shown in the `__repr__` attribute:

1. **MetaPanda**: this tells you it's a MetaPanda object
2. *Translation*: The name of the dataset
3. $n$, $p$ and *mem*: the number of samples, dimensions and memory usage in megabtypes, respectively
4. *mode*: either 'instant' or 'delay', we'll cover this later

In [14]:
g

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[]))

## Viewing the Dataset

**NOTE**: The column names `colnames` and `counter` are reserved for the column/index reference and this is maintained in `MetaPanda`.

We can access the pandas object using the `df_` attribute:

In [15]:
g.head()

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G1_3,translation_G2M_1,translation_G2M_2,translation_G2M_3,translation_MG1_1,translation_MG1_2,translation_MG1_3,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,20.48905,21.01794,20.14569,22.29011,21.11775,20.71892,20.25788,20.58628,20.27662
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,23.11786,24.94606,24.21645,25.26399,23.56139,23.46051,22.21951,22.87688,23.35703
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1,,,,,,25.11629,,,,,
3,Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82;F8VUB6,Aladin,AAAS,25.48382,24.42746,25.22645,24.44556,23.93706,25.30966,25.61462,25.45923,24.48253,24.31645,23.92143
4,Q86V21;Q86V21-2;E7EW25;F5H790;F8W8B5;Q86V21-3;...,Acetoacetyl-CoA synthetase,AACS,24.18177,24.51533,24.32766,24.15993,24.05001,24.95797,24.11656,24.22523,23.96446,23.8944,23.78107


### Some important modifications...

`MetaPanda` does **not** accept MultiIndex for columns, these will be concatenated together. It will also do some nicety cleaning of your column names to remove spaces, tabs etc for your coding.

Categorization is when the data columns are assigned to their correct type. We spend some time trying to find whether a column should be a `pd.Category` or `bool`, `int` or `float` for maximum efficiency.

## Meta-information on the columns

This can be accessed with the `meta_` attribute:

In [20]:
g.meta_.head()

Unnamed: 0,true_type,is_mixed_type,is_unique_id
prot_IDs,object,False,False
prot_names,object,True,False
Gene_names,object,True,False
translation_G1_1,float64,False,False
translation_G1_2,float64,False,False


## MetaPanda properties

`MetaPanda` makes extensive use of `@property` attributes to give an interface to the object. All properties in TurboPanda end with an underscore (`_`). Note that some of these properties *can be modified*, if done so carefully, whilst others are only for viewing and not modifiable.

We have already covered the two most important properties:

* `df_` : accessing the raw DataFrame
* `meta_` : accessing meta-information of the dataset

In addition to this, we have quick-and-easy ways of assessing the size of the dataset, in `n_` (the number of rows, samples) and `p_` (the number of columns, dimensions) following machine-learning nomenclature:

In [16]:
g.n_

5216

In [17]:
g.p_

14

Other important properties (which we explore later) are the `selectors_` and `pipe_` attributes:

In [18]:
g.selectors_

{}

In [19]:
g.pipe_

{'current': []}

## Selectors

Unlike traditional `pandas` which is incredibly difficult to access subsets of a DataFrame with ease, we allow the use of `regex` **and** typing (such as `float`) to specify subgroups that contain that capture pattern or data type.

**NOTE**: Using the `__getitem__` attribute of `MetaPanda` **does not alter the underlying `DataFrame`**! The same super-object remains, allowing you to very quickly view dataframe subsets using a selection method of your choice.

The **order of selection** if:

1. selection is `None`: return \[\]
2. selection is of type `pandas.Index`: return those columns
3. selection is an accepted `dtype`: return columns of that dtype
4. selection is callable (i.e function): return columns associated with boolean series
5. selection is of type `str`:
    1. selection is found as `meta_` column and column is of type `bool`
    2. selection is found in `selectors_`
    3. not in `df` column names: use regular expressions (regex)
    4. otherwise selector is column name: return single `Series`
    
### Using regular expressions

In [21]:
g["translation_[MG12S]*_1"].head()

colnames,translation_G1_1,translation_G2M_1,translation_MG1_1,translation_S_1
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,21.26058,21.01794,21.11775,20.58628
1,22.62015,24.94606,23.56139,22.87688
2,,,,
3,25.48382,24.44556,25.61462,24.31645
4,24.18177,24.15993,24.11656,23.8944


### Using data types

Or using type:

In [27]:
g[object].head()

colnames,prot_IDs,prot_names,Gene_names
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M
2,A8K2U0;F5H2W3;H0YGG5;F5H2Z2;F5GXP1,Alpha-2-macroglobulin-like protein 1,A2ML1
3,Q9NRG9;Q9NRG9-2;F8VZ44;H3BU82;F8VUB6,Aladin,AAAS
4,Q86V21;Q86V21-2;E7EW25;F5H790;F8W8B5;Q86V21-3;...,Acetoacetyl-CoA synthetase,AACS


### Using meta columns

Or using the `meta_` attribute columns as a selector:

In [22]:
g["is_mixed_type"].head()

colnames,prot_names,Gene_names
counter,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Putative RNA exonuclease NEF-sp,44M2.3
1,Alpha-2-macroglobulin,A2M
2,Alpha-2-macroglobulin-like protein 1,A2ML1
3,Aladin,AAAS
4,Acetoacetyl-CoA synthetase,AACS


## Viewing selections

Whereas above we use a *selector* to get a subgroup of columns, what if we want to view those column names for ourselves before we do anything?

Here we `view` by a meta-data column:

In [29]:
g.view("is_mixed_type")

Index(['prot_names', 'Gene_names'], dtype='object', name='colnames')

Or by using the direct column name (as pandas does):

In [30]:
g.view("prot_IDs")

Index(['prot_IDs'], dtype='object', name='colnames')

Or viewing by common regular expression, regex:

In [31]:
g.view("translation_[G1SM2]+_1")

Index(['translation_G1_1', 'translation_G2M_1', 'translation_MG1_1',
       'translation_S_1'],
      dtype='object', name='colnames')

Or by a data type selection:

In [32]:
g.view(object)

Index(['prot_IDs', 'prot_names', 'Gene_names'], dtype='object', name='colnames')

Or by a custom function, which takes the whole dataframe and creating a boolean selection based on some threshold with respect to the variance, for instance, or sample size.

This should allow us to use `pandas.DataFrame.aggregate` for better performance.

In [33]:
g.view(lambda x: x.count()==x.shape[0])

Index(['prot_IDs'], dtype='object', name='colnames')

Note that `view` and `view_not` functions return the selected columns **in the order they appear in the DataFrame**. This is important if you wish to retain a particular sorting of the data.

## Viewing *not* selected columns

We can find which columns *remain* using `view_not` with our selection. This can be very useful if we wish to isolate some specific group or trait.

In [34]:
g.view_not(object)


Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
       'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
       'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
       'translation_S_1', 'translation_S_2'],
      dtype='object', name='colnames')

## Creating multi-views

By using multiple selection criteria, by default `turbopanda` only keeps the **union** of the terms provided:

$$
S=\bigcup_i t_i
$$

This means that if you select for `object` and for "Intensity", you will get all of the column names of type `object` **OR** containing the string "Intensity" within it.

This is contrary to a **intersection** of terms, where you would get the column names of type `object` **AND** they contain the string "Intensity".

In addition, *the order of the elements is maintained*, even across multiple selectors, such that any sorting/order is preserved in future operations.

In [35]:
g.view(float, "_1", "G1")

Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
       'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
       'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
       'translation_S_1', 'translation_S_2'],
      dtype='object', name='colnames')

In [36]:
g.view(float), g.view("_1"), g.view("G1")

(Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
        'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
        'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
        'translation_S_1', 'translation_S_2'],
       dtype='object', name='colnames'),
 Index(['translation_G1_1', 'translation_G2M_1', 'translation_MG1_1',
        'translation_S_1'],
       dtype='object', name='colnames'),
 Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
        'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3'],
       dtype='object', name='colnames'))

### Using `search` to get intersection of terms

In [24]:
# finds all terms of type float AND contain '_1' AND contain 'G1'
g.search(float, "_1", "G1")

Index(['translation_G1_1', 'translation_MG1_1'], dtype='object', name='colnames')

## Renaming columns using rules

Often we want to chain together a bunch of changes to our naming of columns that either increase brevity, or make the dataframe *pretty* in preparation for graphs.

A `MetaPanda` object can chain together a series of *string replacements* to proactively apply to the column names to aid this process.

* Note that from version 0.2.2 onwards renaming columns is used using `rename_axis` instead of `rename`.


In [25]:
g.rename_axis([("Protein|protein","prot"),("Intensity","translation"),("Gene","gene"),
          ("IDs","ids")])

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[]))

In [26]:
g.columns

Index(['prot_ids', 'prot_names', 'gene_names', 'translation_G1_1',
       'translation_G1_2', 'translation_G1_3', 'translation_G2M_1',
       'translation_G2M_2', 'translation_G2M_3', 'translation_MG1_1',
       'translation_MG1_2', 'translation_MG1_3', 'translation_S_1',
       'translation_S_2'],
      dtype='object', name='colnames')

Further to this, the renaming process can be further specified by using a selector to reduce the search space.

In [34]:
g.rename_axis([('prot_', 'prot')], selector=object)

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[]))

## Caching selections using `cache`

We may wish to save our 'selected columns' using the `cache` function, particularly if it is a complicated or long selection criterion.

This also allows us to reference this cached selection using a *meaningful name* further down the line.

**NOTE**: Selections are *not* pre-computed, the selection itself is cached and **executed at runtime**. This means that if you have different columns present further down the line, a *different result* will emerge.

In [35]:
g.cache("ids", object)

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[S]))

Our cached columns now sit in a hidden object called `self.selectors_`:

In [36]:
g.selectors_

{'ids': ['object']}

They can now be summoned by using `view`, `view_not`, or any
 of the other inspection functions that use *selectors*:

In [None]:
g.view("ids")

### Multi-cache

This is an extension to `cache`, where multiple things can be cached at once:

In [41]:
import numpy as np
g.cache_k(hello="_s$", hello2=np.square)

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[S]))

In [42]:
g.selectors_

{'ids': ['object'], 'hello': ['_s$'], 'hello2': [<ufunc 'square'>]}

## Mapping meta-information to column groups

One of the easiest ways is to **cache** the groups and then create a `meta_map` 
from the cached elements.

In [43]:
g.cache_k(numerical_f="translation", identifs=("ids?$","_names$"))

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[S]))

With `meta_map` we specify the name of the meta column, and then give selectors as to identify each subgroup. In this case we reference the name of the cached elements we are interested in, and use the dictionary name we specified to name it.

In [44]:
g.meta_map("feature_types", ["numerical_f","identifs"])



MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[SM]))

Note that duplicate column names **cannot** occur in different subgroups as we are trying to *uniquely* label each feature type.

In [45]:
import pytest

with pytest.raises(ValueError):
    g.meta_map("identifiers", ["identifs","identifs"])

These columns now appear in `meta_`:

In [46]:
g.meta_.head()

Unnamed: 0,true_type,is_mixed_type,is_unique_id
protids,object,False,False
protnames,object,True,False
gene_names,object,True,False
translation_G1_1,float64,False,False
translation_G1_2,float64,False,False


## Applying transformations to selector data

With these selector groups, we can apply a function to the columns of this data using `g.transform`.

Transformations happen inplace and thus will change the underlying dataframe:

In [47]:
g.transform(lambda x:x**2, "numerical_f")

MetaPanda(trl(n=5216, p=14, mem=1.169MB, options=[SM]))

Note that if the `selector` parameter is empty, it will attempt to transform *every column* in the dataset. `pandas.DataFrame.transform` is used, so aggregations are not permitted. 

In [48]:
g.view("numerical_f")

Index(['translation_G1_1', 'translation_G1_2', 'translation_G1_3',
       'translation_G2M_1', 'translation_G2M_2', 'translation_G2M_3',
       'translation_MG1_1', 'translation_MG1_2', 'translation_MG1_3',
       'translation_S_1', 'translation_S_2'],
      dtype='object', name='colnames')

## Dropping columns through `del` or using the `drop` function

Using the powerful selection methods for columns above, we can also remove or drop columns we aren't interested in:

In [49]:
g.drop(object)

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[SM]))

We could also select columns that we want to keep using the `keep` method.

## Writing files

We can write our `MetaPanda` object to file with or without the associated metadata.

Note that from version 0.1.6, the default save type is `JSON`, as
 this allows us to store the metainformation *with* the raw dataset, plus any
  selectors and pipes.
  
At the current patch, we can handle `csv`, `xls` and `json` files, with plans 
to extend to `hdf` formats also.

In [54]:
# g.write("translation2.json")

### But this leaves us with an interesting question...

Can I 'rollback' changes I made to a dataframe, or follow step-by-step what's actually happening to it?

This means we need to create something like a **task graph** as we go along and perform **meta-changes** to the DataFrame.