# Basics to the `MetaPanda` object

Here we will introduce a basic example of how **TurboPanda** works and can be of benefit to you.

### Requirements:

- `numpy`
- `pandas`
- `scipy.stats`
- `matplotlib.pyplot`
- `jupyter`

See `environment.yml` file for Python requirements.

In [84]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

### Version last run:

In [1]:
print("turbopanda: %s" % turb.__version__)

NameError: name 'turb' is not defined

# The bedrock of `turbopanda`: The MetaPanda object.

You can think of a `MetaPanda` as an object that sits on top of the raw dataset
 which is itself a `pandas.DataFrame` object, in addition to certain meta
  information associated to the columns.

<img src="../extras/readme.svg" width ="500" height=500> </img>

where `df_` is the raw dataset and `meta_` is a meta information accessor.

## Creating a `MetaPanda` object

A `pandas.DataFrame` must be passed to the MetaPanda constructor. 

In [86]:
f1 = turb.MetaPanda(
    pd.DataFrame({
        "a": [1, 2, 3],
        "b": ['Ha', 'Ho', 'He'],
        "c": [True, False, True],
        "d": np.random.rand(3),
    })
)

### Printed output

We see the `name` of the MetaPanda, along with `n`: the number of rows, and `p`: the number of columns, memory usage, and some additional boolean flags denoted as `options`.

In [87]:
f1

MetaPanda(DataSet(n=3, p=4, mem=0.000MB, options=[]))

## Reading a `MetaPanda` object

Additionally, the `__repr__` object represents the dataset in terms of dimensions and the memory usage. Future versions will aim to encapsulate multiple `pandas.DataFrames`.

By default, if there isa **metadata** file also present, this will be read in.

`MetaPanda` can be given a name to have, or alternatively it will just adopt the name of the file.

In [88]:
g = turb.read("../data/SDF.json", name="trl")
g

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[]))

By default data types are automatically tuned down to the smallest integer, if possible. Errors are ignored.

Here are the arguments shown in the `__repr__` attribute:

1. **MetaPanda**: this tells you it's a MetaPanda object
2. *trl*: The name of the dataset
3. $n$, $p$ and *mem*: the number of samples, dimensions and memory usage in megabtypes, respectively
4. *options*: Additional information about variables stored internally

## Viewing the Dataset

**NOTE**: The column names `colnames` and `counter` are reserved for the column/index reference and this is maintained in `MetaPanda`.

We can access the pandas object using the `df_` attribute:

In [90]:
g.head(2)

colnames,prot_IDs,prot_names,Gene_names,translation_G1_1,translation_G1_2,translation_G2M_1,translation_G2M_2,translation_MG1_1,translation_MG1_2,translation_S_1,translation_S_2
counter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Q96IC2;Q96IC2-2;H3BM72;H3BV93;H3BSC5,Putative RNA exonuclease NEF-sp,44M2.3,21.26058,20.47467,21.01794,20.14569,21.11775,20.71892,20.58628,20.27662
1,H0YGH4;P01023;H0YGH6;F8W7L3,Alpha-2-macroglobulin,A2M,22.62015,22.26825,24.94606,24.21645,23.56139,23.46051,22.87688,23.35703


### Some important modifications...

`MetaPanda` does **not** accept MultiIndex for columns, this is primarily because many complex pandas operations
do not work properly on multi-indexed datasets, and keeping track of all these states would make the project
unviable. It will also do some nicety cleaning of your column names to remove spaces, 
tabs etc for your coding.

Categorization is when the data columns are assigned to their correct type. 
We spend some time trying to find whether a column should be a `pd.Category` or
 `bool`, `int` or `float` for maximum efficiency.

## Meta-information on the columns

This can be accessed with the `meta_` attribute:

In [91]:
g.meta_.head()

Unnamed: 0,true_type,is_mixed_type,is_unique_id
prot_IDs,object,False,False
prot_names,object,True,False
Gene_names,object,True,False
translation_G1_1,float64,False,False
translation_G1_2,float64,False,False


## MetaPanda properties

`MetaPanda` makes extensive use of `@property` attributes to give an interface
 to the object. Nearly all properties in TurboPanda end with an underscore
 (`_`). Note that some of these properties *can be modified*, if done so carefully,
  whilst others are only for viewing and not modifiable.

We have already covered the two most important properties:

* `df_` : accessing the raw DataFrame
* `meta_` : accessing meta-information of the dataset

In addition to this, we have quick-and-easy ways of assessing the size of
 the dataset, in `n` (the number of rows, samples) and `p` (the number of columns,
 dimensions) following machine-learning nomenclature:

In [92]:
g.n

5216

In [None]:
g.p

Other important properties (which we explore later) are the `selectors_` and `pipe_` attributes:

NOTE: `pipe_` is deprecated and will be removed in v0.3.

In [94]:
g.selectors_

{}

## Renaming columns using rules

Often we want to chain together a bunch of changes to our naming of columns that either increase brevity, or make the dataframe *pretty* in preparation for graphs.

A `MetaPanda` object can chain together a series of *string replacements* to proactively apply to the column names to aid this process.

* Note that from version 0.2.2 onwards renaming columns is used using `rename_axis` instead of `rename`.


In [96]:
g.rename_axis([("Protein|protein","prot"),("Intensity","translation"),("Gene","gene"),
          ("IDs","ids")])

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[]))

In [97]:
g.columns

Index(['prot_ids', 'prot_names', 'gene_names', 'translation_G1_1',
       'translation_G1_2', 'translation_G2M_1', 'translation_G2M_2',
       'translation_MG1_1', 'translation_MG1_2', 'translation_S_1',
       'translation_S_2'],
      dtype='object', name='colnames')

Further to this, the renaming process can be further specified by using a selector to reduce the search space.

In [98]:
g.rename_axis([('prot_', 'prot')], selector=object)

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[]))

## Caching selections using `cache`

We may wish to save our 'selected columns' using the `cache` function, particularly if it is a complicated or long selection criterion.

This also allows us to reference this cached selection using a *meaningful name* further down the line.

**NOTE**: Selections are *not* pre-computed, the selection itself is cached and **executed at runtime**. This means that if you have different columns present further down the line, a *different result* will emerge.

In [99]:
g.cache("ids", object)

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[S]))

Our cached columns now sit in a hidden object called `self.selectors_`:

In [100]:
g.selectors_

{'ids': ['object']}

They can now be summoned by using `view`, `view_not`, or any
 of the other inspection functions that use *selectors*:

In [101]:
g.view("ids")

Index(['protids', 'protnames', 'gene_names'], dtype='object', name='colnames')

### Multi-cache

This is an extension to `cache`, where multiple things can be cached at once:

In [102]:
import numpy as np
g.cache_k(hello="_s$", hello2=np.square)

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[S]))

In [103]:
g.selectors_

{'ids': ['object'], 'hello': ['_s$'], 'hello2': [<ufunc 'square'>]}

## Mapping meta-information to column groups

One of the easiest ways is to **cache** the groups and then create a `meta_map` 
from the cached elements.

In [104]:
g.cache_k(numerical_f="translation", identifs=("ids?$","_names$"))

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[S]))

With `meta_map` we specify the name of the meta column, and then give selectors as to identify each subgroup. In this case we reference the name of the cached elements we are interested in, and use the dictionary name we specified to name it.

In [105]:
g.meta_map("feature_types", ["numerical_f","identifs"])

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[SM]))

Note that duplicate column names **cannot** occur in different subgroups as we are trying to *uniquely* label each feature type.

In [106]:
import pytest

with pytest.raises(ValueError):
    g.meta_map("identifiers", ["identifs","identifs"])

These columns now appear in `meta_`:

In [107]:
g.meta_.head()

Unnamed: 0,true_type,is_mixed_type,is_unique_id,feature_types
protids,object,False,False,identifs
protnames,object,True,False,
gene_names,object,True,False,identifs
translation_G1_1,float64,False,False,numerical_f
translation_G1_2,float64,False,False,numerical_f


## Applying transformations to selector data

With these selector groups, we can apply a function to the columns of this data using `g.transform`.

Transformations happen inplace and thus will change the underlying dataframe:

In [108]:
g.transform(lambda x:x**2, "numerical_f")

MetaPanda(trl(n=5216, p=11, mem=0.918MB, options=[SM]))

Note that if the `selector` parameter is empty, it will attempt to transform *every column* in the dataset. `pandas.DataFrame.transform` is used, so aggregations are not permitted. 

In [109]:
g.view("numerical_f")

Index(['translation_G1_1', 'translation_G1_2', 'translation_G2M_1',
       'translation_G2M_2', 'translation_MG1_1', 'translation_MG1_2',
       'translation_S_1', 'translation_S_2'],
      dtype='object', name='colnames')

## Dropping columns through `del` or using the `drop` function

Using the powerful selection methods for columns above, we can also remove or drop columns we aren't interested in:

In [110]:
g.drop(object)

MetaPanda(trl(n=5216, p=8, mem=0.668MB, options=[SM]))

We could also select columns that we want to keep using the `keep` method.

## Writing files

We can write our `MetaPanda` object to file with or without the associated metadata.

Note that from version 0.1.6, the default save type is `JSON`, as
 this allows us to store the metainformation *with* the raw dataset, plus any
  selectors and pipes.
  
At the current patch, we can handle `csv`, `xls` and `json` files, with plans 
to extend to `hdf` formats also.

In [111]:
# g.write("translation2.json")

### But this leaves us with an interesting question...

Can I 'rollback' changes I made to a dataframe, or follow step-by-step what's actually happening to it?

This means we need to create something like a **task graph** as we go along and perform **meta-changes** to the DataFrame.