# Selecting Columns with `MetaPanda`

In [4]:
import sys
import numpy as np
import pandas as pd
sys.path.insert(0,"../")
# our main import
import turbopanda as turb

print(f"turbopanda version: {turb.__version__}")

turbopanda version: 0.2.9


## Importing some data

In [3]:
g = turb.read("../data/rna.csv", name="mRNA")
g

MetaPanda(mRNA(n=100, p=117, mem=0.187MB, options=[]))

## Selectors

Unlike traditional `pandas` which is incredibly difficult to access subsets of a DataFrame with ease, we allow the use of `regex` **and** typing (such as `float`) to specify subgroups that contain that capture pattern or data type.

**NOTE**: Using the `__getitem__` attribute of `MetaPanda` **does not alter the underlying `DataFrame`**! The same super-object remains, allowing you to very quickly view dataframe subsets using a selection method of your choice.

The **order of selection** if:

1. selection is `None`: return \[\]
2. selection is of type `pandas.Index`: return those columns
3. selection is an accepted `dtype`: return columns of that dtype
4. selection is callable (i.e function): return columns associated with boolean series
5. selection is of type `str`:
    1. selection is found as `meta_` column and column is of type `bool`
    2. selection is found in `selectors_`
    3. not in `df` column names: use regular expressions (regex)
    4. otherwise selector is column name: return single `Series`

## Viewing

An ability to access the column names you wish to view is an advantage as a pre-stage to selecting the full columns you want.

This is achieved using the `view` command;

In [5]:
g.view("float")

Index(['GC_content_mrna', 'length_prop_cds', 'length_prop_utr5', 'MFE',
       'MFE_win10', 'MFE_win20', 'MFE_win30', 'MFE_win40', 'MFE_win60',
       'MFE_win80', 'length_prop_utr3', 'signal_polya', 'CAI', 'tAI', 'RCBS',
       'RCBS_PC'],
      dtype='object', name='colnames')

Note in the above example we select by a data type; in this instance, every column with a `dtype=float`, or more specifically, with a `dtype.kind='f'`, when referencing NumPy arrays.

We can also select directly a column of interest:

In [6]:
g.columns

Index(['counter', 'refseq_id', 'G_mrna', 'A_mrna', 'C_mrna', 'T_mrna',
       'length_mrna', 'GC_content_mrna', 'length_cds', 'length_prop_cds',
       ...
       'ncRNA_fc', 'precursor_RNA_bs', 'misc_feature_fc', 'sig_peptide_fc',
       'STS_fc', 'regulatory_fc', 'mat_peptide_fc', 'exon_fc', 'proprotein_bs',
       'transit_peptide_fc'],
      dtype='object', name='colnames', length=117)

In [7]:
g.view("G_mrna")

Index(['G_mrna'], dtype='object', name='colnames')

`turbopanda` also supports regex pattern matching and so attempts to find the next best match if one is not available:

In [11]:
g.view(".*_mrna$")

Index(['G_mrna', 'A_mrna', 'C_mrna', 'T_mrna', 'length_mrna',
       'GC_content_mrna', 'AA_mrna', 'AC_mrna', 'AG_mrna', 'AT_mrna',
       'CA_mrna', 'CC_mrna', 'CG_mrna', 'CT_mrna', 'GA_mrna', 'GC_mrna',
       'GG_mrna', 'GT_mrna', 'TA_mrna', 'TC_mrna', 'TG_mrna', 'TT_mrna'],
      dtype='object', name='colnames')

In [12]:
g.view("_bs$")


Index(['precursor_RNA_bs', 'proprotein_bs'], dtype='object', name='colnames')

NOTE: `g.view` will ALWAYS return a `pandas.Index` object, whether empty or full; thus you can be guaranteed to chain this to other commands related to pandas if you so wish.

Null is also returned:

In [15]:
g.view(None)

Index([], dtype='object', name='colnames')

Using the meta information, boolean-type columns in the meta information can also act as selectors:

In [18]:
g.view("is_unique_id")

Index(['counter', 'length_mrna'], dtype='object', name='colnames')

Viewing every available option as a coded selector can be found in the `options_` attribute:

In [19]:
g.options_

('is_mixed_type', 'is_unique_id')

And finally a more complex example using regex to get exactly the combination of columns you want:

In [20]:
g.view("[GC]{1,2}_mrna")

Index(['G_mrna', 'C_mrna', 'AC_mrna', 'AG_mrna', 'CC_mrna', 'CG_mrna',
       'GC_mrna', 'GG_mrna', 'TC_mrna', 'TG_mrna'],
      dtype='object', name='colnames')

Or alternatively viewing by a custom function that accesses `DataFrame.apply` under the hood.

Here we show any columns whereby the ratio between the number of non-missing values and the full length is the same: i.e the column has no missing values.

In [21]:
g.view(lambda x: x.count()==x.shape[0])

Index(['counter', 'refseq_id', 'G_mrna', 'A_mrna', 'C_mrna', 'T_mrna',
       'length_mrna', 'GC_content_mrna', 'length_cds', 'length_prop_cds',
       ...
       'ncRNA_fc', 'precursor_RNA_bs', 'misc_feature_fc', 'sig_peptide_fc',
       'STS_fc', 'regulatory_fc', 'mat_peptide_fc', 'exon_fc', 'proprotein_bs',
       'transit_peptide_fc'],
      dtype='object', name='colnames', length=115)

### Inverse viewing

Similarly, we have the `view_not` function for options which we wish to NOT select for, i.e getting all columns that do not comply with some criterion:

In [22]:
g.view_not(float)

Index(['counter', 'refseq_id', 'G_mrna', 'A_mrna', 'C_mrna', 'T_mrna',
       'length_mrna', 'length_cds', 'A_cds', 'C_cds',
       ...
       'ncRNA_fc', 'precursor_RNA_bs', 'misc_feature_fc', 'sig_peptide_fc',
       'STS_fc', 'regulatory_fc', 'mat_peptide_fc', 'exon_fc', 'proprotein_bs',
       'transit_peptide_fc'],
      dtype='object', name='colnames', length=101)

The same sorts of inputs are applicable in this case also.

## Creating multi-views

By using multiple selection criteria, by default `view` only keeps the **union** of the terms provided:

\begin{align}
S=\bigcup_i t_i
\end{align}

This means that if you select for `object` and for "Intensity", you will get all of the column names of type `object` **OR** containing the string "Intensity" within it.

This is contrary to a **intersection** of terms, where you would get the column names of type `object` **AND** they contain the string "Intensity".

In addition, *the order of the elements is maintained*, even across multiple selectors, such that any sorting/order is preserved in future operations.

In [24]:
g.view(float, "mrna")

Index(['G_mrna', 'A_mrna', 'C_mrna', 'T_mrna', 'length_mrna',
       'GC_content_mrna', 'length_prop_cds', 'length_prop_utr5', 'MFE',
       'MFE_win10', 'MFE_win20', 'MFE_win30', 'MFE_win40', 'MFE_win60',
       'MFE_win80', 'length_prop_utr3', 'AA_mrna', 'AC_mrna', 'AG_mrna',
       'AT_mrna', 'CA_mrna', 'CC_mrna', 'CG_mrna', 'CT_mrna', 'GA_mrna',
       'GC_mrna', 'GG_mrna', 'GT_mrna', 'TA_mrna', 'TC_mrna', 'TG_mrna',
       'TT_mrna', 'signal_polya', 'CAI', 'tAI', 'RCBS', 'RCBS_PC'],
      dtype='object', name='colnames')

### Using the intersection

To find the intersect rather than the union of terms, you can use `select`. See below.

## Using `eval`-like string operations

Any stringable command from the previous calls to `view` and `view_not`, etc can be 
stacked into a single string and parsed, much like the `pd.eval` function.
This can be achieved using the `select` function. 

This operation allows you to combine the operations:

* intersection: $\&$
* union: $|$

into a single string. Note that whitespaces are removed and hence this may present aberrant behaviour for column names
that contain whitespace within them. For instance, let's say we wanted to select all feature
counts OR binary selectors:

In [28]:
g.select("_bs | _fc")

Index(['STS_fc', 'exon_fc', 'mat_peptide_fc', 'misc_RNA_fc', 'misc_feature_fc',
       'ncRNA_fc', 'precursor_RNA_bs', 'proprotein_bs', 'regulatory_fc',
       'sig_peptide_fc', 'transit_peptide_fc', 'variation_fc'],
      dtype='object', name='colnames')

Alternatively we can quickly write out some defining features of a small subgroup that we want
using the intersection operators:

In [33]:
g.select("_mrna & T")

Index(['T_mrna', 'AT_mrna', 'CT_mrna', 'GT_mrna', 'TA_mrna', 'TC_mrna',
       'TG_mrna', 'TT_mrna'],
      dtype='object', name='colnames')

This should make it substantially easier when iterating through subgroups. We can also use
the `not` operator to reject certain selections within a regex chain, hence for instance by selecting
all mRNA features, but not the mRNA length or GC 'content':

In [37]:
g.select("_mrna & ~length & ~content")

Index(['G_mrna', 'A_mrna', 'C_mrna', 'T_mrna', 'AA_mrna', 'AC_mrna', 'AG_mrna',
       'AT_mrna', 'CA_mrna', 'CC_mrna', 'CG_mrna', 'CT_mrna', 'GA_mrna',
       'GC_mrna', 'GG_mrna', 'GT_mrna', 'TA_mrna', 'TC_mrna', 'TG_mrna',
       'TT_mrna'],
      dtype='object', name='colnames')