# Working with data matrices and analysis results

This notebook will show the most suitable commands to retrieve data from grandPy object in different scenarios.

Throughout this notebook, we will be using GRAND-SLAM processed SLAM-seq data set from Finkel et al. 2021 [[3]](https://www.nature.com/articles/s41586-021-03610-3). The data set contains time series (progressive labeling) samples from a human epithelial cell line (Calu3 cells); half of the samples were infedctes with SARS-CoV-2 for different periods of time. For more on these initial commands see the [Loading data](--) notebook.

In [66]:
import warnings
import pandas as pd
from scripts.regsetup import description

import grandpy as gp
from grandpy import ModeSlot

warnings.filterwarnings("ignore", category=UserWarning)

df = lambda x: x
sars = gp.read_grand("https://zenodo.org/record/5834034/files/sars.tsv.gz", design=("Condition", "dur.4sU", "Replicate"), classify_genes_func=lambda df: gp.classify_genes(df, cg_name="viral"))

Detected URL -> downloading to temp file
Detected dense format -> using dense reader
Temporary file sars.tsv.gz was deleted after loading.


In [67]:
sars = sars.filter_genes()

# Data slots
Data is orgnaized in a grandPy object in slots:

In [68]:
sars.slots

['ntr', 'alpha', 'beta', 'count']

To learn more about metadata, see the [loading data notebook](). After loading GRAND-SLAM analysis results, the default slots are "count" (read counts), "ntr" (the new-to-total RNA ratio), "alpha" and "beta" (the parameters for the Beta approximation of the NTR posterior distribution). Each of these slots contains a gene x columns ( columns are either samples or cells, depending on wehther your data is bulk or singel cell data) matrix of numeric values.

There is also a default slot, which is used by many functions as default parameter.

In [69]:
sars.default_slot

'count'

New slots are added by specific grandPy functions such as `normalize()` or `normlaize_tpm()`, which by default, change the default slot. The default slot can also be set manually.

In [70]:
sars = sars.normalize()
sars.default_slot

'norm'

In [71]:
sars.with_default_slot("count")
sars.default_slot

'norm'

In [72]:
sars = sars.normalize_tpm(set_to_default = False)
sars.default_slot

'norm'

In [73]:
sars.with_default_slot("norm")


<GrandPy object: 9162 genes x 12 samples>

There are also other grandPy functions that add additional slots, but do not update the `default_slot` automatically:

In [74]:
sars = sars.compute_ntr_ci()
sars.default_slot

'norm'

In [75]:
sars.slots

['ntr', 'alpha', 'beta', 'count', 'norm', 'tpm', 'lower', 'upper']

# Analysis
In addition to data slots, there is an additional kind of data that is part of a grandPy object: analyses.

In [76]:
sars.get_analyses()

[]

After loading data there are no analyses, but such data are added e.g. by performing modeling of progressive labeling time courses or analyzing gene expression (see the notebooks [Kinetic modeling]() and [Differential expression]() for more on these)

In [77]:
sars = sars.fit_kinetics(name_prefix = "kinetics", steady_state = {"Mock": True, "SARS": False})
sars = sars.compute_lfc(contrasts = sars.get_contrasts(contrast = ["duration.4sU.original", "noo4sU"], group = "Condition", no4su = True))
sars.analyses

Fitting Mock: 100%|██████████| 9162/9162 [00:11<00:00, 832.62it/s] 
Fitting SARS: 100%|██████████| 9162/9162 [00:10<00:00, 903.65it/s] 


NameError: name 'make_name' is not defined

Both analysis methods, `fit_kinetics()` and `compute_lfc()`added multiple analyses: `fit_kinetics()`added an analysis for each `Condition` whereas `compute_lfc()` added an analysis for each of many pairwise comparisons defined by `get_contrasts()` (see [Differential expression]() for details)

What is common to data slots and analyses is taht both are tables with as many rows as there are genes. What is different is that the columns of data slots always correspond to the sampels or scells (depending on whether data are bulk or singel cell data), and the columns of analysis tables are arbitrary and depend on the kind of analysis performde.

Analysis columns can be retrieved by setting the description parameter to True for `get_analyses`:

In [None]:
sars.get_analyses(description = True)

We see that the `fit_kinetics()` function by default creates tables with two columns (`snthesis` and `half-life`) corresponding to the synthesis rate and RNA half-life for each gene, and the `compute_lfc()` function creates a single column called `lfc` corresponding to the log2 fold cahnges for each gene.

# Retrieving data from slots or analyses
There are essentially three functions you can use for retrieving slot data:
- `get_table()`: The swiss army knife, return a data frame with genes as rows and columns made from potentially several slots and/or analyses; usually for all or at least a lot of genes
- `get_data()`: Returns a data frame with the samples or cells as rows and slot data for particular genes in columns; usually for a single or at most very few genes
- `get_analysis_table()`: Returns a data frame with genes as rows and columns made from potentially several analyses; usually for all or at least a lot of genes ; there's (almost) no need to call this function (see below for exceptions)

# get_table
Without any other parameters `get_table()` returns for all genes from the default slot:

In [40]:
sars.get_table(mode_slot = "count")

Unnamed: 0_level_0,Mock.no4sU.A,Mock.1h.A,Mock.2h.A,Mock.2h.B,Mock.3h.A,Mock.4h.A,SARS.no4sU.A,SARS.1h.A,SARS.2h.A,SARS.2h.B,SARS.3h.A,SARS.4h.A
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
MIB2,14.0000,420.0000,372.0000,230.0000,456.50,419.5,14.0000,180.0000,2.650000e+02,2.130000e+02,1.000000e+02,1.020000e+02
OSBPL9,133.6667,1202.0000,942.5000,1071.5000,1688.00,1385.5,38.0000,348.0000,3.840000e+02,4.885000e+02,1.470000e+02,1.850000e+02
BTF3L4,160.3333,1208.1667,738.7500,1251.4167,2071.75,1371.0,40.0000,369.1667,2.596667e+02,5.014167e+02,1.650000e+02,2.120000e+02
ZFYVE9,51.5000,486.0000,383.3333,364.5000,503.50,433.5,19.0000,165.0000,1.620000e+02,2.365000e+02,8.300000e+01,1.080000e+02
PRPF38A,99.5000,938.0000,818.0000,837.0000,1296.00,1191.0,40.0000,238.0000,2.980000e+02,4.105000e+02,1.360000e+02,1.665000e+02
...,...,...,...,...,...,...,...,...,...,...,...,...
MECP2,187.0000,1925.5000,1624.5000,1318.5000,2492.00,2283.5,79.0000,618.0000,8.215000e+02,8.080000e+02,3.420000e+02,5.145000e+02
FLNA,5257.5000,67906.3333,64259.3500,45429.1667,82645.60,73480.4,1079.5000,13502.5000,1.668567e+04,1.389320e+04,7.156667e+03,9.362000e+03
DNASE1L1,67.0000,758.5000,683.5000,584.5000,1063.25,993.5,16.0000,213.0000,2.410000e+02,2.267500e+02,9.300000e+01,1.300000e+02
ORF1ab,459.0000,4421.0000,4567.0000,3378.0000,5459.50,4486.0,395071.6429,536923.5833,1.351206e+06,2.337993e+06,1.243669e+06,1.775411e+06


You can change the slot by specifying another `mode_slot`parameter:

In [42]:
sars.get_table(mode_slot = ["norm", "count"]).columns

Index(['Mock.no4sU.A_norm', 'Mock.1h.A_norm', 'Mock.2h.A_norm',
       'Mock.2h.B_norm', 'Mock.3h.A_norm', 'Mock.4h.A_norm',
       'SARS.no4sU.A_norm', 'SARS.1h.A_norm', 'SARS.2h.A_norm',
       'SARS.2h.B_norm', 'SARS.3h.A_norm', 'SARS.4h.A_norm',
       'Mock.no4sU.A_count', 'Mock.1h.A_count', 'Mock.2h.A_count',
       'Mock.2h.B_count', 'Mock.3h.A_count', 'Mock.4h.A_count',
       'SARS.no4sU.A_count', 'SARS.1h.A_count', 'SARS.2h.A_count',
       'SARS.2h.B_count', 'SARS.3h.A_count', 'SARS.4h.A_count'],
      dtype='object')

By using the `mode_slot` syntax (mode being either of `total`, `new` and `old`), you can also retrieve new RNA counts or new RNA normalized values:

In [52]:
sars.get_table(mode_slot= gp.ModeSlot("new", "norm"), ntr_nan = True)

Unnamed: 0_level_0,Mock.no4sU.A,Mock.1h.A,Mock.2h.A,Mock.2h.B,Mock.3h.A,Mock.4h.A,SARS.no4sU.A,SARS.1h.A,SARS.2h.A,SARS.2h.B,SARS.3h.A,SARS.4h.A
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
MIB2,,2.334512,32.111821,16.474117,40.421769,52.710641,,10.323628,68.111814,3.900246e+01,1.107407e+02,1.165965e+02
OSBPL9,,17.339177,55.206225,62.407300,82.942040,109.840247,,49.245814,154.332663,1.816013e+02,1.692269e+02,2.246368e+02
BTF3L4,,17.308215,69.380310,120.949589,177.434594,223.605445,,40.315118,101.825258,1.775885e+02,1.949572e+02,2.326411e+02
ZFYVE9,,2.267216,27.897675,25.821868,39.302175,68.879675,,3.283195,56.454819,6.681548e+01,1.009159e+02,1.443122e+02
PRPF38A,,28.737657,122.185053,133.987687,160.389371,252.489814,,82.290788,223.122015,2.359033e+02,2.318516e+02,2.852863e+02
...,...,...,...,...,...,...,...,...,...,...,...,...
MECP2,,50.773766,321.335986,236.242873,398.240228,552.906767,,108.141762,621.657322,4.684457e+02,7.809246e+02,9.330807e+02
FLNA,,199.957557,2542.705633,1631.919535,2865.773129,3817.188192,,284.479165,5490.437208,3.155187e+03,7.761168e+03,9.774823e+03
DNASE1L1,,6.399312,53.361958,27.137565,52.135871,80.851752,,6.232803,55.731343,4.752973e+01,4.805717e+01,7.580657e+01
ORF1ab,,1125.408603,1223.190926,957.804640,906.522581,844.139520,,224548.040766,995399.866618,1.458746e+06,2.341523e+06,2.919400e+06


Note that the no4sU columns only have NaN values. You can change this behaviour by specifying the `ntr_nan` parameter:

In [78]:
sars.get_table(mode_slot= gp.ModeSlot("new", "norm"), ntr_nan = False)

Unnamed: 0_level_0,Mock.no4sU.A,Mock.1h.A,Mock.2h.A,Mock.2h.B,Mock.3h.A,Mock.4h.A,SARS.no4sU.A,SARS.1h.A,SARS.2h.A,SARS.2h.B,SARS.3h.A,SARS.4h.A
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
MIB2,0.0,2.334512,32.111821,16.474117,40.421769,52.710641,0.0,10.323628,68.111814,3.900246e+01,1.107407e+02,1.165965e+02
OSBPL9,0.0,17.339177,55.206225,62.407300,82.942040,109.840247,0.0,49.245814,154.332663,1.816013e+02,1.692269e+02,2.246368e+02
BTF3L4,0.0,17.308215,69.380310,120.949589,177.434594,223.605445,0.0,40.315118,101.825258,1.775885e+02,1.949572e+02,2.326411e+02
ZFYVE9,0.0,2.267216,27.897675,25.821868,39.302175,68.879675,0.0,3.283195,56.454819,6.681548e+01,1.009159e+02,1.443122e+02
PRPF38A,0.0,28.737657,122.185053,133.987687,160.389371,252.489814,0.0,82.290788,223.122015,2.359033e+02,2.318516e+02,2.852863e+02
...,...,...,...,...,...,...,...,...,...,...,...,...
MECP2,0.0,50.773766,321.335986,236.242873,398.240228,552.906767,0.0,108.141762,621.657322,4.684457e+02,7.809246e+02,9.330807e+02
FLNA,0.0,199.957557,2542.705633,1631.919535,2865.773129,3817.188192,0.0,284.479165,5490.437208,3.155187e+03,7.761168e+03,9.774823e+03
DNASE1L1,0.0,6.399312,53.361958,27.137565,52.135871,80.851752,0.0,6.232803,55.731343,4.752973e+01,4.805717e+01,7.580657e+01
ORF1ab,0.0,1125.408603,1223.190926,957.804640,906.522581,844.139520,0.0,224548.040766,995399.866618,1.458746e+06,2.341523e+06,2.919400e+06


`get_table()` can also be used to retrieve analysis results:

In [79]:
sars.get_table(mode_slot= "kinetics")

ValueError: Slot 'kinetics' not found in data slots.

Note that you can also specify the full name (it actually is a regular expression that is matched against each analysis name).

It is also easily possible to only retrieve data for specific columns (i.e., samples or cells) by using the `columns` parameter. Note that you can use names from the `coldata` table to construct a logical vector over the columns; using a character vector (to specify names) or a numeric vector (to specify positions) also works: