# Working with data matrices and analysis results

This notebook will show the most suitable commands to retrieve data from GrandPy object in different scenarios.

Throughout this notebook, we will be using GRAND-SLAM processed SLAM-seq data set from [Finkel et al. (2021)](https://www.nature.com/articles/s41586-021-03610-3). The data set contains time series (progressive labeling) samples from a human epithelial cell line (Calu3 cells); half of the samples were infedctes with SARS-CoV-2 for different periods of time. For more on these initial commands see the [loading data notebook](../notebook_03_loading_data_and_working_with_grandpy_objects).

In [None]:
import warnings
import pandas as pd
from scripts.regsetup import description

import grandpy as gp
from grandpy import ModeSlot

warnings.filterwarnings("ignore", category=UserWarning)

sars = gp.read_grand("https://zenodo.org/record/5834034/files/sars.tsv.gz", design=("Condition", "dur.4sU", "Replicate"), classify_genes_func=lambda df: gp.classify_genes(df, cg_name="viral"))

In [None]:
sars = sars.filter_genes()

# Data slots
Data is organized in `slots`:

In [None]:
sars.slots

To learn more about metadata, see the [loading data notebook](../notebook_03_loading_data_and_working_with_grandpy_objects). After loading GRAND-SLAM analysis results, the standard slots are "count" (read counts), "ntr" (the new-to-total RNA ratio), "alpha" and "beta" (the parameters for the Beta approximation of the NTR posterior distribution). Each of these slots contains a genes x columns matrix of numeric values. (columns are either samples or cells, depending on whether your data is bulk or single cell data)

There is also a `default_slot`, which is used by many functions as a default parameter.

In [None]:
sars.default_slot

New slots are added by specific GrandPy methods such as `normalize()` or `normalize_tpm()`, which change the `default_slot`. It can also be set manually.

In [None]:
sars = sars.normalize()
sars.default_slot

In [None]:
sars = sars.with_default_slot("count")
sars.default_slot

In [None]:
sars = sars.normalize_tpm(set_to_default = False)
sars.default_slot

In [None]:
sars = sars.with_default_slot("norm")
sars.default_slot

There are also other GrandPy functions that add additional slots, but do not update the `default_slot` automatically:

In [None]:
sars = sars.compute_ntr_ci()
sars.default_slot

In [None]:
sars.slots

# Analysis
In addition to data slots, there is an additional kind of data that is part of a GrandPy object: analyses.

In [None]:
sars.analyses

After loading data there are no analyses, but such data is added e.g. by performing modeling of progressive labeling time courses or analyzing gene expression (see the notebooks [kinetic modeling](../notebook_02_kinetic_modeling) and [differential expression](../notebook_01_differential_expression) for more on these)

In [None]:
# sars = sars.fit_kinetics(steady_state = {"Mock": True, "SARS": False})
# sars = sars.compute_lfc(contrasts = sars.get_contrasts(contrast = ["duration.4sU.original", "no4sU"], group = "Condition", no4su = True))
sars.analyses

Both analysis methods, `fit_kinetics()` and `compute_lfc()` added multiple analyses: `fit_kinetics()` added an analysis for each `condition` whereas `compute_lfc()` added an analysis for each pairwise comparison defined by `get_contrasts()` (see [differential expression](../notebook_01_differential_expression) for details).

What is common to data slots and analyses is that both are tables with as many rows as there are genes. The main difference are their columns. In `slots` the columns correspond to the samples or cells (depending on whether data are bulk or single cell data). In `analyses` the columns are arbitrary, only depending on the kind of analysis performed.

Analysis columns can be retrieved by setting the `description` to 'True' for `get_analyses()`:

In [None]:
sars.get_analyses(description = True)

We see that `fit_kinetics()` by default creates tables with two columns (*Synthesis* and *Half-life*) corresponding to the synthesis rate and RNA half-life for each gene. <span style="color:red;">`compute_lfc()` creates a single column called *lfc* corresponding to the log2 fold changes for each gene.

# Retrieving data from slots or analyses
There are three main methods you can use for retrieving the data from `slots`:
- `get_table()`: Returns a DataFrame with genes as rows and columns made from potentially several slots
- `get_data()`: Returns a DataFrame with the samples/cells as rows and slot data for particular genes in columns
- `get_analysis_table()`: Returns a DataFrame with genes as rows and columns made from potentially several analyses

# get_table
Without any other parameters `get_table()` returns the data from `default_slot` for all genes:

In [None]:
sars.get_table().head()

You can change the slot by specifying the `mode_slot`.

In [None]:
sars.get_table(mode_slot = "count").head()

Multiple slots can be used at once:

In [None]:
sars.get_table(mode_slot = ["norm", "count"]).columns

By using the `mode_slot` syntax (mode being either of *total*, *new* and *old*), you can also retrieve new RNA counts or new RNA normalized values:

In [None]:
sars.get_table(mode_slot="new_norm").head()

A mode slot can also be specified using a `ModeSlot` object, as you will see in the following example.

Note that the no4sU columns contain only NaN values. You can change this behavior by specifying `ntr_nan`:

In [None]:
sars.get_table(mode_slot= gp.ModeSlot("new", "norm"), ntr_nan = False).head()

It is also possible to only retrieve data for specific `columns` (i.e., samples or cells) by using the `columns` parameter. Here you can use column names, their index, or a boolean mask:

In [None]:
sars.get_table(columns = ["Mock.no4sU.A", "SARS.no4sU.A"]).columns

In [None]:
sars.get_table(columns = sars.columns[3:6]).columns

In [None]:
sars.get_table(columns = (sars.coldata["duration.4sU"] >= 2) & (sars.coldata["Condition"] == "Mock")).columns

It is furthermore possible to only fetch data for specific genes (e.g. viral genes) using the `genes` parameter. It is either gene names/symbols, their index, or a boolean mask:

In [None]:
sars.get_table(genes = "MYC").index

In [None]:
sars.get_table(genes = range(0,3)).index

In [None]:
sars.get_table(genes = (sars.gene_info["Type"] == "viral")).index

Sometimes, it makes sense to add the `gene_info` to the DataFrame (for more on gene metadata, see the [loading data notebook](../notebook_03_loading_data_and_working_with_grandPy_objects)):

In [None]:
df = sars.get_table(mode_slot = "norm", with_gene_info = True)
df.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.scatterplot(
    data=df,
    x='SARS.4h.A',
    y='SARS.no4sU.A',
    hue='Type'
)

plt.xscale('log')
plt.yscale('log')

lims = [
    np.min([df['SARS.4h.A'].min(), df['SARS.no4sU.A'].min()]),
    np.max([df['SARS.4h.A'].max(), df['SARS.no4sU.A'].max()]),
]
plt.plot(lims, lims, 'k-')
plt.xlim(lims)
plt.ylim(lims)
plt.xlabel('SARS.4h.A')
plt.ylabel('SARS.no4sU.A')
plt.title('Log-Log Scatter Plot')
plt.legend(title='Type')
plt.tight_layout()
plt.show()

highlight = sars.get_classified_genes("viral")
# gp.plot_scatter(sars, x = "SARS.4h.A",y = "SARS.no4sU.A",log = True, remove_outlier=False, diagonal=True, color="blue", highlight=highlight)
# gp.plot_scatter(sars, x="SARS.4h.A", y="SARS.no4sU.A", remove_outlier=False, log_x=True, log_y=True, diagonal=True)

Finally, values can be summarized across samples or cells from the same `condition`:

In [None]:
sars.get_table(summarize = sars.get_summarize_matrix()).head()

This is achieved by a summerize matrix:

In [None]:
sars.get_summarize_matrix()

For summarization, the summarize matrix is matrix-multiplied with the raw matrix. `get_summarized_matrix()` will generate a matrix with a column for each `condition`:

In [None]:
print(sars.condition)

By default, no4sU columns are removed (i.e. zero in the matrix), but the `no4su` parameter can change this:

In [None]:
sars.get_summarize_matrix(no4su = True)

It is also possible to only focus on specific columns:

In [None]:
sars.get_summarize_matrix(columns =sars.coldata["duration.4sU"] < 4)

The default behavior is to compute the average, which can be changed to computing sums:

In [None]:
sars.get_summarize_matrix(average = False)

As a final example, let's get the averaged normalized values for only the 2h timepoint:

In [None]:
sars.get_table(summarize=sars.get_summarize_matrix(columns=sars.coldata["duration.4sU"] == 2))

# get_data
`get_data()` is very similar to `get_table()`. The most important difference being its shape. The Output is transposed relative to `get_table()`.

In [None]:
sars.get_data(genes = "MYC")

Note that here the `coldata` can be added, which is also the default behavior (for more on column metadata, see the [loading data notebook](../notebook_03_loading_data_and_working_with_grandPy_objects)). This can be changed by using the `with_coldata` parameter:

In [None]:
sars.get_data(genes = "MYC", with_coldata = False)

The `genes`, `columns` and `mode_slot` parameters work just like they do in `get_table()`.

In [None]:
sars.get_data(mode_slot = ["count", "norm"], genes = ["MYC", "SRSF6"], columns = sars.coldata["Condition"] == "Mock", with_coldata = False)

Finally, it is also possible to append multiple genes (and/or slots) not as columns, but as additional rows:

In [None]:
sars.get_data(genes = ["MYC", "SRSF6"], columns = sars.coldata["duration.4sU"] < 2, by_rows = True)

This can be quite helpful, as for the following example: We retrieve total, old new RNA for SRSF6 (only replicate A), and do this by rows. This way, the data can directly be used for mathplotlib/seaborn to plot the progressive labeling time course (note the much shorter half-life, which is the time where the new and old lines cross, for SARS as compared to Mock):

In [None]:
df = sars.get_data(mode_slot = ["old_norm", "new_norm", "total_norm"], genes = "SRSF6",
                   columns = sars.coldata["Replicate"] == "A", by_rows = True, ntr_nan=True)

fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharey=True)

sns.lineplot(ax=axes[0], data=df[df['Condition'] == 'Mock'], x='duration.4sU', y='Value',
             hue='Slot', style='Slot', markers=True, dashes=False)

axes[0].set_title('Mock')
axes[0].set_xlabel('duration.4sU')
axes[0].set_ylabel('Value')

sns.lineplot(ax=axes[1], data=df[df['Condition'] == 'SARS'], x='duration.4sU', y='Value',
             hue='Slot', style='Slot', markers=True, dashes=False)

axes[1].set_title('SARS')
axes[1].set_xlabel('duration.4sU')

plt.tight_layout()
plt.show()

# get_analysis_table
`get_analysis_table` 