# Loading data and working with grandPy objects

<span style="color:green">Notiz: alles was rot ist, ist noch nicht umgesetzt oder müsste noch überarbeitet werden durch Links etc.</span>

GrandPy is a Python package for the analysis of RNA-seq experiments involving metabolic RNA labeling with nucleotide conversion, such as SLAM-seq experiments [[1]](https://www.nature.com/articles/nmeth.4435). In such experiments, nucleoside analogs such as 4sU are added to living cells, which take it up and incorporate it into newly synthesized RNA. Before sequencing, 4sU is converted into a cytosin analog. Reads covering 4sU sites therefore have characteristic T-to-C mismatches after read mapping, in principle providing the opportunity to differentiate newly synthesized (during the time of labeling) from preexisting RNA.

Confounders such as sequencing errors or read that originate from newly synthesized RNA but, by chance, do not cover sites of 4sU incorporation (usually 20-80% of all "new read") can be handled using specialized methods such as GRAND-SLAM [[2]](https://academic.oup.com/bioinformatics/article/34/13/i218/5045735?login=true).

# Reading in the data

Throughout this vignette, we will be using the GRAND-SLAM processed SLAM-seq data set from Finkel et al.2021 [[3]](https://academic.oup.com/bioinformatics/article/34/13/i218/5045735). The data set contains time series (progressive labeling) samples from a human epithelial cell line (Calu3 cells); half of the samples were infected with SARS-CoV-2 for different periods of time.

The output of GRAND-SLAM is a tsv file where rows are genes and columns are read counts and other statistics (e.g., the new-to-total RNA ratio) for all samples. The data set is available on zenodo ("https://zenodo.org/record/5834034/files/sars.tsv.gz"). We start by reading this file into Python:

In [1]:
# Package installation
from Py.load import *
sars = read_grand("../data/sars.tsv", design=("Condition", "Time", "Replicate"))
print(sars.columns)

Detected dense format -> using dense reader
['Mock.no4sU.A', 'Mock.1h.A', 'Mock.2h.A', 'Mock.2h.B', 'Mock.3h.A', 'Mock.4h.A', 'SARS.no4sU.A', 'SARS.1h.A', 'SARS.2h.A', 'SARS.2h.B', 'SARS.3h.A', 'SARS.4h.A']




When reading in the file, we have to define the ``design`` vector. This is used to infer metadata automatically from sample names. Here sample names consist of three parts separated by dots as shown above (the Columns function returns the sample names or cell ids when analyzing a single cell data set). Each part in the sample name represents an aspect of the design. For example, the sample named Mock.2h.A is a sample from the mock condition (i.e. not infected by SARS-CoV-2), subjected to metabolic labeling for 2 hours, and is the first replicate (i.e. replicate "A"). This sample name is consistent with the three element design vector used above. It is possible to specify other design elements (of course the samples would have to be named accordingly). A list of reasonable options is predefined in the dictionary `DESIGN_KEYS`.

There are names (i.e. the things you specify in the design vector) that have additional semantics. For example, for the name `duration.4sU`the values are interpreted like this: 4h is converted into the number 4, 30min into 0.5, and no4sU into 0. For more information, see [below](#column-metadata)). The design vector is mandatory. Attempting to read in the data without it results in an error:

In [2]:
sars_wrong = read_grand("../data/sars.tsv") # Fehlermeldung könnte man noch verkürzen

Detected dense format -> using dense reader




ValueError: Design must be specified.

Alternatively, a table containing the metadata can be specified. Make sure that it contains a `Name` column matching the names in the GRAND-SLAM output table:

In [None]:
import pandas as pd
name = ["Mock.no4sU.A", "Mock.1h.A", "Mock.2h.A", "Mock.2h.B", "Mock.3h.A", "Mock.4h.A", "SARS.no4sU.A", "SARS.1h.A", "SARS.2h.A", "SARS.2h.B", "SARS.3h.A", "SARS.4h.A"]

conditions = ["Mock"] * 6 + ["SARS"] * 6

design_df = pd.DataFrame({"Name": name,
                          "Condition": conditions})

sars_meta = read_grand("../data/sars.tsv", design = design_df)

# What is the grandPy object

`read_grand` returns a grandPy object, which contains

1. metadata for genes
2. metadata for samples/cells (as inferred from the sample names by the design parameter)
3. all data matrices (counts, normalized counts, ntrs, etc. these types of data are called "slots")
4. analysis results

Metadata (1. and 2.) are described below. How to work with the data matrices and analysis results is described in a separate [vignette](working_with_data_matrices_and_analysis_results.ipynb).

# Working with grandPy objects

Here we will see how to work with grandPy objects in general. A short summary can be displayed when `print`ing the object, and there are several functinos to retrieve general information about the object:

In [3]:
print(sars) # Link fehlt

GrandPy:
Read from sars
19659 genes, 12 samples/cells
Available data slots: ['count', 'ntr', 'alpha', 'beta']
Available analyses: []
Available plots: {}
Default data slot: count



In [4]:
print(sars.title)

sars


In [5]:
print(len(sars.genes)) # so?

19659


In [6]:
print(len(sars.coldata)) # so?

12


It is straight-forward to filter genes:

In [7]:
# sars = sars.filter_genes # fehlt bei processing.py
# print(len(sars.genes))

By default genes are retained if they have 100 read counts in at least half of the samples (or cells). There are many options how to filter by genes (note that `filter_genes` returns a new grandPy object, and below we directly call `len()` on this new object to check how many genes are retained by filtering):

In [8]:
# print("Genes with at least 1000 read counts in half of the columns: \n", len(sars.genes)) # filter_genes fehlt bei processing.py

In [9]:
# print("Genes with at least 1000 read counts in half of the columns (retain two genes that are otherwise filtered): \n", ) # filter_genes fehlt bei processing.py

In [10]:
# print("Keep only these two genes: \n", ...) # filter_genes fehlt bei processing.py

In [11]:
# sars = sars.normalizeTPM # normalizeTPM() fehlt bei processing.py
# print("Genes with at least 10 TPM in half of the columns: \n", ...)

`filter_genes()` essentially removes rows from the data slots. It is also possible to remove columns (i.e. samples or cells). This is done using the subset function:

In [12]:
mock = sars[:, sars.coldata["Condition"] == "Mock"]
print(mock) # filter_genes() fehlt noch, daher genes Anzhal noch falsch aber samples/cells richtig

GrandPy:
Read from sars
19659 genes, 6 samples/cells
Available data slots: ['count', 'ntr', 'alpha', 'beta']
Available analyses: []
Available plots: {}
Default data slot: count



The new grandPy object now only has 6 columns. The `columns`parameter to subset must be a logical vector, and you can use the names of the column metadata table (see below) as variables (i.e. the parameter here is a logical vector with all samples being TRUE where the `Condition`column is equal to "Mock".

A closely related function is `split`, which returns a list of several grandPy objects, each composed of samples having the same `Condition`.

In [13]:
# .split() fehlt in grandpy.py

In [14]:
# lapply - funktion raussuchen in python

The inverse of `split` is `merge`:

In [15]:
#

Note that we merged such that now we have first the SARS samples and then the Mock samples. We can reorder by slightly abusing `subset` (note that we actually do not omit any columns, but just define a different order):

In [16]:
#

# Gene metadata

Here we see how to work with metadata for genes. The gene metadata essentially is a table that can be retrieved using the `gene_info` function:

In [17]:
print(sars.gene_info.head(10))

                  Symbol             Gene  Length      Type
Symbol                                                     
HNRNPCL3        HNRNPCL3  ENSG00000277058    1148  Cellular
AL137802.1    AL137802.1  ENSG00000224174     327  Cellular
AL137798.2    AL137798.2  ENSG00000282143     817  Cellular
PLA2G2F          PLA2G2F  ENSG00000158786    4129  Cellular
Z98257.1        Z98257.1  ENSG00000227066    2493  Cellular
UBXN10            UBXN10  ENSG00000162543    5571  Cellular
Z98257.2        Z98257.2  ENSG00000284710    1548  Cellular
ACTL8              ACTL8  ENSG00000117148    1861  Cellular
MIR1302-2HG  MIR1302-2HG  ENSG00000243485     712  Cellular
NBPF3              NBPF3  ENSG00000142794    4401  Cellular


Each gene has associated gene ids and symbols. Gene ids and symbols as well as the transcript length are part of GRAND-SLAM output. The `type` column is inferred automatically (see below).

Genes can be identified by the `genes` function:

In [18]:
print(sars.genes[:20])

['HNRNPCL3', 'AL137802.1', 'AL137798.2', 'PLA2G2F', 'Z98257.1', 'UBXN10', 'Z98257.2', 'ACTL8', 'MIR1302-2HG', 'NBPF3', 'ALPL', 'LDLRAD2', 'DISP3', 'FBXO44', 'FBXO6', 'DRAXIN', 'PTPRU', 'CDA', 'PINK1', 'SMIM1']


In [19]:
# head(Genes(sars,use.symbols = FALSE), n=20)      # the first 20 genes, but now use the ids
# print(sars.genes[:20])

In [20]:
# Genes(sars,genes = c("MYC","ORF1ab"),use.symbols = FALSE)    # convert to ids

In [21]:
# Genes(sars,genes = "YC", regex = TRUE)           # retrieve all genes matching to the regular expression YC

During reading the data into R using `read_grand`, the `Type` column i9s inferred using the `classify_genes` function. By default, this will recognize mitochondrial genes (MT prefix of the gene symbol), ERCC spike-ins, and Ensembl gene identifiers (which it will call "cellular"). Here we also have the viral genes, which are not properly recognized:

In [22]:
print(sars.gene_info["Type"].value_counts()) # Table?

Type
Cellular    19648
Unknown        11
Name: count, dtype: int64


If you want to define your own types, you can do this easily by specifying the `classify_genes` parameter when read in your data:

In [23]:
viral_genes = ['ORF3a','E','M','ORF6','ORF7a','ORF7b','ORF8','N','ORF10','ORF1ab','S']
sars = read_grand("../data/sars.tsv", design=("Condition", "Time", "Replicate"), classification_genes=viral_genes)

print(sars.gene_info["Type"].value_counts())

Detected dense format -> using dense reader
Type
Cellular    19648
Viral          11
Name: count, dtype: int64




Note that each parameter to `classify_genes` must be named (`viral`) and must be a function that takes the gene metadata table and returns a logical vector.

The `classify_genes` function has one additional important parameter, which defines how "Unknown" types are supposed to be called. For this data set, a similar behavior as aboce can be accomplished by:

In [24]:
sars = read_grand("../data/sars.tsv", design=("Condition", "Time", "Replicate"), classification_genes=None, classification_genes_label="Unknown",
                 classify_genes_func=None)

print(sars.gene_info["Type"].value_counts())

# das ist noch nicht ganz richtig hier

Detected dense format -> using dense reader
Type
Cellular    19648
Unknown        11
Name: count, dtype: int64




It is also straight-forward to add additional gene metadata:

In [25]:
# GeneInfo(sars,"length.category") <- cut(GeneInfo(sars,"Length"),
#                                         breaks=c(0,2000,5000,Inf),
#                                         labels = c("Short","Medium","Long"))
# table(GeneInfo(sars,"length.category"))

# Column metadata

Samples for bulk experiments and cells in single cell experiments are in grandR jointly called "columns". The metadata for columns is a table that describes the experimental design we specified when reading in data in grandPy. It can be accessed via the `coldata` function. We can also see that the duration of 4sU has been interpreted and converted to a numeric value (compare "Time" with "Time.original").

In [26]:
print(sars.coldata)

                      Name Condition  Time Replicate Time.original  no4sU
Name                                                                     
Mock.no4sU.A  Mock.no4sU.A      Mock   0.0         A         no4sU   True
Mock.1h.A        Mock.1h.A      Mock   1.0         A            1h  False
Mock.2h.A        Mock.2h.A      Mock   2.0         A            2h  False
Mock.2h.B        Mock.2h.B      Mock   2.0         B            2h  False
Mock.3h.A        Mock.3h.A      Mock   3.0         A            3h  False
Mock.4h.A        Mock.4h.A      Mock   4.0         A            4h  False
SARS.no4sU.A  SARS.no4sU.A      SARS   0.0         A         no4sU   True
SARS.1h.A        SARS.1h.A      SARS   1.0         A            1h  False
SARS.2h.A        SARS.2h.A      SARS   2.0         A            2h  False
SARS.2h.B        SARS.2h.B      SARS   2.0         B            2h  False
SARS.3h.A        SARS.3h.A      SARS   3.0         A            3h  False
SARS.4h.A        SARS.4h.A      SARS  

Additional semantics can also be defined, which is accomplished via the function `apply_design_semantics`, that generates a list for the `semantics` parameter of the function `build_coldata`, which in turn is used to infer metadata from sample names. We briefly explain these mechanisms with an example, but it is important to mention that in most cases, the desired metadata can be added after reading the data, as shown further below.

<span style="color:red">First, it is important to have a function that takes two parameters (a specific column of the original column of the original column metadata table + the name of this column) and returns a dataframe that is then `cbin`ed with the original column metadata table. There is one such predefined function in grandPy, which parses labeling durations:</span>

In [27]:
from pandas import Series
Series(["5h", "30min", "no4sU"]).map(parse_time_string)

0    5.0
1    0.5
2    0.0
dtype: float64

We can easily define our own function like this:

In [28]:
def my_semantics_time(values, name):
    df = pd.DataFrame({name: values})
    df[name + "_hr"] = df[name].map(parse_time_string)
    df["hpi"] = df[name + "_hr"].apply(lambda x: f"{x + 3}hpi")
    df.attrs["_semantics"] = {name: "time"}
    return df

my_semantics_time(["5h", "30min", "no4sU"], "Test")

Unnamed: 0,Test,Test_hr,hpi
0,5h,5.0,8.0hpi
1,30min,0.5,3.5hpi
2,no4sU,0.0,3.0hpi


<span style="color:red"> Here, it is important to mention that at 3h post infection, 4sU was added to the cells for 1, 2, 3 or 4h. The two no4sU samples are also 3h post infection. This function can now be used as `semantics`parameter for `read_grand` like this:</span>

In [29]:
# sars.meta <- ReadGRAND(system.file("extdata", "sars.tsv.gz", package = "grandR"),
#                    design=function(names)
#                      MakeColdata(names,
#                                  c("Cell",Design$dur.4sU,Design$Replicate),
#                                  semantics=DesignSemantics(duration.4sU=my.semantics.time)
#                                  ),
#                  verbose=TRUE)

As mentioned above, it is in most cases easier to add additional metadata after loading.The infection time point can also be added by:

In [30]:
sars.coldata["hpi"] = sars.coldata["Time"].apply(lambda x: f"{x + 3}hpi")
print(sars.coldata) # da es immutable ist, kann ich keine Spalten anfügen ...

                      Name Condition  Time Replicate Time.original  no4sU
Name                                                                     
Mock.no4sU.A  Mock.no4sU.A      Mock   0.0         A         no4sU   True
Mock.1h.A        Mock.1h.A      Mock   1.0         A            1h  False
Mock.2h.A        Mock.2h.A      Mock   2.0         A            2h  False
Mock.2h.B        Mock.2h.B      Mock   2.0         B            2h  False
Mock.3h.A        Mock.3h.A      Mock   3.0         A            3h  False
Mock.4h.A        Mock.4h.A      Mock   4.0         A            4h  False
SARS.no4sU.A  SARS.no4sU.A      SARS   0.0         A         no4sU   True
SARS.1h.A        SARS.1h.A      SARS   1.0         A            1h  False
SARS.2h.A        SARS.2h.A      SARS   2.0         A            2h  False
SARS.2h.B        SARS.2h.B      SARS   2.0         B            2h  False
SARS.3h.A        SARS.3h.A      SARS   3.0         A            3h  False
SARS.4h.A        SARS.4h.A      SARS  

<span style="color:red">There are also some build-in grandPy functions that add metadata,such as `compute_expression_percentage`: </span>

In [31]:
# fehlt in processing.py
# sars <- ComputeExpressionPercentage(sars,name = "viral_percentage",
#                                     genes = GeneInfo(sars,"Type")=="viral")
# ggplot(Coldata(sars),aes(Name,viral_percentage))+
#   geom_bar(stat="identity")+
#   RotatateAxisLabels()+
#   xlab(NULL)

Interestingly the 4sU-naive sample shows more viral gene expression, suggesting that 4sU had an effect on viral gene expression.

Since this is such an important control, there is also a specialized plotting built into grandR for that:

In [32]:
# plot_type_distribution fehlt noch in plot.py

There is a column in the `coldata` metadata table that has a special meaning: `Condition`. It is used by many functions as a default, e.g. to plot colors in the PCA or to model kinetics per conditions. It can be accessed by its own function:

In [33]:
sars.condition # Levels ausgabe?

['Mock',
 'Mock',
 'Mock',
 'Mock',
 'Mock',
 'Mock',
 'SARS',
 'SARS',
 'SARS',
 'SARS',
 'SARS',
 'SARS']

and it can be set either directly:

In [37]:
# new = sars.coldata["saved"] = sars.coldata["Condition"]
# new.coldata["Condition"] = ["control"] * 6 + ["infected"] * 6
# print(sars.coldata)

# es fehlt der vorausbau durch den aufruf von compute_expression_percentage
# auch noch falsch, da immutability uns keine Spalten ergänzen lässt

or from one or several columns of the metadata (here this is not really reasonable, but there are situations where combining more than one metadata column makes sense):

In [38]:
# Condition(sars)<-c("saved","Replicate")   # set it by combining to other columns from the Coldata
# Condition(sars)

# Condition(sars)<-"saved"                  # set it to one other column from the Coldata
# Condition(sars)