[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/bulkrna.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/laminlabs/lamin-usecases/main?labpath=lamin-usecases%2Fdocs%2Fbulkrna.ipynb)

# Bulk RNA-seq

```{note}

More comprehensive examples are provided for these data types:

- {doc}`scrna`
- {doc}`facs`

```

## Setup

In [None]:
!lamin init --storage test-bulkrna --schema bionty

In [None]:
import lamindb as ln
from pathlib import Path
import bionty as bt
import pandas as pd
import anndata as ad

## Ingest data

### Access ![](https://img.shields.io/badge/Access-10b981)

We start by simulating a [nf-core RNA-seq](https://nf-co.re/rnaseq) run which yields us a count matrix artifact.

(See {doc}`docs:nextflow` for running this with Nextflow.)

In [None]:
# pretend we're running a bulk RNA-seq pipeline
track(transform=ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq"))
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.dev.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
    populate_registries=True
)
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.Artifact(path, description="Merged Bulk RNA counts").save()

### Transform ![](https://img.shields.io/badge/Transform-10b981)

In [None]:
ln.transform.stem_uid = "s5V0dNMVwL9i"
ln.transform.version = "0"
ln.track()

Let's query the artifact:

In [None]:
artifact = ln.Artifact.filter(description="Merged Bulk RNA counts").one()

In [None]:
df = artifact.load()

If we look at it, we realize it deviates far from the _tidy data_ standard [Wickham14](https://www.jstatsoft.org/article/view/v059i10), conventions of statistics & machine learning [Hastie09](https://link.springer.com/book/10.1007/978-0-387-84858-7), [Murphy12](https://probml.github.io/pml-book/book0.html) and the major Python & R data packages.

Variables are not in columns and observations are not in rows:

In [None]:
df

Let's change that and move observations into rows:

In [None]:
df = df.T

df

Now, it's clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.

Let's create an AnnData object to model this. First, create a dataframe for the variables:

In [None]:
var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])

In [None]:
var.head()

Now, let's create an AnnData:

In [None]:
# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)

adata

The AnnData object is in tidy form and complies with conventions of statistics and machine learning:

In [None]:
adata.to_df()

### Validate ![](https://img.shields.io/badge/Validate-10b981) 

Let's create a Artifact object from this AnnData.

Almost all gene IDs are validated:

In [None]:
genes = bt.Gene.from_values(
    adata.var.index,
    bt.Gene.stable_id,
    organism="saccharomyces cerevisiae",  # or set globally with bt.settings.organism
)

In [None]:
# also register the 2 non-validated genes obtained from Bionty
ln.save(genes)

### Register ![](https://img.shields.io/badge/Register-10b981)

In [None]:
efs = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()

In [None]:
curated_file = ln.Artifact.from_anndata(
    adata,
    description="Curated bulk RNA counts"
)

Hence, let's save this artifact:

In [None]:
curated_file.save()

Link to validated metadata records:

In [None]:
curated_file.features.add_from_anndata(var_field=bt.Gene.stable_id, organism="saccharomyces cerevisiae")

In [None]:
curated_file.labels.add(efs.rna_seq, features.assay)
curated_file.labels.add(organism.saccharomyces_cerevisiae, features.organism)

In [None]:
curated_file.describe()

## Query data

We have two files in the artifact registry:

In [None]:
ln.Artifact.df()

In [None]:
curated_file.view_lineage()

Let's by query by gene:

In [None]:
genes = bt.Gene.lookup()

In [None]:
genes.spo7

In [None]:
# a gene set containing SPO7
feature_set = ln.FeatureSet.filter(genes=genes.spo7).first()

In [None]:
# artifacts that link to this feature set
ln.Artifact.filter(feature_sets=feature_set).df()

In [None]:
# clean up test instance
!lamin delete --force test-bulkrna
!rm -r test-bulkrna