# TDC 201: TDC-2 Resource and Multimodal Single-Cell Retrieval API

TDC-2 introduces a **multimodal dataset retrieval API providing access to >1,000 datasets**. This API is powered by the TDC-2 resource model. The TDC-2 enhances **contextualization** (i.e., single-cell integration) and **multimodality** (i.e., sequence data and knowledge graphs) of datasets in The Commons across ML tasks. Most notably, TDC-2 features the introduction of a multimodal healthy-diseased retrieval API exposing a vast corpus of nearly 50 million cells across 700 datasets, the integration of various external APIs, and a retrieval API for a precision-medicine oriented knowledge graph. Here we provide a tutorial for the CellXGene retrieval API introduced as part of the Resource model.

In [None]:
!pip install PyTDC==0.4.14



Collecting PyTDC==0.4.14
  Downloading PyTDC-0.4.14.tar.gz (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.9/134.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting biopython==1.78 (from PyTDC==0.4.14)
  Downloading biopython-1.78.tar.gz (16.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dataclasses==0.6 (from PyTDC==0.4.14)
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting fuzzywuzzy==0.18.0 (from PyTDC==0.4.14)
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Collecting mygene==3.2.2 (from PyTDC==0.4.14)
  Downloading mygene-3.2.2-py2.py3-none-any.whl (5.4 kB)
Collecting numpy==1.26.4 (from PyTDC==0.4.14)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K     [90m━━

**The CellXGene Census Discovery Resource**

TDC-2 leverages the SOMA (Stack of Matrices, Annotated) API, adopts TileDB-SOMA for modeling sets of 2D annotated matrices with measurements of features across observations, and enables memory-efficient querying of multiple distinct single-cell modalities (i.e., scRNA-seq, snRNA-seq), across healthy and diseased samples, with tabular annotations of cells, samples, and patients the samples come from.

We provide two APIs.

First, a standard resource-based querying API for simple cross-compatibility with the [CellXGene Census Discover Python API](https://chanzuckerberg.github.io/cellxgene-census/python-api.html).

Second, we've developed a TDC Dataloader over the Census corpus for ease of building large cell atlases based on a reference dataset.

Below first example usage of the implemented Census Resource.
For more information regarding the data schema and TileDB-SOMA, you may reference the [CellXGene Census Discovery Data Schema Documentation](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_schema.html#).

In [None]:
from pandas import DataFrame
from tdc.resource import cellxgene_census

In [None]:
# initialize Census Resource and query filters
resource = cellxgene_census.CensusResource()
gene_value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']"
gene_column_names = ["feature_name", "feature_length"]
cell_value_filter = "tissue == 'brain' and sex == 'male'"
cell_column_names = ["assay", "cell_type", "tissue"]

Below and example of using the Census Resource to obtain **cell metadata from the** **obs** matrix.

In [None]:
# Obtaining cell metadata from the cellxgene census in pandas format
obsdf = resource.get_cell_metadata(
    value_filter=cell_value_filter,
    column_names=cell_column_names,
    fmt="pandas")
assert isinstance(obsdf, DataFrame)
obsdf.head()

The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
INFO:cellxgene_census:The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.


Unnamed: 0,assay,cell_type,tissue,sex
0,10x 3' v2,classical monocyte,brain,male
1,10x 3' v2,"CD4-positive, alpha-beta T cell",brain,male
2,10x 3' v2,malignant cell,brain,male
3,10x 3' v2,macrophage,brain,male
4,10x 3' v2,B cell,brain,male


Below an example of using the Census Resource to obtain **gene metadata form the var** matrix.

In [None]:
# Obtaining gene metadata from cellxgene census in pyarrow format
import pyarrow as pa

# Obtaining gene metadata from cellxgene census in pyarrow format
varpyarrow = resource.get_gene_metadata(
    value_filter=gene_value_filter,
    column_names=gene_column_names,
    fmt="pyarrow",
    measurement_name="RNA")
assert isinstance(varpyarrow, pa.Table)
varpyarrow.slice(0,5)

The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
INFO:cellxgene_census:The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.


pyarrow.Table
feature_name: large_string
feature_length: int64
feature_id: large_string
----
feature_name: [["AQP5","TUBB4B"]]
feature_length: [[1884,2037]]
feature_id: [["ENSG00000161798","ENSG00000188229"]]

Below an example for using the Census Resource for retrieving **raw RNA data for a specific tissue and sex from the counts (X)** matrix. As per memory-efficient querying, this will by defaults provide a python generator expression. For details on converting to other data formats, you can reference the [resource source code](https://github.com/mims-harvard/TDC/blob/main/tdc/resource/cellxgene_census.py#L86).

In [None]:
# Query the measurement matrix "raw" counts for the RNA measurements. This is a Python Generator expression.
X = resource.query_measurement_matrix(measurement_name="RNA",
                                fmt="pyarrow",
                                value_adjustment="raw",
                                value_filter="tissue == 'brain' and sex == 'male'")
next(X).slice(0,3)

The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
INFO:cellxgene_census:The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.


pyarrow.Table
soma_dim_0: int64
soma_dim_1: int64
soma_data: float
----
soma_dim_0: [[33636989,33636989,33636989]]
soma_dim_1: [[111,123,128]]
soma_data: [[1,2,1]]

Below is an example of querying the **feature presence** matrix.

In [None]:
# Get a slice of the feature_presence matrix
sparse_tensor = resource.get_feature_dataset_presence_matrix(
    upper=5,
    lower=0,
    measurement_name="RNA",
    fmt="pyarrow",
    todense=False)

n = 3
sparse_tensor.to_numpy()[:n]

The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
INFO:cellxgene_census:The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.


(array([[1],
        [1],
        [1],
        ...,
        [1],
        [1],
        [1]], dtype=uint8),
 array([[    0,     0],
        [    0,     1],
        [    0,     2],
        ...,
        [    5, 31279],
        [    5, 31307],
        [    5, 31309]]))

**The CellXGene Census Discovery DataLoader**

The CellXGene DataLoader API allows you to build **large-scale cell atlases based on a reference dataset**. It retrieves all cells containing non-zero counts for the genes present in the reference dataset. This is a python generator expression and provides indices for cell and gene, allowing the user to retrieve any desired metadata using those indices. The expression count for each cell and gene pair is also included. To mantain consistency with the standard TDC-2 dataloader API, the output is a pandas dataframe.

In [None]:
# How to use the CellXGene Dataloader
from tdc.multi_pred.single_cell import CellXGene
from pandas import DataFrame
dataloader = CellXGene(name="Tabula Sapiens - Blood")
gen = dataloader.get_data(
    value_filter="cell_type == 'T cell' and sex == 'male'")
df = next(gen)
assert isinstance(df, DataFrame)
assert len(df) > 0
print(df.head())

The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
INFO:cellxgene_census:The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
INFO:cellxgene_census:The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
INFO:cellxgene_census:The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.
The "stable" release is currently 2023-12-15. Specify 'census_version="2023-

   cell_idx  gene_idx  expression
0   3029780       143         1.0
1   3029780       145         2.0
2   3029780       219         3.0
3   3029780       254         1.0
4   3029780       267         1.0
