# Running Rank 1 models

In this notebook, we will run rank 1 model with another dataset. At 2022 competition, rank 1 model excelled in Multiome prediction task, so we'll focus on it. CITE-seq model leverages a lot of the same code and has a very similar architecture, so it should be possible to run it as well, just put CITE-seq data instead of Multiome in this tutorial and change task_type to "cite" in the scripts.

In [1]:
import pandas as pd
import numpy as np
import scanpy as sc
from scanpy.preprocessing._utils import _get_mean_var
import muon

  from .autonotebook import tqdm as notebook_tqdm


Download 2021 competition data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122

In [2]:
adata = sc.read_h5ad("../data/GSE194122_openproblems_neurips2021_multiome_BMMC_processed.h5ad")

The dataset has both modalities, ATAC-seq and RNA in one AnnData object. Let's split modalities

In [3]:
adata

AnnData object with n_obs × n_vars = 69249 × 129921
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap'
    layers: 'counts'

In [None]:
adata.var["feature_types"].value_counts()

feature_types
ATAC    116490
GEX      13431
Name: count, dtype: int64

In this example notebook, we'll subset data to a small number of cells for faster testing

In [5]:
random_cells = np.random.choice(adata.n_obs, size=500, replace=False)

In [6]:
adata = adata[random_cells, :].copy()

In [7]:
adata_rna = adata[:, adata.var["feature_types"] == "GEX"]
adata_rna

View of AnnData object with n_obs × n_vars = 500 × 13431
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap'
    layers: 'counts'

In [8]:
adata_atac = adata[:, adata.var["feature_types"] == "ATAC"]
adata_atac

View of AnnData object with n_obs × n_vars = 500 × 116490
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap'
    layers: 'counts'

Let's further reduce the number of features to 500 for RNA and 1000 for ATAC for faster testing.

In [9]:
adata_rna = adata_rna[:, np.random.choice(adata_rna.var_names, size=500, replace=False)]
adata_rna

View of AnnData object with n_obs × n_vars = 500 × 500
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap'
    layers: 'counts'

In [10]:
adata_atac = adata_atac[:, np.random.choice(adata_atac.var_names, size=1000, replace=False)]
adata_atac

View of AnnData object with n_obs × n_vars = 500 × 1000
    obs: 'GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors', 'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments', 'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction', 'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order', 'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality', 'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType', 'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker'
    var: 'feature_types', 'gene_id'
    uns: 'ATAC_gene_activity_var_names', 'dataset_id', 'genome', 'organism'
    obsm: 'ATAC_gene_activity', 'ATAC_lsi_full', 'ATAC_lsi_red', 'ATAC_umap', 'GEX_X_pca', 'GEX_X_umap'
    layers: 'counts'

Before modality prediction, apply QC to your dataset, otherwise the code will likely fail. Here, we'll simply remove cells with 0 RNA counts or open chromatin regions, and filter out constant features.

In [11]:
non_empty_cells = (adata_rna.X.sum(axis=1) != 0) & (adata_atac.X.sum(axis=1) != 0)

In [12]:
adata_rna = adata_rna[non_empty_cells, :]
adata_atac = adata_atac[non_empty_cells, :]

In [13]:
rna_means, rna_vars = _get_mean_var(adata_rna.X)
adata_rna = adata_rna[:, rna_vars != 0]
adata_rna.shape

(499, 491)

In [14]:
atac_means, atac_vars = _get_mean_var(adata_atac.X)
adata_atac = adata_atac[:, atac_vars != 0]
adata_atac.shape

(499, 964)

Normalize ATAC data with TF-IDF. Note that RNA data must be library-size and log1p normalized as well.

In [15]:
muon.atac.pp.tfidf(adata_atac)
adata_atac.X[:10, :10].toarray()

  view_to_actual(adata)


array([[ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
         0.      ,  0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
        16.419798,  0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
        20.047827,  0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
         0.      ,  0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
         0.      ,  0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
         0.      ,  0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
         0.      ,  0.      ,  0.      ,  0.      ],
       [ 0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
         0.      ,  0.      ,  0.    

In [16]:
adata_rna.write_h5ad("../data/adata_rna_subset.h5ad")
adata_atac.write_h5ad("../data/adata_atac_subset.h5ad")

# Format the data correctly

To run the model, we need to save our data to the following files:
- test_multi_inputs.h5
- train_multi_inputs.h5
- train_multi_targets.h5

If you want to use CITE-seq model too, the you'll additionally need:
- test_cite_inputs
- train_cite_inputs
- train_cite_targets

Additionally, we must save test set labels to the "evaluation_ids.csv" file with "cell_id" column

In [17]:
adata_rna.obs["DonorID"].value_counts()

DonorID
15078    123
19593     85
18303     62
12710     48
16710     45
28483     43
10886     42
13272     34
11466     11
28045      6
Name: count, dtype: int64

Let's use donor 15078 as a test

In [18]:
adata_rna.obs["split"] = "train"
adata_rna.obs.loc[adata_rna.obs["DonorID"] == 15078, "split"] = "test"
adata_atac.obs["split"] = adata_rna.obs["split"]

  adata_rna.obs["split"] = "train"


In [19]:
adata_rna[adata_rna.obs["split"] == "train"].write_h5ad("../data/train_multi_targets.h5")
adata_atac[adata_atac.obs["split"] == "train"].write_h5ad("../data/train_multi_inputs.h5")
adata_atac[adata_atac.obs["split"] == "test"].write_h5ad("../data/test_multi_inputs.h5")

  df[key] = c
  df[key] = c
  df[key] = c


We also need to save metadata to a file. Note that in the current implementation, the metadata column **must** be named "day", "donor", "technology", "cell_type", and "cell_id". Day must be a number

In [20]:
adata_rna.obs.columns

Index(['GEX_pct_counts_mt', 'GEX_n_counts', 'GEX_n_genes', 'GEX_size_factors',
       'GEX_phase', 'ATAC_nCount_peaks', 'ATAC_atac_fragments',
       'ATAC_reads_in_peaks_frac', 'ATAC_blacklist_fraction',
       'ATAC_nucleosome_signal', 'cell_type', 'batch', 'ATAC_pseudotime_order',
       'GEX_pseudotime_order', 'Samplename', 'Site', 'DonorNumber', 'Modality',
       'VendorLot', 'DonorID', 'DonorAge', 'DonorBMI', 'DonorBloodType',
       'DonorRace', 'Ethnicity', 'DonorGender', 'QCMeds', 'DonorSmoker',
       'split'],
      dtype='object')

In [21]:
adata_rna.obs["day"] = adata_rna.obs["batch"].str[3:]
adata_rna.obs["donor"] = adata_rna.obs["DonorID"]
adata_rna.obs["technology"] = adata_rna.obs["Modality"]
adata.obs.index.name = "cell_id"

In [22]:
adata_rna.obs[["day", "donor", "technology", "cell_type"]].reset_index(names="cell_id").to_csv("../data/metadata.csv")

Small function to save the data in the appropriate format:

In [23]:
def save_X_to_h5(adata, path):
    from pathlib import Path
    path = Path(path)
    X = adata.X.A if hasattr(adata.X, "A") else adata.X  # make dense if sparse
    df = pd.DataFrame(X, index=adata.obs_names, columns=adata.var_names)
    df.to_hdf(
        path,
        key=path.name,  # e.g. "train_multi_targets"
        mode="w",
        format="fixed",
    )

save_X_to_h5(adata_rna[adata_rna.obs["split"] == "train"], "../data/train_multi_targets.h5")
save_X_to_h5(adata_atac[adata_atac.obs["split"] == "train"], "../data/train_multi_inputs.h5")
save_X_to_h5(adata_atac[adata_atac.obs["split"] == "test"], "../data/test_multi_inputs.h5")

  check_attribute_name(name)
  check_attribute_name(name)
  check_attribute_name(name)


In [24]:
%%bash

export DATA_DIR=/Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/

cd rank1/open-problems-multimodal/
python3 script/make_compressed_dataset.py --data_dir ${DATA_DIR}
python3 script/make_additional_files.py --data_dir ${DATA_DIR}
python3 script/make_compressed_dataset.py --data_dir ${DATA_DIR}
python3 script/train_model.py --data_dir ${DATA_DIR} --task_type multi 

File /Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/evaluation_ids.csv does not exist
File /Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/sample_submission.csv does not exist
123
376
376
File /Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/test_cite_inputs.h5 does not exist
File /Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/train_cite_inputs.h5 does not exist
File /Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/train_cite_targets.h5 does not exist
Some citeseq files don't exist, not making citeseq cell statistics
File /Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/evaluation_ids.csv does not exist
File /Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/sample_submission.csv does not exist
123
376
376
File /Users/vladimir.shitov/Documents/programming/OpenProblems2022Analysis/data/test_cite_inputs.h5

The results are saved as pickled numpy array. Let's read them and calculate a correlation score to ground truth. We can import a function to compute the metric directly from competitor's code:

In [25]:
import sys
sys.path.append("rank1/open-problems-multimodal")

from ss_opm.metric.correlation_score import correlation_score

In [26]:
import pickle

with open("rank1/open-problems-multimodal/result/multimodal_pred.pickle", "rb") as f:
    y_pred = pickle.load(f)

In [27]:
y_pred.shape

(123, 491)

In [28]:
y_pred

array([[ 0.1396377 , -0.40570018,  2.36897   , ..., -0.5240119 ,
        -0.24955416, -0.23399031],
       [ 0.34854636, -0.43715882,  2.7376328 , ..., -0.46387258,
        -0.51144576,  0.36972368],
       [ 0.32831118, -0.50022084,  2.745265  , ..., -0.53843397,
        -0.413774  ,  0.25159332],
       ...,
       [ 0.08649594, -0.5026955 ,  2.5488021 , ..., -0.40476736,
        -0.3691034 ,  0.6495405 ],
       [ 0.17613435, -0.5445092 ,  2.459973  , ..., -0.53740096,
        -0.46087798,  0.42784047],
       [-0.06657705, -0.5384305 ,  2.7005804 , ..., -0.5858595 ,
        -0.36725417,  0.16582374]], dtype=float32)

In [29]:
y_true = adata_rna[adata_rna.obs["split"] == "test"].X
correlation_score(y_true, y_pred)

0.3452738604287309