# Preprocess `Sci-Plex 3` data  for a `PertrubNet` model with 


The data was generated by Srivatsan et al.[[1]](https://www.science.org/doi/10.1126/science.aax6234)
The dataset consists of three cancer cell lines (A549, MCF7, K562), which are treated with 188 different compounds with different mechanisms of action. The cells are treated with 4 dosages (10, 100, 1000, and 10000 nM) plus vehicle.

Data was downloaded from [2](https://doi.org/10.1101/2022.07.20.500854) [sciPlex3_whole_filtered_NormBYHighGenes_processed.h5ad](https://www.dropbox.com/sh/3nk6qk1653h2y1v/AAAnPOX8jTw440MucpNFODDsa/perturbnet_sciPlex_example/data/sciPlex3_whole_filtered_NormBYHighGenes_processed.h5ad) and adapted to the split defined in the biolord analysis.

PerturbNet model training and evaluation follows guidelines provided in the package implementation, [PerturbNet](https://github.com/welch-lab/PerturbNet). 



[[1] Srivatsan, S. R., McFaline-Figueroa, J. L., Ramani, V., Saunders, L., Cao, J., Packer, J., ... & Trapnell, C. (2020). Massively multiplex chemical transcriptomics at single-cell resolution. Science, 367(6473), 45-51.](https://www.science.org/doi/10.1126/science.aax6234)

[[2] Yu, H. and Welch, J.D., 2022. PerturbNet predicts single-cell responses to unseen chemical and genetic perturbations. bioRxiv, pp.2022-07.](https://doi.org/10.1101/2022.07.20.500854)


In [1]:
import sys
import os

import scanpy as sc
import pandas as pd
import numpy as np

In [2]:
sys.path.append("../../../")
sys.path.append("../../../utils/")
from paths import DATA_DIR, FIG_DIR

## Set parameters

In [3]:
DATA_DIR_LCL = str(DATA_DIR) + "/perturbations/sciplex3/"

## Load reference files

In [4]:
adata_ref =  sc.read(
    DATA_DIR_LCL + "sciplex3_biolord.h5ad",
    backup_url="https://figshare.com/ndownloader/files/39324305",
)

In [5]:
ref_obs = pd.Index([obs.split("-")[0] for obs in adata_ref.obs_names])
adata_ref.obs_names = ref_obs

## Load data

In [6]:
adata_orig = sc.read(os.path.join(DATA_DIR_LCL, 'sciPlex3_whole_filtered_NormBYHighGenes_processed.h5ad')) 
adata = adata_orig[adata_orig.obs_names.isin(ref_obs)].copy()
adata.obs["split_ood"] = adata_ref.obs.loc[adata.obs_names ,"split_ood"]
adata.obs["cov_drug_dose_name"] = adata_ref.obs.loc[adata.obs_names ,"cov_drug_dose_name"]


In [8]:
## remove 9 unseene drugs (using reference adata)
input_ltpm_label = adata.obs.copy()
kept_indices = list(np.where((input_ltpm_label["split_ood"] != "ood") & (input_ltpm_label["treatment"] != "S0000"))[0])


In [9]:
adata_train = adata[kept_indices, :].copy()
adata_train.save(DATA_DIR_LCL + "adata_train.h5ad")