## (1) Load Processed Data in the Paper

Currently, we support norman / adamson / dixit.

In [None]:
import sys
sys.path.append('../../')

from omnicell.models.gears.pertdata import PertData
from omnicell.data.catalogue import Catalogue


## (2) Create your own Perturb-Seq data
Prepare a scanpy adata object with 
1. `adata.obs` dataframe has `condition` and `cell_type` columns, where `condition` is the perturbation name for each cell. Control cells have condition format of `ctrl`, single perturbation has condition format of `A+ctrl` or `ctrl+A`, combination perturbation has condition format of `A+B`.
2. `adata.var` dataframe has `gene_name` column, where each gene name is the gene symbol.
3. `adata.X` stores the post-perturbed gene expression. 

Here an example using dixit 2016 dataset.

In [None]:
import scanpy as sc
dd = Catalogue.get_dataset_details('repogle_k562_essential_raw')
adata = sc.read(dd.path)
adata

In [5]:
adata.obs["condition"] = adata.obs["gene"]


In [6]:
#We relabel NT as ctrl and all other entries as some_entry+ctrl

perts = [p for p in adata.obs["condition"].unique() if p != dd.control]
adata.obs["condition"] = adata.obs["condition"].replace({dd.control:"ctrl"})
adata.obs["condition"] = adata.obs["condition"].replace({p:p+"+ctrl" for p in perts})



In [None]:
adata.obs

### Suggested normalization

For raw count data we recommend the following normalization and subsetting to the top 5000 most variable genes

In [8]:
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata,n_top_genes=5000, subset=True)

### Create dataloader

GEARS will take it from here. The new data processing takes around 15 minutes for 5K genes and 100K cells. 

In [None]:


from omnicell.models.gears.pertdata import PertData

pert_data = PertData('./data') # specific saved folder
pert_data.new_data_process(dataset_name = 'repogle', adata = adata, skip_calc_de=True) # specific dataset name and adata object
print(f"Data processed and saved in {pert_data.data_path}")



In [1]:
pert_data.load(data_path = './data/repogle') # load the processed data, the path is saved folder + dataset_name
print(f"Data loaded from {pert_data.data_path}")
pert_data.prepare_split(split = 'simulation', seed = 1) # get data split with seed
print(f"Data split with seed 1")
pert_data.get_dataloader(batch_size = 32, test_batch_size = 128) # prepare data loader
print(f"Data loader prepared")

NameError: name 'pert_data' is not defined