Much of the code for this notebook is sourced from the original underlying base model scCraft. This ensures reproducibilty of the original results in comparison to the new approach. The github link is available here: https://github.com/ch2343/scCRAFT/tree/main, and a link to the publication is available here: https://www.nature.com/articles/s42003-025-07988-y. 

Data is available here: "https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-\_integration_task_datasets_Immune_and_pancreas_/12420968"

We use the pancreas_norm_complexBatch dataset

Command to download the data:
```bash
curl 'https://figshare.com/ndownloader/articles/12420968/versions/8' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-US,en;q=0.9' \
  -b 'fig_tracker_client=df7ffe61-2bbb-4fe0-8bcb-9bc57a17efc2; GLOBAL_FIGSHARE_SESSION_KEY=05a1baad4778e8acefd3ad8dbaa84e1ff45215f3431635e120e44fb69cda775d8d0bf275; FIGINSTWEBIDCD=05a1baad4778e8acefd3ad8dbaa84e1ff45215f3431635e120e44fb69cda775d8d0bf275; figshare-cookies-essential=true; figshare-cookies-performance=true' \
  -H 'priority: u=0, i' \
  -H 'referer: https://figshare.com/' \
  -H 'sec-ch-ua: "Not;A=Brand";v="99", "Google Chrome";v="139", "Chromium";v="139"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-gpc: 1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36' \
  -o data_download.zip
  ```

In [None]:
import warnings
warnings.filterwarnings('ignore')

from wcd_vae.scCRAFT.utils import set_seed
from wcd_vae.data import prep_data
from wcd_vae.hyperparameter import nested_cv_hyperparameter_tuning
from wcd_vae.plot import create_paper_assets

In [None]:
# set the torch random seed
set_seed(42)

# Pancrease

In [None]:
pancrease_adata = prep_data("/workspaces/data/human_pancreas_norm_complexBatch.h5ad", 
                  batch_key="tech", 
                  celltype_key="celltype",
                  batch_count=2,
                  min_genes=300,
                  min_cells=5,
                  norm_val=1e4,
                  n_top_genes=2000, 
                  balance=False)

In [None]:
results_df, outer_fold_results = nested_cv_hyperparameter_tuning(
    pancrease_adata,
    batch_key="tech",
    celltype_key="celltype",
    reference_batch=0,
    epochs=50,
    n_outer_folds=3,
    n_inner_folds=5,
    output_dir="results",
    output_prefix="pancrease_binary_balanced"
)

In [None]:
create_paper_assets(results_df, outer_fold_results, output_dir="results", output_prefix="pancrease_binary")

# Immune

In [None]:
immune_adata = prep_data("/workspaces/data/Immune_ALL_human.h5ad", 
                  batch_key="batch", 
                  celltype_key="final_annotation",
                  batch_count=2,
                  min_genes=300,
                  min_cells=5,
                  norm_val=1e4,
                  n_top_genes=2000,
                  balance=False)

In [None]:
results_df, outer_fold_results = nested_cv_hyperparameter_tuning(
    immune_adata,
    n_outer_folds=3,
    n_inner_folds=5,
    epochs=50,
    batch_key='batch',
    celltype_key='final_annotation'
)

In [None]:
create_paper_assets(results_df, outer_fold_results, output_dir="results", output_prefix="Immune_binary")

# Lung

In [None]:
lung_adata = prep_data("/workspaces/data/Lung_atlas_public.h5ad",
                    batch_key="batch",
                    celltype_key="cell_type",
                    batch_count=2,
                    min_genes=300,
                    min_cells=5,
                    norm_val=1e4,
                    n_top_genes=2000,
                    balance=False)

In [None]:
results_df, outer_fold_results = nested_cv_hyperparameter_tuning(
    lung_adata,
    reference_batch=0,
    n_outer_folds=3,
    n_inner_folds=5,
    epochs=50,
    batch_key='batch',
    celltype_key='cell_type'
)

In [None]:
create_paper_assets(results_df, outer_fold_results, output_dir="results", output_prefix="Lung_binary")