In [1]:
import os
import zipfile
from pathlib import Path

Dataset:

- Manuscript: https://www.nature.com/articles/s41588-025-02193-3#data-availability
- 10x: https://www.10xgenomics.com/platforms/visium/product-family/dataset-human-crc
- 10x (alt): https://www.10xgenomics.com/datasets/visium-hd-cytassist-gene-expression-libraries-of-human-crc
- GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE280318
- SRA: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA1177833&o=acc_s%3Aa

## Extract data

Unzip the file `data_demo/images.zip` and place the 2 `.tif` files under `data_demo`:

In [2]:
os.chdir("../")

In [3]:
data_dir = Path("data_demo")

with zipfile.ZipFile(data_dir / "images.zip", 'r') as zip_ref:
    zip_ref.extractall(data_dir)

## Config file

Parameters are defined in a config file (``./configs/config_demo.json`` for this demo). Important parameters include:

- ``comps``: if ``avgexp``, ``celltype``, or ``neighb`` information is not available, set any of these missing parameters to ``false``. If ``celltype`` is ``false``, ``neighb`` will also be ``false`` as it depends on ``celltype``.
- ``cell_types``: list of cell types in your data. Ignored if ``celltype`` is ``false``.
- ``data_sources_train_val``: locations of data for training and validation.
- ``regions_val.divisions``: contains the start and end of partitions to be used for validation. The image is divided along the y (vertical) axis at the specified intervals. Portion k is used for validation in fold k while the remainder is used for training.  Set to ``[0.0, 0.0]`` as the list of a fold to use the whole image for training.

## Training

```sh
python train.py --config_file configs/FILENAME.json --resume_epoch EPOCH --fold_id FOLD --gpu_id GPU_NUM
```

- ``--config_file`` path to config file
- ``--resume_epoch`` specifies whether to train from scratch or resume from a checkpoint, e.g., ``--resume_epoch 10`` to resume from the saved checkpoint from epoch 10. Set to 0 for training from scratch. Training will end when epoch number reaches `total_epochs` specified in the config file.
- ``--fold_id`` specifies the cross-validation fold (1, 2, 3...)
- ``--gpu_id`` which GPU to use (0, 1, 2...)

In [5]:
!python train.py --config_file configs/config_demo.json --resume_epoch 0 --fold_id 1 --gpu_id 0

Using GPUs: 0
2025-09-18 12:33:55,821 INFO Initialising model
['B', 'Myeloid', 'Endothelial', 'Fibroblast', 'Macrophage', 'Malignant', 'Epithelial', 'Plasma', 'T']
Num cell types 9
280 genes
Avgexp shape  (63, 280)
2025-09-18 12:33:56,683 INFO Preparing data
Cell type data shape, (5508, 1)
Histology image (5120, 5120, 3), Nuclei (5120, 5120)
4152 cells
Patches min/max coords 1024 5120
Getting valid patches
100%|████████████████████████████████████████| 320/320 [00:00<00:00, 536.21it/s]
Standardisation
100%|████████████████████████████████████████| 320/320 [00:01<00:00, 231.02it/s]
2025-09-18 12:33:59,753 INFO Total number of training batches: 40
2025-09-18 12:33:59,801 INFO Begin training
Epoch: 1
loss: 10.0839: 100%|████████████████████████████| 40/40 [00:56<00:00,  1.41s/it]
Epoch[1/3], Loss:744.6718
2025-09-18 12:34:56,220 INFO Model saved: experiments/demo/models/epoch_1_model.pth
2025-09-18 12:34:56,528 INFO Optimiser saved: experiments/demo/models/epoch_1_optim.pth
Epoch: 2
loss:

A folder is created under `./experiments/demo` where model checkpoints and outputs will be saved.

## Validation

Validation mode assumes ground truth is available. To run prediction where ground truth is not available, see [Prediction on data without ground truth](./3_prediction.ipynb)

```sh
python inference.py --config_file configs/FILENAME.json --epoch EPOCH --mode val --fold_id FOLD --gpu_id GPU_NUM
```

- ``--epoch`` specifies which epoch to test, e.g., ``10`` to use the model from epoch 10, or use `last` for the most recent, or `all` for all epochs

In [7]:
!python inference.py --config_file configs/config_demo.json --epoch last --mode val --fold_id 1 --gpu_id 0

Using GPUs: 0
['B', 'Myeloid', 'Endothelial', 'Fibroblast', 'Macrophage', 'Malignant', 'Epithelial', 'Plasma', 'T']
Num cell types 9
280 genes
Avgexp shape  (63, 280)
Cell type data shape, (5508, 1)
Histology image (5120, 5120, 3), Nuclei (5120, 5120)
1072 cells
Patches min/max coords 0 1024
Getting valid patches
100%|████████████████████████████████████████| 115/115 [00:00<00:00, 602.44it/s]
Standardisation
Predict using experiments/demo/models/epoch_3_model.pth
100%|███████████████████████████████████████████| 15/15 [00:04<00:00,  3.74it/s]
Saved predicted expressions of 1072 cells to experiments/demo/val_output//epoch_3_expr.csv
***expression correlation***
PCC mean: 0.24223653899061412
***best epoch***
PCC best epoch 3 mean 0.24223653899061412


In this tutorial we trained for 3 epochs to keep the demo short. In practice, for better performance, the model would typically be trained on more data and for more iterations.

## Outputs

The predictions were saved to ``experiments/demo/val_output/``, and the csv files contain the predicted gene expressions for each cell, where the index is the cell ID that corresponds to the IDs from the nuclei segmentation image, and the columns are the genes. An example is provided as ``example_output.csv`` to show the format.  