# Setup and Train the DOLPHIN Model

This tutorial provides a step-by-step guide on configuring the model architecture, setting hyperparameters, and visualizing cell embedding clusters using DOLPHIN.


In [None]:
from DOLPHIN.model import run_DOLPHIN
import numpy as np

## Load Processed Dataset

Specify the graph data input and the highly variable gene (HVG)-filtered feature matrix obtained from the preprocessing step.

In [None]:
#load datasets
graph_data = "model_<sample_name>.pt"
feature_data = "FeatureCompHvg_<sample_name>.h5ad"
## save the output adata, default is set to the current folder
output_path = './'

## Set Hyperparameters and Train the Model

The function `run_DOLPHIN` is used to configure hyperparameters and train the model. Below is a detailed explanation of its parameters:

---

#### **Function Definition**
```python
run_DOLPHIN(data_type, graph_in, fea_in, current_out_path='./', params=None, device='cuda:0', seed_num=0)

### Parameters

##### 1. `data_type` Specifies the type of input single-cell RNA-seq data:
- `"full-length"`: For full-length RNA-seq data.
- `"10x"`: For 10x Genomics RNA-seq data.

##### 2. `graph_in` The input graph dataset.

##### 3. `fea_in` The input feature matrix, provided as an AnnData object (`adata`).

##### 4. `current_out_path` Specifies the output directory where the resulting cell embeddings (`X_z`) will be saved.  The output file will be named: `DOLPHIN_Z.h5ad`

##### 5. `params` Model hyperparameters.  
If `data_type` is set, you can use the **default hyperparameters** or provide your own in a dictionary format.  
Below is a list of customizable hyperparameters:

| Parameter             | Description                                                              |
|-----------------------|--------------------------------------------------------------------------|
| `"gat_channel"`       | Number of features per node after the GAT layer.                        |
| `"nhead"`             | Number of attention heads in the graph attention layer.                 |
| `"gat_dropout"`       | Dropout rate for the GAT layer.                                         |
| `"list_gra_enc_hid"`  | Neuron sizes for each fully connected layer of the encoder.              |
| `"gra_p_dropout"`     | Dropout rate for the encoder.                                           |
| `"z_dim"`             | Dimensionality of the latent Z space.                                   |
| `"list_fea_dec_hid"`  | Neuron sizes for each fully connected layer of the feature decoder.      |
| `"list_adj_dec_hid"`  | Neuron sizes for each fully connected layer of the adjacency decoder.    |
| `"lr"`                | Learning rate for optimization.                                         |
| `"batch"`             | Mini-batch size.                                                       |
| `"epochs"`            | Number of training epochs.                                              |
| `"kl_beta"`           | KL divergence weight.                                                  |
| `"fea_lambda"`        | Feature matrix reconstruction loss weight.                              |
| `"adj_lambda"`        | Adjacency matrix reconstruction loss weight.                            |

##### 6. `device` Specifies the device for training.  Default: `"cuda:0"` for GPU training (highly recommended).

##### 7. `seed_num`Sets the random seed for reproducibility.



In [None]:
run_DOLPHIN("full-length", graph_data, feature_data, output_path)

## Cell Embedding Cluster Using `X_z`

The cell embedding matrix `X_z` represents the low-dimensional latent space learned by the DOLPHIN model.  
This matrix can be used to visualize cell clusters and analyze their relationships in the latent space.

In [19]:
import scanpy as sc
from sklearn.metrics import adjusted_rand_score

In [20]:
adata = sc.read_h5ad("./DOLPHIN_Z.h5ad")

In [None]:
sc.pp.neighbors(adata, use_rep="X_z")
sc.tl.umap(adata)
sc.tl.leiden(adata)
print(len(set(adata.obs["leiden"])))
adjusted_rand_score(adata.obs["celltype"], adata.obs["leiden"])
sc.pl.umap(adata, color=['leiden', "celltype"], wspace=0.5)