# Dataset preparation

This example offers a step-by-step guide on how to prepare your single-cell datasets for training and for subsequent analysis using the UNAGI tool. The UNAGI tool lies in the assumption that the dataset is a time-series single-cell data and time information of each cell is known. Thus, to use the UNAGI tool, it's mandatory to annotate the time-point information for each cell.

This example shows how to append time point attributes to the annData object. These time points should be sequentially organized as [0, 1, 2, ..., n]. (0->n, from early time points to late time points) 

In [None]:
# define the project name and load the data
import warnings
import scanpy as sc
import os
warnings.filterwarnings('ignore')
PATH_TO_YOUR_DATA = 'your_data.h5ad'
adata = sc.read(PATH_TO_YOUR_DATA)

The following code will assign the stage key to each batch according to their time points. (e.g. Assuming the time-series dataset has 3 batches, each comes from an individual time point.) 

**Note:** UNAGI tool requires the time point $\geq$ 2 time points.

In [None]:
# Using 'stage' as the key for the stage information in the adata.obs
stage_key = 'stage' # change this to whatever you want
adata.obs[stage_key] = None

sc.tl.pca(adata)

#Assume the batch information is in adata.obs['batch'], and the batch names are batch1, batch2, batch3....
# Change the following code according to your data
adata.obs.loc[adata.obs['batch'] == 'batch1', stage_key] = '0'
adata.obs.loc[adata.obs['batch'] == 'batch2', stage_key] = '1'
adata.obs.loc[adata.obs['batch'] == 'batch3', stage_key] = '2'
#....

After appending the time-points information, you can either write the whole dataset into the disk or divided it into individual stages and then write to the disk.   

In [None]:
# Option 1: Save the data in the disk
adata.write(f'{PATH_TO_YOUR_DATA}', compression='gzip', compression_opts=9)

# Option 2: Seperate the data into different stages and save them
import os
dir_name = os.path.dirname(PATH_TO_YOUR_DATA)

for each in list(adata.obs[stage_key].unique()):
    stage_adata = adata[adata.obs[stage_key] == each]
    stage_adata.write(f'{dir_name}/{each}.h5ad', compression='gzip', compression_opts=9)