# How to train an ontology-based variational autoencoder (Ontix)

In normal autoencoders latent dimensions are not explainable by design. To gain explainability and to incorporate biological information, a popular approach is to restrict the decoder of the autoencoder to match feature connectivity like an ontology. 

**IMPORTANT**

> This tutorial only shows the specifics of the Ontix pipeline. If you're unfamilar with general concepts,  
> we recommend to follow the `Getting Started - Vanillix` Tutorial first.
## What You'll Learn

In this notebook we will show two types of ontologies and how they can be used to train an explainable variational autoencoder `ontix`. 
The first is based on biological pathways of the Reactome database (left) and the second uses chromosomal location of genes (right) as a showcase. 

<img src="https://raw.githubusercontent.com/jan-forest/autoencodix/5dabc4a697cbba74d3f6144dc4b6d0fd6df2b624/images/ontix_scheme.svg" alt="ontix-ontologies" width="1200"/>
Youâ€™ll learn how to:

1. **Initialize** the pipeline, and ontologies, then run the pipeline. <br> <br>
2. Understand the Ontix sepecific **pipeline steps**. <br><br>
3. Access the Ontox specific **results** (mus, sigma ,kl/mmd losses). <br><br>
4. **Visualize** outputs effectively. <br><br>
5. Apply **custom parameters**. <br><br>
6. **Save, load, and reuse** a trained pipeline. <br><br>



## 01 Set-up Ontology and Initialize Pipeline

The only thing you need to do to train an `ontix` is to provide text files for up to two ontology levels. The first ontology level is the mapping of your features (e.g. Gene ID) and an ontology level like subpathways or cytobands of chromosome. The second level is optional, but recommended, and is the mapping of the first ontology to a second level like top-level pathways or chromosome. 

The mapping should have the format:  
Gene 1 `separator` Pathway1  
Gene 2 `separator` Pathway1  
Gene 3 `separator` Pathway2  <br><br>

**Example 1: Set-up chromosomal ontology**:  

From Ensembl via Biomart or any other sequence database you can get cytoband (karyotype) and chromosomal information for human genes like this:

In [None]:
import pandas as pd

df_genes = pd.read_csv("genes_chromosomes.txt", sep="\t")
df_genes.head()

Unnamed: 0,Gene stable ID,Gene stable ID version,Karyotype band,Chromosome/scaffold name,Gene start (bp),Gene end (bp),HGNC symbol,NCBI gene (formerly Entrezgene) ID
0,ENSG00000198888,ENSG00000198888.2,,MT,3307,4262,MT-ND1,4535
1,ENSG00000198763,ENSG00000198763.3,,MT,4470,5511,MT-ND2,4536
2,ENSG00000198804,ENSG00000198804.2,,MT,5904,7445,MT-CO1,4512
3,ENSG00000210151,ENSG00000210151.2,,MT,7446,7514,MT-TS1,113219467
4,ENSG00000198712,ENSG00000198712.1,,MT,7586,8269,MT-CO2,4513


We have to solve some issues before we can use this as ontology.  
(1) We only want chromosomes and not scaffolds  
(2) Karyotype/cytoband should have identification of the chromosome in their name as identifier

In [None]:
df_genes = df_genes.loc[
    df_genes["Chromosome/scaffold name"].str.len() < 3
]  ## get rid of scaffolds and keep only chromosomes
df_genes.loc[df_genes["Chromosome/scaffold name"] == "MT", "Karyotype band"] = (
    "MT"  ## create missing karyotype for mito genes
)
print("This will be our chromosomes and latent dimensions in Ontix:")
print(df_genes["Chromosome/scaffold name"].unique())
print(f"Latent dimension: {len(df_genes['Chromosome/scaffold name'].unique())}")

This will be our chromosomes and latent dimensions in Ontix:
['MT' 'Y' '21' '13' '18' '22' '20' 'X' '15' '14' '10' '9' '8' '16' '4' '5'
 '7' '6' '19' '12' '11' '3' '17' '2' '1']
Latent dimension: 25


In [None]:
# Combine Chromosome name and cytoband
df_genes = df_genes.copy()
df_genes.loc[:, "Chr_and_karyotype"] = df_genes.loc[
    :, ["Chromosome/scaffold name", "Karyotype band"]
].apply(lambda x: ":".join(x.values.tolist()), axis=1)
print("This will be our hidden layer in the sparse decoder:")
print(df_genes["Chr_and_karyotype"].unique()[0:20])
print(f"Hidden layer dim: {len(df_genes['Chr_and_karyotype'].unique())}")

This will be our hidden layer in the sparse decoder:
['MT:MT' 'Y:p11.2' 'Y:q11.223' 'Y:q11.221' 'Y:q11.222' 'Y:q11.23'
 'Y:p11.31' 'Y:p11.32' 'Y:q12' '21:p12' '21:q21.1' '21:q21.2' '21:p11.2'
 '13:q12.12' '21:q21.3' '21:q22.11' '13:q12.3' '13:q14.12' '13:q14.2'
 '21:q22.12']
Hidden layer dim: 817


Now we can save this as files in the correct format for the two levels:

In [8]:
import os

p = os.getcwd()
d = "autoencodix_package"
if d not in p:
    raise FileNotFoundError(f"'{d}' not found in path: {p}")
os.chdir(os.sep.join(p.split(os.sep)[: p.split(os.sep).index(d) + 1]))
print(f"Changed to: {os.getcwd()}")
# ---------------------------------------------------------------------
# Paths
# ---------------------------------------------------------------------
data_root = "data/raw"
rna_file = "combined_rnaseq_formatted.parquet"
meth_file = "combined_meth_formatted.parquet"
clin_file = "combined_clin_formatted.parquet"
ont_file1 = "chromosome_ont_lvl1_ncbi.txt",
ont_file2 = "data/raw/chromosome_ont_lvl2.txt",


# Level 1
df_genes[
    [
        "NCBI gene (formerly Entrezgene) ID",
        "Chr_and_karyotype",
    ]  # Level 1: Feature (gene) to hidden layer (cytoband)
].drop_duplicates(  # Chromosomal ontology must be unique
).to_csv(
    os.path.join(data_root, ont_file1),
    sep="\t",
    header=False,
    index=False,
)

# Level 2
df_genes[
    [
        "Chr_and_karyotype",
        "Chromosome/scaffold name",
    ]  # Level 2: hidden layer (cytoband) to latent dimension (chromosome)
].drop_duplicates(  # Chromosomal ontology must be unique
).to_csv(

    os.path.join(data_root, ont_file2),
    sep="\t",
    header=False,
    index=False,
)

Changed to: /Users/maximilianjoas/development/autoencodix_package


TypeError: join() argument must be str, bytes, or os.PathLike object, not 'tuple'

In [12]:
import os
import autoencodix as acx
from autoencodix.configs.default_config import DataConfig, DataInfo, DataCase
from autoencodix.configs import OntixConfig

# ---------------------------------------------------------------------
# Paths
# ---------------------------------------------------------------------
data_root = "data/raw"
rna_file = "combined_rnaseq_formatted.parquet"
meth_file = "combined_meth_formatted.parquet"
clin_file = "combined_clin_formatted.parquet"

# ---------------------------------------------------------------------
# Define individual data modalities
# ---------------------------------------------------------------------
rna_info = DataInfo(
    file_path=os.path.join(data_root, rna_file),
    data_type="NUMERIC",
    filtering="VAR",
)

meth_info = DataInfo(
    file_path=os.path.join(data_root, meth_file),
    data_type="NUMERIC",
    filtering="VAR",
)

anno_info = DataInfo(
    file_path=os.path.join(data_root, clin_file),
    data_type="ANNOTATION",
)

# ---------------------------------------------------------------------
# Combine into DataConfig
# ---------------------------------------------------------------------
data_config = DataConfig(
    data_info={
        "RNA": rna_info,
        "METH": meth_info,
        "ANNO": anno_info,
    },
    annotation_columns=[
        "CANCER_TYPE",
        "CANCER_TYPE_ACRONYM",
        "TMB_NONSYNONYMOUS",
        "AGE",
        "OS_STATUS",
        "GRADE",
        "SEX",
    ],
)

# ---------------------------------------------------------------------
# Define the full DefaultConfig (roughly equivalent to old cfg)
# ---------------------------------------------------------------------
ontix_config = OntixConfig(
    data_config=data_config,
    reproducible=True,
    global_seed=42,
    epochs=500,
    learning_rate=0.0005,
    batch_size=128,
    drop_p=0.3,
    k_filter=1000,
    latent_dim=6, 
    reconstruction_loss="mse",
    default_vae_loss="kl",
    beta=0.5,
    save_memory=True,
    scaling="MINMAX",
    device="auto",
    train_ratio=0.7,
    test_ratio=0.2,
    valid_ratio=0.1,
)

# ---------------------------------------------------------------------
# Now pass into your Ontix object
# ---------------------------------------------------------------------

ont_files = [ont_file1, ont_file2]
ontix = acx.Ontix(
    ontologies=ont_files,
    config=ontix_config,
)


ValidationError: 1 validation error for OntixConfig
  Value error, Invalid scaling 'NOTSET' for modality 'RNA'. OntixConfig only permits ['MINMAX', 'NONE']. [type=value_error, input_value={'data_config': DataConfi...0.2, 'valid_ratio': 0.1}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.12/v/value_error