# Setting up AUTOENCODIX
<img src="https://raw.githubusercontent.com/jan-forest/autoencodix/5dabc4a697cbba74d3f6144dc4b6d0fd6df2b624/images/autoencodix_logo.svg" alt="AUTOENCODIX-Logo" width="300"/>


## Get the code and create environment

In [None]:
git clone https://git.informatik.uni-leipzig.de/joas/autoencoder.git # Clone the repo
cd ./autoencoder # Enter Repo folder
make create_environment # Create environment
source venv-gallia/bin/activate # Activate environment

## Install requirements

In [None]:
make requirements

# Basic usage of AUTOENCODIX

## Input data and supported format

### (1) Get your input data in the shape samples x features
Each data modality should be provided as a data matrix with the shape samples x features with index names and column headers

The data can be provided as text files (csv, tsv, txt) or as parquet-files

Let's have a look at an example:

In [5]:
## Download Stomach Adenocarcinoma (brca) from TCGA via cbioportal
!wget https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pan_can_atlas_2018.tar.gz
!tar -xzf stad_tcga_pan_can_atlas_2018.tar.gz 


--2024-08-02 15:28:58--  https://cbioportal-datahub.s3.amazonaws.com/stad_tcga_pan_can_atlas_2018.tar.gz
Auflösen des Hostnamens cbioportal-datahub.s3.amazonaws.com (cbioportal-datahub.s3.amazonaws.com) … 52.217.123.233, 52.217.123.33, 52.216.214.217, ...
Verbindungsaufbau zu cbioportal-datahub.s3.amazonaws.com (cbioportal-datahub.s3.amazonaws.com)|52.217.123.233|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 270698379 (258M) [application/x-tar]
Wird in ‘stad_tcga_pan_can_atlas_2018.tar.gz’ gespeichert.


2024-08-02 15:29:15 (15,7 MB/s) - ‘stad_tcga_pan_can_atlas_2018.tar.gz’ gespeichert [270698379/270698379]



In [8]:
## Assume we want to integrate RNAseq data and methylation data with Autoencodix
## Let's have a look at the format
!echo "RNASeq data"
!head ./stad_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt | cut -d$'\t' -f1-5
!echo "" 
!echo "Methylation data"
!head ./stad_tcga_pan_can_atlas_2018/data_methylation_hm27_hm450_merged.txt | cut -d$'\t' -f1-5 

RNASeq data
Hugo_Symbol	Entrez_Gene_Id	TCGA-3M-AB46-01	TCGA-3M-AB47-01	TCGA-B7-5816-01
	100130426	NA	NA	NA
	100133144	9.03123265102881	11.3578399208674	2.78670231680473
UBE2Q2P2	100134869	9.33091037876535	4.04540928371471	6.52782278727798
HMGB1P1	10357	532.008646595265	242.424264157504	17.1946772404742
	10431	2798.81395589004	1139.17798133097	970.855139799279
	136542	NA	NA	NA
	155060	201.067799452434	161.403860554953	42.1178187296961
RNU12-2P	26823	2.67435202229722	0.980275237559254	0.0902798202715707
SSX9P	280660	NA	NA	NA

Methylation data
ENTITY_STABLE_ID	NAME	DESCRIPTION	TRANSCRIPT_ID	TCGA-3M-AB46-01
cg00000292	ATP2A1	1stExon	NM_173201;NM_004320	0.352150562578197
cg00003994	MEOX2	1stExon	NM_005924	0.644610852345857
cg00005847	HOXD3	5'UTR	NM_006898	0.764708219270656
cg00007981	PANX1	1stExon	NM_015368	0.0263955618283324
cg00008493	KIAA1409;COX8C	Body;5'UTR	NM_020818;NM_182971	0.941936515760024
cg00008713	IMPA2	TSS1500	NM_014214	0.062956253756997
cg00009407	TTC8	TSS200	NM_144596;NM_198

For usage with AUTOENCODIX we need to adress the following format issues:

- Standard format from cbioportal is flipped (features x samples)
- Methylation data is not per gene (Entrez Gene ID), but per probe. This works with `varix` and other autoencoders, but for the ontology-based `ontix` it is better to aggregate methylation data per gene for better integration.  

Let's reformat the data

In [9]:
import pandas as pd

df_rna = pd.read_csv(
	"./stad_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt",
	delimiter="\t",
	index_col=["Entrez_Gene_Id"],
	dtype= {"Entrez_Gene_Id": str},
) # We only need Entrez ID
map_hugo_entrez = df_rna["Hugo_Symbol"]
df_rna = df_rna.drop(columns=["Hugo_Symbol"], errors="ignore")
df_rna = df_rna.loc[~df_rna.index.isnull(), :]  # Filter Genes without Entrez ID
df_rna = df_rna.T.dropna(axis=1) # Swap rows and columns + drop features with NA values
df_rna = df_rna.loc[:,~df_rna.columns.duplicated()] # Remove duplicated features

print("Shape of RNASeq data")
print(df_rna.shape)

df_rna.to_parquet(
	"stad_rnaseq_formatted.parquet",
	index=True,
)


## For Methylation data as well

df_meth = pd.read_csv(
	"./stad_tcga_pan_can_atlas_2018/data_methylation_hm27_hm450_merged.txt",
	delimiter="\t",
	index_col=["ENTITY_STABLE_ID"],
	dtype= {"ENTITY_STABLE_ID": str},
)
df_meth = df_meth.merge(
            map_hugo_entrez.reset_index(), # Get the Entrez ID from RNA data
            left_on="NAME",
            right_on="Hugo_Symbol",
        )
df_meth = df_meth.drop(columns=["ENTITY_STABLE_ID", "NAME", "DESCRIPTION", "TRANSCRIPT_ID"], errors="ignore") #Dropping not needed columns
df_meth = df_meth.groupby(["Entrez_Gene_Id"]).mean(numeric_only=True) # We will aggregate over multiple measurements per gene to match RNA data

df_meth = df_meth.loc[~df_meth.index.isnull(), :]  # Filter Genes without Entrez ID
df_meth = df_meth.T.dropna(axis=1) # Swap rows and columns + drop features with NA values
df_meth = df_meth.loc[:,~df_meth.columns.duplicated()] # Remove duplicated features

print("Shape of Methylation data")
print(df_meth.shape)

df_meth.to_parquet(
	"stad_meth_formatted.parquet",
	index=True,
)

Shape of RNASeq data
(412, 16747)
Shape of Methylation data
(440, 11055)


A file with clinical variables for annotation is also required to create nice figures

Let's check the files from TCGA

In [9]:
!echo "First file:"
!head ./stad_tcga_pan_can_atlas_2018/data_clinical_patient.txt | cut -d$'\t' -f1-5
## Information in two files
!echo ""
!echo "Second file:"
!head ./stad_tcga_pan_can_atlas_2018/data_clinical_sample.txt | cut -d$'\t' -f1-5

First file:


#Patient Identifier	Subtype	TCGA PanCanAtlas Cancer Type Acronym	Other Patient ID	Diagnosis Age
#Identifier to uniquely specify a patient.	Subtype	Text field to hold cancer type acronym used by TCGA PanCanAtlas.	Legacy DMP patient identifier (DMPnnnn)	Age at which a condition or disease was first diagnosed.
#STRING	STRING	STRING	STRING	NUMBER
#1	1	1	1	1
PATIENT_ID	SUBTYPE	CANCER_TYPE_ACRONYM	OTHER_PATIENT_ID	AGE
TCGA-3M-AB46	STAD_CIN	STAD	BE6531B2-D1F3-44AB-9C02-1CEAE51EF2BB	70
TCGA-3M-AB47	STAD_GS	STAD	85C11B74-9E50-4DA1-8C0B-D5677CC801B1	51
TCGA-B7-5816	STAD_MSI	STAD	f07070c0-fd0a-4c19-ba1e-5f06b933cd7c	51
TCGA-B7-5818	STAD_EBV	STAD	6e03b415-84a1-4b91-8717-1a41edd4a255	62
TCGA-B7-A5TI	STAD_MSI	STAD	4310A287-5F01-4E0D-94E3-96C5379C3245	52

Second file:
#Patient Identifier	Sample Identifier	Oncotree Code	Cancer Type	Cancer Type Detailed
#Identifier to uniquely specify a patient.	A unique sample identifier.	Oncotree Code	Cancer Type	Cancer Type Detailed
#STRING	STRING	STRING	STRING	STRI

Shape is correct for AUTOENCODIX (samples x features)

But we need to remove pre-header rows and need to join the two files based on SAMPLE_ID

In [10]:
df_clin_sample = pd.read_csv(
	"./stad_tcga_pan_can_atlas_2018/data_clinical_sample.txt",
	index_col = ["PATIENT_ID"],
	skiprows = 3,
	header = 1,
	delimiter="\t"
)
df_clin_patient = pd.read_csv(
	"./stad_tcga_pan_can_atlas_2018/data_clinical_patient.txt",
	index_col = ["PATIENT_ID"],
	skiprows = 3,
	header = 1,
	delimiter="\t"
)

df_clin = df_clin_sample.merge(
	df_clin_patient, left_on="PATIENT_ID", right_on="PATIENT_ID"
)
df_clin = df_clin.set_index("SAMPLE_ID")

for col in df_clin: 
	dt = df_clin[col].dtype
	if dt == object or dt == str:
		df_clin[col] = df_clin[col].fillna("unknown") ## We must fill missing information in annotation files

print("Clinical variables we can use later for visualization:")
print(df_clin.columns)

df_clin.to_parquet("./stad_clin_formatted.parquet")

Clinical variables we can use later for visualization:
Index(['ONCOTREE_CODE', 'CANCER_TYPE', 'CANCER_TYPE_DETAILED', 'TUMOR_TYPE',
       'GRADE', 'TISSUE_PROSPECTIVE_COLLECTION_INDICATOR',
       'TISSUE_RETROSPECTIVE_COLLECTION_INDICATOR', 'TISSUE_SOURCE_SITE_CODE',
       'TUMOR_TISSUE_SITE', 'ANEUPLOIDY_SCORE', 'SAMPLE_TYPE',
       'MSI_SCORE_MANTIS', 'MSI_SENSOR_SCORE', 'SOMATIC_STATUS',
       'TMB_NONSYNONYMOUS', 'TISSUE_SOURCE_SITE', 'SUBTYPE',
       'CANCER_TYPE_ACRONYM', 'OTHER_PATIENT_ID', 'AGE', 'SEX',
       'AJCC_PATHOLOGIC_TUMOR_STAGE', 'AJCC_STAGING_EDITION',
       'DAYS_LAST_FOLLOWUP', 'DAYS_TO_BIRTH',
       'DAYS_TO_INITIAL_PATHOLOGIC_DIAGNOSIS', 'ETHNICITY',
       'FORM_COMPLETION_DATE', 'HISTORY_NEOADJUVANT_TRTYN', 'ICD_10',
       'ICD_O_3_HISTOLOGY', 'ICD_O_3_SITE', 'INFORMED_CONSENT_VERIFIED',
       'NEW_TUMOR_EVENT_AFTER_INITIAL_TREATMENT', 'PATH_M_STAGE',
       'PATH_N_STAGE', 'PATH_T_STAGE', 'PERSON_NEOPLASM_CANCER_STATUS',
       'PRIMARY_LYMPH_NODE_P

## Copy formatted data to root data directory
The standard directory for your final input data is in `data/raw`

In [13]:
!cp ./stad_*.parquet ../../data/raw/
!ls ../../data/raw/stad*.parquet

../../data/raw/stad_clin_formatted.parquet
../../data/raw/stad_meth_formatted.parquet
../../data/raw/stad_rnaseq_formatted.parquet


Now we are ready to train autoencoders!  
To do this check the other tutorials `Basiccs_Autoencodix.ipynb` or `Advanced_Ontix.ipynb`