# Setting up AUTOENCODIX
<img src="https://raw.githubusercontent.com/jan-forest/autoencodix/5dabc4a697cbba74d3f6144dc4b6d0fd6df2b624/images/autoencodix_logo.svg" alt="AUTOENCODIX-Logo" width="300"/>

The following tutorial has been tested and optimized for Linux/MacOS bash commands. You may need to change commands for MS powershell.

## Get the code and create environment
You probably have already cloned the repo and created the environment.
If not, you can run this code in your terminal:
```bash
# Check if running on macOS and if the repo exists
if [[ $(uname) == "Darwin" ]]; then echo "Running on macOS"; IS_MACOS=true; else echo "Not running on macOS"; IS_MACOS=false; fi
if [ -d "autoencodix" ]; then echo "Repository already exists, skipping clone..."; else echo "Cloning repository..."; git clone https://github.com/jan-forest/autoencodix.git; fi

# Change to the repo directory
cd autoencodix

# Copy macOS Makefile if on macOS
if [[ $(uname) == "Darwin" ]]; then if [ -f "Makefile_macos" ]; then echo "Copying macOS Makefile..."; cp Makefile_macos Makefile; echo "Copied successfully"; else echo "Warning: Makefile_macos not found"; fi; else echo "Not on macOS, keeping default Makefile"; fi

# Create environment
make create_environment

```

## Install requirements

In [None]:
!source venv-gallia/bin
!make requirements

# Basic usage of AUTOENCODIX

## Input data and supported format

### (1) Get your input data in the shape samples x features
Each data modality should be provided as a data matrix with the shape samples x features with index names and column headers

The data can be provided as text files (csv, tsv, txt) or as parquet-files

Let's have a look at an example:
- Combine five cancer subtypes from TCGA
- prepare two data modalities: gene expression (RNA) and methylation data (METH)

In [None]:
import urllib.request
import tarfile

def download_and_extract(url, filename):
	print(f"Downloading {filename}...")
	urllib.request.urlretrieve(url, filename)
	print(f"Extracting {filename}...")
	with tarfile.open(filename, 'r:gz') as tar:
		tar.extractall()

cancer_types = ["brca", "luad", "lusc", "ov", "coadread", "ucec", "ucs"]

for cancer in cancer_types:
	# Download TCGA data via cBioPortal
	url = f"https://cbioportal-datahub.s3.amazonaws.com/{cancer}_tcga_pan_can_atlas_2018.tar.gz"
	filename = f"{cancer}_tcga_pan_can_atlas_2018.tar.gz"
	download_and_extract(url, filename)

print("All downloads and extractions completed!")


Downloading brca_tcga_pan_can_atlas_2018.tar.gz...
Extracting brca_tcga_pan_can_atlas_2018.tar.gz...
Downloading luad_tcga_pan_can_atlas_2018.tar.gz...
Extracting luad_tcga_pan_can_atlas_2018.tar.gz...
Downloading lusc_tcga_pan_can_atlas_2018.tar.gz...
Extracting lusc_tcga_pan_can_atlas_2018.tar.gz...
Downloading ov_tcga_pan_can_atlas_2018.tar.gz...
Extracting ov_tcga_pan_can_atlas_2018.tar.gz...
Downloading coadread_tcga_pan_can_atlas_2018.tar.gz...
Extracting coadread_tcga_pan_can_atlas_2018.tar.gz...
Downloading ucec_tcga_pan_can_atlas_2018.tar.gz...
Extracting ucec_tcga_pan_can_atlas_2018.tar.gz...
Downloading ucs_tcga_pan_can_atlas_2018.tar.gz...
Extracting ucs_tcga_pan_can_atlas_2018.tar.gz...
All downloads and extractions completed!


In [3]:
## Assume we want to integrate RNAseq data and methylation data with Autoencodix
## Let's have a look at the format
!echo "RNASeq data"
!head ./brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt | cut -d$'\t' -f1-5
!echo "" 
!echo "Methylation data"
!head ./brca_tcga_pan_can_atlas_2018/data_methylation_hm27_hm450_merged.txt | cut -d$'\t' -f1-5 

RNASeq data
Hugo_Symbol	Entrez_Gene_Id	TCGA-3C-AAAU-01	TCGA-3C-AALI-01	TCGA-3C-AALJ-01
	100130426	0	0	0.9066
	100133144	16.3644	9.2659	11.6228
UBE2Q2P2	100134869	12.9316	17.379	9.2294
HMGB1P1	10357	52.1503	69.7553	154.297
	10431	408.076	563.893	1360.83
	136542	0	0	0
	155060	1187.01	516.041	592.022
RNU12-2P	26823	0	1.0875	0
SSX9P	280660	0	0.5438	0

Methylation data
ENTITY_STABLE_ID	NAME	DESCRIPTION	TRANSCRIPT_ID	TCGA-3C-AAAU-01
cg00000292	ATP2A1	1stExon	NM_173201;NM_004320	0.67848346283127
cg00003994	MEOX2	1stExon	NM_005924	0.100005173216671
cg00005847	HOXD3	5'UTR	NM_006898	0.875122134700595
cg00007981	PANX1	1stExon	NM_015368	0.0286584491585683
cg00008493	KIAA1409;COX8C	Body;5'UTR	NM_020818;NM_182971	0.954225470776835
cg00008713	IMPA2	TSS1500	NM_014214	0.0849249456209205
cg00009407	TTC8	TSS200	NM_144596;NM_198310;NM_198309	0.0311837978543924
cg00011459	PMM2;TMEM186	Body;TSS1500	NM_000303;NM_015421	0.94043753639814
cg00012199	ANG;RNASE4	TSS1500	NM_002937;NM_001145	0.0473055017612671


For usage with AUTOENCODIX we need to adress the following format issues:

- Standard format from cbioportal is flipped (features x samples)
- Methylation data is not per gene (Entrez Gene ID), but per probe. This works with `varix` and other autoencoders, but for the ontology-based `ontix` it is better to aggregate methylation data per gene for better integration.  

Let's reformat the data

In [4]:
import pandas as pd

## Function to format RNASeq data
def format_rna_data(cancer_types):
	combined_rna = []
	for cancer_type in cancer_types:
		df_rna = pd.read_csv(
			f"./{cancer_type}_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt",
			delimiter="\t",
			index_col=["Entrez_Gene_Id"],
			dtype={"Entrez_Gene_Id": str},
		)  # We only need Entrez ID
		map_hugo_entrez = df_rna["Hugo_Symbol"]
		df_rna = df_rna.drop(columns=["Hugo_Symbol"], errors="ignore")
		df_rna = df_rna.loc[~df_rna.index.isnull(), :]  # Filter Genes without Entrez ID
		df_rna = df_rna.T.dropna(axis=1)  # Swap rows and columns + drop features with NA values
		df_rna = df_rna.loc[:, ~df_rna.columns.duplicated()]  # Remove duplicated features
		combined_rna.append(df_rna)

	combined_rna_df = pd.concat(combined_rna, axis=0)
	print(f"Shape of combined RNASeq data")
	print(combined_rna_df.shape)

	combined_rna_df.to_parquet(
		"combined_rnaseq_formatted.parquet",
		index=True,
	)
	return map_hugo_entrez

## Function to format methylation data
def format_meth_data(cancer_types, map_hugo_entrez):
	combined_meth = []
	for cancer_type in cancer_types:
		df_meth = pd.read_csv(
			f"./{cancer_type}_tcga_pan_can_atlas_2018/data_methylation_hm27_hm450_merged.txt",
			delimiter="\t",
			index_col=["ENTITY_STABLE_ID"],
			dtype={"ENTITY_STABLE_ID": str},
		)
		df_meth = df_meth.merge(
			map_hugo_entrez.reset_index(),  # Get the Entrez ID from RNA data
			left_on="NAME",
			right_on="Hugo_Symbol",
		)
		df_meth = df_meth.drop(
			columns=["ENTITY_STABLE_ID", "NAME", "DESCRIPTION", "TRANSCRIPT_ID"], errors="ignore")  # Dropping not needed columns
		df_meth = df_meth.groupby(["Entrez_Gene_Id"]).mean(numeric_only=True)  # Aggregate over multiple measurements per gene to match RNA data

		df_meth = df_meth.loc[~df_meth.index.isnull(), :]  # Filter Genes without Entrez ID
		df_meth = df_meth.T.dropna(axis=1)  # Swap rows and columns + drop features with NA values
		df_meth = df_meth.loc[:, ~df_meth.columns.duplicated()]  # Remove duplicated features
		combined_meth.append(df_meth)

	combined_meth_df = pd.concat(combined_meth, axis=0)
	print(f"Shape of combined Methylation data")
	print(combined_meth_df.shape)

	combined_meth_df.to_parquet(
		"combined_meth_formatted.parquet",
		index=True,
	)


# List of cancer subtypes
cancer_types = ["brca", "lusc", "luad", "ov", "coadread", "ucec", "ucs"]

# Format data for all cancer subtypes
print("Processing RNASeq data...")
map_hugo_entrez = format_rna_data(cancer_types)
print("Processing Methylation data...")
format_meth_data(cancer_types, map_hugo_entrez)

Processing RNASeq data...
Shape of combined RNASeq data
(3552, 20506)
Processing Methylation data...
Shape of combined Methylation data
(3875, 11285)


A file with clinical variables for annotation is also required to create nice figures

Let's check the files from TCGA

In [5]:
!echo "First file:"
!head ./brca_tcga_pan_can_atlas_2018/data_clinical_patient.txt | cut -d$'\t' -f1-5
## Information in two files
!echo ""
!echo "Second file:"
!head ./brca_tcga_pan_can_atlas_2018/data_clinical_sample.txt | cut -d$'\t' -f1-5

First file:
#Patient Identifier	Subtype	TCGA PanCanAtlas Cancer Type Acronym	Other Patient ID	Diagnosis Age
#Identifier to uniquely specify a patient.	Subtype	Text field to hold cancer type acronym used by TCGA PanCanAtlas.	Legacy DMP patient identifier (DMPnnnn)	Age at which a condition or disease was first diagnosed.
#STRING	STRING	STRING	STRING	NUMBER
#1	1	1	1	1
PATIENT_ID	SUBTYPE	CANCER_TYPE_ACRONYM	OTHER_PATIENT_ID	AGE
TCGA-3C-AAAU	BRCA_LumA	BRCA	6E7D5EC6-A469-467C-B748-237353C23416	55
TCGA-3C-AALI	BRCA_Her2	BRCA	55262FCB-1B01-4480-B322-36570430C917	50
TCGA-3C-AALJ	BRCA_LumB	BRCA	427D0648-3F77-4FFC-B52C-89855426D647	62
TCGA-3C-AALK	BRCA_LumA	BRCA	C31900A4-5DCD-4022-97AC-638E86E889E4	52
TCGA-4H-AAAK	BRCA_LumA	BRCA	6623FC5E-00BE-4476-967A-CBD55F676EA6	50

Second file:
#Patient Identifier	Sample Identifier	Oncotree Code	Cancer Type	Cancer Type Detailed
#Identifier to uniquely specify a patient.	A unique sample identifier.	Oncotree Code	Cancer Type	Cancer Type Detailed
#STRING	STRING	

Shape is correct for AUTOENCODIX (samples x features)

But we need to remove pre-header rows and need to join the two files based on SAMPLE_ID

In [6]:
combined_clin = []

for cancer_type in cancer_types:
	df_clin_sample = pd.read_csv(
		f"./{cancer_type}_tcga_pan_can_atlas_2018/data_clinical_sample.txt",
		index_col=["PATIENT_ID"],
		skiprows=3,
		header=1,
		delimiter="\t",
		dtype={"GRADE": str} # Fix inconsistent data types
		)
	df_clin_patient = pd.read_csv(
		f"./{cancer_type}_tcga_pan_can_atlas_2018/data_clinical_patient.txt",
		index_col=["PATIENT_ID"],
		skiprows=3,
		header=1,
		delimiter="\t",
		dtype={"AJCC_PATHOLOGIC_TUMOR_STAGE":str, "AJCC_STAGING_EDITION": str, "PATH_N_STAGE":str} # Fix inconsistent data types
	)

	# Clean the DAYS_LAST_FOLLOWUP column
	df_clin_patient["DAYS_LAST_FOLLOWUP"] = pd.to_numeric(
		df_clin_patient["DAYS_LAST_FOLLOWUP"], errors="coerce"
	)

	df_clin = df_clin_sample.merge(
		df_clin_patient, left_on="PATIENT_ID", right_on="PATIENT_ID"
	)
	df_clin = df_clin.set_index("SAMPLE_ID")

	for col in df_clin:
		dt = df_clin[col].dtype
		if dt.name == 'object' or dt.name == 'string':
			df_clin[col] = df_clin[col].fillna("unknown")  # Fill missing information in annotation files
			df_clin[col] = df_clin[col].astype(str)  # Ensure consistent string type

	combined_clin.append(df_clin)

# Combine all clinical data into a single DataFrame
combined_clin_df = pd.concat(combined_clin, axis=0)

print("Clinical variables we can use later for visualization:")
print(combined_clin_df.columns)

# Save the combined clinical data to a parquet file
combined_clin_df.to_parquet("./combined_clin_formatted.parquet")

Clinical variables we can use later for visualization:
Index(['ONCOTREE_CODE', 'CANCER_TYPE', 'CANCER_TYPE_DETAILED', 'TUMOR_TYPE',
       'GRADE', 'TISSUE_PROSPECTIVE_COLLECTION_INDICATOR',
       'TISSUE_RETROSPECTIVE_COLLECTION_INDICATOR', 'TISSUE_SOURCE_SITE_CODE',
       'TUMOR_TISSUE_SITE', 'ANEUPLOIDY_SCORE', 'SAMPLE_TYPE',
       'MSI_SCORE_MANTIS', 'MSI_SENSOR_SCORE', 'SOMATIC_STATUS',
       'TMB_NONSYNONYMOUS', 'TISSUE_SOURCE_SITE', 'TBL_SCORE', 'SUBTYPE',
       'CANCER_TYPE_ACRONYM', 'OTHER_PATIENT_ID', 'AGE', 'SEX',
       'AJCC_PATHOLOGIC_TUMOR_STAGE', 'AJCC_STAGING_EDITION',
       'DAYS_LAST_FOLLOWUP', 'DAYS_TO_BIRTH',
       'DAYS_TO_INITIAL_PATHOLOGIC_DIAGNOSIS', 'ETHNICITY',
       'FORM_COMPLETION_DATE', 'HISTORY_NEOADJUVANT_TRTYN', 'ICD_10',
       'ICD_O_3_HISTOLOGY', 'ICD_O_3_SITE', 'INFORMED_CONSENT_VERIFIED',
       'NEW_TUMOR_EVENT_AFTER_INITIAL_TREATMENT', 'PATH_M_STAGE',
       'PATH_N_STAGE', 'PATH_T_STAGE', 'PERSON_NEOPLASM_CANCER_STATUS',
       'PRIMARY

## Copy formatted data to root data directory
The standard directory for your final input data is in `data/raw`

In [7]:
!mkdir -p ../data/raw/
!cp ./combined_*.parquet ../data/raw/
!ls ../data/raw/combined*.parquet

../data/raw/combined_clin_formatted.parquet
../data/raw/combined_meth_formatted.parquet
../data/raw/combined_rnaseq_formatted.parquet


Now we are ready to train autoencoders!  
To do this check the other tutorials `Basiccs_Autoencodix.ipynb` or `Advanced_Ontix.ipynb`