## 1. Download the datasets from UCSC website

### 1.1 Download the multi-omics data

* Parse the data from the UCSC Xena website in PANCAN cohort:
https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

* Copy number (gene-level) - gene-level copy number (gistic2_thresholded)
    * Dataset: https://xenabrowser.net/datapages/?dataset=TCGA.PANCAN.sampleMap%2FGistic2_CopyNumber_Gistic2_all_thresholded.by_genes&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

* DNA methylation (Methylation450K)
    * Dataset: https://xenabrowser.net/datapages/?dataset=jhu-usc.edu_PANCAN_HumanMethylation450.betaValue_whitelisted.tsv.synapse_download_5096262.xena&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
    * ID Map: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16304

* Gene expression RNAseq - TOIL RSEM fpkm
    * Dataset: https://xenabrowser.net/datapages/?dataset=tcga_RSEM_gene_fpkm&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
    
* Protein expression - RPPA
    * Dataset: https://xenabrowser.net/datapages/?dataset=TCGA-RPPA-pancan-clean.xena&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

### 1.2 Download the clinical data

* Phenotype - Curated clinical data
    * Dataset: https://xenabrowser.net/datapages/?dataset=Survival_SupplementalTable_S1_20171025_xena_sp&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

* Phenotype - Immune subtype
    * Dataset: https://xenabrowser.net/datapages/?dataset=Subtype_Immune_Model_Based.txt&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

* Phenotype - Molecular subtype
    * Dataset: https://xenabrowser.net/datapages/?dataset=TCGASubtype.20170308.tsv&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

* Phenotype - sample type and primary disease
    * Dataset: https://xenabrowser.net/datapages/?dataset=TCGA_phenotype_denseDataOnlyDownload.tsv&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443


## 2. Read the files

### 2.1 Read DNA methylation data

In [None]:
import pandas as pd
methylation_value = pd.read_csv('./UCSC-raw/jhu-usc.edu_PANCAN_HumanMethylation450.betaValue_whitelisted.tsv.synapse_download_5096262.xena', delimiter='\t')

In [None]:
methylation_value

### 2.2 Read Platform annotations for 450k methylation 

In [None]:
import pandas as pd
# Reading and processing basic data
try:
    annotation = pd.read_table('./UCSC-raw/GPL16304-47833.txt', delimiter='\t')
    annotation['Distance_closest_TSS'] = annotation['Distance_closest_TSS'].astype(int)
    annotation = annotation[~annotation['Closest_TSS'].apply(lambda x: len(str(x).split(';')) > 1)]
except ValueError as e:
    print(f"Unable to convert 'Closest_TSS' column to integer: {e}")
    problematic_rows = annotation['Distance_closest_TSS'].apply(lambda x: not str(x).isnumeric())
    print("Problematic rows:")
    print(annotation.loc[problematic_rows])

annotation

In [None]:
map_annotation= annotation[['ID', 'Closest_TSS','Closest_TSS_gene_name', 'Distance_closest_TSS']]
map_annotation

### 2.3 Read copynumber data

In [None]:
copynumber = pd.read_csv('./UCSC-raw/Gistic2_CopyNumber_Gistic2_all_thresholded.by_genes',sep='\t')
copynumber

### 2.4 Read gene expression data, mapping information and substitue the the gene name

In [None]:
gene_expression = pd.read_csv('./UCSC-raw/tcga_RSEM_gene_fpkm', sep='\t')
gene_expression

In [None]:
gene_expression_map = pd.read_csv('./UCSC-raw/probeMap_gencode.v23.annotation.gene.probemap', sep='\t')

In [None]:
gene_expression_map

In [None]:
expression_merged = pd.merge(gene_expression, gene_expression_map, left_on='sample', right_on='id', how='left')

In [None]:
expression_merged.drop(columns=['sample','chrom', 'chromStart','chromEnd','strand','id'], inplace=True)

In [None]:
# set gene to the first column
cols = ['gene'] + [col for col in expression_merged if col != 'gene']
gene_expression = expression_merged[cols]

In [None]:
gene_expression 

### 2.5 Read clinical data

In [None]:
survival = pd.read_csv('./UCSC-raw/Survival_SupplementalTable_S1_20171025_xena_sp', sep='\t')

In [None]:
survival

### 2.6 Read immune subtype data

In [None]:
immune_subtype = pd.read_csv('./UCSC-raw/Subtype_Immune_Model_Based.txt',sep='\t')

In [None]:
immune_subtype

### 2.7 Read proteomics data

In [None]:
protein = pd.read_csv('./UCSC-raw/TCGA-RPPA-pancan-clean.xena',sep='\t')

In [None]:
protein

### 2.8 Read molecular subtype

In [None]:
cellsub = pd.read_csv('./UCSC-raw/TCGASubtype.20170308.tsv', sep='\t')

In [None]:
cellsub

### 2.9 Read sample type and primary disease

In [None]:
dense = pd.read_csv('./UCSC-raw/TCGA_phenotype_denseDataOnlyDownload.tsv', sep='\t')

In [None]:
dense

## 3. Methylation data process

### 3.1 Define methylation region

In [None]:
import pandas as pd
import numpy as np

#Define vectorized area determination function for methylation data
def vectorized_determine_region(distances):
    regions = ['Upstream', 'Distal Promoter', 'Proximal Promoter', 'Core Promoter', 'Downstream']
    conditions = [
        (-6000 <= distances) & (distances < -3000),
        (-3000 <= distances) & (distances < -250),
        (-250 <= distances) & (distances < -50),
        (-50 <= distances) & (distances <= 0),
        (0 < distances) & (distances <= 3000)
    ]
    return np.select(conditions, regions, default=None)

### 3.2 Merge the annotation files to the methylation data and apply region function

In [None]:
# Merging basic data and methylation data
methylation_merged_df = pd.merge(map_annotation, methylation_value, left_on='ID', right_on='sample', how='right')

# Determining the region for each row outside the loop
methylation_merged_df['Region'] = vectorized_determine_region(methylation_merged_df['Distance_closest_TSS'])

methylation_merged_df = methylation_merged_df.dropna(subset=['Region'])  # Remove rows without a region

# Initializing a dictionary to store data for each region
regions_data = {region: pd.DataFrame() for region in ["Upstream", "Distal Promoter", "Proximal Promoter", "Core Promoter", "Downstream"]}

In [None]:
methylation_merged_df

In [None]:
# Delete the 'sample' column
methylation_merged_df = methylation_merged_df.drop('sample', axis=1)
# Delete the 'ID' column
methylation_merged_df = methylation_merged_df.drop('ID', axis=1)
# Delete the 'Distance_closest_TSS' column
methylation_merged_df = methylation_merged_df.drop('Distance_closest_TSS', axis=1)


In [None]:
methylation_merged_df

In [None]:
methylation_merged_df['Closest_TSS'] = methylation_merged_df['Closest_TSS'].astype(int)
methylation_merged_df['Closest_TSS_gene_name'] = methylation_merged_df['Closest_TSS_gene_name'].astype(str)
methylation_merged_df['Region'] = methylation_merged_df['Region'].astype(str)

In [None]:
print(methylation_merged_df[['Closest_TSS', 'Closest_TSS_gene_name', 'Region']].dtypes)

### 3.3 Calculate the average methylation value of five regions

In [None]:
# find columns not started with 'TCGA'
non_tcga_columns = methylation_merged_df.filter(regex='^(?!TCGA)').columns

print("Columns not starting with TCGA:")
print(non_tcga_columns)

In [None]:
# obtain all regions
regions = methylation_merged_df['Region'].unique()
regions

In [None]:
import pandas as pd
# Initialize empty DataFrames for each region
Upstream_df = pd.DataFrame()
Distal_Promoter_df = pd.DataFrame()
Proximal_Promoter_df = pd.DataFrame()
Core_Promoter_df = pd.DataFrame()
Downstream_df = pd.DataFrame()

# Operate on each region
for region in regions:
    # Get all data for this region
    region_data = methylation_merged_df[methylation_merged_df['Region'] == region]
    
    # Group and calculate the average for each (TSS, Region) combination
    grouped = region_data.groupby(['Closest_TSS_gene_name', 'Region'], as_index=False).mean()
    
    # Since we split the data into different files based on Region, we can delete this column
    grouped = grouped.drop(columns=['Region'])
    
    # Print the shape of the grouped data
    print(f"Shape of {region}: {grouped.shape}")
    
    # Assign the grouped data to the respective DataFrame
    if region == 'Upstream':
        Upstream_df = grouped
    elif region == 'Distal Promoter':
        Distal_Promoter_df = grouped
    elif region == 'Proximal Promoter':
        Proximal_Promoter_df = grouped
    elif region == 'Core Promoter':
        Core_Promoter_df = grouped
    elif region == 'Downstream':
        Downstream_df = grouped

    # Optionally, save the data for this region to a new csv file
    # grouped.to_csv(f"{region}_averaged_tss_data.csv", index=False)


### 3.4 Unify gene and TSS for five methylation value files

In [None]:
import pandas as pd

# from methylation files above DataFrame 
dfs = [Upstream_df, Distal_Promoter_df, Proximal_Promoter_df, Core_Promoter_df, Downstream_df]

# merge those files to find all combos 
all_genes_tss = pd.concat(dfs)['Closest_TSS_gene_name'].drop_duplicates()

In [None]:
all_genes_tss

In [None]:
# Merge unique combinations back into each DataFrame and fill NaN values with 0
# Upstream
Upstream_df = pd.merge(all_genes_tss, Upstream_df, on=['Closest_TSS_gene_name'], how='outer').fillna(0)
print(f"Shape of Upstream_df: {Upstream_df.shape}")

# Distal Promoter
Distal_Promoter_df = pd.merge(all_genes_tss, Distal_Promoter_df, on=['Closest_TSS_gene_name'], how='outer').fillna(0)
print(f"Shape of Distal_Promoter_df: {Distal_Promoter_df.shape}")

# Proximal Promoter
Proximal_Promoter_df = pd.merge(all_genes_tss, Proximal_Promoter_df, on=['Closest_TSS_gene_name'], how='outer').fillna(0)
print(f"Shape of Proximal_Promoter_df: {Proximal_Promoter_df.shape}")

# Core Promoter
Core_Promoter_df = pd.merge(all_genes_tss, Core_Promoter_df, on=['Closest_TSS_gene_name'], how='outer').fillna(0)
print(f"Shape of Core_Promoter_df: {Core_Promoter_df.shape}")

# Downstream
Downstream_df = pd.merge(all_genes_tss, Downstream_df, on=['Closest_TSS_gene_name'], how='outer').fillna(0)
print(f"Shape of Downstream_df: {Downstream_df.shape}")

## 4. Unify genes and patient samples within datasets

### 4.1 Unify gene and TSS for methylation, copynumer, and gene expression data

In [None]:
copynumber

In [None]:
gene_expression

In [None]:
Upstream_df

In [None]:
import pandas as pd

# Convert the gene name columns from each DataFrame to sets
copynumber_genes = set(copynumber['Sample'])
gene_expression_genes = set(gene_expression['gene'])
methylation_genes = set(Upstream_df['Closest_TSS_gene_name'])

# Find the intersection of the three sets
common_genes = copynumber_genes & gene_expression_genes & methylation_genes

# Convert the intersection back to a list, if needed
common_genes_list = list(common_genes)

# Print the number of common genes
print(f"Number of common genes: {len(common_genes)}")

##### 4.1.1 Intersecting genes with various databases

In [None]:
import pandas as pd
# Add the gene names from databases like [KEGG / BioGRID] to intersect with the common genes
# KEGG
kegg_pathway_df = pd.read_csv('./Regulatory-network-data/KEGG/full_kegg_pathway_list.csv')
kegg_pathway_df = kegg_pathway_df[['source', 'target', 'pathway_name']]
kegg_df = kegg_pathway_df[kegg_pathway_df['pathway_name'].str.contains('signaling pathway|signaling pathways', case=False)]
print(kegg_df['pathway_name'].value_counts())
kegg_df = kegg_df.rename(columns={'source': 'src', 'target': 'dest'})
src_list = list(kegg_df['src'])
dest_list = list(kegg_df['dest'])
path_list = list(kegg_df['pathway_name'])
# ADJUST ALL GENES TO UPPERCASE
up_src_list = []
for src in src_list:
    up_src = src.upper()
    up_src_list.append(up_src)
up_dest_list = []
for dest in dest_list:
    up_dest = dest.upper()
    up_dest_list.append(up_dest)
up_kegg_conn_dict = {'src': up_src_list, 'dest': up_dest_list}
up_kegg_df = pd.DataFrame(up_kegg_conn_dict)
up_kegg_df = up_kegg_df.drop_duplicates()
up_kegg_df.to_csv('./Regulatory-network-data/KEGG/up_kegg.csv', index=False, header=True)
kegg_gene_list = list(set(list(up_kegg_df['src']) + list(up_kegg_df['dest'])))
print('----- NUMBER OF GENES IN KEGG: ' + str(len(kegg_gene_list)) + ' -----')
print(up_kegg_df.shape)

up_kegg_path_conn_dict = {'src': up_src_list, 'dest': up_dest_list, 'path': path_list}
up_kegg_path_df = pd.DataFrame(up_kegg_path_conn_dict)
up_kegg_path_df = up_kegg_path_df.drop_duplicates()
up_kegg_path_df.to_csv('./Regulatory-network-data/KEGG/up_kegg_path.csv', index=False, header=True)
kegg_path_gene_list = list(set(list(up_kegg_path_df['src']) + list(up_kegg_path_df['dest'])))
print('----- NUMBER OF GENES IN KEGG PATH: ' + str(len(kegg_path_gene_list)) + ' -----')
print(up_kegg_path_df.shape)

In [None]:
# BioGRID
biogrid_df = pd.read_table('./Regulatory-network-data/BioGrid/BIOGRID-ALL-3.5.174.mitab.Symbol.txt', delimiter = '\t')
eh_list = list(biogrid_df['e_h'])
et_list = list(biogrid_df['e_t'])
# ADJUST ALL GENES TO UPPERCASE
up_eh_list = []
for eh in eh_list:
    up_eh = eh.upper()
    up_eh_list.append(up_eh)
up_et_list = []
for et in et_list:
    up_et = et.upper()
    up_et_list.append(up_et)
up_biogrid_conn_dict = {'e_h': up_eh_list, 'e_t': up_et_list}
up_biogrid_df = pd.DataFrame(up_biogrid_conn_dict)
print(up_biogrid_df)
print(up_biogrid_df.shape)
up_biogrid_df.to_csv('./Regulatory-network-data/BioGrid/up_biogrid.csv', index = False, header = True)
up_biogrid_gene_list = list(set(list(up_biogrid_df['e_h']) + list(up_biogrid_df['e_t'])))
print('----- NUMBER OF GENES IN BioGRID: ' + str(len(up_biogrid_gene_list)) + ' -----')

In [None]:
# STRING
string_df = pd.read_csv('./Regulatory-network-data/STRING/9606.protein.links.detailed.v11.0_sym.csv', low_memory=False)
src_list = list(string_df['Source'])
tar_list = list(string_df['Target'])
# ADJUST ALL GENES TO UPPERCASE
up_src_list = []
for src in src_list:
    up_src = src.upper()
    up_src_list.append(up_src)
up_tar_list = []
for tar in tar_list:
    up_tar = tar.upper()
    up_tar_list.append(up_tar)
up_string_conn_dict = {'Source': up_src_list, 'Target': up_tar_list}
up_string_df = pd.DataFrame(up_string_conn_dict)
print(up_string_df)
up_string_df.to_csv('./Regulatory-network-data/STRING/up_string.csv', index = False, header = True)
up_string_gene_list = list(set(list(up_string_df['Source']) + list(up_string_df['Target'])))
print('----- NUMBER OF GENES IN STRING: ' + str(len(up_string_gene_list)) + ' -----')

In [None]:
# intersect the [common genes] with the genes in the different databases [KEGG / BioGRID / STRING]
selected_database = 'KEGG'
# selected_database = 'BioGRID'
# selected_database = 'STRING'
if selected_database == 'KEGG':
    common_genes = list(set(common_genes) & set(kegg_gene_list))
    print('----- NUMBER OF INTERSECTED GENES IN KEGG: ' + str(len(common_genes)) + ' -----')
elif selected_database == 'BioGRID':
    common_genes = list(set(common_genes) & set(up_biogrid_gene_list))
    print('----- NUMBER OF INTERSECTED GENES IN BioGRID: ' + str(len(common_genes)) + ' -----')
elif selected_database == 'STRING':
    common_genes = list(set(common_genes) & set(up_string_gene_list))
    print('----- NUMBER OF INTERSECTED GENES IN STRING: ' + str(len(common_genes)) + ' -----')

# filter the genes in the different databases [KEGG / BioGRID / STRING] with the [common genes]
if selected_database == 'KEGG':
    filtered_up_kegg_df = up_kegg_df[up_kegg_df['src'].isin(common_genes) & up_kegg_df['dest'].isin(common_genes)]
    filtered_up_kegg_df = filtered_up_kegg_df.drop_duplicates()
    filtered_up_kegg_df = filtered_up_kegg_df.sort_values(by=['src', 'dest']).reset_index(drop=True)
    print('----- NEW KEGG EDGE CONNECTIONS: ' + str(len(filtered_up_kegg_df)) + ' -----')
    filtered_up_kegg_path_df = up_kegg_path_df[up_kegg_path_df['src'].isin(common_genes) & up_kegg_path_df['dest'].isin(common_genes)]
    filtered_up_kegg_path_df = filtered_up_kegg_path_df.drop_duplicates()
    filtered_up_kegg_path_df = filtered_up_kegg_path_df.sort_values(by=['src', 'dest']).reset_index(drop=True)
    print('----- NEW KEGG PATHWAY CONNECTIONS: ' + str(len(filtered_up_kegg_path_df)) + ' -----')

In [None]:
display(filtered_up_kegg_df)
display(filtered_up_kegg_path_df)

In [None]:
# select common genes in copynumber data
copynumber_filtered = copynumber.loc[copynumber['Sample'].isin(common_genes)]

# select common genes in gene expression data
gene_expression_filtered = gene_expression.loc[gene_expression['gene'].isin(common_genes)]

In [None]:
copynumber_filtered

In [None]:
gene_expression_filtered

In [None]:
gene_expression_filtered['gene'].nunique()

In [None]:
gene_expression_filtered = gene_expression_filtered.groupby('gene', as_index=False).mean()

In [None]:
gene_expression_filtered

In [None]:
# select common genes in methylation data
Upstream_df_filtered = Upstream_df.loc[Upstream_df['Closest_TSS_gene_name'].isin(common_genes)]
Distal_Promoter_df_filtered = Distal_Promoter_df.loc[Distal_Promoter_df['Closest_TSS_gene_name'].isin(common_genes)]
Proximal_Promoter_df_filtered = Proximal_Promoter_df.loc[Proximal_Promoter_df['Closest_TSS_gene_name'].isin(common_genes)]
Core_Promoter_df_filtered = Core_Promoter_df.loc[Core_Promoter_df['Closest_TSS_gene_name'].isin(common_genes)]
Downstream_df_filtered = Downstream_df.loc[Downstream_df['Closest_TSS_gene_name'].isin(common_genes)]

In [None]:
Upstream_df_filtered

### 4.2 Unify patient samples within methylation, copynumer,  gene expression, clinical, proteomics, molecular subtype, and sample type, primary disease datasets

In [None]:
# clinical data
survival

In [None]:
#immune subtype
immune_subtype

In [None]:
#proteomics data
protein

In [None]:
#molecular subtype
cellsub

In [None]:
#sample type and primary disease
dense

In [None]:
# Upstream_df_filtered, Distal_Promoter_df_filtered, Proximal_Promoter_df_filtered, Core_Promoter_df_filtered, Downstream_df_filtered
# copynumber_filtered, gene_expression_filtered
# survival, protein, cellsub, dense

# Extract column names starting with 'TCGA' from methylation datasets
tcga_columns_upstream = [col for col in Upstream_df_filtered.columns if col.startswith('TCGA')]
tcga_columns_distal = [col for col in Distal_Promoter_df_filtered.columns if col.startswith('TCGA')]
tcga_columns_proximal = [col for col in Proximal_Promoter_df_filtered.columns if col.startswith('TCGA')]
tcga_columns_core = [col for col in Core_Promoter_df_filtered.columns if col.startswith('TCGA')]
tcga_columns_downstream = [col for col in Downstream_df_filtered.columns if col.startswith('TCGA')]

# Extract 'TCGA' columns from other datasets
tcga_columns_copynumber = [col for col in copynumber_filtered.columns if col.startswith('TCGA')]
tcga_columns_gene_expression = [col for col in gene_expression_filtered.columns if col.startswith('TCGA')]
tcga_columns_survival = [col for col in survival['sample'] if col.startswith('TCGA')]
tcga_columns_protein = [col for col in protein.columns if col.startswith('TCGA')]
tcga_columns_cellsub = [col for col in cellsub['sampleID'] if col.startswith('TCGA')]
tcga_columns_dense = [col for col in dense['sample'] if col.startswith('TCGA')]
tcga_columns_immune_subtype = [col for col in immune_subtype['sample'] if col.startswith('TCGA')]
# Find the intersection of TCGA column names across all DataFrames
common_tcga_columns = set(tcga_columns_upstream) & set(tcga_columns_distal) & set(tcga_columns_proximal) & set(tcga_columns_core) & set(tcga_columns_downstream) & set(tcga_columns_copynumber) & set(tcga_columns_gene_expression) & set(tcga_columns_survival) & set(tcga_columns_protein) & set(tcga_columns_cellsub) & set(tcga_columns_dense) & set(tcga_columns_immune_subtype)

# Convert the intersection back to a list, if needed
common_tcga_columns_list = list(common_tcga_columns)

# Print the number and the list of common TCGA columns
print(f"Number of common TCGA columns: {len(common_tcga_columns)}")

In [None]:
# Define columns to keep along with common TCGA columns
additional_cols_methylation = ['Closest_TSS_gene_name']

# Filter each methylation DataFrame
Upstream_df_filtered = Upstream_df_filtered[additional_cols_methylation + common_tcga_columns_list]
Distal_Promoter_df_filtered = Distal_Promoter_df_filtered[additional_cols_methylation + common_tcga_columns_list]
Proximal_Promoter_df_filtered = Proximal_Promoter_df_filtered[additional_cols_methylation + common_tcga_columns_list]
Core_Promoter_df_filtered = Core_Promoter_df_filtered[additional_cols_methylation + common_tcga_columns_list]
Downstream_df_filtered = Downstream_df_filtered[additional_cols_methylation + common_tcga_columns_list]


In [None]:
Core_Promoter_df_filtered

In [None]:
#sort by gene name and reset index
Upstream_df_filtered = Upstream_df_filtered.sort_values(by='Closest_TSS_gene_name').reset_index(drop=True)
Distal_Promoter_df_filtered = Distal_Promoter_df_filtered.sort_values(by='Closest_TSS_gene_name').reset_index(drop=True)
Proximal_Promoter_df_filtered = Proximal_Promoter_df_filtered.sort_values(by='Closest_TSS_gene_name').reset_index(drop=True)
Core_Promoter_df_filtered = Core_Promoter_df_filtered.sort_values(by='Closest_TSS_gene_name').reset_index(drop=True)
Downstream_df_filtered = Downstream_df_filtered.sort_values(by='Closest_TSS_gene_name').reset_index(drop=True)


In [None]:
Upstream_df_filtered

In [None]:
copynumber_filtered

In [None]:
# Define columns to keep along with common TCGA columns
additional_cols_copynumber = ['Sample']

# Filter the copynumber DataFrame
copynumber_filtered = copynumber_filtered[additional_cols_copynumber + common_tcga_columns_list]
copynumber_filtered.columns.values[0] = 'gene'
copynumber_filtered.sort_values(by='gene', inplace=True)
copynumber_filtered

In [None]:
# Define columns to keep along with common TCGA columns
additional_cols_gene_expression = ['gene']

# Filter the gene expression DataFrame
gene_expression_filtered = gene_expression_filtered[additional_cols_gene_expression + common_tcga_columns_list]
gene_expression_filtered

In [None]:
# Define columns to keep along with common TCGA columns
additional_cols_protein = ['SampleID']

# Filter the protein DataFrame
protein_filtered = protein[additional_cols_protein + common_tcga_columns_list]
protein_filtered

In [None]:
# Filter rows based on common TCGA identifiers
immune_subtype_filtered = immune_subtype[immune_subtype['sample'].isin(common_tcga_columns_list)]
survival_filtered = survival[survival['sample'].isin(common_tcga_columns_list)]
cellsub_filtered = cellsub[cellsub['sampleID'].isin(common_tcga_columns_list)]
dense_filtered = dense[dense['sample'].isin(common_tcga_columns_list)]

In [None]:
immune_subtype_filtered

In [None]:
survival_filtered

In [None]:
cellsub_filtered

In [None]:
dense_filtered

### 4.3 proteomics missing value and intersection


In [None]:
protein_filtered

In [None]:
#calculate the NaN proportion of each row
nan_proportions = protein_filtered.isna().mean(axis=1)
# Display the results
print(nan_proportions)

In [None]:
protein_filtered = protein_filtered[nan_proportions <= 1/3]

# Fill NaN values with 0 in the remaining rows
protein_filtered = protein_filtered.fillna(0)

In [None]:
gene_list = gene_expression_filtered['gene']
protein_list = protein_filtered['SampleID'].tolist()
protein_intersection = list(set(gene_list) & set(protein_list))
len(protein_intersection)

In [None]:
protein_filtered

In [None]:
# select common genes in copynumber data
protein_filtered = protein_filtered.loc[protein_filtered['SampleID'].isin(protein_intersection)].reset_index(drop=True)
protein_filtered

## 5. gene name/patient samples/ pheotype file lists

### 5.1 gene name and patient samples lists

In [None]:
gene_list = gene_expression_filtered['gene']
gene_list

In [None]:
protein_list = protein_filtered['SampleID'].tolist()

In [None]:
print(len(protein_list))

In [None]:
intersection = list(set(gene_list) & set(protein_list))
len(intersection)

In [None]:
patient_sample_list = pd.DataFrame(common_tcga_columns,columns=['sample'])
patient_sample_list

### 5.2 phenotype lists

In [None]:
immune_subtype_filtered

In [None]:
survival_filtered

In [None]:
survival_nan_column_proportions = survival_filtered.isna().mean()

# Display the results
print(survival_nan_column_proportions)

In [None]:
# Calculate the proportion of NaN values in each column
survival_nan_column_proportions = survival_filtered.isna().mean()

# Identify columns to be dropped (where proportion of NaN values is greater than 1/3)
columns_to_drop = survival_nan_column_proportions[survival_nan_column_proportions > 1/3].index.tolist()

# Drop these columns from the DataFrame
survival_filtered = survival_filtered.drop(columns=columns_to_drop)

# List of columns that were dropped
print("Columns dropped:", columns_to_drop)

In [None]:
survival_filtered

In [None]:
dense_filtered

In [None]:
cellsub_filtered

In [None]:
cellsub_nan_column_proportions = cellsub_filtered.isna().mean()

# Display the results
print(cellsub_nan_column_proportions)

In [None]:
# Calculate the proportion of NaN values in each column
cellsub_nan_column_proportions = cellsub_filtered.isna().mean()

# Identify columns to be dropped (where proportion of NaN values is greater than 1/3)
columns_to_drop = cellsub_nan_column_proportions[cellsub_nan_column_proportions > 1/3].index.tolist()

# Drop these columns from the DataFrame
cellsub_filtered = cellsub_filtered.drop(columns=columns_to_drop)

# List of columns that were dropped
print("Columns dropped:", columns_to_drop)

In [None]:
cellsub_filtered

In [None]:
import pandas as pd

# extract phenotype names
immune_phenotypes = immune_subtype_filtered.columns[1:].tolist()
survival_phenotypes = survival_filtered.columns[2:].tolist() # _PATIENT infor is not needed (sample id)
dense_phenotypes = dense_filtered.columns[2:].tolist() # sample_type_id infor is not needed (all = 1)
cellsub_phenotypes = cellsub_filtered.columns[1:].tolist()

# creat phenotype name and source
phenotype_list = []
phenotype_list.extend([(p, 'immunesub') for p in immune_phenotypes])
phenotype_list.extend([(p, 'survival') for p in survival_phenotypes])
phenotype_list.extend([(p, 'dense') for p in dense_phenotypes])
phenotype_list.extend([(p, 'cellsub') for p in cellsub_phenotypes])

# list DataFrame
phenotype_lists = pd.DataFrame(phenotype_list, columns=['Phenotype_Name', 'Phenotype_Source'])
phenotype_lists

## 6. Save processed datasets

### 6.1 Keep the consistency for dataframes on genes and samples

In [None]:
# [gene_list]
# gene
sorted_gene_list = gene_list.sort_values()
sorted_gene = sorted_gene_list.tolist()
sorted_gene_df = pd.DataFrame(sorted_gene, columns=['Gene'])
display(sorted_gene_df)
# gene-meth
sorted_gene_methy = [gene + '-METH' for gene in sorted_gene]
sorted_gene_methy_df = pd.DataFrame(sorted_gene_methy, columns=['Gene'])
display(sorted_gene_methy_df)
# gene-protein
sorted_intersection = sorted(intersection)
sorted_gene_protein = [gene + '-PROT' for gene in sorted_intersection]
sorted_gene_protein_df = pd.DataFrame(sorted_gene_protein, columns=['Gene'])
display(sorted_gene_protein_df)
# all-gene
sorted_gene_all = sorted_gene + sorted_gene_methy + sorted_gene_protein
sorted_all_gene_df = pd.DataFrame(sorted_gene_all, columns=['Gene'])
display(sorted_all_gene_df)

In [None]:
# [patient-sample-list]
sorted_patient_sample_list = patient_sample_list.sort_values(by='sample')['sample'].tolist()
print(sorted_patient_sample_list)
sorted_patient_sample_df = patient_sample_list.sort_values(by='sample').reset_index(drop=True)
display(sorted_patient_sample_df)

In [None]:
Upstream_df_filtered = Upstream_df_filtered[['Closest_TSS_gene_name'] + sorted_patient_sample_list]
Upstream_df_filtered

In [None]:
Distal_Promoter_df_filtered = Distal_Promoter_df_filtered[['Closest_TSS_gene_name'] + sorted_patient_sample_list]
Distal_Promoter_df_filtered

In [None]:
Proximal_Promoter_df_filtered = Proximal_Promoter_df_filtered[['Closest_TSS_gene_name'] + sorted_patient_sample_list]
Proximal_Promoter_df_filtered

In [None]:
Core_Promoter_df_filtered = Core_Promoter_df_filtered[['Closest_TSS_gene_name'] + sorted_patient_sample_list]
Core_Promoter_df_filtered

In [None]:
Downstream_df_filtered = Downstream_df_filtered[['Closest_TSS_gene_name'] + sorted_patient_sample_list]
Downstream_df_filtered

In [None]:
copynumber_filtered = copynumber_filtered[['gene'] + sorted_patient_sample_list].sort_values(by='gene').reset_index(drop=True)
copynumber_filtered

In [None]:
gene_expression_filtered = gene_expression_filtered[['gene'] + sorted_patient_sample_list].sort_values(by='gene').reset_index(drop=True)
gene_expression_filtered

In [None]:
protein_filtered = protein_filtered[['SampleID'] + sorted_patient_sample_list].sort_values(by='SampleID').reset_index(drop=True)
protein_filtered

In [None]:
immune_subtype_filtered = immune_subtype_filtered.sort_values(by='sample').reset_index(drop=True)
immune_subtype_filtered

In [None]:
survival_filtered = survival_filtered.sort_values(by='sample').reset_index(drop=True)
survival_filtered

In [None]:
dense_filtered = dense_filtered.sort_values(by='sample').reset_index(drop=True)
dense_filtered

In [None]:
cellsub_filtered = cellsub_filtered.sort_values(by='sampleID').reset_index(drop=True)
cellsub_filtered

### 6.2 create output folder and save processed datasets

In [None]:
import os

# outputfile name
output_folder = 'UCSC-process'
# create folder if not exist
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

In [None]:
# DataFrame needed to be saved
dataframes = {
    'gene-list.csv': sorted_gene_df,
    'gene-methy-list.csv': sorted_gene_methy_df,
    'gene-protein-list.csv': sorted_gene_protein_df,
    'gene-all-list.csv': sorted_all_gene_df,
    'gene-kegg-edge-list.csv': filtered_up_kegg_df,
    'gene-kegg-path-edge-list.csv': filtered_up_kegg_path_df,
    'patient-sample-list.csv': sorted_patient_sample_df,
    'phenotype-lists.csv': phenotype_lists,
    'processed-genotype-methy-Upstream.csv': Upstream_df_filtered,
    'processed-genotype-methy-Distal-Promoter.csv': Distal_Promoter_df_filtered,
    'processed-genotype-methy-Proximal-Promoter.csv': Proximal_Promoter_df_filtered,
    'processed-genotype-methy-Core-Promoter.csv': Core_Promoter_df_filtered,
    'processed-genotype-methy-Downstream.csv': Downstream_df_filtered,
    'processed-genotype-cnv.csv': copynumber_filtered,
    'processed-genotype-gene-expression.csv': gene_expression_filtered,
    'processed-genotype-proteomics.csv': protein_filtered,
    'processed-phenotype-immune-subtype-transposed.csv': immune_subtype_filtered,
    'processed-phenotype-survival-transposed.csv': survival_filtered,
    'processed-phenotype-dense-transposed.csv': dense_filtered,
    'processed-phenotype-cellsub-transposed.csv': cellsub_filtered
}

# save to output folder
for file_name, df in dataframes.items():
    df.to_csv(os.path.join(output_folder, file_name), index=False)

## 7. Convert the processed data into node dictionary

In [1]:
# load processed data
import pandas as pd
import os

# read the file names under the folder
# Define the path to the output folder where CSV files are stored
output_folder = 'UCSC-process'

# List of file names you saved earlier
file_names = [
    'gene-list', 'gene-methy-list', 'gene-protein-list', 'gene-all-list', 
    'gene-kegg-edge-list', 'gene-kegg-path-edge-list', 'patient-sample-list', 
    'phenotype-lists', 'processed-genotype-methy-Upstream', 
    'processed-genotype-methy-Distal-Promoter', 
    'processed-genotype-methy-Proximal-Promoter', 
    'processed-genotype-methy-Core-Promoter', 'processed-genotype-methy-Downstream', 
    'processed-genotype-cnv', 'processed-genotype-gene-expression', 
    'processed-genotype-proteomics', 'processed-phenotype-immune-subtype-transposed', 
    'processed-phenotype-survival-transposed', 'processed-phenotype-dense-transposed', 
    'processed-phenotype-cellsub-transposed'
]

# Dictionary to hold the dataframes
dataframes = {}

# Read each file and assign to a dataframe
for file_name in file_names:
    full_path = os.path.join(output_folder, file_name + '.csv')
    dataframes[file_name] = pd.read_csv(full_path)

In [2]:
# Assign each dataframe to a variable
sorted_gene_df = dataframes['gene-list']
sorted_gene_methy_df = dataframes['gene-methy-list']
sorted_gene_protein_df = dataframes['gene-protein-list']
sorted_all_gene_df = dataframes['gene-all-list']
filtered_up_kegg_df = dataframes['gene-kegg-edge-list']
filtered_up_kegg_path_df = dataframes['gene-kegg-path-edge-list']
sorted_patient_sample_df = dataframes['patient-sample-list']
phenotype_lists = dataframes['phenotype-lists']
Upstream_df_filtered = dataframes['processed-genotype-methy-Upstream']
Distal_Promoter_df_filtered = dataframes['processed-genotype-methy-Distal-Promoter']
Proximal_Promoter_df_filtered = dataframes['processed-genotype-methy-Proximal-Promoter']
Core_Promoter_df_filtered = dataframes['processed-genotype-methy-Core-Promoter']
Downstream_df_filtered = dataframes['processed-genotype-methy-Downstream']
copynumber_filtered = dataframes['processed-genotype-cnv']
gene_expression_filtered = dataframes['processed-genotype-gene-expression']
protein_filtered = dataframes['processed-genotype-proteomics']
immune_subtype_filtered = dataframes['processed-phenotype-immune-subtype-transposed']
survival_filtered = dataframes['processed-phenotype-survival-transposed']
dense_filtered = dataframes['processed-phenotype-dense-transposed']
cellsub_filtered = dataframes['processed-phenotype-cellsub-transposed']

In [3]:
# outputfile name
graph_output_folder = 'graph-data'
# create folder if not exist
if not os.path.exists(graph_output_folder):
    os.makedirs(graph_output_folder)

### 7.1 Make nodes dictionary

In [4]:
sorted_all_gene_dict = sorted_all_gene_df['Gene'].to_dict()
sorted_all_gene_name_dict = {value: key for key, value in sorted_all_gene_dict.items()}
num_gene = sorted_gene_df.shape[0]
num_gene_protein = sorted_gene_protein_df.shape[0]
nodetype_list = ['Gene'] * num_gene + ['Gene-METH'] * num_gene + ['Gene-PROT'] * num_gene_protein
map_all_gene_df = pd.DataFrame({'Gene_num': sorted_all_gene_dict.keys(), 'Gene_name': sorted_all_gene_dict.values(), 'NodeType': nodetype_list})
display(map_all_gene_df)
map_all_gene_df.to_csv(os.path.join(graph_output_folder, 'map-all-gene.csv'), index=False)

Unnamed: 0,Gene_num,Gene_name,NodeType
0,0,ABL1,Gene
1,1,ABL2,Gene
2,2,ACAA1,Gene
3,3,ACACA,Gene
4,4,ACACB,Gene
...,...,...,...
4154,4154,SMAD4-PROT,Gene-PROT
4155,4155,SRC-PROT,Gene-PROT
4156,4156,SYK-PROT,Gene-PROT
4157,4157,TFRC-PROT,Gene-PROT


### 7.2 Create the edges connection between promoter methylations and proteins

In [5]:
# [Gene-METH - Gene]
sorted_gene_methy = sorted_gene_methy_df['Gene'].tolist()
sorted_gene_list = sorted_gene_df['Gene'].tolist()
sorted_gene_protein = sorted_gene_protein_df['Gene'].tolist()
sorted_intersection = [gene_protein.replace('-PROT', '') for gene_protein in sorted_gene_protein]
gene_meth_edge_df = pd.DataFrame({'src': sorted_gene_methy, 'dest': sorted_gene_list})
display(gene_meth_edge_df)
# [Gene - Gene-PROT]
gene_protein_edge_df = pd.DataFrame({'src': sorted_intersection, 'dest': sorted_gene_protein})
display(gene_protein_edge_df)

Unnamed: 0,src,dest
0,ABL1-METH,ABL1
1,ABL2-METH,ABL2
2,ACAA1-METH,ACAA1
3,ACACA-METH,ACACA
4,ACACB-METH,ACACB
...,...,...
2056,ZFYVE16-METH,ZFYVE16
2057,ZFYVE9-METH,ZFYVE9
2058,ZMAT3-METH,ZMAT3
2059,ZNF274-METH,ZNF274


Unnamed: 0,src,dest
0,ARAF,ARAF-PROT
1,ATM,ATM-PROT
2,BAX,BAX-PROT
3,BCL2,BCL2-PROT
4,BCL2A1,BCL2A1-PROT
5,BID,BID-PROT
6,BRAF,BRAF-PROT
7,CDK1,CDK1-PROT
8,DUSP4,DUSP4-PROT
9,DVL3,DVL3-PROT


In [6]:
sorted_all_gene_name_dict['ABL1-METH']

2061

In [8]:
# replace gene name with gene number
gene_meth_num_edge_df = gene_meth_edge_df.copy()
gene_meth_num_edge_df['src'] = gene_meth_edge_df['src'].map(sorted_all_gene_name_dict)
gene_meth_num_edge_df['dest'] = gene_meth_edge_df['dest'].map(sorted_all_gene_name_dict)
display(gene_meth_num_edge_df)
gene_protein_num_edge_df = gene_protein_edge_df.copy()
gene_protein_num_edge_df['src'] = gene_protein_edge_df['src'].map(sorted_all_gene_name_dict)
gene_protein_num_edge_df['dest'] = gene_protein_edge_df['dest'].map(sorted_all_gene_name_dict)
display(gene_protein_num_edge_df)

Unnamed: 0,src,dest
0,2061,0
1,2062,1
2,2063,2
3,2064,3
4,2065,4
...,...,...
2056,4117,2056
2057,4118,2057
2058,4119,2058
2059,4120,2059


Unnamed: 0,src,dest
0,92,4122
1,110,4123
2,152,4124
3,156,4125
4,157,4126
5,166,4127
6,183,4128
7,331,4129
8,525,4130
9,533,4131


### 7.3 Concat all of the edges

In [27]:
filtered_up_kegg_num_df = filtered_up_kegg_df.copy()
filtered_up_kegg_num_df['src'] = filtered_up_kegg_num_df['src'].map(sorted_all_gene_name_dict)
filtered_up_kegg_num_df['dest'] = filtered_up_kegg_num_df['dest'].map(sorted_all_gene_name_dict)
display(filtered_up_kegg_num_df)
all_gene_edge_num_df = pd.concat([filtered_up_kegg_num_df, gene_meth_num_edge_df, gene_protein_num_edge_df])
display(all_gene_edge_num_df)

num_gene_edge = filtered_up_kegg_num_df.shape[0]
num_gene_meth_edge = gene_meth_num_edge_df.shape[0]
num_gene_protein_edge = gene_protein_num_edge_df.shape[0]
edgetype_list = ['Gene-Gene'] * num_gene_edge + ['Gene-Gene-METH'] * num_gene_meth_edge + ['Gene-Gene-PROT'] * num_gene_protein_edge
all_gene_edge_num_df['EdgeType'] = edgetype_list
all_gene_edge_num_df = all_gene_edge_num_df.sort_values(by=['src', 'dest']).reset_index(drop=True)
display(all_gene_edge_num_df)
all_gene_edge_num_df.to_csv(os.path.join(graph_output_folder, 'all-gene-edge-num.csv'), index=False)

Unnamed: 0,src,dest
0,18,1255
1,18,1256
2,18,1257
3,18,1857
4,19,1255
...,...,...
18226,2060,679
18227,2060,680
18228,2060,681
18229,2060,682


Unnamed: 0,src,dest
0,18,1255
1,18,1256
2,18,1257
3,18,1857
4,19,1255
...,...,...
32,1802,4154
33,1836,4155
34,1864,4156
35,1902,4157


Unnamed: 0,src,dest,EdgeType
0,18,1255,Gene-Gene
1,18,1256,Gene-Gene
2,18,1257,Gene-Gene
3,18,1857,Gene-Gene
4,19,1255,Gene-Gene
...,...,...,...
32,1802,4154,Gene-Gene-PROT
33,1836,4155,Gene-Gene-PROT
34,1864,4156,Gene-Gene-PROT
35,1902,4157,Gene-Gene-PROT


In [28]:
# gene edge interactions without map
all_gene_edge_df = all_gene_edge_num_df.copy()
all_gene_edge_df = all_gene_edge_df.replace(sorted_all_gene_dict)

num_gene_edge = filtered_up_kegg_num_df.shape[0]
num_gene_meth_edge = gene_meth_edge_df.shape[0]
num_gene_protein_edge = gene_protein_edge_df.shape[0]
# all_gene_edge_df = all_gene_edge_df.sort_values(by=['src', 'dest']).reset_index(drop=True)
all_gene_edge_df.to_csv(os.path.join(graph_output_folder, 'all-gene-edge.csv'), index=False)
display(all_gene_edge_df)

Unnamed: 0,src,dest,EdgeType
0,ACTB,MYL6,Gene-Gene
1,ACTB,MYL6B,Gene-Gene
2,ACTB,MYL9,Gene-Gene
3,ACTB,STK3,Gene-Gene
4,ACTG1,MYL6,Gene-Gene
...,...,...,...
32,SMAD4,SMAD4-PROT,Gene-Gene-PROT
33,SRC,SRC-PROT,Gene-Gene-PROT
34,SYK,SYK-PROT,Gene-Gene-PROT
35,TFRC,TFRC-PROT,Gene-Gene-PROT


## 8. Load data into graph format

### 8.1 Form up the input samples

recommends the use of the endpoints of OS, PFI, DFI, and DSS for each TCGA cancer type

* OS: overall survial
* PFI: progression-free interval
* DSS: disease-specific survival
* DFI: disease-free interval

In [20]:
survival_filtered

Unnamed: 0,sample,_PATIENT,cancer type abbreviation,age_at_initial_pathologic_diagnosis,gender,race,ajcc_pathologic_tumor_stage,histological_type,initial_pathologic_dx_year,birth_days_to,vital_status,tumor_status,last_contact_days_to,OS,OS.time,DSS,DSS.time,PFI,PFI.time
0,TCGA-05-4384-01,TCGA-05-4384,LUAD,66.0,MALE,,Stage IIIA,Lung Adenocarcinoma,2009.0,-24411.0,Alive,WITH TUMOR,426.0,0.0,426.0,0.0,426.0,1.0,183.0
1,TCGA-05-4396-01,TCGA-05-4396,LUAD,76.0,MALE,,Stage IIIB,Lung Adenocarcinoma,2006.0,-28094.0,Dead,,,1.0,303.0,,303.0,0.0,303.0
2,TCGA-05-4405-01,TCGA-05-4405,LUAD,74.0,FEMALE,,Stage IB,Lung Adenocarcinoma,2006.0,-27241.0,Alive,TUMOR FREE,610.0,0.0,610.0,0.0,610.0,0.0,610.0
3,TCGA-05-4410-01,TCGA-05-4410,LUAD,62.0,MALE,,Stage IB,Lung Adenocarcinoma,2007.0,-22888.0,Alive,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,TCGA-05-4417-01,TCGA-05-4417,LUAD,51.0,FEMALE,,Stage IB,Lung Adenocarcinoma,2008.0,-18780.0,Alive,TUMOR FREE,455.0,0.0,455.0,0.0,455.0,0.0,455.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3657,TCGA-ZA-A8F6-01,TCGA-ZA-A8F6,STAD,71.0,MALE,WHITE,Stage IB,"Stomach, Intestinal Adenocarcinoma, Not Otherw...",2013.0,-26122.0,Alive,TUMOR FREE,525.0,0.0,525.0,0.0,525.0,0.0,525.0
3658,TCGA-ZG-A8QW-01,TCGA-ZG-A8QW,PRAD,72.0,MALE,,,Prostate Adenocarcinoma Acinar Type,2013.0,-26632.0,Alive,TUMOR FREE,94.0,0.0,94.0,0.0,94.0,0.0,94.0
3659,TCGA-ZG-A8QX-01,TCGA-ZG-A8QX,PRAD,56.0,MALE,,,Prostate Adenocarcinoma Acinar Type,2013.0,-20683.0,Alive,TUMOR FREE,442.0,0.0,442.0,0.0,442.0,0.0,442.0
3660,TCGA-ZG-A8QY-01,TCGA-ZG-A8QY,PRAD,67.0,MALE,,,Prostate Adenocarcinoma Acinar Type,2013.0,-24731.0,Alive,WITH TUMOR,404.0,0.0,404.0,0.0,404.0,0.0,404.0


In [21]:
survival_filtered_feature_df = survival_filtered.copy()
survival_filtered_feature_df = survival_filtered_feature_df[['sample', 'cancer type abbreviation', 'OS', 'vital_status']]
display(survival_filtered_feature_df)

nan_counts = survival_filtered_feature_df.isna().sum()  # or df.isnull()
print(nan_counts)

# Convert 'alive' to 0.0 and 'dead' to 1.0
survival_filtered_feature_df['vital_status'] = survival_filtered_feature_df['vital_status'].map({'Alive': 0.0, 'Dead': 1.0})
display(survival_filtered_feature_df)
survival_filtered_feature_df['OS'] == survival_filtered_feature_df['vital_status']


Unnamed: 0,sample,cancer type abbreviation,OS,vital_status
0,TCGA-05-4384-01,LUAD,0.0,Alive
1,TCGA-05-4396-01,LUAD,1.0,Dead
2,TCGA-05-4405-01,LUAD,0.0,Alive
3,TCGA-05-4410-01,LUAD,0.0,Alive
4,TCGA-05-4417-01,LUAD,0.0,Alive
...,...,...,...,...
3657,TCGA-ZA-A8F6-01,STAD,0.0,Alive
3658,TCGA-ZG-A8QW-01,PRAD,0.0,Alive
3659,TCGA-ZG-A8QX-01,PRAD,0.0,Alive
3660,TCGA-ZG-A8QY-01,PRAD,0.0,Alive


sample                      0
cancer type abbreviation    0
OS                          0
vital_status                0
dtype: int64


Unnamed: 0,sample,cancer type abbreviation,OS,vital_status
0,TCGA-05-4384-01,LUAD,0.0,0.0
1,TCGA-05-4396-01,LUAD,1.0,1.0
2,TCGA-05-4405-01,LUAD,0.0,0.0
3,TCGA-05-4410-01,LUAD,0.0,0.0
4,TCGA-05-4417-01,LUAD,0.0,0.0
...,...,...,...,...
3657,TCGA-ZA-A8F6-01,STAD,0.0,0.0
3658,TCGA-ZG-A8QW-01,PRAD,0.0,0.0
3659,TCGA-ZG-A8QX-01,PRAD,0.0,0.0
3660,TCGA-ZG-A8QY-01,PRAD,0.0,0.0


0       True
1       True
2       True
3       True
4       True
        ... 
3657    True
3658    True
3659    True
3660    True
3661    True
Length: 3662, dtype: bool

In [22]:
# Check if each row in Column1 and Column2 have the same value
rows_same = (survival_filtered_feature_df['OS'] == survival_filtered_feature_df['vital_status']).all()
print("All rows have the same value in column 'OS' and column 'vital_status' :", rows_same)

All rows have the same value in column 'OS' and column 'vital_status' : True


In [23]:
survival_filtered_feature_df = survival_filtered_feature_df[['sample', 'OS', 'cancer type abbreviation']]
display(survival_filtered_feature_df)
survival_filtered_feature_df.to_csv(os.path.join(graph_output_folder, 'survival-label.csv'), index=False)

Unnamed: 0,sample,OS,cancer type abbreviation
0,TCGA-05-4384-01,0.0,LUAD
1,TCGA-05-4396-01,1.0,LUAD
2,TCGA-05-4405-01,0.0,LUAD
3,TCGA-05-4410-01,0.0,LUAD
4,TCGA-05-4417-01,0.0,LUAD
...,...,...,...
3657,TCGA-ZA-A8F6-01,0.0,STAD
3658,TCGA-ZG-A8QW-01,0.0,PRAD
3659,TCGA-ZG-A8QX-01,0.0,PRAD
3660,TCGA-ZG-A8QY-01,0.0,PRAD


### 8.2 Randomize the input label

In [24]:
# Randomize the survival label
def input_random(randomized, graph_output_folder):
    randomized = False
    if randomized == True:
        random_survival_filtered_feature_df = survival_filtered_feature_df.sample(frac = 1).reset_index(drop=True)
        random_survival_filtered_feature_df.to_csv(os.path.join(graph_output_folder, 'random-survival-label.csv'), index=False)
    else:
        random_survival_filtered_feature_df = pd.read_csv(os.path.join(graph_output_folder, 'random-survival-label.csv'))
    display(random_survival_filtered_feature_df)

### 8.3 Split the randomized input into 5-fold

In [25]:
# Split deep learning input into training and test
def split_k_fold(k, graph_output_folder):
    random_survival_filtered_feature_df = pd.read_csv(os.path.join(graph_output_folder, 'random-survival-label.csv'))
    num_points = random_survival_filtered_feature_df.shape[0]
    num_div = int(num_points / k)
    num_div_list = [i * num_div for i in range(0, k)]
    num_div_list.append(num_points)
    # Split [random_survival_filtered_feature_df] into [k] folds
    for place_num in range(k):
        low_idx = num_div_list[place_num]
        high_idx = num_div_list[place_num + 1]
        print('\n--------TRAIN-TEST SPLIT WITH TEST FROM ' + str(low_idx) + ' TO ' + str(high_idx) + '--------')
        split_input_df = random_survival_filtered_feature_df[low_idx : high_idx]
        split_input_df.to_csv(os.path.join(graph_output_folder, 'split-random-survival-label-' + str(place_num + 1) + '.csv'), index=False)
        print(split_input_df.shape)

### 8.4 Reprocess the edge_index file after loading

In [26]:
import os
import numpy as np
import pandas as pd

graph_output_folder = 'graph-data'
form_data_path = './' + graph_output_folder + '/form_data'
edge_index = np.load(form_data_path + '/edge_index.npy')
# Convert the 2D array into a DataFrame
edge_index_df = pd.DataFrame(edge_index.T, columns=['src', 'dest'])

gene_edge_num_df = pd.read_csv(os.path.join(graph_output_folder, 'all-gene-edge-num.csv'))
src_gene_list = list(gene_edge_num_df['src'])
dest_gene_list = list(gene_edge_num_df['dest'])
edgetype_list = list(gene_edge_num_df['EdgeType'])
gene_edge_num_reverse_df = pd.DataFrame({'src': dest_gene_list, 'dest': src_gene_list, 'EdgeType': edgetype_list})
gene_edge_num_all_df = pd.concat([gene_edge_num_df, gene_edge_num_reverse_df]).drop_duplicates().sort_values(by=['src', 'dest']).reset_index(drop=True)

display(edge_index_df)
display(gene_edge_num_all_df)
merged_gene_edge_num_all_df = pd.merge(gene_edge_num_all_df, edge_index_df, on=['src', 'dest'], how='inner')
display(merged_gene_edge_num_all_df)
merged_gene_edge_num_all_df.to_csv(os.path.join(graph_output_folder, 'merged-gene-edge-num-all.csv'), index=False)

merged_gene_edge_name_all_df = merged_gene_edge_num_all_df.replace(sorted_all_gene_dict)
display(merged_gene_edge_name_all_df)
merged_gene_edge_name_all_df.to_csv(os.path.join(graph_output_folder, 'merged-gene-edge-name-all.csv'), index=False)

Unnamed: 0,src,dest
0,0,405
1,0,406
2,0,1331
3,0,1332
4,0,1333
...,...,...
39975,4154,1802
39976,4155,1836
39977,4156,1864
39978,4157,1902


Unnamed: 0,src,dest,EdgeType
0,0,405,Gene-Gene
1,0,406,Gene-Gene
2,0,1331,Gene-Gene
3,0,1332,Gene-Gene
4,0,1333,Gene-Gene
...,...,...,...
39975,4154,1802,Gene-Gene-PROT
39976,4155,1836,Gene-Gene-PROT
39977,4156,1864,Gene-Gene-PROT
39978,4157,1902,Gene-Gene-PROT


Unnamed: 0,src,dest,EdgeType
0,0,405,Gene-Gene
1,0,406,Gene-Gene
2,0,1331,Gene-Gene
3,0,1332,Gene-Gene
4,0,1333,Gene-Gene
...,...,...,...
39975,4154,1802,Gene-Gene-PROT
39976,4155,1836,Gene-Gene-PROT
39977,4156,1864,Gene-Gene-PROT
39978,4157,1902,Gene-Gene-PROT


Unnamed: 0,src,dest,EdgeType
0,ABL1,CRK,Gene-Gene
1,ABL1,CRKL,Gene-Gene
2,ABL1,NTRK1,Gene-Gene
3,ABL1,NTRK2,Gene-Gene
4,ABL1,NTRK3,Gene-Gene
...,...,...,...
39975,SMAD4-PROT,SMAD4,Gene-Gene-PROT
39976,SRC-PROT,SRC,Gene-Gene-PROT
39977,SYK-PROT,SYK,Gene-Gene-PROT
39978,TFRC-PROT,TFRC,Gene-Gene-PROT
