This script concatenates the Reactome pathway definitions and the single-omics datasets for metabolomics and proteomics to form a integrated Reactome pathway definition and an integrated multi-omics dataset.

### Create multi-omics pathways

In [None]:
#Load libraries 
import pandas as pd
import sspa

Downloading the multiomics data for the pathways is as easy as downloading from the sspa package. However, the latest pathway version is v85, whereas I have used v84 for the single omics metabolomics and proteomics dataset. Therefore, I would have have to manually concatenate the v84 pathway files together for metabolomics and proteomics.

Method 1: Downloading from sspa package

In [None]:
#Download Reactome latest multi-omics for mice
reactome_mouse_latest_mo = sspa.process_reactome("Mus musculus", download_latest=True, filepath=".", omics_type='multiomics')

In [None]:
#Download Reactome latest multi-omics for human
reactome_human_latest_mo = sspa.process_reactome("Homo sapiens", download_latest=True, filepath=".", omics_type='multiomics')

Method 2: Manually concatenating specific version files

In [None]:
#Metabolomic pathways (ChEBI IDs)
metabolomic_reactome_pathways = sspa.process_gmt("../Data/Reactome_Homo_sapiens_pathways_compounds_R84.gmt")

In [None]:
#Proteomic pathways  (UniProt IDs)
proteomic_reactome_pathways = sspa.process_reactome('Homo sapiens', infile = '../Data/UniProt2Reactome_All_Levels_ver84.txt', download_latest = False, filepath = None)

In [None]:
proteomic_reactome_pathways = proteomic_reactome_pathways.rename_axis('Pathway_ID')
proteomic_reactome_pathways

Currently the only index is the Pathway ID. I create a multi-index so that the Pathway name also becomes part of the index as well.

In [None]:
metabolomic_reactome_pathways.set_index([metabolomic_reactome_pathways.index, metabolomic_reactome_pathways['Pathway_name']], inplace=True)
metabolomic_reactome_pathways.drop(['Pathway_name'], axis=1, inplace=True)

In [None]:
proteomic_reactome_pathways.set_index([proteomic_reactome_pathways.index, proteomic_reactome_pathways['Pathway_name']], inplace=True)
proteomic_reactome_pathways.drop(['Pathway_name'], axis=1, inplace=True)

In [None]:
display(metabolomic_reactome_pathways)
display(proteomic_reactome_pathways)

In [None]:
#Merge pathways on uniprot index
reactome_mo = metabolomic_reactome_pathways.merge(proteomic_reactome_pathways, how='outer', left_index=True, right_index=True)    
reactome_mo


In [None]:
#Pathway_name column turns from index to normal column
reactome_mo = reactome_mo.reset_index(level=[1]) 
reactome_mo.index


In [None]:
reactome_mo.to_csv("../Data/Reactome_multi_omics_ChEBI_Uniprot.csv")

In [None]:
#Read in file to check
mo_pathways = pd.read_csv("../Data/Reactome_multi_omics_ChEBI_Uniprot.csv", index_col=0,dtype="str")
#Dtype warning because in some columns, some values are in string format whereas some are in integer format, that's why I specify dtype="str"

In [None]:
mo_pathways

### Create multi-omics dataset

In [None]:
#Load datasets
metabolomic_df = pd.read_csv('../Data/Su_COVID_metabolomics_processed_commoncases.csv', index_col=0)
proteomic_df = pd.read_csv('../Data/Su_COVID_proteomics_processed_commoncases.csv', index_col=0)

In [None]:
#Filter to common samples
list1 = list(metabolomic_df.index)
list2 = list(proteomic_df.index)

#Obtain common samples and subset accordingly
intersection = list(set(metabolomic_df.index.tolist()) & set(proteomic_df.index.tolist())) #set removes duplicates
intersection = [sample for sample in intersection if sample.startswith("INCOV")]

metabolomic_df = metabolomic_df[metabolomic_df.index.isin(intersection)]
proteomic_df = proteomic_df[proteomic_df.index.isin(intersection)]


In [None]:
concat_omics = pd.concat([metabolomic_df.iloc[:,:-2], proteomic_df], axis=1)

In [None]:
concat_omics.to_csv("../Data/Su_integrated_data.csv")

In [None]:
#Read in file to check
multiomic_df = pd.read_csv("../Data/Su_integrated_data.csv", index_col=0)

In [None]:
multiomic_df.dtypes

In [None]:
#Test kPCA on whole dataset
kpca_scores_all = sspa.sspa_kpca(multiomic_df, mo_pathways)