# Integrating gene regulatory network with multi-omics data
Author: Romana T. Pop^1^

1. Centre for Molecular Medicine Norway (NCMM), Faculty of Medicine, University of Oslo, Oslo, Norway

## Introduction
In this notebook, we reproduce the analysis presented in *paper link*

## Necessary data and software
Before starting the analysis, make sure all the necessary data is downloaded and software is installed. For exact reproducibility of the results presented in *paper*, we provide a container with the environment used for the analysis. The data is available on Zenodo at *link*.

If not using the container provided, we recommend cloning this repository and then downloading the data from Zenodo in the cloned repository directory.

In [None]:
# ensure environment is clean
rm(list=ls())

# install MARMOT
library(devtools)
install_github("rtpop/MARMOT")

# load libraries
library(MARMOT)
library(tidyverse)
library(preprocessCore)
library(RColorBrewer)
library(msigdbr)
library(gridExtra)

## Setting parameters
Here we set some parameters that will be used throughout the notebook.

In [None]:
# setting working directory
wd <- "results"
setwd(wd)

# specify where plots should be saved
figure_dir <- "figures"

# specify data directories
# this should also have the clinical data so make sure to change the code accordingly
# and to remove these comments once you do so
data_tcga <- "data/TCGA"
data_gep <- "data/GEP"

# specify directory for logs to be saved
log_dir <- "logs"

# defining vector of cancer names for which to do the analysis
# I'm not sure if this is the best way to go about it, but for now...
cancers_tcga <- c("aml", "breast", "colon", "gbm", "kidney", "liver", "lung",
          "melanoma", "ovarian", "sarcoma")
cancers_gep <- "liver"

# defining names for the JDR models that we will run
# also might not be the best way, so review once you have everything in place
model <- c("nonet", "indeg", "out", "both")

# define vector of omic names that will be used
omics_tcga <- c("expression", "methylation", "miRNA", "indegree", "outdegree")
omics_gep <- c("expression", "indegree", "outdegree")

# some intermediate files are provided for ease, set this parameter to FALSE
# if you do not wish to use them and wish to compute them again instead
precomputed <- TRUE

## Preparing the data
**This entire section need not be run if `precomputed = TRUE`**

We reformat the data and prepare it for downstream analysis. The JDR tools used take a list of matrices as input. Here, we create a list of matrices for each cancer type and quantile normalise the indegrees. We also perform PCA on the omics and create a separate list of matrices for the PCA data.

Since we are working with several datasets, the metadata can be messy and inconsistent. Here, we re-format the survival data to ensure the labels are uniform across the cancer types.

See the documentation of [MARMOT](https://github.com/rtpop/MARMOT) for complete details of the formatting and processing done below. 

Since we are applying this processing to many datasets, we first create a wrapper function for processing the omics data and one for processing the survival data. Please note that the paths and filenames below assume the use of the data and directory structure provided on Zenodo. If using this for your own data, you may need to change them accordingly. 

In [None]:
if (!precomputed) {
    # function for omic processing
    prepare_cancer_data <- function(cancer, data_dir, omic_names, log_dir, wd) {
        print(cancer)
        # Get omic file names
        files <- paste0(data_dir, cancer, "/",
                        c("log_exp", "methy", "log_mirna",
                          "indegree_quant.RData", "outdegree.RData"))
        
        # Quantile normalize indegrees
        indegree_quant_file <- file.path(data_dir, cancer, 
                                         "indegree_quant.RData")
        if (!file.exists(indegree_quant_file)) {
            indegree_file <- file.path(data_dir, cancer, "indegree.RData")
            load(indegree_file)
            indegree <- normalize.quantiles(as.matrix(indegree), copy = FALSE)
            save(indegree, file = indegree_quant_file)
        }

        # Prepare data without PCA
        omics <- prepare_data(omics = files, names = omic_names, pca = FALSE,
                              logs = TRUE,
                              log_name = file.path(log_dir, 
                                                   "prep_data_no_pca_log.txt"))
        save(omics, file = file.path(wd, paste0(cancer, "_omics_no_pca.Rda")))

        # Prepare data with PCA
        omics <- prepare_data(omics = files, names = omic_names, pca = TRUE,
                              logs = TRUE, 
                              log_name = file.path(log_dir, 
                                                   "prep_data_pca_log.txt"),
                              file_name = paste0(cancer, 
                                                 "_omics_pca_results.Rda"))
        save(omics, file = file.path(wd, paste0(cancer, "_omics_pca.Rda")))
    }

    # Function to prepare survival data for a given cancer type
    prepare_survival_data <- function(cancer, data_dir, wd) {
        print(cancer)

        # Define the clinical data path
        clin <- file.path(data_dir, cancer)
        
        # Special handling for 'kidney' cancer type
        if (cancer == "kidney") {
            feature_names <- list(
                sample_id = "submitter_id.samples",
                vital_status = "vital_status.diagnoses",
                time_to_event = c("days_to_death.diagnoses",
                                  "days_to_last_follow_up.diagnoses")
            )
            surv <- prepare_surv(clinical = clin, feature_names = feature_names)
            surv$sample_id <- str_sub(surv$sample_id, end = -2)
        } else {
            feature_names <- list(
                sample_id = "sampleID",
                vital_status = "vital_status",
                time_to_event = c("days_to_death", "days_to_last_followup")
            )
            surv <- prepare_surv(clinical = clin, feature_names = feature_names)
        }

        # Standardize sample IDs
        surv$sample_id <- gsub("-", "\\.", surv$sample_id)

        # Save the survival data
        save(surv, file = file.path(wd, paste0(cancer, "_surv.Rda")))
    }
}

Now we apply it to the TCGA and GEP data.

In [None]:
if (!precomputed) {
    # Call the preparation function for each cancer type
    # for tcga
    # omics
    lapply(cancers_tcga, prepare_cancer_data, data_dir = data_tcga, 
           omic_names = omics_tcga, log_dir = log_dir, wd = wd)
    # survival       
    lapply(cancers_tcga, prepare_survival_data, data_dir = data_tcga, wd = wd)
       
    
    # for gep
    # omics
    prepare_cancer_data(cancer = cancers_gep, data_dir = data_gep,
                        omic_names = omics_gep, log_dir = log_dir, wd = wd)
    # survival
    prepare_survival_data(cancer = cancers_gep, data_dir = data_gep, wd = wd)
}

# Comparing JDR models with and without PCA
We compare the models of four JDR tools with and without PCA.