# Process MSigDB v7

---

This is an example notebook to generate the standard pathway table expected by the first step of the pipeline using canonical pathways from MSigDB. 


#### Table format

The standard table must have only two columns named `set_name` and `genes` containing the gene set or pathway name and entrez gene IDs. If a gene set has 10 genes in it, there should be 10 rows for that particular gene set, all with the same gene set name. The genes MUST be represented with their Entrez ID. 



In [1]:
suppressPackageStartupMessages({
    library(dplyr)
    library(plyr)
    library(biomaRt)
    library(data.table)
})
options(stringsAsFactors=FALSE)

### Input settings

In [2]:
# Unprocessed gene set file
infile<-'~/projects/pdxn_2.0/data/gene_sets/MSigDB_v7/MSigDBV7_Canonical.RDS'
outfile<-'../../input/std_gene_tables/MSigDB_v7_pathway_table.csv'

### Load raw data

In [3]:
gs<-readRDS(infile)
head(gs,1)

### Processing

In [4]:
gs_df <- data.frame(set_name=names(gs)) %>%
         mutate(genes=purrr::map(set_name,function(pn){gs[[pn]]})) %>%
         tidyr::unnest(genes) %>%
         filter(!grepl('Static_Module',set_name)) %>% # Remove static module pathways
         filter(!grepl('L1000',set_name)) # Remove LINCS pathways

### Visualize standard table

In [5]:
head(gs_df)

set_name,genes
<chr>,<chr>
Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS,55902
Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS,2645
Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS,5232
Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS,5230
Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS,5162
Pathway.KEGG_GLYCOLYSIS_GLUCONEOGENESIS,5160


In [6]:
nrow(gs_df)

In [7]:
message('Total number of pathways = ',length(unique(gs_df$set_name)))
message('Total number of unique genes = ',length(unique(gs_df$genes)))

Total number of pathways = 2199

Total number of unique genes = 11763



### Write pathway table

In [8]:
write.table(gs_df,
            file = outfile,
            quote = F,
            sep = ",",
            row.names = F)