# Network file creation
In this notebook we will create files which can be used in the Cytoscape STRING automation with the RCy3 package. These network files will contain an edge table and an node table. These will be based on the pathways selected which was described in a previous jupyter notebook.

In [1]:
# check wd
getwd()

In [2]:
# load libraries
library(limma)
library(qusage)
library(plyr)
library(dplyr)
library(tidyr)
library(biomaRt)

"package 'qusage' was built under R version 3.5.2"
Attaching package: 'dplyr'

The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

"package 'tidyr' was built under R version 3.5.2"

First we will load in the same .gmt files for all three databases and combine these together.

In [3]:
# KEGG
kegg <- read.gmt(file.path(getwd(), "data-input", "c2.cp.kegg.v6.2.entrez.gmt"))
kegg <- ldply(kegg, data.frame)
colnames(kegg)[c(1,2)] <- c("pathway", "entrezgene")

### REACTOME
reactome <- read.gmt(file.path(getwd(), "data-input", "c2.cp.reactome.v6.2.entrez.gmt"))
reactome <- ldply(reactome, data.frame)
colnames(reactome)[c(1,2)] <- c("pathway", "entrezgene")

### WikiPathways
wp <- read.gmt(file.path(getwd(), "data-input", "wikipathways-20190410-gmt-Homo_sapiens.gmt"))
wp <- ldply(wp, data.frame)
colnames(wp)[c(1,2)] <- c("pathway", "entrezgene")

# combine databases
allDatabases <- rbind(kegg, reactome, wp)
head(allDatabases)

"incomplete final line found on 'C:/Users/Laurent/Documents/GitHub/inflammation_networks2/data-input/wikipathways-20190410-gmt-Homo_sapiens.gmt'"

pathway,entrezgene
KEGG_GLYCOLYSIS_GLUCONEOGENESIS,55902
KEGG_GLYCOLYSIS_GLUCONEOGENESIS,2645
KEGG_GLYCOLYSIS_GLUCONEOGENESIS,5232
KEGG_GLYCOLYSIS_GLUCONEOGENESIS,5230
KEGG_GLYCOLYSIS_GLUCONEOGENESIS,5162
KEGG_GLYCOLYSIS_GLUCONEOGENESIS,5160


Load in disgenet file for later use.

In [4]:
# disgenet
# load in disgenet file
disgenet <- read.delim(file.path(getwd(), "data-input", "C0021368_disease_gda_summary.tsv"))

# check head of disgenet file
head(disgenet)
disgenet <- disgenet[,c("geneid", "symbol", "score")]
colnames(disgenet)[c(1,2,3)] <- c("entrezgene", "hgnc_symbol", "dis_score")

# filtered
infl <- disgenet[disgenet$dis_score > 0.01,]
head(infl)
paste0("The number of genes is ", nrow(infl), ".")

diseaseid,disease_name,geneid,symbol,uniprot,protein_class,gene_name,dsi,dpi,pli,score,el,ei,npmids,nsnps,year_initial,year_final
C0021368,Inflammation,3576,CXCL8,P10145,signaling molecule,C-X-C motif chemokine ligand 8,0.342,0.862,,0.4,no reported evidence,,13,0,2003,2015
C0021368,Inflammation,1401,CRP,P02741,,C-reactive protein,0.399,0.862,0.0037357,0.4,no reported evidence,,18,0,2001,2011
C0021368,Inflammation,7124,TNF,P01375,signaling molecule,tumor necrosis factor,0.263,0.966,0.8046,0.4,no reported evidence,,26,0,2001,2014
C0021368,Inflammation,7040,TGFB1,P01137,signaling molecule,transforming growth factor beta 1,0.336,0.931,0.17182,0.38,no reported evidence,,10,0,2002,2010
C0021368,Inflammation,3569,IL6,P05231,,interleukin 6,0.287,0.966,0.33873,0.37,no reported evidence,,18,0,2001,2016
C0021368,Inflammation,3553,IL1B,P01584,,interleukin 1 beta,0.312,0.931,0.12568,0.37,no reported evidence,,16,0,1999,2017


entrezgene,hgnc_symbol,dis_score
3576,CXCL8,0.4
1401,CRP,0.4
7124,TNF,0.4
7040,TGFB1,0.38
3569,IL6,0.37
3553,IL1B,0.37


Next, we will load in the file that contains the selected pathways. In this file an extra column is added which contains the process clusters. Note that this clustering is done manually based on literature and prior knowledge! Additionally, pathways that described disease processes were removed.

In [5]:
# load in selected pathways
selectedpws <- read.table(file.path(getwd(), "data-output", "20190509selectedpws.txt"), header = T, sep = "\t")
head(selectedpws)

pathway,process
ACE Inhibitor Pathway%WikiPathways_20190410%WP554%Homo sapiens,Angiogenesis
Angiogenesis%WikiPathways_20190410%WP1539%Homo sapiens,Angiogenesis
REACTOME_CREATION_OF_C4_AND_C2_ACTIVATORS,Complement
Cytokines and Inflammatory Response%WikiPathways_20190410%WP530%Homo sapiens,Cytokines
IL-10 Anti-inflammatory Signaling Pathway %WikiPathways_20190410%WP4495%Homo sapiens,Cytokines
Platelet-mediated interactions with vascular and circulating cells%WikiPathways_20190410%WP4462%Homo sapiens,Cytokines


First, lets create the edge table file!

In [6]:
# edge table
edge_table <- as.data.frame(allDatabases[allDatabases$pathway %in% selectedpws$pathway,])
head(edge_table)
tail(edge_table)

Unnamed: 0,pathway,entrezgene
7430,KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY,7205
7431,KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY,841
7432,KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY,1147
7433,KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY,257397
7434,KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY,2919
7435,KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY,8767


Unnamed: 0,pathway,entrezgene
72283,Platelet-mediated interactions with vascular and circulating cells%WikiPathways_20190410%WP4462%Homo sapiens,3553
72284,Platelet-mediated interactions with vascular and circulating cells%WikiPathways_20190410%WP4462%Homo sapiens,6403
72285,Platelet-mediated interactions with vascular and circulating cells%WikiPathways_20190410%WP4462%Homo sapiens,6347
72286,Platelet-mediated interactions with vascular and circulating cells%WikiPathways_20190410%WP4462%Homo sapiens,6404
72287,Platelet-mediated interactions with vascular and circulating cells%WikiPathways_20190410%WP4462%Homo sapiens,958
72288,Platelet-mediated interactions with vascular and circulating cells%WikiPathways_20190410%WP4462%Homo sapiens,959


Change pathway names to process names based on the manual clustering.

In [7]:
# change names to process names
edge_table$pathway <- selectedpws$process[match(edge_table$pathway, selectedpws$pathway)]
head(edge_table)
tail(edge_table)

Unnamed: 0,pathway,entrezgene
7430,NFkB,7205
7431,NFkB,841
7432,NFkB,1147
7433,NFkB,257397
7434,NFkB,2919
7435,NFkB,8767


Unnamed: 0,pathway,entrezgene
72283,Cytokines,3553
72284,Cytokines,6403
72285,Cytokines,6347
72286,Cytokines,6404
72287,Cytokines,958
72288,Cytokines,959


In [8]:
# only unique rows
edge_table <- unique(edge_table)

# map entrezgene IDs to hgnc symbols
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

genes <- getBM(
  attributes = c('hgnc_symbol', 'entrezgene'), 
  filters = 'entrezgene',
  values = edge_table$entrezgene,
  mart = ensembl
)

head(genes)
dim(genes)

hgnc_symbol,entrezgene
AKT3,10000
CDH3,1001
NAMPT,10135
ATP6AP2,10159
TLR6,10333
NOD1,10392


In [9]:
# merge edge table with hgnc symbols
# remove NAs as they are pseudogenes, microRNAs or discontinued genes
edge_table$hgnc_symbol <- genes$hgnc_symbol[match(edge_table$entrezgene, genes$entrezgene)]
edge_table <- edge_table[!is.na(edge_table$hgnc_symbol),]
head(edge_table)
tail(edge_table)

# save edge table
write.table(edge_table, file.path(getwd(), "data-output", "edges.txt"), col.names = T, row.names = F, sep = "\t", quote = F)

Unnamed: 0,pathway,entrezgene,hgnc_symbol
7430,NFkB,7205,TRIP6
7431,NFkB,841,CASP8
7432,NFkB,1147,CHUK
7433,NFkB,257397,TAB3
7434,NFkB,2919,CXCL1
7435,NFkB,8767,RIPK2


Unnamed: 0,pathway,entrezgene,hgnc_symbol
72282,Cytokines,51284,TLR7
72284,Cytokines,6403,SELP
72285,Cytokines,6347,CCL2
72286,Cytokines,6404,SELPLG
72287,Cytokines,958,CD40
72288,Cytokines,959,CD40LG


Now we only have to make the node table and then we can use these for the Cytoscape STRING automation notebook.

In [10]:
# node table
# split edge table in two seperate data frames, rbind, unique, add type.
pathwayNodes <- as.data.frame(unique(edge_table$pathway))
colnames(pathwayNodes)[1] <- "nodes"

geneNodes_entrezgene <- as.data.frame(unique(edge_table$entrezgene))
colnames(geneNodes_entrezgene)[1] <- "nodes"

geneNodes_hgnc <- as.data.frame(unique(edge_table$hgnc_symbol))
colnames(geneNodes_hgnc)[1] <- "nodes"

node_table_entrezgene <- as.data.frame(rbind(pathwayNodes, geneNodes_entrezgene))
node_table_hgnc <- as.data.frame(rbind(pathwayNodes, geneNodes_hgnc))

# add type
node_table_entrezgene$type <- "Gene"
node_table_entrezgene$type[node_table_entrezgene$nodes %in% infl$entrezgene] <- "InflGene"
node_table_entrezgene$type[node_table_entrezgene$nodes %in% edge_table$pathway] <- "Process"

node_table_hgnc$type <- "Gene"
node_table_hgnc$type[node_table_hgnc$nodes %in% infl$hgnc_symbol] <- "InflGene"
node_table_hgnc$type[node_table_hgnc$nodes %in% edge_table$pathway] <- "Process"

# save node table
write.table(node_table_entrezgene, file.path(getwd(), "data-output", "nodes_entrezgene.txt"), col.names = T, row.names = F, sep = "\t", quote = F)
write.table(node_table_hgnc, file.path(getwd(), "data-output", "nodes_hgnc.txt"), col.names = T, row.names = F, sep = "\t", quote = F)

In [11]:
# information about this session
sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.38.0       tidyr_0.8.2          dplyr_0.7.8         
[4] plyr_1.8.4           qusage_2.16.1        limma_3.38.3        
[7] RevoUtils_11.0.1     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3         progress_1.2.0       tidyselect_0.2.5    
 [4] repr_0.19.1          purrr_0.3.0          lattice_0.20-35     
 [7] htmltools_0.3.6      stats4_3.5.1         base64enc_0.1-3     
[10] blob_1.1.1           XML_3.98-1.19        rlang_0.3.1         
[13] pillar_1.3.1         glue_1.3.0      