# Network extension

In this notebook we will extend the network with genes that were excluded based on the set criteria for the selection of pathways. These gene were in pathways from all three databases, however not in a pathway that was selected for the creation of the network. Via the STRING app in Cytoscape we will try to add these genes to the network using ppi interactions.

In [1]:
# check working directory
getwd()

In [2]:
# load libraries
library(RCy3)
library(RNeo4j)
library(stringr)
library(dplyr)

"changing locked binding for 'length.path' in 'httr' whilst loading 'RNeo4j'"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Open Cytoscape on your computer and install the STRING app.

In [3]:
# check if Cytoscape is connected
cytoscapePing()
cytoscapeVersionInfo()

# install STRING app
installApp('STRINGapp') 
if("string" %in% commandsHelp("")) print("Success: the STRING app is installed") else print("Warning: STRING app is not installed. Please install the STRING app before proceeding.")

[1] "Available namespaces:"
[1] "Success: the STRING app is installed"


To begin with we will create the network and a ppi network using the STRING app in Cytoscape.

In [4]:
# load in network and data files
nodes <- read.table(file.path(getwd(), "data-output", "nodes_hgnc.txt"), header = T, sep = "\t")
edges <- read.table(file.path(getwd(), "data-output", "edges.txt"), header = T, sep = "\t")
disgenet <- read.delim(file.path(getwd(), "data-input", "C0021368_disease_gda_summary.tsv"))

head(nodes)
head(edges)
head(disgenet)

nodes,type
NFkB,Process
Inflammation,Process
Cytokines,Process
Complement,Process
Vitamin B12,Process
Immune cells,Process


pathway,entrezgene,hgnc_symbol
NFkB,7205,TRIP6
NFkB,841,CASP8
NFkB,1147,CHUK
NFkB,257397,TAB3
NFkB,2919,CXCL1
NFkB,8767,RIPK2


diseaseid,disease_name,geneid,symbol,uniprot,protein_class,gene_name,dsi,dpi,pli,score,el,ei,npmids,nsnps,year_initial,year_final
C0021368,Inflammation,3576,CXCL8,P10145,signaling molecule,C-X-C motif chemokine ligand 8,0.342,0.862,,0.4,no reported evidence,,13,0,2003,2015
C0021368,Inflammation,1401,CRP,P02741,,C-reactive protein,0.399,0.862,0.0037357,0.4,no reported evidence,,18,0,2001,2011
C0021368,Inflammation,7124,TNF,P01375,signaling molecule,tumor necrosis factor,0.263,0.966,0.8046,0.4,no reported evidence,,26,0,2001,2014
C0021368,Inflammation,7040,TGFB1,P01137,signaling molecule,transforming growth factor beta 1,0.336,0.931,0.17182,0.38,no reported evidence,,10,0,2002,2010
C0021368,Inflammation,3569,IL6,P05231,,interleukin 6,0.287,0.966,0.33873,0.37,no reported evidence,,18,0,2001,2016
C0021368,Inflammation,3553,IL1B,P01584,,interleukin 1 beta,0.312,0.931,0.12568,0.37,no reported evidence,,16,0,1999,2017


In [5]:
# clean up network and data files
# nodes
colnames(nodes)[1] <- "id"
nodes$id <- as.character(nodes$id)

# edges
edges <- edges[,c(-2)]
colnames(edges)[c(1,2)] <- c("source", "target") 
edges$interaction <- "interacts"
edges$source <- as.character(edges$source)
edges$target <- as.character(edges$target)

# disgenet
disgenet <- disgenet[,c("geneid", "symbol", "score")]
colnames(disgenet)[c(1,2,3)] <- c("entrezgene", "hgnc_symbol", "dis_score")
# filtered
infl <- disgenet[disgenet$dis_score > 0.01,]

head(nodes)
head(edges)
head(infl)
dim(infl)

id,type
NFkB,Process
Inflammation,Process
Cytokines,Process
Complement,Process
Vitamin B12,Process
Immune cells,Process


source,target,interaction
NFkB,TRIP6,interacts
NFkB,CASP8,interacts
NFkB,CHUK,interacts
NFkB,TAB3,interacts
NFkB,CXCL1,interacts
NFkB,RIPK2,interacts


entrezgene,hgnc_symbol,dis_score
3576,CXCL8,0.4
1401,CRP,0.4
7124,TNF,0.4
7040,TGFB1,0.38
3569,IL6,0.37
3553,IL1B,0.37


In [6]:
# create network from files
createNetworkFromDataFrames(nodes, edges, title = "MyNetwork", collection = "MyCollection")

Loading data...
Applying default style...
Applying preferred layout...


We now have the network and everything running and installen in Cytoscape. We will use the all genes that are in the network, including the genes associated with inflammation to create a ppi network via the STRING app.

In [7]:
# get all genes from network and combine them with inflammation genes
networkGenes <- nodes[nodes$type != "Process",]
networkGenes <- networkGenes[-2]
colnames(networkGenes)[1] <- "hgnc_symbol"

inflGenes <- as.data.frame(infl$hgnc_symbol)
colnames(inflGenes)[1] <- "hgnc_symbol"

networkGenes <- rbind(networkGenes, inflGenes)
networkGenes <- unique(networkGenes)

head(networkGenes)
paste0("The number of genes is ", nrow(networkGenes), ".")

Unnamed: 0,hgnc_symbol
10,TRIP6
11,CASP8
12,CHUK
13,TAB3
14,CXCL1
15,RIPK2


Lets create the ppi network!

In [8]:
# create STRING app API command and create ppi network
string_cmd <- paste('string protein query taxonID=9606 cutoff=0.9 query="',paste(networkGenes$hgnc_symbol, collapse=","),'"',sep="")
commandsGET(string_cmd)

setVisualStyle("default")
setNodeLabelMapping(table.column = "display name")

### Now we have to do a manual part. We will merge the networks. Go to Cytoscape -> 'Tools' -> 'Merge' -> 'Networks...'. Then select the two network and click 'Advanced options'. For the STRING network chose 'query term' as matching column!

Now we have the merged network we will extract the edge table and make two seperate files from this table for the Neo4J part. 

In [9]:
# get table columns we need from edge table
table <- getTableColumns(table = "edge", columns = c("shared name", "interaction"))

head(table)

Unnamed: 0,shared name,interaction
16384,Cytokines (interacts) JAK1,interacts
16385,Cytokines (interacts) IL6ST,interacts
16386,Cytokines (interacts) IL6R,interacts
16387,Cytokines (interacts) IL6,interacts
16388,Inflammation (interacts) MYD88,interacts
16389,Inflammation (interacts) IRAK1,interacts


In [10]:
table_process <- table[table$interaction == "interacts",]
table_pp <- table[table$interaction == "pp",]

head(table_process)
head(table_pp)

Unnamed: 0,shared name,interaction
16384,Cytokines (interacts) JAK1,interacts
16385,Cytokines (interacts) IL6ST,interacts
16386,Cytokines (interacts) IL6R,interacts
16387,Cytokines (interacts) IL6,interacts
16388,Inflammation (interacts) MYD88,interacts
16389,Inflammation (interacts) IRAK1,interacts


Unnamed: 0,shared name,interaction
11797,GNG2 (pp) CCL4,pp
11798,GNG2 (pp) GNB1,pp
11799,GNG2 (pp) CCL5,pp
11800,GNG2 (pp) AKT1,pp
11801,GNG2 (pp) BDKRB2,pp
11802,GNG2 (pp) ADM,pp


In [11]:
# clean process table
table_process <- table_process[-2]
table_process <- as.data.frame(lapply(table_process, gsub, pattern ="\\(", replacement = ''))
table_process <- as.data.frame(lapply(table_process, gsub, pattern ="\\)", replacement = ''))
table_process <- as.data.frame(lapply(table_process, gsub, pattern ="interacts", replacement = ''))

table_process <- as.data.frame(str_split_fixed(table_process$shared.name, " ", n = 2))

table_process <- as.data.frame(apply(table_process,2,function(x)gsub('\\s+', '',x)))
colnames(table_process)[c(1,2)] <- c("source", "target")
                                     
head(table_process)
                                     
# save table
write.table(table_process, file.path(getwd(), "data-output", "cat_gene_table.txt"), col.names = T, row.names = F, sep = "\t", quote = F)

source,target
Cytokines,JAK1
Cytokines,IL6ST
Cytokines,IL6R
Cytokines,IL6
Inflammation,MYD88
Inflammation,IRAK1


In [12]:
# clean pp table
table_pp <- table_pp[-2]
table_pp <- as.data.frame(lapply(table_pp, gsub, pattern ="\\(", replacement = ''))
table_pp <- as.data.frame(lapply(table_pp, gsub, pattern ="\\)", replacement = ''))
table_pp <- as.data.frame(lapply(table_pp, gsub, pattern ="pp", replacement = ''))

table_pp <- as.data.frame(str_split_fixed(table_pp$shared.name, " ", n = 2))

table_pp <- as.data.frame(apply(table_pp,2,function(x)gsub('\\s+', '',x)))
colnames(table_pp)[c(1,2)] <- c("source", "target")
                             
head(table_pp)

# save table
write.table(table_pp, file.path(getwd(), "data-output", "ppi_table.txt"), col.names = T, row.names = F, sep = "\t", quote = F)

source,target
GNG2,CCL4
GNG2,GNB1
GNG2,CCL5
GNG2,AKT1
GNG2,BDKRB2
GNG2,ADM


Now that we have the edge table split up we can check the shared neighbors between process nodes and added gene nodes. We will use the RNeo4J package for this purpose. 

In [13]:
# first make conenction with Neo4J. Start Neo4J and open the url in a webbrowser. Create your own username and password
graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "123")

In [14]:
# load in both tables and load them in Neo4J
data = data.frame(read.table(file.path(getwd(), "data-output", "cat_gene_table.txt"), header = T, sep = "\t"))
data <- unique(data)

query = "
MERGE (source:Category {id:{Category}})
MERGE (target:Gene {id:{Gene}})
CREATE (source)<-[:pathway]-(target)
"

t = newTransaction(graph)

for (i in 1:nrow(data)) {
  Category = data[i, ]$source
  Gene = data[i, ]$target
  
  appendCypher(t, 
               query, 
               Category = Category, 
               Gene = Gene 
               )
}

commit(t)

data1 = data.frame(read.table(file.path(getwd(), "data-output", "ppi_table.txt"), header = T, sep = "\t"))

query = "
MERGE (source:Gene {id:{Gene}})
MERGE (target:Gene1 {id:{Gene1}})
CREATE (source)<-[:ppi]-(target)
"

y = newTransaction(graph)

for (i in 1:nrow(data1)) {
  Gene = data1[i, ]$source
  Gene1 = data1[i, ]$target
  
  appendCypher(y, 
               query, 
               Gene = Gene, 
               Gene1 = Gene1 
  )
}

commit(y)

In [15]:
# perform Neo4J query
shared_neighbors <- cypher(graph, "MATCH(source:Category)-[:pathway]-(neighbor:Gene)-[:ppi]-(target:Gene1)
WHERE NOT (source) = (target)
RETURN DISTINCT source.id AS source_id, target.id AS target_id, count(neighbor) AS common_neighbors")

# only extract the 55 added genes via STRING
genes_added <- read.table(file.path(getwd(), "data-input", "genes_added.txt"), header = T, sep = "\t")

colnames(genes_added)[1] <- "gene"

# retrieve rows if one of the added genes is in that row
shared_neighbors_genes_added <- shared_neighbors[shared_neighbors$target_id %in% genes_added$gene,]

head(shared_neighbors_genes_added)
dim(shared_neighbors_genes_added)

Unnamed: 0,source_id,target_id,common_neighbors
2,Cytokines,GNB1,12
4,Cytokines,PTPN11,11
8,Cytokines,SHC1,18
11,Cytokines,PIK3R1,21
20,Cytokines,UBC,15
22,Cytokines,UBA52,12


In [16]:
# save shared neighbors table
write.table(shared_neighbors, file.path(getwd(), "data-output", "shared_neighbors.txt"), row.names = F, sep = "\t", quote = F)
write.table(shared_neighbors_genes_added, file.path(getwd(), "data-output", "shared_neighbors55.txt"), row.names = F, sep = "\t", quote = F)

With these shared neighbors we can decide if we would like to include genes. We opt to include genes with at least 4 shared neighbors to a process node. 

In [17]:
# at least 4 of more shared neighbors
shared_neighbors_genes_added <- shared_neighbors_genes_added[shared_neighbors_genes_added$common_neighbors >= 4,]
head(shared_neighbors_genes_added)
dim(shared_neighbors_genes_added)

Unnamed: 0,source_id,target_id,common_neighbors
2,Cytokines,GNB1,12
4,Cytokines,PTPN11,11
8,Cytokines,SHC1,18
11,Cytokines,PIK3R1,21
20,Cytokines,UBC,15
22,Cytokines,UBA52,12


In [18]:
# add these to the edge table and node table
# edge table
colnames(shared_neighbors_genes_added)[c(1,2)] <- c("source", "target") 
shared_neighbors_genes_added <- shared_neighbors_genes_added[-3]
edges1 <- edges[-3]
edge_table <- rbind(shared_neighbors_genes_added, edges1)
edge_table <- unique(edge_table)

head(edge_table)
dim(edge_table)

# save edge table
write.table(edge_table, file.path(getwd(), "data-output", "edge_table_final.txt"), row.names = F, sep = "\t", quote = F)

Unnamed: 0,source,target
2,Cytokines,GNB1
4,Cytokines,PTPN11
8,Cytokines,SHC1
11,Cytokines,PIK3R1
20,Cytokines,UBC
22,Cytokines,UBA52


In [19]:
# node table
source <- as.data.frame(edge_table[,"source"])
target <- as.data.frame(edge_table[,"target"])

colnames(source)[1] <- "id"
colnames(target)[1] <- "id"

node_table <- rbind(source, target)
node_table <- unique(node_table)

# add node typing to table
node_table$type <- "Gene"
node_table$type[node_table$id %in% edge_table$source] <- "Process"
node_table$type[node_table$id %in% infl$hgnc_symbol] <- "InflGene"

head(node_table)
dim(node_table)

# save node_table
write.table(node_table, file.path(getwd(), "data-output", "node_table_final.txt"), row.names = F, sep = "\t", quote = F)

Unnamed: 0,id,type
1,Cytokines,Process
18,Inflammation,Process
25,NFkB,Process
49,Angiogenesis,Process
68,Metabolism,Process
194,Complement,Process


We can use these files to create the network and integrate the gene expression data into this network for analysis.

In [20]:
# information about the session
sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.7.8          stringr_1.3.1        RNeo4j_1.7.0        
[4] RCy3_2.2.6           RevoUtils_11.0.1     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0          pillar_1.3.1        compiler_3.5.1     
 [4] bindr_0.1.1         R.methodsS3_1.7.1   R.utils_2.7.0      
 [7] base64enc_0.1-3     tools_3.5.1         digest_0.6.18      
[10] uuid_0.1-2          tibble_2.0.1        jsonlite_1.6       
[13] evaluate_0.12       pkgconfig_2.0.2     rlang_0.3.1        
[16] graph_1.60.0        igraph_1.2