# Merging gene expression data
In this R script we will merge the gene expression data files that were obtained from GEO and TCGA and pre-processed. Notebooks of the pre-processing of the data can be found at [GitHub](https://github.com/macsbio/inflammation_networks/tree/master/Jupyter-DataPreProcessing). In another notebook, which can be found in the repository, we cleaned the gene expression datasets. 

In [1]:
# check wd
getwd()

In [2]:
# load libraries
library(dplyr)
library(biomaRt)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [3]:
# load data files
data1 <- read.table(file.path(getwd(),"Datasets", "Clean", "breast-cancer_stats_clean.txt"), header = T, sep = "\t")
data2 <- read.table(file.path(getwd(),"Datasets", "Clean", "lung-cancer-stats_clean.txt"), header = T, sep = "\t")
data3 <- read.table(file.path(getwd(),"Datasets", "Clean", "MUO-Lean_stats_clean.txt"), header = T, sep = "\t")
data4 <- read.table(file.path(getwd(),"Datasets", "Clean", "RA-control_GSE55235_stats_clean.txt"), header = T, sep = "\t")
data5 <- read.table(file.path(getwd(),"Datasets", "Clean", "DCM_clean.txt"), header = T, sep = "\t")

head(data1)

ensembl_gene_id,logFC_BC,PValue_BC,entrezgene
ENSG00000000003,-0.19677472,0.733608386,7105
ENSG00000000005,-8.22242726,1.14e-05,64102
ENSG00000000419,0.06848394,0.833040252,8813
ENSG00000000457,0.88768039,0.055529182,57147
ENSG00000000460,1.2717589,0.019866978,55732
ENSG00000000938,-3.09305336,0.000177035,2268


Now we have read in all cleaned data files, we would like to create a data frame that contains all unique entrezgene IDs of all these data files togethet. So if we merge all data files, we don't lose any data. We create a new dataframe with all unique entrezgene IDs, because we would like to have a data frame we add a column called "Nonsense". This columns contains the value 1 for every row. That way we have two dimensions, so we have a data frame. 

In [8]:
# get all unique entrezgene IDs from all datasets
dataStacked <- as.data.frame(cbind(data1$entrezgene,data2$entrezgene,data3$entrezgene,data4$entrezgene,
                     data5$entrezgene,data6$entrezgene))
dataStacked <- stack(dataStacked)
dataStacked$Nonsense <- 1
dataStacked <- dataStacked[, -2]
dataStacked <- unique(dataStacked)
dataStacked <- na.omit(dataStacked)
colnames(dataStacked)[1] <- "entrezgene"

head(dataStacked)
print0("The number of unique entrezgene IDs is ", nrow(dataStacked))

"number of rows of result is not a multiple of vector length (arg 1)"

entrezgene,Nonsense
7105,1
64102,1
8813,1
57147,1
55732,1
2268,1


ERROR: Error in print0("The number of unique entrezgene IDs is ", nrow(dataStacked)): could not find function "print0"


In [9]:
# merge tables based on dataStacked, with all entrezgene IDs from all datasets
dataTotal <- dataStacked %>% left_join(data1, by = "entrezgene")
dataTotal <- dataTotal[,c(-2,-3)]
dataTotal <- dataTotal %>% left_join(data2, by = "entrezgene")
dataTotal <- dataTotal[,c(-4,-5)]
dataTotal <- dataTotal %>% left_join(data3, by = "entrezgene")
dataTotal <- dataTotal[,-6]
dataTotal <- dataTotal %>% left_join(data4, by = "entrezgene")
dataTotal <- dataTotal[,-8]
dataTotal <- dataTotal %>% left_join(data5, by = "entrezgene")
dataTotal <- dataTotal[,c(-10,-13)]
dataTotal <- dataTotal %>% left_join(data6, by = "entrezgene")
dataTotal <- dataTotal[,c(-12,-15)]


head(dataTotal)

entrezgene,logFC_BC,PValue_BC,logFC_LC,PValue_LC,logFC_MUO,PValue_MUO,logFC_RA,PValue_RA,logFC_DCM,PValue_DCM,logFC_ICM,PValue_ICM
7105,-0.19677472,0.733608386,1.2109203,0.030285815,-0.19539969,0.075489177,-0.30046001,0.01865526,-0.05590022,0.7977584,-0.30891354,0.23532135
64102,-8.22242726,1.14e-05,1.6015344,0.171398244,1.80412948,1.62e-06,-0.04860816,0.8196471,2.92402119,9.558365e-05,1.84791635,0.01961827
8813,0.06848394,0.833040252,0.4674649,0.294138,-0.40356445,0.00015691,-0.34473391,0.02328536,-0.09280415,0.7007694,-0.07775583,0.73349588
57147,0.88768039,0.055529182,0.0544462,0.891635248,-0.15488979,0.170040143,0.08311677,0.4803042,0.12344168,0.6039372,0.36591905,0.14264643
55732,1.2717589,0.019866978,1.4376887,0.004173094,0.08653913,0.013563482,-0.21473569,0.03695494,-0.07692294,0.8106303,0.14191353,0.64189891
2268,-3.09305336,0.000177035,-2.9959429,4.94e-06,0.55657959,9.11e-08,1.34143362,6.749178e-06,-0.14915484,0.7410703,-0.19886117,0.59028843


We have merged the data files together! Now we only have to add hgnc symbols where ever it is possible and save the file!

In [14]:
# add hgnc symbol to dataTotal
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- dataTotal

my.genes1 <- getBM(
  attributes = c('entrezgene', 'hgnc_symbol'), 
  filters = 'entrezgene',
  values = mygenes,
  mart = ensembl
)

# clean merged data file
dataTotal <- dataTotal %>% left_join(my.genes1, by = "entrezgene")
dataTotal <- dataTotal[,c(1,14,2,3,4,5,6,7,8,9,10,11,12,13)]
dataTotal <- unique(dataTotal)

head(dataTotal)

                                                                      

entrezgene,hgnc_symbol,logFC_BC,PValue_BC,logFC_LC,PValue_LC,logFC_MUO,PValue_MUO,logFC_RA,PValue_RA,logFC_DCM,PValue_DCM,logFC_ICM,PValue_ICM
7105,TSPAN6,-0.19677472,0.733608386,1.2109203,0.030285815,-0.19539969,0.075489177,-0.30046001,0.01865526,-0.05590022,0.7977584,-0.30891354,0.23532135
64102,TNMD,-8.22242726,1.14e-05,1.6015344,0.171398244,1.80412948,1.62e-06,-0.04860816,0.8196471,2.92402119,9.558365e-05,1.84791635,0.01961827
8813,DPM1,0.06848394,0.833040252,0.4674649,0.294138,-0.40356445,0.00015691,-0.34473391,0.02328536,-0.09280415,0.7007694,-0.07775583,0.73349588
57147,SCYL3,0.88768039,0.055529182,0.0544462,0.891635248,-0.15488979,0.170040143,0.08311677,0.4803042,0.12344168,0.6039372,0.36591905,0.14264643
55732,C1orf112,1.2717589,0.019866978,1.4376887,0.004173094,0.08653913,0.013563482,-0.21473569,0.03695494,-0.07692294,0.8106303,0.14191353,0.64189891
2268,FGR,-3.09305336,0.000177035,-2.9959429,4.94e-06,0.55657959,9.11e-08,1.34143362,6.749178e-06,-0.14915484,0.7410703,-0.19886117,0.59028843


In [15]:
# save file
write.table(dataTotal, file.path(getwd(), "data-output", "merged_data_final.txt"), row.names = F, col.names = T, sep = "\t", quote = F)

In [16]:
# information about session
sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.38.0       dplyr_0.7.8          RevoUtils_11.0.1    
[4] RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0           pillar_1.3.1         compiler_3.5.1      
 [4] bindr_0.1.1          prettyunits_1.0.2    progress_1.2.0      
 [7] base64enc_0.1-3      bitops_1.0-6         tools_3.5.1         
[10] digest_0.6.18        uuid_0.1-2           bit_1.1-14          
[13] jsonlite_1.6         evaluate_0.12        RSQLite_2.1.1       
[16] memoise_1.1.0        tibble_2.0.1         pkgconfig_2.0.2