# Cleaning gene expression data
In this R script we will clean the gene expression data files that were obtained from GEO and TCGA and that were pre-processed. Notebooks of the pre-processing of the data can be found at [GitHub](https://github.com/macsbio/inflammation_networks/tree/master/Jupyter-DataPreProcessing). In another notebook, which can be found in the repository, we will merge the cleaned gene expression datasets. 

In [16]:
# check wd
getwd()

In [17]:
# load libraries
library(dplyr)
library(biomaRt)

First of all we have to read gene expression data files of all diseases. In this case we have 8 data files. These data files were pre-processed as mentioned above, using R or [ArrayAnalysis](http://www.arrayanalysis.org/). 

In [20]:
# load data
data1 <- read.table(file.path(getwd(), "Datasets", "breast-cancer_stats60.txt"), header = T, sep = "\t")
data2 <- read.table(file.path(getwd(), "Datasets", "NAFLD_stats.txt"), header = T, sep = "\t")
data3 <- read.table(file.path(getwd(), "Datasets", "MUO-Lean_stats.txt"), header = T, sep = "\t")
data4 <- read.table(file.path(getwd(), "Datasets", "RA-control_GSE55235_stats.txt"), header = T, sep = "\t")
data5 <- read.table(file.path(getwd(), "Datasets", "DCM_stats.txt"), header = T, sep = "\t")

# view data1 as example on how data looks like
head(data1)

ENSG_ID,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,hgnc_symbol
ENSG00000000003,3568.1353,-0.6346625,0.13853996,-4.581079,4.625831e-06,1.171792e-05,TSPAN6
ENSG00000000005,484.2763,-4.0222475,0.3503673,-11.480088,1.6611089999999998e-30,3.748896e-29,TNMD
ENSG00000000419,1938.3714,0.3567976,0.08902614,4.007784,6.129101e-05,0.0001351941,DPM1
ENSG00000000457,1670.1705,0.303383,0.07960962,3.810884,0.0001384706,0.0002921716,SCYL3
ENSG00000000460,519.0613,1.0983553,0.09921183,11.07081,1.738228e-28,3.342777e-27,C1orf112
ENSG00000000938,751.0182,-0.3356203,0.1928038,-1.740735,0.08173009,0.1107454,FGR


We see that the column names of data1 are not the column names we desire. There are also some columns we are not going to use (GeneName, logCPM, F and FDR), so we might as well remove those.
We have to change this for all datasets. 

In [21]:
# clean up data
data1 <- data1[,c(-2,-4,-5,-7)] 
colnames(data1)[c(1,2,3)] <- c("ensembl_gene_id", "logFC_BC", "PValue_BC")
data1 <- data1[,c(1,4,2,3)]

data2 <- data2[,c(-1, -3, -4, -5, -7, -8)]
data2 <- data2[,c(3,4,1,2)]
colnames(data2)[c(1:4)] <- c("hgnc_symbol", "entrezgene", "logFC_NAFLD", "PValue_NAFLD")
data2 <- data2[!grepl("///", data2$hgnc_symbol),]

data3 <- data3[,c(-1,-3,-5)]
data3 <- data3[,c(3,1,2)]
data3 <- data3[!(data3$hgnc_symbol == "---"),]
colnames(data3)[c(2,3)] <- c("logFC_MUO", "PValue_MUO")

data4 <- data4[,c(-3,-4,-5,-7,-8)]
colnames(data4)[c(1,2,3)] <- c("ensembl_gene_id", "logFC_RA","PValue_RA")

data5 <- data5[,c(-3,-4,-6)]
colnames(data5)[c(1,2,3,4)] <- c("ensembl_gene_id", "logFC_DCM", "PValue_DCM", "hgnc_symbol")

# view data1 as example on how data looks like
head(data1)

ensembl_gene_id,hgnc_symbol,logFC_BC,PValue_BC
ENSG00000000003,TSPAN6,-0.6346625,4.625831e-06
ENSG00000000005,TNMD,-4.0222475,1.6611089999999998e-30
ENSG00000000419,DPM1,0.3567976,6.129101e-05
ENSG00000000457,SCYL3,0.303383,0.0001384706
ENSG00000000460,C1orf112,1.0983553,1.738228e-28
ENSG00000000938,FGR,-0.3356203,0.08173009


For example we see that data1 contains ensembl gene IDs, while we would like to have entrezgene IDs. We have to change this by mapping the ensembl IDs to entrezgene IDs. 

## Datasets 1, 4 and 5 have Ensembl IDs, 3 has hgnc_symbols and datasets 2 already has entrezgene IDs!

In [22]:
# entrezgene IDs from ensembl gene IDs
# because data 1, 4, 5, 6 and 7 have ensembl IDs, we have to perform this chunk of code 5 times.
ensembl <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl", miror = "useast")
mygenes <- data1

my.genes1 <- getBM(
  attributes = c('entrezgene', 'ensembl_gene_id'), 
  filters = 'ensembl_gene_id',
  values = mygenes,
  mart = ensembl
)

data1 <- data1 %>% left_join(my.genes1, by = "ensembl_gene_id")

head(data1)

"Column `ensembl_gene_id` joining factor and character vector, coercing into character vector"

ensembl_gene_id,hgnc_symbol,logFC_BC,PValue_BC,entrezgene
ENSG00000000003,TSPAN6,-0.6346625,4.625831e-06,7105
ENSG00000000005,TNMD,-4.0222475,1.6611089999999998e-30,64102
ENSG00000000419,DPM1,0.3567976,6.129101e-05,8813
ENSG00000000457,SCYL3,0.303383,0.0001384706,57147
ENSG00000000460,C1orf112,1.0983553,1.738228e-28,55732
ENSG00000000938,FGR,-0.3356203,0.08173009,2268


In [7]:
# entrezgene IDs from ensembl gene IDs
# because data 1, 4, 5 and 6 have ensembl IDs, we have to perform this chunk of code 4 times.
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data4

my.genes1 <- getBM(
  attributes = c('entrezgene', 'ensembl_gene_id'), 
  filters = 'ensembl_gene_id',
  values = mygenes,
  mart = ensembl
)

data4 <- data4 %>% left_join(my.genes1, by = "ensembl_gene_id")

head(data4)

"Column `ensembl_gene_id` joining factor and character vector, coercing into character vector"

ensembl_gene_id,logFC_RA,PValue_RA,entrezgene
ENSG00000223865,2.432572,9.646176000000001e-17,3115.0
ENSG00000211952,7.955072,1.875752e-15,
ENSG00000132465,8.390554,1.98938e-15,3512.0
ENSG00000111801,2.761847,2.238005e-15,10384.0
ENSG00000110777,6.009361,2.591924e-15,5450.0
ENSG00000242574,3.658038,3.447606e-15,3109.0


In [8]:
# entrezgene IDs from ensembl gene IDs
# because data 1, 4, 5 and 6 have ensembl IDs, we have to perform this chunk of code 4 times.
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data5

my.genes1 <- getBM(
  attributes = c('entrezgene', 'ensembl_gene_id'), 
  filters = 'ensembl_gene_id',
  values = mygenes,
  mart = ensembl
)

data5 <- data5 %>% left_join(my.genes1, by = "ensembl_gene_id")

head(data5)

"Column `ensembl_gene_id` joining factor and character vector, coercing into character vector"

ensembl_gene_id,logFC_DCM,PValue_DCM,hgnc_symbol,entrezgene
ENSG00000000003,-0.05590022,0.7977584,TSPAN6,7105
ENSG00000000005,2.92402119,9.558365e-05,TNMD,64102
ENSG00000000419,-0.09280415,0.7007694,DPM1,8813
ENSG00000000457,0.12344168,0.6039372,SCYL3,57147
ENSG00000000460,-0.07692294,0.8106303,C1orf112,55732
ENSG00000000938,-0.14915484,0.7410703,FGR,2268


In [9]:
# entrezgene IDs from ensembl gene IDs
# because data 1, 4, 5 and 6 have ensembl IDs, we have to perform this chunk of code 4 times.
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data6

my.genes1 <- getBM(
  attributes = c('entrezgene', 'ensembl_gene_id'), 
  filters = 'ensembl_gene_id',
  values = mygenes,
  mart = ensembl
)

data6 <- data6 %>% left_join(my.genes1, by = "ensembl_gene_id")

head(data6)

"Column `ensembl_gene_id` joining factor and character vector, coercing into character vector"

ensembl_gene_id,logFC_ICM,PValue_ICM,hgnc_symbol,entrezgene
ENSG00000000003,-0.30891354,0.23532135,TSPAN6,7105
ENSG00000000005,1.84791635,0.01961827,TNMD,64102
ENSG00000000419,-0.07775583,0.73349588,DPM1,8813
ENSG00000000457,0.36591905,0.14264643,SCYL3,57147
ENSG00000000460,0.14191353,0.64189891,C1orf112,55732
ENSG00000000938,-0.19886117,0.59028843,FGR,2268


We have now the entrezgene IDs for datasets 1, 2, 4, 5, 6, 7 and 8. Only dataset 3 has to be mapped to entrezgene IDs. 

In [10]:
# entrezegene IDs from hgnc symbols
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data3

my.genes1 <- getBM(
  attributes = c('entrezgene', 'hgnc_symbol'), 
  filters = 'hgnc_symbol',
  values = mygenes,
  mart = ensembl
)

data3 <- data3 %>% left_join(my.genes1, by = "hgnc_symbol")

head(data3)

"Column `hgnc_symbol` joining factor and character vector, coercing into character vector"

hgnc_symbol,logFC_MUO,PValue_MUO,entrezgene
PLA2G7,3.6562151,7.99e-12,7941
ETFA,-0.6770891,4.31e-11,2108
LYZ,2.3115932,5.25e-11,4069
ALPK3,-2.3067365,5.93e-11,57538
HLA-DMB,0.6703927,1.07e-10,3109
ARRB2,1.30113,1.3e-10,409


Save these files so we have access to our cleaned data files.

In [11]:
# save files
write.table(data1, file.path(getwd(),"Datasets", "Clean", "breast-cancer_stats60_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data2, file.path(getwd(),"Datasets", "Clean", "NAFLD_stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data3, file.path(getwd(),"Datasets", "Clean", "MUO-Lean_stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data4, file.path(getwd(),"Datasets", "Clean", "RA-control_GSE55235_stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data5, file.path(getwd(),"Datasets", "Clean", "DCM_clean.txt"), row.names = F, sep = "\t", quote = F)

We now have cleaned our gene expression data files. In the next jupyter notebook, which can be found in this repository, we are going to merge these data files, so only have one file for all our gene expression data.

In [5]:
# informaiton about session
sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.38.0       dplyr_0.7.8          RevoUtils_11.0.1    
[4] RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0           pillar_1.3.1         compiler_3.5.1      
 [4] bindr_0.1.1          prettyunits_1.0.2    progress_1.2.0      
 [7] base64enc_0.1-3      bitops_1.0-6         tools_3.5.1         
[10] digest_0.6.18        uuid_0.1-2           bit_1.1-14          
[13] jsonlite_1.6         evaluate_0.12        RSQLite_2.1.1       
[16] memoise_1.1.0        tibble_2.0.1         pkgconfig_2.0.2