# Cleaning gene expression data
In this R script we will clean the gene expression data files that were obtained from GEO and TCGA and pre-processed. Notebooks of the pre-processing of the data can be found at [GitHub](https://github.com/macsbio/inflammation_networks/tree/master/Jupyter-DataPreProcessing). In another notebook, which can be found in the repository, we will merge the cleaned gene expression datasets. 

## The following step only works in RStudio. If working in another environment, please set the working directory properly and check if the working directory is correct. 

In [1]:
# set wd to where script file is saved
setwd(dirname(rstudioapi::callFun("getActiveDocumentContext")$path))

ERROR: Error: RStudio not running


In [1]:
# check wd
getwd()

In [2]:
# load libraries
library(dplyr)
library(biomaRt)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



First of all we have to read gene expression data files of all diseases. In this case we have 8 data files. These data files were pre-processed as mentioned above, using R or [ArrayAnalysis](http://www.arrayanalysis.org/). 

In [3]:
# load data
data1 <- read.table(file.path(getwd(), "Datasets", "breast-cancer_stats.txt"), header = T, sep = "\t")
data2 <- read.table(file.path(getwd(), "Datasets", "lung-cancer-stats.txt"), header = T, sep = "\t")
data3 <- read.table(file.path(getwd(), "Datasets", "MUO-Lean_stats.txt"), header = T, sep = "\t")
data4 <- read.table(file.path(getwd(), "Datasets", "RA-control_GSE55235_stats.txt"), header = T, sep = "\t")
data5 <- read.table(file.path(getwd(), "Datasets", "RETT-control_FC_stats.txt"), header = T, sep = "\t")
data6 <- read.table(file.path(getwd(), "Datasets", "RETT-control_TC_stats.txt"), header = T, sep = "\t")
data7 <- read.table(file.path(getwd(), "Datasets", "SLE_stats.txt"), header = T, sep = "\t")

# view data1 as example on how data looks like
head(data1)

GeneID,GeneName,logFC,logCPM,F,PValue,FDR
ENSG00000000003,TSPAN6,-0.19677472,5.878281,0.12159179,0.733608386,0.82208427
ENSG00000000005,TNMD,-8.22242726,3.044306,53.86632165,1.14e-05,0.00081797
ENSG00000000419,DPM1,0.06848394,4.849708,0.04650132,0.833040252,0.892648753
ENSG00000000457,SCYL3,0.88768039,4.612369,4.53550697,0.055529182,0.137126409
ENSG00000000460,C1orf112,1.2717589,2.944843,7.29990937,0.019866978,0.068869167
ENSG00000000938,FGR,-3.09305336,3.656325,29.46189716,0.000177035,0.003730564


We see that the column names of data1 are not the column names we desire. There are also some columns we are not going to use (GeneName, logCPM, F and FDR), so we might as well remove those.
We have to change this for all datasets. 

In [4]:
# clean up data
data1 <- data1[,c(-2,-4,-5,-7)] 
colnames(data1)[c(1,2,3)] <- c("ensembl_gene_id", "logFC_BC", "PValue_BC")

data2 <- data2[,-6]
colnames(data2)[c(3,4,5)] <- c("ensembl_gene_id", "logFC_LC", "PValue_LC")

data3 <- data3[,c(-1,-3,-5)]
data3 <- data3[,c(3,1,2)]
data3 <- data3[!(data3$hgnc_symbol == "---"),]
colnames(data3)[c(2,3)] <- c("logFC_MUO", "PValue_MUO")

data4 <- data4[,c(-3,-4,-5,-7,-8)]
colnames(data4)[c(1,2,3)] <- c("ensembl_gene_id", "logFC_RA","PValue_RA")

data5 <- data5[,c(-4,-6)]
colnames(data5)[c(1,2,3,4)] <- c("ensembl_gene_id", "hgnc_symbol", "logFC_RETT_FC", "PValue_RETT_FC")

data6 <- data6[,c(-4,-6)]
colnames(data6)[c(1,2,3,4)] <- c("ensembl_gene_id", "hgnc_symbol", "logFC_RETT_TC", "PValue_RETT_TC")

data7 <- data7[,c(-3,-4,-6)]
colnames(data7)[c(1,2,3)] <- c("entrezgene", "logFC_SLE", "PValue_SLE")

# view data1 as example on how data looks like
head(data1)

ensembl_gene_id,logFC_BC,PValue_BC
ENSG00000000003,-0.19677472,0.733608386
ENSG00000000005,-8.22242726,1.14e-05
ENSG00000000419,0.06848394,0.833040252
ENSG00000000457,0.88768039,0.055529182
ENSG00000000460,1.2717589,0.019866978
ENSG00000000938,-3.09305336,0.000177035


For example we see that data1 contains ensembl gene IDs, while we would like to have entrezgene IDs. We have to change this by mapping the ensembl IDs to entrezgene IDs. 

## Datasets 1, 4, 5 and 6 have Ensembl IDs, 3 has hgnc_symbols and datasets 2 and 7 have already entrezgene IDs!

In [5]:
# entrezgene IDs from ensembl gene IDs
# because data 1, 4, 5, 6 and 7 have ensembl IDs, we have to perform this chunk of code 5 times.
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data1

my.genes1 <- getBM(
  attributes = c('entrezgene', 'ensembl_gene_id'), 
  filters = 'ensembl_gene_id',
  values = mygenes,
  mart = ensembl
)

data1 <- data1 %>% left_join(my.genes1, by = "ensembl_gene_id")

head(data1)

"Column `ensembl_gene_id` joining factor and character vector, coercing into character vector"

ensembl_gene_id,logFC_BC,PValue_BC,entrezgene
ENSG00000000003,-0.19677472,0.733608386,7105
ENSG00000000005,-8.22242726,1.14e-05,64102
ENSG00000000419,0.06848394,0.833040252,8813
ENSG00000000457,0.88768039,0.055529182,57147
ENSG00000000460,1.2717589,0.019866978,55732
ENSG00000000938,-3.09305336,0.000177035,2268


In [6]:
# entrezgene IDs from ensembl gene IDs
# because data 1, 4, 5, 6 and 7 have ensembl IDs, we have to perform this chunk of code 5 times.
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data4

my.genes1 <- getBM(
  attributes = c('entrezgene', 'ensembl_gene_id'), 
  filters = 'ensembl_gene_id',
  values = mygenes,
  mart = ensembl
)

data4 <- data4 %>% left_join(my.genes1, by = "ensembl_gene_id")

head(data4)

"Column `ensembl_gene_id` joining factor and character vector, coercing into character vector"

ensembl_gene_id,logFC_RA235,PValue_RA235,entrezgene
ENSG00000223865,2.432572,9.646176000000001e-17,3115.0
ENSG00000211952,7.955072,1.875752e-15,
ENSG00000132465,8.390554,1.98938e-15,3512.0
ENSG00000111801,2.761847,2.238005e-15,10384.0
ENSG00000110777,6.009361,2.591924e-15,5450.0
ENSG00000242574,3.658038,3.447606e-15,3109.0


In [7]:
# entrezgene IDs from ensembl gene IDs
# because data 1, 4, 5, 6 and 7 have ensembl IDs, we have to perform this chunk of code 5 times.
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data5

my.genes1 <- getBM(
  attributes = c('entrezgene', 'ensembl_gene_id'), 
  filters = 'ensembl_gene_id',
  values = mygenes,
  mart = ensembl
)

data5 <- data5 %>% left_join(my.genes1, by = "ensembl_gene_id")

head(data5)

"Column `ensembl_gene_id` joining factor and character vector, coercing into character vector"

ensembl_gene_id,hgnc_symbol,logFC_RETT_FC,PValue_RETT_FC,entrezgene
ENSG00000184254,ALDH1A3,1.4808993,5.82e-06,220
ENSG00000184828,ZBTB7C,0.9939575,8.11e-06,201501
ENSG00000112799,LY86,-0.7647789,1.16e-05,9450
ENSG00000168329,CX3CR1,-1.5765757,1.56e-05,1524
ENSG00000119535,CSF3R,-0.8475975,1.88e-05,1441
ENSG00000232859,LYRM9,1.0868543,1.88e-05,201229


In [8]:
# entrezgene IDs from ensembl gene IDs
# because data 1, 4, 5, 6 and 7 have ensembl IDs, we have to perform this chunk of code 5 times.
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data6

my.genes1 <- getBM(
  attributes = c('entrezgene', 'ensembl_gene_id'), 
  filters = 'ensembl_gene_id',
  values = mygenes,
  mart = ensembl
)

data6 <- data6 %>% left_join(my.genes1, by = "ensembl_gene_id")

head(data6)

"Column `ensembl_gene_id` joining factor and character vector, coercing into character vector"

ensembl_gene_id,hgnc_symbol,logFC_RETT_TC,PValue_RETT_TC,entrezgene
ENSG00000168329,CX3CR1,-2.063534,7.22e-07,1524
ENSG00000141750,STAC2,1.2639781,1.6e-06,342667
ENSG00000165025,SYK,-0.8430392,2.35e-06,6850
ENSG00000112799,LY86,-0.8457504,3.78e-06,9450
ENSG00000197943,PLCG2,-0.7966658,4.83e-06,5336
ENSG00000242574,HLA-DMB,-1.2366369,1.18e-05,3109


We have now the entrezgene IDs for datasets 1, 2, 4, 5, 6, 7 and 8. Only dataset 3 has to be mapped to entrezgene IDs. 

In [9]:
# entrezegene IDs from hgnc symbols
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mygenes <- data3

my.genes1 <- getBM(
  attributes = c('entrezgene', 'hgnc_symbol'), 
  filters = 'hgnc_symbol',
  values = mygenes,
  mart = ensembl
)

data3 <- data3 %>% left_join(my.genes1, by = "hgnc_symbol")

head(data3)

"Column `hgnc_symbol` joining factor and character vector, coercing into character vector"

hgnc_symbol,logFC_MUO,PValue_MUO,entrezgene
PLA2G7,3.6562151,7.99e-12,7941
ETFA,-0.6770891,4.31e-11,2108
LYZ,2.3115932,5.25e-11,4069
ALPK3,-2.3067365,5.93e-11,57538
HLA-DMB,0.6703927,1.07e-10,3109
ARRB2,1.30113,1.3e-10,409


Save these files so we have access to our cleaned data files.

In [23]:
# save files
write.table(data1, file.path(getwd(),"Datasets", "Clean", "breast-cancer_stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data2, file.path(getwd(),"Datasets", "Clean", "lung-cancer-stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data3, file.path(getwd(),"Datasets", "Clean", "MUO-Lean_stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data4, file.path(getwd(),"Datasets", "Clean", "RA-control_GSE55235_stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data5, file.path(getwd(),"Datasets", "Clean", "RETT-control_FC_stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data6, file.path(getwd(),"Datasets", "Clean", "RETT-control_TC_stats_clean.txt"), row.names = F, sep = "\t", quote = F)
write.table(data7, file.path(getwd(),"Datasets", "Clean", "SLE_stats_clean.txt"), row.names = F, sep = "\t", quote = F)

We now have cleaned our gene expression data files. In the next jupyter notebook, which can be found in this repository, we are going to merge these data files, so only have one file for all our gene expression data.