Add differential expression analyses in ANCA-associated vasculitis (#14)

* Fix DOI * Add NARES differential expression notebook * Add GPL file from GEO Downloaded on 15 Sept 2017 * Only ignore the sample metadata in greenelab/rheum-plier-data * Add GPA blood diff exp notebook * Remove lines * Ignore private kidney data * Move function for summarizing expression to util * Add table of contents * Add glomeruli differential expression analysis * Update: save B data.frame for plotting (NARES) * Update: save B data.frame for plotting (GPA blood) * Update: save B data.frame for plotting (kidney) * Add notebook -- first pass at looking across AAV sets differential expression of LVs * Update notebook title * Update: add note about LV96 * Update: add information about clinical data * Write wrapper function for DLVE * Move wrapper function to DLVE util * Use wrapper for GPA blood * Use wrapper for kidney * Save venn diagram as png
greenelab · May 31, 2018 · 62047c2 · 62047c2
1 parent 664690a
commit 62047c2
Show file tree

Hide file tree

Showing 27 changed files with 277,134 additions and 4 deletions.
diff --git a/.gitignore b/.gitignore
@@ -33,4 +33,12 @@ data/expression_data/E-MTAB-2452_hugene11st_SCANfast.pcl
 data/expression_data/GSE18885_series_matrix.txt
 data/expression_data/NARES_SCANfast_ComBat.pcl
 data/expression_data/SLE_WB_all_microarray_QN_zto_before.pcl
-data/sample_info
+data/sample_info/E-GEOD-39088.sdrf.txt
+data/sample_info/E-GEOD-65391.sdrf.txt
+data/sample_info/E-GEOD-78193.sdrf.txt
+data/sample_info/E-MTAB-2452.sdrf.txt
+data/sample_info/NARES_demographic_data.tsv
+data/sample_info/sle-wb_sample_dataset_mapping.tsv
+
+# private kidney data
+data/kidney/
diff --git a/12-train_NARES_PLIER.Rmd b/12-train_NARES_PLIER.Rmd
@@ -10,7 +10,7 @@ output:
 
 The NARES study was published in the following article:
 
-> Grayson PC, Steiling K, Platt M, et al. [Defining the Nasal Transcriptome in Granulomatosis with Polyangiitis](https://dx.doi.org/10.100art.39185). _Arthritis & Rheumatology_, 2015. doi: 10.1002/art.39185.
+> Grayson PC, Steiling K, Platt M, et al. [Defining the Nasal Transcriptome in Granulomatosis with Polyangiitis](https://dx.doi.org/10.1002/art.39185). _Arthritis & Rheumatology_, 2015. doi: 10.1002/art.39185.
 
 We processed this data in the [`greenelab/rheum-data-plier`](https://github.com/greenelab/rheum-plier-data) 
 repo.

diff --git a/18-NARES_differential_expression.Rmd b/18-NARES_differential_expression.Rmd
@@ -0,0 +1,118 @@
+---
+title: "NARES differential expression"
+output:   
+  html_notebook: 
+    toc: true
+    toc_float: true
+---
+
+**J. Taroni 2018**
+
+The NARES study ([Grayson, et al. _Arthritis Rheumatol_. 2015](https://dx.doi.org/10.1002/art.39185)) 
+included three sets of patients with ANCA-associated vasculitis: those with 
+active nasal disease, prior nasal disease, and those with no history of nasal 
+disease.
+There is also a composite comparator group.
+
+Here, we'll test for PLIER LVs that are differentially expressed between groups.
+We'll focus on the multiPLIER LVs (e.g., the recount2 PLIER model), as 
+we have demonstrated the validity of this approach in prior analyses.
+
+## Functions and directory set up
+
+```{r}
+`%>%` <- dplyr::`%>%` 
+source(file.path("util", "test_LV_differences.R"))
+```
+
+```{r}
+# plot and result directory setup for this notebook
+plot.dir <- file.path("plots", "18")
+dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
+results.dir <- file.path("results", "18")
+dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
+```
+
+## Read in data
+
+### Clinical data
+
+This clinical data includes information like measures of disease severity 
+([`BVAS`](https://doi.org/10.1136/ard.2008.101279)) and disease duration.
+We're most interested in the disease labels (`Classification`).
+
+```{r}
+nares.demo <- readr::read_tsv(file.path("data", "sample_info", 
+                                        "NARES_demographic_data.tsv")) %>%
+  dplyr::mutate(Classification = dplyr::recode(Classification, 
+                                               C1 = "Healthy",
+                                               C2 = "Sarcoidosis",
+                                               C3 = "Allergic Rhinitis",
+                                               C4 = "EGPA",
+                                               V1 = "GPA (no prior nasal disease)",
+                                               V2 = "GPA with prior nasal disease",
+                                               V3 = "GPA with active nasal disease"))
+```
+
+### recount2 LVs
+
+```{r}
+recount.b <- readRDS(file.path("results", "13", "NARES_recount2_B.RDS"))
+```
+
+## NARES differential latent variable expression (DLVE)
+
+We're going to compare the seven classification groups, which correspond to
+different disease labels (and were recoded above).
+
+```{r}
+nares.df.list <- LVTestWrapper(b.matrix = recount.b,
+                               sample.info.df = nares.demo,
+                               phenotype.col = "Classification",
+                               file.lead = "NARES_recount2_model", 
+                               plot.dir = plot.dir,
+                               results.dir = results.dir)
+```
+
+### Example LV
+
+```{r}
+nares.limma.df <- nares.df.list$limma
+nares.limma.df %>% 
+  dplyr::filter(adj.P.Val < 0.05) %>% 
+  dplyr::arrange(adj.P.Val)
+```
+
+```{r}
+# read in recount2 summary data.frame
+summary.file <- file.path("results", "08", "recount2_PLIER_summary.tsv")
+recount.summary <- readr::read_tsv(summary.file)
+```
+
+Let's plot `LV96`, as it looks biologically interesting because of it's 
+putative association with natural killer (NK) cells. 
+We'd expect myeloid cell, specifically granulocyte, signatures to be 
+differentially expressed in this dataset, but not necessarily lymphocyte 
+signatures.
+First, does the _name_ agree with the recount2 PLIER summary?
+
+```{r}
+recount.summary %>%
+  dplyr::filter(`LV index` == 96)
+```
+
+Yes, `SVM NK cells activated` is significantly associated with `LV96`.
+
+```{r}
+recount.b.df <- nares.df.list$b.df
+recount.b.df %>%
+  dplyr::filter(LV == "96,SVM NK cells activated") %>%
+  ggplot2::ggplot(ggplot2::aes(x = Classification, y = Value)) +
+  ggplot2::geom_boxplot() +
+  ggplot2::geom_jitter(position = ggplot2::position_jitter(0.15)) + 
+  ggplot2::theme_bw() +
+  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1),
+                 plot.title = ggplot2::element_text(hjust = 0.5, face = "bold")) +
+  ggplot2::ggtitle("LV 96 SVM NK cells activated")
+```
+
diff --git a/18-NARES_differential_expression.nb.html b/18-NARES_differential_expression.nb.html
diff --git a/19-GPA_blood_differential_expression.Rmd b/19-GPA_blood_differential_expression.Rmd
@@ -0,0 +1,191 @@
+---
+title: "GPA blood differential expression"
+output:   
+  html_notebook: 
+    toc: true
+    toc_float: true
+---
+
+**J. Taroni 2018**
+
+Here, we'll perform differential expression (of LV) analyses for the PBMC data 
+from Cheadle C, Berger AE, Andrade F, et al. 
+[Transcription of PR3 and Related Myelopoiesis Genes in Peripheral Blood Mononuclear Cells in Active Wegener’s Granulomatosis](https://dx.doi.org/10.1002/art.27398). 
+_Arthritis & Rheumatism_, 2010. doi: 10.1002/art.27398.
+
+We'll be comparing the healthy controls, the patients with a GPA signature 
+(GPA-positive) and the patients without a GPA signature (GPA-negative).
+We'll first need to project this data into the recount2 PLIER model latent space.
+
+## Functions and directory set up
+
+```{r}
+`%>%` <- dplyr::`%>%` 
+source(file.path("util", "test_LV_differences.R"))
+source(file.path("util", "plier_util.R"))
+```
+
+```{r}
+# plot and result directory setup for this notebook
+plot.dir <- file.path("plots", "19")
+dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
+results.dir <- file.path("results", "19")
+dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
+```
+
+## Read in data
+
+The GPA blood file is in the form of a GEO series matrix, which means the 
+sample metadata is in what are essentially comment fields at the beginning of 
+the file.
+Thus, some ugliness/wrangling will be necessary.
+
+### Expression data
+
+```{r}
+series.mat.file <- file.path("data", "expression_data", 
+                             "GSE18885_series_matrix.txt")
+
+# expression matrix
+gpa.ma.data <- 
+  readr::read_delim(series.mat.file, 
+                    delim = "\t", 
+                    comment = "!",
+                    col_names = TRUE, 
+                    skip = 1)
+
+# this information about GPL6140 was downloaded on 15 Sept 2017, and reportedly
+# last updated on 18 Jan 2013
+gpl.from.geo <- readr::read_tsv(file.path("data", "expression_data", 
+                                          "GPL6104-11576.txt"), comment = "#")
+
+# use the gene symbols from the GPL file -- PLIER uses gene symbols
+gpa.ma.annot <- gpl.from.geo %>%
+  dplyr::select(c(ID, ILMN_Gene)) %>%
+  dplyr::left_join(y = gpa.ma.data,
+                   by = c("ID" = "ID_REF"))
+colnames(gpa.ma.annot)[2] <- "Gene"
+```
+
+```{r}
+# are there any duplicates in the gene symbol column?
+sum(duplicated(gpa.ma.annot$Gene))
+```
+
+```{r}
+# aggregate & write to file
+agg.ma.df <- PrepExpressionDF(dplyr::select(gpa.ma.annot, -ID))
+readr::write_tsv(agg.ma.df,
+                 path = file.path("data", "expression_data",
+                                  "GSE18885_annotated_mean.pcl"))
+
+# get expression matrix
+exprs.mat <- as.matrix(dplyr::select(agg.ma.df, -Gene))
+rownames(exprs.mat) <- agg.ma.df$Gene
+```
+
+### Metadata
+
+As mentioned above, metadata is also extracted from the series matrix file.
+
+```{r}
+# get more information from the series matrix file
+# line 41 has the sample names
+conn <- file(series.mat.file)
+open(conn)
+smpl.name <- read.table(conn, skip = 41, nrow = 1)
+close(conn)
+
+# line 51 contains the WG signature status -- I'll call this GPA signature
+# GPA-positive, GPA-negative to be in line with the current disease name
+conn <- file(series.mat.file)
+open(conn)
+gpa.sig.status <- read.table(conn, skip = 51, nrow = 1)
+close(conn)
+
+# this is the GEO sample accession information
+conn <- file(series.mat.file)
+open(conn)
+gsm.info <- read.table(conn, skip = 77, nrow = 1)
+close(conn)
+
+# get those lines into data.frame format
+smpl.info.df <- as.data.frame(t(dplyr::bind_rows(gsm.info, 
+                                                 smpl.name, 
+                                                 gpa.sig.status))[-1, ])
+colnames(smpl.info.df) <- c("Sample", "Name", "GPA_signature")
+
+# remove 1 row data.frames read in from the series matrix
+rm(conn, gsm.info, smpl.name, gpa.sig.status)
+```
+
+```{r}
+# extract fraction ("cell type(s)") from sample names
+smpl.info.df <- smpl.info.df %>%
+                  dplyr::mutate(Cell_type = sub(".*\\ ", "", Name))
+
+# recode the GPA signature bit
+smpl.info.df <-
+  smpl.info.df %>%
+  dplyr::mutate(GPA_signature = dplyr::case_when(
+                  grepl("control", GPA_signature) ~ "Control", 
+                  grepl("WG Sig -", GPA_signature) ~ "GPAneg",
+                  grepl("WG Sig +", GPA_signature) ~ "GPApos"
+                ))
+
+# write to file
+readr::write_tsv(smpl.info.df, 
+                 path = file.path("data", "sample_info",
+                                  "GSE18885_sample_info.tsv"))
+```
+
+```{r}
+# remove data.frames that won't be needed
+rm(agg.ma.df, gpa.ma.annot, gpa.ma.data, gpl.from.geo)
+```
+
+## Apply recount2 model
+
+```{r}
+# load model itself
+recount.plier <- readRDS(file.path("data", "recount2_PLIER_data", 
+                                   "recount_PLIER_model.RDS"))
+# project
+recount.b <- GetNewDataB(exprs.mat = as.matrix(exprs.mat),
+                         plier.model = recount.plier)
+```
+
+```{r}
+# save B matrix to file
+b.file <- file.path(results.dir, "GPA_blood_recount2_B.RDS")
+saveRDS(recount.b, file = b.file)
+```
+
+## Differential expression
+
+### PBMC only
+
+We're going to restrict our analyses to the peripheral blood mononuclear cell 
+(PBMC) fraction.
+
+```{r}
+# sample information
+pbmc.df <- smpl.info.df %>%
+  dplyr::filter(Cell_type == "PBMC")
+```
+
+```{r}
+# b matrix (latent variables)
+pbmc.b <- recount.b[, pbmc.df$Sample]
+```
+
+### Differential expression analysis itself
+
+```{r}
+LVTestWrapper(b.matrix = pbmc.b,
+              sample.info.df = pbmc.df,
+              phenotype.col = "GPA_signature",
+              file.lead = "GPA_blood_recount2_model",
+              plot.dir = plot.dir,
+              results.dir = results.dir)
+```
diff --git a/19-GPA_blood_differential_expression.nb.html b/19-GPA_blood_differential_expression.nb.html