Skip to content

Commit

Permalink
Add differential expression analyses in ANCA-associated vasculitis (#14)
Browse files Browse the repository at this point in the history
* Fix DOI

* Add NARES differential expression notebook

* Add GPL file from GEO

Downloaded on 15 Sept 2017

* Only ignore the sample metadata in greenelab/rheum-plier-data

* Add GPA blood diff exp notebook

* Remove lines

* Ignore private kidney data

* Move function for summarizing expression to util

* Add table of contents

* Add glomeruli differential expression analysis

* Update: save B data.frame for plotting (NARES)

* Update: save B data.frame for plotting (GPA blood)

* Update: save B data.frame for plotting (kidney)

* Add notebook -- first pass at looking across AAV sets

differential expression of LVs

* Update notebook title

* Update: add note about LV96

* Update: add information about clinical data

* Write wrapper function for DLVE

* Move wrapper function to DLVE util

* Use wrapper for GPA blood

* Use wrapper for kidney

* Save venn diagram as png
  • Loading branch information
jaclyn-taroni committed May 31, 2018
1 parent 664690a commit 62047c2
Show file tree
Hide file tree
Showing 27 changed files with 277,134 additions and 4 deletions.
10 changes: 9 additions & 1 deletion .gitignore
Expand Up @@ -33,4 +33,12 @@ data/expression_data/E-MTAB-2452_hugene11st_SCANfast.pcl
data/expression_data/GSE18885_series_matrix.txt
data/expression_data/NARES_SCANfast_ComBat.pcl
data/expression_data/SLE_WB_all_microarray_QN_zto_before.pcl
data/sample_info
data/sample_info/E-GEOD-39088.sdrf.txt
data/sample_info/E-GEOD-65391.sdrf.txt
data/sample_info/E-GEOD-78193.sdrf.txt
data/sample_info/E-MTAB-2452.sdrf.txt
data/sample_info/NARES_demographic_data.tsv
data/sample_info/sle-wb_sample_dataset_mapping.tsv

# private kidney data
data/kidney/
2 changes: 1 addition & 1 deletion 12-train_NARES_PLIER.Rmd
Expand Up @@ -10,7 +10,7 @@ output:

The NARES study was published in the following article:

> Grayson PC, Steiling K, Platt M, et al. [Defining the Nasal Transcriptome in Granulomatosis with Polyangiitis](https://dx.doi.org/10.100art.39185). _Arthritis & Rheumatology_, 2015. doi: 10.1002/art.39185.
> Grayson PC, Steiling K, Platt M, et al. [Defining the Nasal Transcriptome in Granulomatosis with Polyangiitis](https://dx.doi.org/10.1002/art.39185). _Arthritis & Rheumatology_, 2015. doi: 10.1002/art.39185.
We processed this data in the [`greenelab/rheum-data-plier`](https://github.com/greenelab/rheum-plier-data)
repo.
Expand Down
118 changes: 118 additions & 0 deletions 18-NARES_differential_expression.Rmd
@@ -0,0 +1,118 @@
---
title: "NARES differential expression"
output:
html_notebook:
toc: true
toc_float: true
---

**J. Taroni 2018**

The NARES study ([Grayson, et al. _Arthritis Rheumatol_. 2015](https://dx.doi.org/10.1002/art.39185))
included three sets of patients with ANCA-associated vasculitis: those with
active nasal disease, prior nasal disease, and those with no history of nasal
disease.
There is also a composite comparator group.

Here, we'll test for PLIER LVs that are differentially expressed between groups.
We'll focus on the multiPLIER LVs (e.g., the recount2 PLIER model), as
we have demonstrated the validity of this approach in prior analyses.

## Functions and directory set up

```{r}
`%>%` <- dplyr::`%>%`
source(file.path("util", "test_LV_differences.R"))
```

```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "18")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "18")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```

## Read in data

### Clinical data

This clinical data includes information like measures of disease severity
([`BVAS`](https://doi.org/10.1136/ard.2008.101279)) and disease duration.
We're most interested in the disease labels (`Classification`).

```{r}
nares.demo <- readr::read_tsv(file.path("data", "sample_info",
"NARES_demographic_data.tsv")) %>%
dplyr::mutate(Classification = dplyr::recode(Classification,
C1 = "Healthy",
C2 = "Sarcoidosis",
C3 = "Allergic Rhinitis",
C4 = "EGPA",
V1 = "GPA (no prior nasal disease)",
V2 = "GPA with prior nasal disease",
V3 = "GPA with active nasal disease"))
```

### recount2 LVs

```{r}
recount.b <- readRDS(file.path("results", "13", "NARES_recount2_B.RDS"))
```

## NARES differential latent variable expression (DLVE)

We're going to compare the seven classification groups, which correspond to
different disease labels (and were recoded above).

```{r}
nares.df.list <- LVTestWrapper(b.matrix = recount.b,
sample.info.df = nares.demo,
phenotype.col = "Classification",
file.lead = "NARES_recount2_model",
plot.dir = plot.dir,
results.dir = results.dir)
```

### Example LV

```{r}
nares.limma.df <- nares.df.list$limma
nares.limma.df %>%
dplyr::filter(adj.P.Val < 0.05) %>%
dplyr::arrange(adj.P.Val)
```

```{r}
# read in recount2 summary data.frame
summary.file <- file.path("results", "08", "recount2_PLIER_summary.tsv")
recount.summary <- readr::read_tsv(summary.file)
```

Let's plot `LV96`, as it looks biologically interesting because of it's
putative association with natural killer (NK) cells.
We'd expect myeloid cell, specifically granulocyte, signatures to be
differentially expressed in this dataset, but not necessarily lymphocyte
signatures.
First, does the _name_ agree with the recount2 PLIER summary?

```{r}
recount.summary %>%
dplyr::filter(`LV index` == 96)
```

Yes, `SVM NK cells activated` is significantly associated with `LV96`.

```{r}
recount.b.df <- nares.df.list$b.df
recount.b.df %>%
dplyr::filter(LV == "96,SVM NK cells activated") %>%
ggplot2::ggplot(ggplot2::aes(x = Classification, y = Value)) +
ggplot2::geom_boxplot() +
ggplot2::geom_jitter(position = ggplot2::position_jitter(0.15)) +
ggplot2::theme_bw() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1),
plot.title = ggplot2::element_text(hjust = 0.5, face = "bold")) +
ggplot2::ggtitle("LV 96 SVM NK cells activated")
```

486 changes: 486 additions & 0 deletions 18-NARES_differential_expression.nb.html

Large diffs are not rendered by default.

191 changes: 191 additions & 0 deletions 19-GPA_blood_differential_expression.Rmd
@@ -0,0 +1,191 @@
---
title: "GPA blood differential expression"
output:
html_notebook:
toc: true
toc_float: true
---

**J. Taroni 2018**

Here, we'll perform differential expression (of LV) analyses for the PBMC data
from Cheadle C, Berger AE, Andrade F, et al.
[Transcription of PR3 and Related Myelopoiesis Genes in Peripheral Blood Mononuclear Cells in Active Wegener’s Granulomatosis](https://dx.doi.org/10.1002/art.27398).
_Arthritis & Rheumatism_, 2010. doi: 10.1002/art.27398.

We'll be comparing the healthy controls, the patients with a GPA signature
(GPA-positive) and the patients without a GPA signature (GPA-negative).
We'll first need to project this data into the recount2 PLIER model latent space.

## Functions and directory set up

```{r}
`%>%` <- dplyr::`%>%`
source(file.path("util", "test_LV_differences.R"))
source(file.path("util", "plier_util.R"))
```

```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "19")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "19")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```

## Read in data

The GPA blood file is in the form of a GEO series matrix, which means the
sample metadata is in what are essentially comment fields at the beginning of
the file.
Thus, some ugliness/wrangling will be necessary.

### Expression data

```{r}
series.mat.file <- file.path("data", "expression_data",
"GSE18885_series_matrix.txt")
# expression matrix
gpa.ma.data <-
readr::read_delim(series.mat.file,
delim = "\t",
comment = "!",
col_names = TRUE,
skip = 1)
# this information about GPL6140 was downloaded on 15 Sept 2017, and reportedly
# last updated on 18 Jan 2013
gpl.from.geo <- readr::read_tsv(file.path("data", "expression_data",
"GPL6104-11576.txt"), comment = "#")
# use the gene symbols from the GPL file -- PLIER uses gene symbols
gpa.ma.annot <- gpl.from.geo %>%
dplyr::select(c(ID, ILMN_Gene)) %>%
dplyr::left_join(y = gpa.ma.data,
by = c("ID" = "ID_REF"))
colnames(gpa.ma.annot)[2] <- "Gene"
```

```{r}
# are there any duplicates in the gene symbol column?
sum(duplicated(gpa.ma.annot$Gene))
```

```{r}
# aggregate & write to file
agg.ma.df <- PrepExpressionDF(dplyr::select(gpa.ma.annot, -ID))
readr::write_tsv(agg.ma.df,
path = file.path("data", "expression_data",
"GSE18885_annotated_mean.pcl"))
# get expression matrix
exprs.mat <- as.matrix(dplyr::select(agg.ma.df, -Gene))
rownames(exprs.mat) <- agg.ma.df$Gene
```

### Metadata

As mentioned above, metadata is also extracted from the series matrix file.

```{r}
# get more information from the series matrix file
# line 41 has the sample names
conn <- file(series.mat.file)
open(conn)
smpl.name <- read.table(conn, skip = 41, nrow = 1)
close(conn)
# line 51 contains the WG signature status -- I'll call this GPA signature
# GPA-positive, GPA-negative to be in line with the current disease name
conn <- file(series.mat.file)
open(conn)
gpa.sig.status <- read.table(conn, skip = 51, nrow = 1)
close(conn)
# this is the GEO sample accession information
conn <- file(series.mat.file)
open(conn)
gsm.info <- read.table(conn, skip = 77, nrow = 1)
close(conn)
# get those lines into data.frame format
smpl.info.df <- as.data.frame(t(dplyr::bind_rows(gsm.info,
smpl.name,
gpa.sig.status))[-1, ])
colnames(smpl.info.df) <- c("Sample", "Name", "GPA_signature")
# remove 1 row data.frames read in from the series matrix
rm(conn, gsm.info, smpl.name, gpa.sig.status)
```

```{r}
# extract fraction ("cell type(s)") from sample names
smpl.info.df <- smpl.info.df %>%
dplyr::mutate(Cell_type = sub(".*\\ ", "", Name))
# recode the GPA signature bit
smpl.info.df <-
smpl.info.df %>%
dplyr::mutate(GPA_signature = dplyr::case_when(
grepl("control", GPA_signature) ~ "Control",
grepl("WG Sig -", GPA_signature) ~ "GPAneg",
grepl("WG Sig +", GPA_signature) ~ "GPApos"
))
# write to file
readr::write_tsv(smpl.info.df,
path = file.path("data", "sample_info",
"GSE18885_sample_info.tsv"))
```

```{r}
# remove data.frames that won't be needed
rm(agg.ma.df, gpa.ma.annot, gpa.ma.data, gpl.from.geo)
```

## Apply recount2 model

```{r}
# load model itself
recount.plier <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS"))
# project
recount.b <- GetNewDataB(exprs.mat = as.matrix(exprs.mat),
plier.model = recount.plier)
```

```{r}
# save B matrix to file
b.file <- file.path(results.dir, "GPA_blood_recount2_B.RDS")
saveRDS(recount.b, file = b.file)
```

## Differential expression

### PBMC only

We're going to restrict our analyses to the peripheral blood mononuclear cell
(PBMC) fraction.

```{r}
# sample information
pbmc.df <- smpl.info.df %>%
dplyr::filter(Cell_type == "PBMC")
```

```{r}
# b matrix (latent variables)
pbmc.b <- recount.b[, pbmc.df$Sample]
```

### Differential expression analysis itself

```{r}
LVTestWrapper(b.matrix = pbmc.b,
sample.info.df = pbmc.df,
phenotype.col = "GPA_signature",
file.lead = "GPA_blood_recount2_model",
plot.dir = plot.dir,
results.dir = results.dir)
```
885 changes: 885 additions & 0 deletions 19-GPA_blood_differential_expression.nb.html

Large diffs are not rendered by default.

0 comments on commit 62047c2

Please sign in to comment.