-
Notifications
You must be signed in to change notification settings - Fork 0
/
nanostring_deseq2_analysis_cho_jan2022.Rmd
1212 lines (884 loc) · 44.3 KB
/
nanostring_deseq2_analysis_cho_jan2022.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "NanoString NCounter analysis: Breastmilk miRNA Obese vs Normals"
author: "Michael S Chimenti"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
df_print: paged
toc: true
toc_depth: 2
toc_float: true
number_sections: true
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
## Project Summary
**Our central hypothesis is that altered breast milk EV (bEV)-miRNA profiles in obese mothers affect the gene expression of intestinal epithelial cells (IECs), which might be associated with unfavorable health outcomes in children. Our long-term goal is to promote the health and development of children by improving the quality of breast milk through interventions and therapeutic strategies. The objectives of this study to achieve the long-term goal are 1) to understand the physiological roles of bEVs associated with maternal obesity, and 2) to establish a research team investigating bEVs with collaborators for the further studies. To attain the overall objectives, we propose the following specific aims:
Aim1. To determine differentially expressed bEV-miRNAs in obese mothers; We will recruit normal weight mothers and overweight/obese mothers and who have delivered a baby in the past one month. EV-miRNA expression profile will be analyzed with Nanostring Human miRNA Expression Panel. It will allow us to screen 798 biologically relevant miRNAs, which will enable us to find novel biomarkers related to maternal obesity. Our working hypothesis is that immune-response and metabolism-related miRNAs are altered in overweight/obese mother breast milk.**
## Methods
Nanostring NCounter data were obtained as “RCC” format files. These were imported into R with 'readRcc' function from the NanoStringQCPro package (Bioconductor; https://www.bioconductor.org/packages/release/bioc/html/NanoStringQCPro.html; Bourgon et. al.). Here, we follow the NanoString RUV normalization procedure outlined in Love, et. al (ref), for quality control checks and iterative RUVg normalization (Risso, et. al.). All samples passed QC checks and were included in the analysis. After visualization of the normalized data with RLE and PCA plots, RUVg-normalized data (k=2) was chosen for downstream DEG analysis. A DESeq2 'dds' object was created from the normalized data using ‘DESeqDataSetFromMatrix’ with design conditioned on a BMI-cutoff factor variable, milk collection date quartile, and the RUVg-learned 'W1' and 'W2' correction factors for technical batch. R code for the analysis is available on Github.
Nickles D, Sandmann T, Ziman R, Bourgon R (2021). NanoStringQCPro: Quality metrics and data processing methods for NanoString mRNA gene expression data. R package version 1.26.0.
Bhattacharya A, Hamilton AM, Furberg H, et al. An approach for normalization and quality control for NanoString RNA expression data. Brief Bioinform. 2021;22(3):bbaa163. doi:10.1093/bib/bbaa163
Risso D, Ngai J, Speed T, Dudoit S (2014). “Normalization of RNA-seq data using factor analysis of control genes or samples.” Nature Biotechnology, 32(9), 896–902. In press, http://www.nature.com/nbt/journal/v32/n9/full/nbt.2931.html.
Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.
## Library imports
Click the "Code" button to see hidden code.
```{r, imports, warning=FALSE, message=FALSE, class.source = 'fold-show'}
## Format and plotting
library(ggrepel)
library(kableExtra)
library(pcaExplorer)
library(Vennerable)
require(ggplot2)
## Nanostring specific
library(RUVSeq)
library(NanoStringQCPro)
library(NanoNormIter)
## DE analysis
library(DESeq2)
library(limma)
library(matrixStats)
library(EnvStats)
## TidyR
library(tidyverse)
library(magrittr)
```
## Function definitions
```{r, function_defs}
## Volcano Plot
do_vol_plot <- function(df, sig=0.05, fc=0.5, size = 4){
df_sig<- df %>% filter(padj < sig)
## genes with labels -- signficant,decreased expression and start with IGF
df_label<- df %>%
filter(padj < sig, abs(log2FoldChange) > fc)
#mutate(short_id = stringr::str_split(gene_id, "000000") %>% map_chr(.,2))
## plot
p <- ggplot(df, aes(log2FoldChange, -log10(padj))) +
geom_point(size=0.8, color="black", alpha=.8) +
geom_point(size=0.8, data=df_sig, aes(log2FoldChange, -log10(padj)), colour="red") +
geom_text_repel(size= size,
colour="black",
segment.size=0.1,
nudge_x=0.06,
nudge_y=0.06,
data=df_label,
aes(log2FoldChange, -log10(padj), label=gene_name),
max.iter= 200,
point.padding = 0.15,
segment.alpha = 1,
box.padding=.15,
min.segment.length = unit(0.15, 'lines'),size=2.5) +
theme(
axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12),
axis.title.x = element_text(size=14, margin = margin(t = 10, r = 0, b = 10, l = 0)),
axis.title.y = element_text(size=14, margin = margin(t = 0, r = 10, b = 0, l = 10)),
plot.margin =unit(c(.5,.5,.5,.5),"cm"),
plot.title = element_text(size = 11)
)
return (p)
}
# Function to add correlation coefficients
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
Cor <- abs(cor(x, y)) # Remove abs function if desired
txt <- paste0(prefix, format(c(Cor, 0.123456789), digits = digits)[1])
if(missing(cex.cor)) {
cex.cor <- 0.4 / strwidth(txt)
}
text(0.5, 0.5, txt,
cex = 1 + cex.cor * Cor) # Resize the text by level of correlation
}
#### THESE FUNCTIONS ARE IMPORTED BY 'NANONORMITER' PACKAGE
# FOR REFERENCE, HERE ARE THE QC CHECKS:
# imagingQC <- function(rcc){
#
# fovRatio = as.numeric(rcc$Lane_Attributes[3]) / as.numeric(rcc$Lane_Attributes[2])
# if (!(fovRatio > .75)) {return('Flag')}
# if (fovRatio > .75) {return('No flag')}
#
# }
#
# #### INPUT: rcc - input from rcc (use readRcc from NanoStringQCPro)
# #### low, high - the lower and upper limits for binding density
# #### OUTPUT: flag for binding density
#
# bindingDensityQC <- function(rcc,low,high){
#
# bd = as.numeric(rcc$Lane_Attributes[6])
# if(!(bd < high & bd > low)) {return('Flag')}
# if (bd < high & bd > low) {return('No flag')}
#
#
# }
#
# #### INPUT: rcc - input from rcc (use readRcc from NanoStringQCPro)
# #### OUTPUT: flag for linearity for positive controls
#
# positiveLinQC <- function(rcc){
#
# counts = rcc$Code_Summary
# posControls = as.numeric(counts$Count[grepl('POS_',counts$Name)])
# known = c(128,128/4,128/16,128/64,128/256,128/(256*4))
# r2 = summary(lm(sort(posControls)~sort(known)))$r.squared
# if(!(r2 > .95) | is.na(r2)) {return('Flag')}
# if(r2 > .95) {return('No flag')}
#
# }
#
# #### INPUT: rcc - input from rcc (use readRcc from NanoStringQCPro)
# #### numSD - number of standard deviations to calibrate the LOD
# #### OUTPUT: flag for limit of detection
#
# limitOfDetectionQC <- function(rcc,numSD = 0){
#
# counts = rcc$Code_Summary
# posE = as.numeric(counts$Count[counts$Name == 'POS_E'])
# negControls = as.numeric(counts$Count[grepl('NEG',counts$Name)])
# if(!(posE > mean(negControls) + numSD*sd(negControls))) {return('Flag')}
# if (posE > mean(negControls) + numSD*sd(negControls)) {return('No flag')}
#
# }
##### HERE IS THE RUV CALCULATION AS IMPLEMENTED IN NANONORMITER()
# RUV_total <- function(raw,pData,fData,k,hkgenes = NULL,exclude = NULL){
#
# library(RUVSeq)
# library(DESeq2)
# library(limma)
# library(matrixStats)
#
# if (!is.null(hkgenes)){
#
# fData(set)$Class[rownames(set) %in% hkgenes] = 'Housekeeping'
#
# }
#
# fData = fData[rownames(raw),]
# int = intersect(rownames(raw),rownames(fData))
# fData = fData[int,]
# raw = raw[int,]
#
# set <- newSeqExpressionSet(as.matrix(round(raw)),
# phenoData=pData,
# featureData=fData)
#
# cIdx <- rownames(set)[fData(set)$Class == "Housekeeping"]
# cIdx = cIdx[!(cIdx %in% exclude)]
# x <- as.factor(pData$Group)
# set <- betweenLaneNormalization(set, which="upper")
# set <- RUVg(set, cIdx, k=k)
# dds <- DESeqDataSetFromMatrix(counts(set),colData=pData(set),design=~1)
# rowData(dds) <- fData
# dds <- estimateSizeFactors(dds)
# dds <- estimateDispersionsGeneEst(dds)
# cts <- counts(dds, normalized=TRUE)
# disp <- pmax((rowVars(cts) - rowMeans(cts)),0)/rowMeans(cts)^2
# mcols(dds)$dispGeneEst <- disp
# dds <- estimateDispersionsFit(dds, fitType="mean")
# vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
# mat <- assay(vsd)
# covars <- as.matrix(colData(dds)[,grep("W",colnames(colData(dds))),drop=FALSE])
# mat <- removeBatchEffect(mat, covariates=covars)
# assay(vsd) <- mat
# return(list(set = set,vsd = vsd))
#
# }
```
# Exploratory Data Analysis
## Import metadata
```{r, import_meta, class.source = 'fold-show'}
meta_df_full <- readxl::read_xlsx('cleaned_metadata.xlsx', sheet = 'Sheet1', col_names = TRUE, col_types = "guess")
meta_df <- meta_df_full %>% dplyr::select(c(`file name`, batch, `study ID`, Age, Race, `prepregnancy BMI`, `current BMI`, `Breast milk colllection date post delivary`, `Delivery type`))
colnames(meta_df) <- c("filename","batch","SampleID","age","race","prepreg_BMI","curr_BMI","milk_col_dt", "delivery")
meta_df$batch[28:40] <- "4" #fix the singleton batch
```
Let's do some exploratory analysis on the metadata before starting on the Nanostring analysis. We're looking for correlated variables or correlations between batches and variables.
## Exploratory Plots
### Correlation between pre-preg BMI and delivery?
```{r, meta_EDA}
library(ggpubr)
# Box plots with jittered points
# :::::::::::::::::::::::::::::::::::::::::::::::::::
# Change outline colors by groups: dose
# Use custom color palette
# Add jitter points and change the shape by groups
p <- ggboxplot(meta_df, x = "delivery", y = "prepreg_BMI",
color = "delivery", palette =c("#00AFBB", "#E7B800","#FC4E07"),
add = "jitter")
my_comps <- list( c("vaginal", "vbac"), c("vaginal","c-sec"), c("vbac","c-sec"))
p <- p + stat_compare_means(comparisons = my_comps) + stat_compare_means(label.y=50)
p
```
### Correlation between batch and mean BMI
Sample batch appears to be at possibly partially confounded with mean BMI in both pre- and post-pregnancy BMI. We may want to try excluding batches 6 and 7 to see if discovered DE miRNAs are robust to the presence of these batches with higher mean BMI.
However, when we look at % overweight after classifying according to BMI > 24.25, batches 6-8 aren't too far from the others. Batch 9 has only lean samples.
```{r, meta_EDA2}
p1 <- ggboxplot(meta_df, x = "batch", y = "prepreg_BMI",
color = "batch",
add = "jitter")
p1
p2 <- ggboxplot(meta_df, x = "batch", y = "curr_BMI",
color = "batch",
add = "jitter")
p2
## Overweight >= 25. Obese >=30.
meta_df <- meta_df %>% dplyr::mutate(
status_ov = dplyr::case_when(
.data$prepreg_BMI < 25 ~ "Lean",
.data$prepreg_BMI >=25 ~ "Over_Obese",
TRUE ~ NA_character_
)
) %>% dplyr::mutate(
status_ob = dplyr::case_when(
.data$prepreg_BMI < 25 ~ "Lean",
.data$prepreg_BMI >= 30 ~ "Obese",
(.data$prepreg_BMI >= 25 & .data$prepreg_BMI < 30) ~ "Overweight"
)
)
p3 <- ggplot(meta_df, aes(y=batch, fill=status_ov)) + geom_bar(position="fill") + ggtitle("Fraction Overweight by Batch; BMI > 25")
p3
p4 <- ggplot(meta_df, aes(y=batch, fill=status_ob)) + geom_bar(position="fill") + ggtitle("Fraction Obese by Batch; BMI > 30")
p4
```
### Correlations between dataset variables
Only pre- and current BMI appear to be correlated in a significant way.
```{r, corr, class.source = 'fold-show'}
df <- meta_df %>% dplyr::select(c(batch,age,prepreg_BMI,curr_BMI,milk_col_dt)) %>% mutate_at(vars(batch),factor) %>% as.data.frame()
groups <- df[,1]
pairs(df[,2:5], # Data frame of variables
labels = colnames(df[,2:5]), # Variable names
pch = 21, # Pch symbol
bg = rainbow(9)[groups], # Background color of the symbol (pch 21 to 25)
col = rainbow(9)[groups], # Border color of the symbol
main = "Breastmilk dataset (batch=color)", # Title of the plot
row1attop = TRUE, # If FALSE, changes the direction of the diagonal
gap = 1, # Distance between subplots
cex.labels = NULL, # Size of the diagonal text
font.labels = 1,
lower.panel = panel.smooth,
upper.panel = panel.cor) # Font style of the diagonal text
```
## Variable distributions
Here, we look at the distribution of age and BMI in the dataset.
```{r, dist}
# Density plot with mean lines and marginal rug
# :::::::::::::::::::::::::::::::::::::::::::::::::::
# Change outline and fill colors by groups ("sex")
# Use custom palette
p <- gghistogram(meta_df, x = "age",
add = "mean", rug = TRUE, color = "blue")
p
p <- ggdensity(meta_df, x = "prepreg_BMI",
add = "mean", rug = TRUE, color = "blue")
p
p <- ggdensity(meta_df, x = "curr_BMI",
add = "mean", rug = TRUE, color = "blue")
p
```
# RUV-seq QC and Normalization
RUV-seq method of Love et. al. for best-practices normalization and removal of unwanted technical artifacts
Here, we will follow the method of Love, et. al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8138885/):
**"Using DESeq2 [25], we identified genes differentially expressed in NK cells primed by CTV-1 or IL-2 cytokines compared to unprimed NK cells at FDR-adjusted equation M22. The two normalization methods led to a different number of differentially expressed genes with a limited overlap of significant genes by both methods (Figure 4A). The raw equation M23-value histograms from differential expression analysis using nSolver-normalized expression exhibited a slope toward 0 for equation M24-values under 0.3, which can indicate issues with unaccounted-for correlations among samples [42], such as residual technical variation. The distributions of equation M25-values using the RUVSeq-normalized data were closer to uniform throughout the range [0,1] for most genes (Figure 4B). While the log2-fold changes were correlated between the two normalization procedures, the genes found to be differentially expressed only with nSolver-normalized data tended to have large standard errors with RUVSeq-normalized data and therefore not statistically significant using RUVSeq (Figure 4C). These differences in DE results emphasize the importance of properly validating normalization prior to downstream genomic analyses."**
## Import data
### Create pheno and expr tables
```{r, import_data, class.source = 'fold-show'}
#getwd()
files.RCC = list.files(".", full.names = TRUE)
files.RCC = files.RCC[grepl('RCC',files.RCC)]
head(files.RCC)
ng = nrow(readRcc(files.RCC[1])$Code_Summary)
ncol = length(files.RCC)
raw_expression = as.data.frame(matrix(nrow = ng,ncol = length(files.RCC)+2))
colnames(raw_expression)[1:2] = c('Gene','Class')
pData = as.data.frame(matrix(nrow = length(files.RCC),ncol = 11))
colnames(pData) = c('BCAC_ID','SampleID','Owner','Comments','Date','GeneRLF','SystemAPF','imagingQC',
'bindingDensityQC','limitOfDetectionQC','positiveLinearityQC')
raw_expression[,1:2] = readRcc(files.RCC[1])$Code_Summary[,c(2,1)]
head(pData, 5)
head(raw_expression, 5)
```
### Populate empty tables from RCC
```{r, create_pData, message=FALSE,warning=FALSE}
## NOTE: I have to override the positiveLinQC function for this panel b/c it uses two types of positive controls and that confuses the grepl search.
positiveLinQC <- function(rcc){
counts = rcc$Code_Summary
#posControls = as.numeric(counts$Count[grepl('POS_',counts$Name)])
posControls = as.numeric(counts$Count[grepl('^POS_',counts$Name)])
known = c(128,128/4,128/16,128/64,128/256,128/(256*4))
r2 = summary(lm(sort(posControls)~sort(known)))$r.squared
if(!(r2 > .95) | is.na(r2)) {return('Flag')}
if(r2 > .95) {return('No flag')}
}
## Override this function to account for two types of Neg controls.
limitOfDetectionQC <- function(rcc,numSD = 0){
counts = rcc$Code_Summary
posE = as.numeric(counts$Count[counts$Name == 'POS_E'])
negControls = as.numeric(counts$Count[grepl('^NEG',counts$Name)])
if(!(posE > mean(negControls) + numSD*sd(negControls))) {return('Flag')}
if (posE > mean(negControls) + numSD*sd(negControls)) {return('No flag')}
}
##
## Populate pData and raw expr tables from RCC files
for (i in 1:length(files.RCC)){
#i = 1
print(i)
rcc = readRcc(files.RCC[i])
raw = rcc$Code_Summary
raw_expression[,i+2] = as.numeric(raw$Count)
#colnames(raw_expression)[i+2] = strsplit(files.RCC[i],'_')[[1]][2:4]
pData[i,2:7] = as.vector(rcc$Sample_Attributes)
pData$imagingQC[i] = imagingQC(rcc)
pData$bindingDensityQC[i] = bindingDensityQC(rcc,.05,2.25)
pData$limitOfDetectionQC[i] = limitOfDetectionQC(rcc)
pData$positiveLinearityQC[i] = positiveLinQC(rcc)
}
## Munge pData and metadata together; create treatment classes
pData$SampleID <- meta_df$SampleID
pData <- dplyr::left_join(pData, meta_df, by = "SampleID")
rownames(pData) <- pData$SampleID
colnames(raw_expression)[3:82] <- pData$SampleID
head(pData)
```
## Quality Control Analysis
No QC flags are set on any of the 80 samples:
```{r, checkQCflags}
pData[,8:11]
```
This hidden code block does more setup and preparation for analysis.
```{r, make_pData, warning=FALSE}
raw = raw_expression[,-c(1:2)]
fData = raw_expression[,c(1:2)]
rownames(raw) = fData$Gene
cIdx <- fData$Gene[fData$Class == "Housekeeping"]
pData$HK_Gene_Miss = colSums(raw[cIdx,] == 0)
rownames(fData) = fData$Gene
rownames(raw) = fData$Gene
#rownames(pData) = colnames(raw)
#
```
### Check Housekeeping Genes
So let's proceed to checking whether housekeeping genes correlate with treatment. RUV-seq normalization assumes that housekeeping genes DO NOT correlate with treatment or biology of interest.
```{r,housekeeping}
#### CHECK IF housekeeping Genes ARE ASSOCIATED WITH PRIMARY PHENO
hk_raw = raw[cIdx,]
pval = vector(length = nrow(hk_raw))
require(MASS)
for (i in 1:nrow(hk_raw)){
reg = glm.nb(as.numeric(hk_raw[i,]) ~ as.factor(pData$status_ov))
pval[i] = coef(summary(reg))[2,4]
}
```
### P-values
These are the p-values of the relationship between the housekeeping genes and the treatment groups. There appears to be no strongly significant correlation between the housekeeping miRNAs and the obese/lean status (this is good!)
```{r, pval_hk}
pval <- pval %>% as_tibble() %>% mutate(hkgene = rownames(hk_raw))
pval
```
### Check limit of detection
We want to define how many genes are below the limit of detection (i.e., >1SD below the negative control mean).
```{r, blod}
neg_raw <- raw %>% filter(row.names(raw) %in% c("NEG_A","NEG_B","NEG_C","NEG_D","NEG_E","NEG_F","NEG_G","NEG_H"))
lod <- colMeans(neg_raw) - apply(neg_raw,1,sd)
num_endo_blod <- colSums(raw_expression[raw_expression$Class == "Endogenous1", -c(1:2)] < lod)
num_hk_blod <- colSums(raw_expression[raw_expression$Class == "Housekeeping", -c(1:2)] < lod)
## Check the BLOD numbers by batch
df <- data.frame(below_lim = num_endo_blod, batch = pData$batch, name=pData$SampleID)
p <- ggbarplot(df, y="below_lim", x="name", color="batch", fill="batch")
p <- p + ggtitle("Number Endogenous miRNAs Below Limit of Detection by Batch")
p
```
We can clearly see a batch effect with Batch 4 having larger than normal "Below Limit of Detection" miRNAs.
## Apply RUVg normalization
Now, we remove correlated negative control genes and apply RUVg normalization:
```{r,RUV_normalization, results='hide', message=FALSE}
## k = 1,2, and 3
## k is the number of sources of technical confounding (i.e., suspected batches)
norm.dat.k1 <- NanoNormIter::RUV_total(raw,pData,fData,k = 1)
norm.dat.k2 <- NanoNormIter::RUV_total(raw,pData,fData,k = 2)
norm.dat.k3 <- NanoNormIter::RUV_total(raw,pData,fData,k = 3)
```
### RLE analysis {.tabset}
Here, we visualize the raw and RUV normalized (k=1,2, or 3; k is the number of sources of suspected technical confounding) datasets using an RLE plot.An RLE plot shows the boxplots of the log-ratios of the gene-level read counts of each sample to those of a reference sample (defined as the median across the samples). Ideally, the distributions should be centered around the zero line and as tight as possible. Clear deviations indicate the need for normalization and/or the presence of outlying samples.
The RUVg function returns two pieces of information: the estimated factors of unwanted variation (added as columns to the phenoData slot of set) and the normalized counts obtained by regressing the original counts on the unwanted factors. The normalized values are stored in the normalizedCounts slot of set and can be accessed with the normCounts method. These counts should be used only for exploration. It is important that subsequent DE analysis be done on the original counts (accessible through the counts method), as removing the unwanted factors from the counts can also remove part of a factor of interest [6].
#### Raw Data
These are the raw, non-normalized counts. Y-axis is the median deviation of log expression. We can see systematic deviations from the median line (zero). This is bad and indicates a definite batch effect.
```{r, vis_raw_RLE}
## Raw data
EDASeq::plotRLE(as.matrix(raw), cex.lab=0.5)
```
#### RUVg (k=1)
These are RUVSeq normalized with K=1. The batch effect in samples ~30-40 has been removed.
```{r, vis_RUV_RLE_1}
## K=1
EDASeq::plotRLE(norm.dat.k1$set)
```
#### RUVg (k=2)
These are RUVSeq normalized with k=2.
```{r, vis_RUV_RLE_2}
## K=2
EDASeq::plotRLE(norm.dat.k2$set)
```
#### RUVg (k=3)
These are RUVSeq normalized with k=3.
```{r, vis_RUV_RLE_3}
## K=3
EDASeq::plotRLE(norm.dat.k3$set)
```
### {-}
### PCA analysis {.tabset}
Here were look at the PCA plots for the raw (non-normalized), and RUVg normalized data with k=1,2,or3
#### PCA raw
Here, there is a huge batch effect, with Batch 4 distinct from 1-3 and 5-9. We see this also in the number of below limit of detection miRNAs (above), and in the individual count plots as well.
```{r, pca_plots, dpi=300, fig.width=9}
## Raw data
# making variance stabilized raw counts for later PCA plotting
dds_raw_counts <- DESeqDataSetFromMatrix(as.matrix(raw),colData=pData,design=~status_ov)
vst_raw_counts <- DESeq2::varianceStabilizingTransformation(dds_raw_counts)
pcaExplorer::pcaplot(vst_raw_counts, intgroup = "batch", ellipse = FALSE, text_labels = TRUE)
pcaExplorer::pcaplot(vst_raw_counts, intgroup = "status_ov", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(vst_raw_counts, intgroup = "status_ob", ellipse = FALSE, text_labels = FALSE)
```
#### PCA RUV (k=1)
Removing one batch variable with RUVg lessens the batch effect somewhat.
```{r, pca2}
## RUVg norm, k=1
pcaExplorer::pcaplot(norm.dat.k1$vsd, intgroup = "batch", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k1$vsd, intgroup = "status_ov", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k1$vsd, intgroup = "status_ob", ellipse = FALSE, text_labels = FALSE)
```
#### PCA RUV (k=2)
Removing two batch effects (variables) creates a more reasonable looking PCA plot. Unfortunately, obese and non-obese do not cluster (this doesn't mean that we won't find some DE miRNAs)
```{r,pca3}
## RUVg norm, k=2
pcaExplorer::pcaplot(norm.dat.k2$vsd, intgroup = "batch", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k2$vsd, intgroup = "status_ov", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k2$vsd, intgroup = "status_ob", ellipse = FALSE, text_labels = FALSE)
```
#### PCA RUV (k=3)
Removing 3 batch variables may be overkill. We want to be careful not to overfit the data or remove biological effects.
```{r,pca4}
## RUVg norm, k=3
pcaExplorer::pcaplot(norm.dat.k3$vsd, intgroup = "batch", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k3$vsd, intgroup = "status_ov", ellipse = FALSE, text_labels = FALSE)
```
### {-}
# DESeq2 DE analysis
## Create DESEQ2 object (Overweight)
Let's test the DE analysis on these data using RUV normalized data (k=2)
```{r, DESeq2, class.source = 'fold-show'}
dds <- DESeqDataSetFromMatrix(countData = counts(norm.dat.k2$set[26:828,]),
colData = pData(norm.dat.k2$set),
design = ~ W_1 + W_2 + status_ov)
#####
####W _1 and W_2 basically represent the technical batches in the data. The counts used here for DESeq2 are raw counts, not normalized. So we use ##### RUVg to learn factors of "unwanted variation" and then account for those in the linear model, rather than use the post-normalization counts.
#####
dds <- DESeq(dds)
```
## Analyze DE changes
### Overweight vs Lean
We are comparing samples with prepreg BMI < 25 (lean) and BMI >=25 (Overweight and Obese).
There samples break down as follows:
Lean Obese Overweight
47 18 15
```{r, res1}
res_OV <- as.data.frame(results(dds,contrast = c('status_ov','Over_Obese','Lean')))
```
### DEG table
```{r, table1}
## Table of top DE genes
res_OV_sig <- res_OV %>%
arrange(padj) %>%
filter(padj < 0.1) %>%
dplyr::select(c(padj, baseMean, log2FoldChange)) %>%
mutate(across(c(1:3), ~round(.x, digits=3)))
res_OV_sig
```
There are `r dim(res_OV_sig)[1]` miRNAs with adjusted p < 0.1.
As we can see from the count plots below, a few high count outliers are driving a lot of the difference, although it appears the group means would probably still be different without the outliers.
```{r, counts1, class.source = 'fold-show'}
## Count plots of individual top DE genes
df <- plotCounts(dds, gene = "hsa-miR-630|0", intgroup = "status_ov", returnData = TRUE)
p <- ggboxplot(df,
x='status_ov',
y='count',
fill = "status_ov",
add = "jitter",
label = rownames(df),
label.select = list(criteria = "`y` > 1300"),
repel = TRUE)
p <- p + ggtitle("hsa-miR-630|0 Counts by Status")
p
df <- plotCounts(dds, gene = "hsa-miR-642a-3p|0", intgroup = "status_ov", returnData = TRUE)
p <- ggboxplot(df,
x='status_ov',
y='count',
fill = "status_ov",
add = "jitter",
label = rownames(df),
label.select = list(criteria = "`y` > 130"),
repel = TRUE)
p <- p + ggtitle("hsa-miR-642a-3p|0 Counts by Status")
p
df <- plotCounts(dds, gene = "hsa-miR-575|0", intgroup = "status_ov", returnData = TRUE)
p <- ggboxplot(df,
x='status_ov',
y='count',
fill = "status_ov",
add = "jitter",
label = rownames(df),
label.select = list(criteria = "`y` > 1300"),
repel = TRUE)
p <- p + ggtitle("hsa-miR-575|0 Counts by Status")
p
df <- plotCounts(dds, gene = "hsa-miR-223-3p|0", intgroup = "status_ov", returnData = TRUE)
p <- ggboxplot(df,
x='status_ov',
y='count',
fill = "status_ov",
add = "jitter",
label = rownames(df),
label.select = list(criteria = "`y` > 150"),
repel = TRUE)
p <- p + ggtitle("hsa-miR-223-3p|0 Counts by Status")
p
```
```{r, volPlot}
df <- res_OV %>% mutate(gene_name = rownames(.)) %>% filter(padj < 0.1)
p <- do_vol_plot(df = df, sig = 0.05, fc = 0.3, size = 4)
p <- p + ggtitle("Overweight vs Lean Volcano plot")
p
```
## Create DESEQ2 object (Obese)
Let's test again looking at just obese vs lean (defined as > 30 BMI and < 25 BMI).
Only the lean and obese are tested against each other in this analysis.
```{r, DESeq2_OB, class.source = 'fold-show'}
dat <- norm.dat.k2$set
dat_ob <- dat[,dat$status_ob %in% c("Lean","Obese")]
dds_ob <- DESeqDataSetFromMatrix(countData = counts(dat_ob[26:828,]),
colData = pData(dat_ob),
design = ~ W_1 + W_2 + status_ob) #since we largely normalized out our technical batches, we will not include it explicitly in the design formula
dds_ob <- DESeq(dds_ob)
```
In this way of subsetting the data, there are: `r table(dat$status_ob)`
## Analyze DE changes
### Obese vs Lean
```{r, res2}
res_OB <- as.data.frame(results(dds_ob,contrast = c('status_ob','Obese','Lean')))
```
### DEG table
```{r, table2}
## Table of top DE genes
res_OB_sig <- res_OB %>%
arrange(padj) %>%
filter(padj < 0.1) %>%
dplyr::select(c(padj, baseMean, log2FoldChange)) %>%
mutate(across(c(1:3), ~round(.x, digits=3)))
res_OB_sig
```
There are `r dim(res_OB_sig)[1]` DE miRNAs in this analysis at p_adj < 0.1.
```{r, countplots}
df <- plotCounts(dds_ob, gene = "hsa-miR-575|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df,
x='status_ob',
y='count',
fill = "status_ob",
add = "jitter",
label = rownames(df),
label.select = list(criteria = "`y` > 150"),
repel = TRUE)
p <- p + ggtitle("hsa-miR-575|0 Counts by Status")
p
df <- plotCounts(dds_ob, gene = "hsa-miR-630|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df,
x='status_ob',
y='count',
fill = "status_ob",
add = "jitter",
label = rownames(df),
label.select = list(criteria = "`y` > 150"),
repel = TRUE)
p <- p + ggtitle("hsa-miR-630|0 Counts by Status")
p
df <- plotCounts(dds_ob, gene = "hsa-miR-30c-5p|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df,
x='status_ob',
y='count',
fill = "status_ob",
add = "jitter",
label = rownames(df),
label.select = list(criteria = "`y` > 150"),
repel = TRUE)
p <- p + ggtitle("hsa-miR-30c-5p|0 Counts by Status")
p
df <- plotCounts(dds_ob, gene = "hsa-miR-582-3p|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df,
x='status_ob',
y='count',
fill = "status_ob",
add = "jitter",
label = rownames(df),
label.select = list(criteria = "`y` > 50"),
repel = TRUE)
p <- p + ggtitle("hsa-miR-582-3p|0 Counts by Status")
p
```
```{r, volPlot2}
df <- res_OB %>% mutate(gene_name = rownames(.)) %>% filter(padj < 0.1)
p <- do_vol_plot(df = df, sig = 0.05, fc = 0.3, size = 4)
p <- p + ggtitle("Obese vs Lean Volcano plot")
p
```
### Venn overlap of overweight vs obese miRNAs
Here we are comparing miRNAs found as DE between overweight (+obese) vs lean, and just obese (excluding overweight) vs lean.
The "obese" set contains all but 4 miRNAs found in the "overweight" set and also contains 25 other miRNAs.
```{r, VENN}
library(Vennerable)
v <- Vennerable::Venn(list(obese = rownames(res_OB_sig), overweight = rownames(res_OV_sig)))
plot(v, doWeights = TRUE)
```
## Create DESEQ2 object (Milk col date)
Let's test the DE analysis on these data using RUV normalization factors (k=2) and controlling for milk date as a factor. What I'm doing here is using quartiles to place the milk collection dates into four equal groups and incorporating that into the model.
"Milk collection factor" here are 4 equal quartile "bins" of the collection dates.
```{r}
milk_df <- data.frame(milk_col_date = pData$milk_col_dt, milk_col_factor = ntile(pData$milk_col_dt, 4), batch = pData$batch)
p4 <- ggplot(milk_df, aes(y=batch, fill=as.factor(milk_col_factor))) + geom_bar(position="fill") + ggtitle("Milk Collection Quartile by Batch")
p4
```
You can see that the batches 1-4 are biased toward later collection dates, while batches 6-9 are biased toward earlier.
```{r, DESeq2_OB2, class.source = 'fold-show'}
dat <- norm.dat.k2$set
dat_ob <- dat[,dat$status_ob %in% c("Lean","Obese")]
pData_2 <- pData(dat_ob)
pData_2 <- pData_2 %>% as_tibble() %>% mutate(milk_col_factor = as.factor(ntile(milk_col_dt, 4))) #Quartile bins of collection date
dds_ob <- DESeqDataSetFromMatrix(countData = counts(dat_ob[26:828,]),
colData = pData_2,
design = ~ W_1 + W_2 + milk_col_factor + status_ob)
dds_ob <- DESeq(dds_ob)
```
## Analyze DE accounting for milk collection date
### Obese vs Lean
We are comparing samples with prepreg BMI < 25 (lean) and BMI > 30 (Obese).
```{r}
res_OB_MCF <- as.data.frame(results(dds_ob,contrast = c('status_ob','Obese','Lean')))
```
### DEG table
```{r}
## Table of top DE genes
res_OB_MCF_sig <- res_OB_MCF %>%
arrange(padj) %>%
filter(padj < 0.1) %>%
dplyr::select(c(padj, baseMean, log2FoldChange)) %>%
mutate(across(c(1:3), ~round(.x, digits=3)))
res_OB_MCF_sig
```
```{r, VENN2}
v <- Vennerable::Venn(list(obese = rownames(res_OB_sig), obese_MCF = rownames(res_OB_MCF_sig)))
plot(v, doWeights = TRUE)
```
There are only 4 new DE miRNAs detected by accounting for milk collection date. Those are:
"hsa-miR-137|0" "hsa-miR-381-3p|0" "hsa-miR-3195|0" "hsa-miR-3928-3p|0"
## Analyze DE by collection time
Controlled for BMI, we create a contrast on collection time. I am using an LRT model for time-dependent miRNAs.
In this analysis, I am using the Likelihood Ratio Test to test a model fit to `W_1 + W_2 + milk_col_dt + status_ov` vs a reduced model with just `W_1 + W_2 + status_ov`. miRNAs that are DE here are those that are better fit by accounting for milk collection date, all other factors being equal. I am using overweight/obese vs lean as a factor for BMI status in this analysis.
I am also testing the LRT again using milk collection factor (a discrete factor instead of continuous integer value) .
```{r, DESeq2_LRT, class.source = 'fold-show'}
#### Milk collection date (continuous)
dat <- norm.dat.k2$set
dds <- DESeqDataSetFromMatrix(countData = counts(dat[26:828,]),
colData = pData(dat),
design = ~ W_1 + W_2 + milk_col_dt + status_ov) #since we largely normalized out our technical batches, we will not include it explicitly in the design formula
dds <- DESeq(dds, test = "LRT", reduced = ~ W_1 + W_2 + status_ov)
res_col_dt_LRT <- results(dds) %>% as.data.frame() %>% filter(padj < 0.05) %>% arrange(padj)
#### Milk collection factor (discrete quartiles)
dat <- norm.dat.k2$set
pData_2 <- pData(dat) %>% as_tibble() %>% mutate(milk_col_factor = as.factor(ntile(milk_col_dt, 4))) #Quartile bins of collection date
dds <- DESeqDataSetFromMatrix(countData = counts(dat[26:828,]),
colData = pData_2,
design = ~ W_1 + W_2 + milk_col_factor + status_ov) #s
dds <- DESeq(dds, test = "LRT", reduced = ~ W_1 + W_2 + status_ov)
res_col_fac_LRT <- results(dds) %>% as.data.frame() %>% filter(padj < 0.05) %>% arrange(padj)
```
These are the miRNAs that are DE with collection date, controlling for BMI:
```{r}
res_col_fac_LRT
```
In these plots, we see that:
1. Whether we use the continuous variable "milk collection date" or a quartile collection "factor" you find many of the same miRNAs. There are more DE miRNAs found using the factor.
2. DE miRNAs on BMI (Obese vs Lean) controlling for milk collection date do not substantially overlap with DE miRNAs on collection date (controlling for BMI). This is good.
```{r}
v <- Vennerable::Venn(list(milk_col_day = rownames(res_col_dt_LRT), milk_col_quartiles = rownames(res_col_fac_LRT)))
plot(v, doWeights = TRUE)
v <- Vennerable::Venn(list(obese = rownames(res_OB_MCF_sig), milk_col_quartiles = rownames(res_col_fac_LRT)))
plot(v, doWeights = TRUE)
```
## Counts by BMI and Milk Collection Quartile
```{r, plot_counts_by_time}
### dds here is the Deseq2 object calculated on milk_col_factor using the LRT
df <- plotCounts(dds, gene = "hsa-miR-579-3p|0.128", intgroup = c("milk_col_factor", "status_ov"), returnData = TRUE)
p <- ggboxplot(df,
x='milk_col_factor',
y='count',
color = "status_ov",
add = "jitter")
#label = rownames(df),
#label.select = list(criteria = "`y` > 50"),
#repel = TRUE)
p <- p + ggtitle(" hsa-miR-579-3p|0.128 Counts by Status and Milk Col. Date Quartile")
p
df <- plotCounts(dds, gene = "hsa-miR-502-5p|0.005", intgroup = c("milk_col_factor", "status_ov", "batch"), returnData = TRUE)
p <- ggboxplot(df,
x='milk_col_factor',
y='count',
color = "batch",
add = "jitter")
#label = rownames(df),
#label.select = list(criteria = "`y` > 50"),
#repel = TRUE)
p <- p + ggtitle(" hsa-miR-502-5p|0.005 Counts by Status and Milk Col. Date Quartile")
p