nanostring_deseq2_analysis_cho_jan2022.Rmd

---
title: "NanoString NCounter analysis: Breastmilk miRNA Obese vs Normals"
author: "Michael S Chimenti"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  html_document:
    df_print: paged
    toc: true
    toc_depth: 2
    toc_float: true
    number_sections: true
    code_folding: hide
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Introduction

## Project Summary 

**Our central hypothesis is that altered breast milk EV (bEV)-miRNA profiles in obese mothers affect the gene expression of intestinal epithelial cells (IECs), which might be associated with unfavorable health outcomes in children. Our long-term goal is to promote the health and development of children by improving the quality of breast milk through interventions and therapeutic strategies. The objectives of this study to achieve the long-term goal are 1) to understand the physiological roles of bEVs associated with maternal obesity, and 2) to establish a research team investigating bEVs with collaborators for the further studies. To attain the overall objectives, we propose the following specific aims:

Aim1. To determine differentially expressed bEV-miRNAs in obese mothers; We will recruit normal weight mothers and overweight/obese mothers and who have delivered a baby in the past one month. EV-miRNA expression profile will be analyzed with Nanostring Human miRNA Expression Panel. It will allow us to screen 798 biologically relevant miRNAs, which will enable us to find novel biomarkers related to maternal obesity. Our working hypothesis is that immune-response and metabolism-related miRNAs are altered in overweight/obese mother breast milk.**

## Methods 

Nanostring NCounter data were obtained as “RCC” format files. These were imported into R with 'readRcc' function from the NanoStringQCPro package (Bioconductor; https://www.bioconductor.org/packages/release/bioc/html/NanoStringQCPro.html; Bourgon et. al.).  Here, we follow the NanoString RUV normalization procedure outlined in Love, et. al (ref), for quality control checks and iterative RUVg normalization (Risso, et. al.).  All samples passed QC checks and were included in the analysis.  After visualization of the normalized data with RLE and PCA plots, RUVg-normalized data (k=2) was chosen for downstream DEG analysis.  A DESeq2 'dds' object was created from the normalized data using ‘DESeqDataSetFromMatrix’ with design conditioned on a BMI-cutoff factor variable, milk collection date quartile, and the RUVg-learned 'W1' and 'W2' correction factors for technical batch. R code for the analysis is available on Github.    

Nickles D, Sandmann T, Ziman R, Bourgon R (2021). NanoStringQCPro: Quality metrics and data processing methods for NanoString mRNA gene expression data. R package version 1.26.0.

Bhattacharya A, Hamilton AM, Furberg H, et al. An approach for normalization and quality control for NanoString RNA expression data. Brief Bioinform. 2021;22(3):bbaa163. doi:10.1093/bib/bbaa163

Risso D, Ngai J, Speed T, Dudoit S (2014). “Normalization of RNA-seq data using factor analysis of control genes or samples.” Nature Biotechnology, 32(9), 896–902. In press, http://www.nature.com/nbt/journal/v32/n9/full/nbt.2931.html.

Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.

## Library imports 
Click the "Code" button to see hidden code.  
```{r, imports, warning=FALSE, message=FALSE, class.source = 'fold-show'}

## Format and plotting 
library(ggrepel)
library(kableExtra)
library(pcaExplorer)
library(Vennerable)
require(ggplot2)

## Nanostring specific 
library(RUVSeq)
library(NanoStringQCPro)
library(NanoNormIter)


## DE analysis 
library(DESeq2)
library(limma)
library(matrixStats)
library(EnvStats)

## TidyR
library(tidyverse)
library(magrittr)

```

## Function definitions 

```{r, function_defs}
## Volcano Plot

do_vol_plot <- function(df, sig=0.05, fc=0.5, size = 4){
  
  df_sig<- df %>% filter(padj < sig)
  
  ## genes with labels -- signficant,decreased expression and start with IGF
  
  df_label<- df %>% 
              filter(padj < sig, abs(log2FoldChange) > fc)
              #mutate(short_id = stringr::str_split(gene_id, "000000") %>% map_chr(.,2))
  
  
  ## plot
  
  p <- ggplot(df, aes(log2FoldChange, -log10(padj)))  +
      geom_point(size=0.8, color="black", alpha=.8) + 
      geom_point(size=0.8, data=df_sig, aes(log2FoldChange, -log10(padj)), colour="red") +
       geom_text_repel(size= size, 
                    colour="black",
                    segment.size=0.1, 
                    nudge_x=0.06, 
                    nudge_y=0.06, 
                    data=df_label,
                    aes(log2FoldChange, -log10(padj), label=gene_name), 
                    max.iter= 200, 
                    point.padding = 0.15, 
                    segment.alpha = 1, 
                    box.padding=.15,
                    min.segment.length = unit(0.15, 'lines'),size=2.5) +
    theme(
      axis.text.x = element_text(size=12),
      axis.text.y = element_text(size=12),
      axis.title.x = element_text(size=14, margin = margin(t = 10, r = 0, b = 10, l = 0)),
      axis.title.y = element_text(size=14, margin = margin(t = 0, r = 10, b = 0, l = 10)),
      plot.margin =unit(c(.5,.5,.5,.5),"cm"),
      plot.title = element_text(size = 11)
    )
  
  return (p)
}

# Function to add correlation coefficients
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    Cor <- abs(cor(x, y)) # Remove abs function if desired
    txt <- paste0(prefix, format(c(Cor, 0.123456789), digits = digits)[1])
    if(missing(cex.cor)) {
        cex.cor <- 0.4 / strwidth(txt)
    }
    text(0.5, 0.5, txt,
         cex = 1 + cex.cor * Cor) # Resize the text by level of correlation
}

#### THESE FUNCTIONS ARE IMPORTED BY 'NANONORMITER' PACKAGE
# FOR REFERENCE, HERE ARE THE QC CHECKS: 
# imagingQC <- function(rcc){
# 
#     fovRatio = as.numeric(rcc$Lane_Attributes[3]) / as.numeric(rcc$Lane_Attributes[2])
#     if (!(fovRatio > .75)) {return('Flag')}
#     if (fovRatio > .75) {return('No flag')}
# 
# }
# 
# #### INPUT: rcc - input from rcc (use readRcc from NanoStringQCPro)
# ####         low, high - the lower and upper limits for binding density
# #### OUTPUT: flag for binding density
# 
# bindingDensityQC <- function(rcc,low,high){
# 
#     bd = as.numeric(rcc$Lane_Attributes[6])
#     if(!(bd < high & bd > low)) {return('Flag')}
#     if (bd < high & bd > low) {return('No flag')}
# 
# 
# }
# 
# #### INPUT: rcc - input from rcc (use readRcc from NanoStringQCPro)
# #### OUTPUT: flag for linearity for positive controls
# 
# positiveLinQC <- function(rcc){
# 
#     counts = rcc$Code_Summary
#     posControls = as.numeric(counts$Count[grepl('POS_',counts$Name)])
#     known = c(128,128/4,128/16,128/64,128/256,128/(256*4))
#     r2 = summary(lm(sort(posControls)~sort(known)))$r.squared
#     if(!(r2 > .95) | is.na(r2)) {return('Flag')}
#     if(r2 > .95) {return('No flag')}
# 
# }
# 
# #### INPUT: rcc - input from rcc (use readRcc from NanoStringQCPro)
# ####         numSD - number of standard deviations to calibrate the LOD
# #### OUTPUT: flag for limit of detection
# 
# limitOfDetectionQC <- function(rcc,numSD = 0){
#
#     counts = rcc$Code_Summary
#     posE = as.numeric(counts$Count[counts$Name == 'POS_E'])
#     negControls = as.numeric(counts$Count[grepl('NEG',counts$Name)])
#     if(!(posE > mean(negControls) + numSD*sd(negControls))) {return('Flag')}
#     if (posE > mean(negControls) + numSD*sd(negControls)) {return('No flag')}
# 
# }
##### HERE IS THE RUV CALCULATION AS IMPLEMENTED IN NANONORMITER()
# RUV_total <- function(raw,pData,fData,k,hkgenes = NULL,exclude = NULL){
# 
#     library(RUVSeq)
#     library(DESeq2)
#     library(limma)
#     library(matrixStats)
# 
#     if (!is.null(hkgenes)){
# 
#         fData(set)$Class[rownames(set) %in% hkgenes] = 'Housekeeping'
# 
#     }
# 
#     fData = fData[rownames(raw),]
#     int = intersect(rownames(raw),rownames(fData))
#     fData = fData[int,]
#     raw = raw[int,]
# 
#     set <- newSeqExpressionSet(as.matrix(round(raw)),
#                                phenoData=pData,
#                                featureData=fData)
# 
#     cIdx <- rownames(set)[fData(set)$Class == "Housekeeping"]
#     cIdx = cIdx[!(cIdx %in% exclude)]
#     x <- as.factor(pData$Group)
#     set <- betweenLaneNormalization(set, which="upper")
#     set <- RUVg(set, cIdx, k=k)
#     dds <- DESeqDataSetFromMatrix(counts(set),colData=pData(set),design=~1)
#     rowData(dds) <- fData
#     dds <- estimateSizeFactors(dds)
#     dds <- estimateDispersionsGeneEst(dds)
#     cts <- counts(dds, normalized=TRUE)
#     disp <- pmax((rowVars(cts) - rowMeans(cts)),0)/rowMeans(cts)^2
#     mcols(dds)$dispGeneEst <- disp
#     dds <- estimateDispersionsFit(dds, fitType="mean")
#     vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
#     mat <- assay(vsd)
#     covars <- as.matrix(colData(dds)[,grep("W",colnames(colData(dds))),drop=FALSE])
#     mat <- removeBatchEffect(mat, covariates=covars)
#     assay(vsd) <- mat
#     return(list(set = set,vsd = vsd))
# 
# }

```

# Exploratory Data Analysis 

## Import metadata

```{r, import_meta, class.source = 'fold-show'}

meta_df_full <- readxl::read_xlsx('cleaned_metadata.xlsx', sheet = 'Sheet1', col_names = TRUE, col_types = "guess")
meta_df <- meta_df_full %>% dplyr::select(c(`file name`, batch, `study ID`, Age, Race, `prepregnancy BMI`, `current  BMI`, `Breast milk colllection date post delivary`, `Delivery type`))

colnames(meta_df) <- c("filename","batch","SampleID","age","race","prepreg_BMI","curr_BMI","milk_col_dt", "delivery")
meta_df$batch[28:40] <- "4"  #fix the singleton batch 

```

Let's do some exploratory analysis on the metadata before starting on the Nanostring analysis. We're looking for correlated variables or correlations between batches and variables. 

## Exploratory Plots 

### Correlation between pre-preg BMI and delivery? 
```{r, meta_EDA}
library(ggpubr)
# Box plots with jittered points
# :::::::::::::::::::::::::::::::::::::::::::::::::::
# Change outline colors by groups: dose
# Use custom color palette
# Add jitter points and change the shape by groups
 p <- ggboxplot(meta_df, x = "delivery", y = "prepreg_BMI",
                color = "delivery", palette =c("#00AFBB", "#E7B800","#FC4E07"),
                add = "jitter")
 
 my_comps <- list( c("vaginal", "vbac"), c("vaginal","c-sec"), c("vbac","c-sec"))
 p <- p + stat_compare_means(comparisons = my_comps) + stat_compare_means(label.y=50)
 p

```
 
### Correlation between batch and mean BMI

Sample batch appears to be at possibly partially confounded with mean BMI in both pre- and post-pregnancy BMI. We may want to try excluding batches 6 and 7 to see if discovered DE miRNAs are robust to the presence of these batches with higher mean BMI.  

However, when we look at % overweight after classifying according to BMI > 24.25, batches 6-8 aren't too far from the others.  Batch 9 has only lean samples.  

```{r, meta_EDA2}
p1 <- ggboxplot(meta_df, x = "batch", y = "prepreg_BMI",
                color = "batch",
                add = "jitter")
 
p1 


p2 <- ggboxplot(meta_df, x = "batch", y = "curr_BMI",
                color = "batch",
                add = "jitter")
 
p2

## Overweight >= 25.  Obese >=30.
meta_df <- meta_df %>% dplyr::mutate(
  status_ov = dplyr::case_when(
    .data$prepreg_BMI < 25 ~ "Lean",
    .data$prepreg_BMI >=25 ~ "Over_Obese",
    TRUE ~ NA_character_
    )
) %>% dplyr::mutate(
  status_ob = dplyr::case_when(
    .data$prepreg_BMI < 25 ~ "Lean",
    .data$prepreg_BMI >= 30 ~ "Obese",
    (.data$prepreg_BMI >= 25 & .data$prepreg_BMI < 30) ~ "Overweight"
    )
)  

p3 <- ggplot(meta_df, aes(y=batch, fill=status_ov)) + geom_bar(position="fill") + ggtitle("Fraction Overweight by Batch; BMI > 25")
p3

p4 <- ggplot(meta_df, aes(y=batch, fill=status_ob)) + geom_bar(position="fill") + ggtitle("Fraction Obese by Batch; BMI > 30")

p4

```

### Correlations between dataset variables 

Only pre- and current BMI appear to be correlated in a significant way.  

```{r, corr, class.source = 'fold-show'}


df <- meta_df %>% dplyr::select(c(batch,age,prepreg_BMI,curr_BMI,milk_col_dt)) %>% mutate_at(vars(batch),factor) %>% as.data.frame()
groups <- df[,1]
pairs(df[,2:5],                     # Data frame of variables
      labels = colnames(df[,2:5]),  # Variable names
      pch = 21,                 # Pch symbol
      bg = rainbow(9)[groups],  # Background color of the symbol (pch 21 to 25)
      col = rainbow(9)[groups], # Border color of the symbol
      main = "Breastmilk dataset (batch=color)",    # Title of the plot
      row1attop = TRUE,         # If FALSE, changes the direction of the diagonal
      gap = 1,                  # Distance between subplots
      cex.labels = NULL,        # Size of the diagonal text
      font.labels = 1,
      lower.panel = panel.smooth,
      upper.panel = panel.cor)          # Font style of the diagonal text
```

## Variable distributions 

Here, we look at the distribution of age and BMI in the dataset.  
```{r, dist}
 # Density plot with mean lines and marginal rug
# :::::::::::::::::::::::::::::::::::::::::::::::::::
# Change outline and fill colors by groups ("sex")
# Use custom palette
p <- gghistogram(meta_df, x = "age",
   add = "mean", rug = TRUE, color = "blue")
p

p <- ggdensity(meta_df, x = "prepreg_BMI",
   add = "mean", rug = TRUE, color = "blue")
p

p <- ggdensity(meta_df, x = "curr_BMI",
   add = "mean", rug = TRUE, color = "blue")
p
```


# RUV-seq QC and Normalization 

RUV-seq method of Love et. al. for best-practices normalization and removal of unwanted technical artifacts

Here, we will follow the method of Love, et. al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8138885/):  

**"Using DESeq2 [25], we identified genes differentially expressed in NK cells primed by CTV-1 or IL-2 cytokines compared to unprimed NK cells at FDR-adjusted equation M22. The two normalization methods led to a different number of differentially expressed genes with a limited overlap of significant genes by both methods (Figure 4A). The raw equation M23-value histograms from differential expression analysis using nSolver-normalized expression exhibited a slope toward 0 for equation M24-values under 0.3, which can indicate issues with unaccounted-for correlations among samples [42], such as residual technical variation. The distributions of equation M25-values using the RUVSeq-normalized data were closer to uniform throughout the range [0,1] for most genes (Figure 4B). While the log2-fold changes were correlated between the two normalization procedures, the genes found to be differentially expressed only with nSolver-normalized data tended to have large standard errors with RUVSeq-normalized data and therefore not statistically significant using RUVSeq (Figure 4C). These differences in DE results emphasize the importance of properly validating normalization prior to downstream genomic analyses."**

## Import data 

### Create pheno and expr tables
```{r, import_data, class.source = 'fold-show'}
#getwd()
files.RCC = list.files(".", full.names = TRUE)
files.RCC = files.RCC[grepl('RCC',files.RCC)]

head(files.RCC)

ng = nrow(readRcc(files.RCC[1])$Code_Summary)
ncol = length(files.RCC)

raw_expression = as.data.frame(matrix(nrow = ng,ncol = length(files.RCC)+2))
colnames(raw_expression)[1:2] = c('Gene','Class')
pData = as.data.frame(matrix(nrow = length(files.RCC),ncol = 11))
colnames(pData) = c('BCAC_ID','SampleID','Owner','Comments','Date','GeneRLF','SystemAPF','imagingQC',
                    'bindingDensityQC','limitOfDetectionQC','positiveLinearityQC')
raw_expression[,1:2] = readRcc(files.RCC[1])$Code_Summary[,c(2,1)]

head(pData, 5)
head(raw_expression, 5)
```

### Populate empty tables from RCC

```{r, create_pData, message=FALSE,warning=FALSE}

## NOTE: I have to override the positiveLinQC function for this panel b/c it uses two types of positive controls and that confuses the grepl search.  

positiveLinQC <- function(rcc){

     counts = rcc$Code_Summary
     #posControls = as.numeric(counts$Count[grepl('POS_',counts$Name)])
     posControls = as.numeric(counts$Count[grepl('^POS_',counts$Name)])
     known = c(128,128/4,128/16,128/64,128/256,128/(256*4))
     r2 = summary(lm(sort(posControls)~sort(known)))$r.squared
     if(!(r2 > .95) | is.na(r2)) {return('Flag')}
     if(r2 > .95) {return('No flag')}
 
}

## Override this function to account for two types of Neg controls.  
limitOfDetectionQC <- function(rcc,numSD = 0){
  counts = rcc$Code_Summary
  posE = as.numeric(counts$Count[counts$Name == 'POS_E'])
  negControls = as.numeric(counts$Count[grepl('^NEG',counts$Name)])
  if(!(posE > mean(negControls) + numSD*sd(negControls))) {return('Flag')}
  if (posE > mean(negControls) + numSD*sd(negControls)) {return('No flag')}
  }

## 
## Populate pData and raw expr tables from RCC files
for (i in 1:length(files.RCC)){
  #i = 1 
  print(i)
  rcc = readRcc(files.RCC[i])
  raw = rcc$Code_Summary
  
  raw_expression[,i+2] = as.numeric(raw$Count)
  #colnames(raw_expression)[i+2] = strsplit(files.RCC[i],'_')[[1]][2:4]
  pData[i,2:7] = as.vector(rcc$Sample_Attributes)
  pData$imagingQC[i] = imagingQC(rcc)
  pData$bindingDensityQC[i] = bindingDensityQC(rcc,.05,2.25)
  pData$limitOfDetectionQC[i] = limitOfDetectionQC(rcc)
  pData$positiveLinearityQC[i] = positiveLinQC(rcc)
}

## Munge pData and metadata together; create treatment classes
pData$SampleID <- meta_df$SampleID
pData <- dplyr::left_join(pData, meta_df, by = "SampleID")
rownames(pData) <- pData$SampleID 

colnames(raw_expression)[3:82] <- pData$SampleID

head(pData)
```

## Quality Control Analysis

No QC flags are set on any of the 80 samples: 
```{r, checkQCflags}
pData[,8:11]
```

This hidden code block does more setup and preparation for analysis.  
```{r, make_pData, warning=FALSE}
raw = raw_expression[,-c(1:2)]
fData = raw_expression[,c(1:2)]
rownames(raw) = fData$Gene
cIdx <- fData$Gene[fData$Class == "Housekeeping"]
pData$HK_Gene_Miss = colSums(raw[cIdx,] == 0)
rownames(fData) = fData$Gene
rownames(raw) = fData$Gene
#rownames(pData) = colnames(raw)

#
```

### Check Housekeeping Genes

So let's proceed to checking whether housekeeping genes correlate with treatment.  RUV-seq normalization assumes that housekeeping genes DO NOT correlate with treatment or biology of interest.  
```{r,housekeeping}
#### CHECK IF housekeeping Genes ARE ASSOCIATED WITH PRIMARY PHENO
hk_raw = raw[cIdx,]
pval = vector(length = nrow(hk_raw))

require(MASS)

for (i in 1:nrow(hk_raw)){
  
  reg = glm.nb(as.numeric(hk_raw[i,]) ~ as.factor(pData$status_ov))
  pval[i] = coef(summary(reg))[2,4]
  
}

```

### P-values

These are the p-values of the relationship between the housekeeping genes and the treatment groups.  There appears to be no strongly significant correlation between the housekeeping miRNAs and the obese/lean status (this is good!)

```{r, pval_hk}
pval <- pval %>% as_tibble() %>% mutate(hkgene = rownames(hk_raw))
pval 

```
### Check limit of detection 
We want to define how many genes are below the limit of detection (i.e., >1SD below the negative control mean).  

```{r, blod}
neg_raw <- raw %>% filter(row.names(raw) %in% c("NEG_A","NEG_B","NEG_C","NEG_D","NEG_E","NEG_F","NEG_G","NEG_H"))
lod <- colMeans(neg_raw) - apply(neg_raw,1,sd)
num_endo_blod <- colSums(raw_expression[raw_expression$Class == "Endogenous1", -c(1:2)] < lod)
num_hk_blod <- colSums(raw_expression[raw_expression$Class == "Housekeeping", -c(1:2)] < lod)

## Check the BLOD numbers by batch 
df <- data.frame(below_lim = num_endo_blod, batch = pData$batch, name=pData$SampleID)
p <- ggbarplot(df, y="below_lim", x="name", color="batch", fill="batch")
p <- p + ggtitle("Number Endogenous miRNAs Below Limit of Detection by Batch")
p
```

We can clearly see a batch effect with Batch 4 having larger than normal "Below Limit of Detection" miRNAs.  

## Apply RUVg normalization 

Now, we remove correlated negative control genes and apply RUVg normalization: 
```{r,RUV_normalization, results='hide', message=FALSE}

## k = 1,2, and 3 
## k is the number of sources of technical confounding (i.e., suspected batches)
norm.dat.k1 <- NanoNormIter::RUV_total(raw,pData,fData,k = 1)
norm.dat.k2 <- NanoNormIter::RUV_total(raw,pData,fData,k = 2)
norm.dat.k3 <- NanoNormIter::RUV_total(raw,pData,fData,k = 3)
                               
```

### RLE analysis {.tabset}

Here, we visualize the raw and RUV normalized (k=1,2, or 3; k is the number of sources of suspected technical confounding) datasets using an RLE plot.An RLE plot shows the boxplots of the log-ratios of the gene-level read counts of each sample to those of a reference sample (defined as the median across the samples). Ideally, the distributions should be centered around the zero line and as tight as possible. Clear deviations indicate the need for normalization and/or the presence of outlying samples.

The RUVg function returns two pieces of information: the estimated factors of unwanted variation (added as columns to the phenoData slot of set) and the normalized counts obtained by regressing the original counts on the unwanted factors. The normalized values are stored in the normalizedCounts slot of set and can be accessed with the normCounts method. These counts should be used only for exploration. It is important that subsequent DE analysis be done on the original counts (accessible through the counts method), as removing the unwanted factors from the counts can also remove part of a factor of interest [6].

#### Raw Data 

These are the raw, non-normalized counts.  Y-axis is the median deviation of log expression.  We can see systematic deviations from the median line (zero).  This is bad and indicates a definite batch effect.  
```{r, vis_raw_RLE}

## Raw data
EDASeq::plotRLE(as.matrix(raw), cex.lab=0.5)

```

#### RUVg (k=1)

These are RUVSeq normalized with K=1. The batch effect in samples ~30-40 has been removed.  
```{r, vis_RUV_RLE_1}
## K=1
EDASeq::plotRLE(norm.dat.k1$set)

```

#### RUVg (k=2)

These are RUVSeq normalized with k=2. 
```{r, vis_RUV_RLE_2}
## K=2
EDASeq::plotRLE(norm.dat.k2$set)
```

#### RUVg (k=3)

These are RUVSeq normalized with k=3. 
```{r, vis_RUV_RLE_3}
## K=3
EDASeq::plotRLE(norm.dat.k3$set)
```
### {-}

### PCA analysis {.tabset}

Here were look at the PCA plots for the raw (non-normalized), and RUVg normalized data with k=1,2,or3

#### PCA raw 

Here, there is a huge batch effect, with Batch 4 distinct from 1-3 and 5-9. We see this also in the number of below limit of detection miRNAs (above), and in the individual count plots as well.  
```{r, pca_plots, dpi=300, fig.width=9}
## Raw data
# making variance stabilized raw counts for later PCA plotting 
dds_raw_counts <- DESeqDataSetFromMatrix(as.matrix(raw),colData=pData,design=~status_ov)
vst_raw_counts <- DESeq2::varianceStabilizingTransformation(dds_raw_counts)
pcaExplorer::pcaplot(vst_raw_counts, intgroup = "batch", ellipse = FALSE, text_labels = TRUE)
pcaExplorer::pcaplot(vst_raw_counts, intgroup = "status_ov", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(vst_raw_counts, intgroup = "status_ob", ellipse = FALSE, text_labels = FALSE)


```


#### PCA RUV (k=1)

Removing one batch variable with RUVg lessens the batch effect somewhat.
```{r, pca2}
## RUVg norm, k=1
pcaExplorer::pcaplot(norm.dat.k1$vsd, intgroup = "batch", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k1$vsd, intgroup = "status_ov", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k1$vsd, intgroup = "status_ob", ellipse = FALSE, text_labels = FALSE)
```

#### PCA RUV (k=2)

Removing two batch effects (variables) creates a more reasonable looking PCA plot.  Unfortunately, obese and non-obese do not cluster (this doesn't mean that we won't find some DE miRNAs)  
```{r,pca3}
## RUVg norm, k=2
pcaExplorer::pcaplot(norm.dat.k2$vsd, intgroup = "batch", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k2$vsd, intgroup = "status_ov", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k2$vsd, intgroup = "status_ob", ellipse = FALSE, text_labels = FALSE)
```


#### PCA RUV (k=3)

Removing 3 batch variables may be overkill.  We want to be careful not to overfit the data or remove biological effects.  
```{r,pca4}
## RUVg norm, k=3
pcaExplorer::pcaplot(norm.dat.k3$vsd, intgroup = "batch", ellipse = FALSE, text_labels = FALSE)
pcaExplorer::pcaplot(norm.dat.k3$vsd, intgroup = "status_ov", ellipse = FALSE, text_labels = FALSE)
```

### {-}

# DESeq2 DE analysis 

## Create DESEQ2 object (Overweight)
Let's test the DE analysis on these data using RUV normalized data (k=2)
 
```{r, DESeq2, class.source = 'fold-show'}
dds <- DESeqDataSetFromMatrix(countData = counts(norm.dat.k2$set[26:828,]),
                              colData = pData(norm.dat.k2$set),
                              design = ~ W_1 + W_2 + status_ov)  


#####
####W   _1 and W_2 basically represent the technical batches in the data.  The counts used here for DESeq2 are raw counts, not normalized.  So we use #####  RUVg to learn factors of "unwanted variation" and then account for those in the linear model, rather than use the post-normalization counts. 
#####
dds <- DESeq(dds)

```


## Analyze DE changes

### Overweight vs Lean 

We are comparing samples with prepreg BMI < 25 (lean) and BMI >=25 (Overweight and Obese).

There samples break down as follows: 
Lean           Obese          Overweight 
        47         18         15 
```{r, res1}
res_OV <- as.data.frame(results(dds,contrast = c('status_ov','Over_Obese','Lean')))
```


### DEG table 


```{r, table1}
## Table of top DE genes 
res_OV_sig <- res_OV %>% 
  arrange(padj) %>% 
  filter(padj < 0.1) %>% 
  dplyr::select(c(padj, baseMean, log2FoldChange)) %>% 
  mutate(across(c(1:3), ~round(.x, digits=3)))
  
  
res_OV_sig
```
There are `r dim(res_OV_sig)[1]` miRNAs with adjusted p < 0.1. 

As we can see from the count plots below, a few high count outliers are driving a lot of the difference, although it appears the group means would probably still be different without the outliers.  

```{r, counts1, class.source = 'fold-show'}
## Count plots of individual top DE genes 


df <- plotCounts(dds, gene = "hsa-miR-630|0", intgroup = "status_ov", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ov', 
               y='count', 
               fill = "status_ov",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 1300"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-630|0 Counts by Status")
p

df <- plotCounts(dds, gene = "hsa-miR-642a-3p|0", intgroup = "status_ov", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ov', 
               y='count', 
               fill = "status_ov",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 130"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-642a-3p|0 Counts by Status")
p


df <- plotCounts(dds, gene = "hsa-miR-575|0", intgroup = "status_ov", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ov', 
               y='count', 
               fill = "status_ov",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 1300"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-575|0 Counts by Status")
p

df <- plotCounts(dds, gene = "hsa-miR-223-3p|0", intgroup = "status_ov", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ov', 
               y='count', 
               fill = "status_ov",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 150"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-223-3p|0 Counts by Status")
p

```

```{r, volPlot}
df <- res_OV %>% mutate(gene_name = rownames(.)) %>% filter(padj < 0.1) 
p <- do_vol_plot(df = df, sig = 0.05, fc = 0.3, size = 4)
p <- p + ggtitle("Overweight vs Lean Volcano plot")
p
```

## Create DESEQ2 object (Obese)
Let's test again looking at just obese vs lean (defined as > 30 BMI and < 25 BMI). 


Only the lean and obese are tested against each other in this analysis.  
 
```{r, DESeq2_OB, class.source = 'fold-show'}

dat <- norm.dat.k2$set
dat_ob <- dat[,dat$status_ob %in% c("Lean","Obese")]
dds_ob <- DESeqDataSetFromMatrix(countData = counts(dat_ob[26:828,]),
                              colData = pData(dat_ob),
                              design = ~ W_1 + W_2 + status_ob)  #since we largely normalized out our technical batches, we will not include it explicitly in the design formula
dds_ob <- DESeq(dds_ob)

```
In this way of subsetting the data, there are: `r table(dat$status_ob)`


## Analyze DE changes

### Obese vs Lean 
```{r, res2}
res_OB <- as.data.frame(results(dds_ob,contrast = c('status_ob','Obese','Lean')))
```


### DEG table 


```{r, table2}
## Table of top DE genes 
res_OB_sig <- res_OB %>% 
  arrange(padj) %>% 
  filter(padj < 0.1) %>% 
  dplyr::select(c(padj, baseMean, log2FoldChange)) %>% 
  mutate(across(c(1:3), ~round(.x, digits=3))) 

res_OB_sig


```

There are `r dim(res_OB_sig)[1]` DE miRNAs in this analysis at p_adj < 0.1.  

```{r, countplots}
df <- plotCounts(dds_ob, gene = "hsa-miR-575|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ob', 
               y='count', 
               fill = "status_ob",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 150"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-575|0 Counts by Status")
p

df <- plotCounts(dds_ob, gene = "hsa-miR-630|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ob', 
               y='count', 
               fill = "status_ob",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 150"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-630|0 Counts by Status")
p


df <- plotCounts(dds_ob, gene = "hsa-miR-30c-5p|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ob', 
               y='count', 
               fill = "status_ob",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 150"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-30c-5p|0 Counts by Status")
p


df <- plotCounts(dds_ob, gene = "hsa-miR-582-3p|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ob', 
               y='count', 
               fill = "status_ob",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 50"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-582-3p|0 Counts by Status")
p


```

```{r, volPlot2}
df <- res_OB %>% mutate(gene_name = rownames(.)) %>% filter(padj < 0.1) 
p <- do_vol_plot(df = df, sig = 0.05, fc = 0.3, size = 4)
p <- p + ggtitle("Obese vs Lean Volcano plot")
p
```

### Venn overlap of overweight vs obese miRNAs

Here we are comparing miRNAs found as DE between overweight (+obese) vs lean, and just obese (excluding overweight) vs lean. 

The "obese" set contains all but 4 miRNAs found in the "overweight" set and also contains 25 other miRNAs.  

```{r, VENN}
library(Vennerable)
v <- Vennerable::Venn(list(obese = rownames(res_OB_sig), overweight = rownames(res_OV_sig)))
plot(v, doWeights = TRUE)
```


## Create DESEQ2 object (Milk col date)

Let's test the DE analysis on these data using RUV normalization factors (k=2) and controlling for milk date as a factor.  What I'm doing here is using quartiles to place the milk collection dates into four equal groups and incorporating that into the model.  

"Milk collection factor" here are 4 equal quartile "bins" of the collection dates.  

```{r}
milk_df <- data.frame(milk_col_date = pData$milk_col_dt, milk_col_factor = ntile(pData$milk_col_dt, 4), batch = pData$batch)

p4 <- ggplot(milk_df, aes(y=batch, fill=as.factor(milk_col_factor))) + geom_bar(position="fill") + ggtitle("Milk Collection Quartile by Batch")
p4
```

You can see that the batches 1-4 are biased toward later collection dates, while batches 6-9 are biased toward earlier. 
 
```{r, DESeq2_OB2, class.source = 'fold-show'}

dat <- norm.dat.k2$set
dat_ob <- dat[,dat$status_ob %in% c("Lean","Obese")]
pData_2 <- pData(dat_ob)
pData_2 <- pData_2 %>% as_tibble() %>% mutate(milk_col_factor = as.factor(ntile(milk_col_dt, 4))) #Quartile bins of collection date

dds_ob <- DESeqDataSetFromMatrix(countData = counts(dat_ob[26:828,]),
                              colData = pData_2,
                              design = ~ W_1 + W_2 + milk_col_factor + status_ob)  
dds_ob <- DESeq(dds_ob)

```


## Analyze DE accounting for milk collection date

### Obese vs Lean 

We are comparing samples with prepreg BMI < 25 (lean) and BMI > 30 (Obese).   
```{r}

res_OB_MCF <- as.data.frame(results(dds_ob,contrast = c('status_ob','Obese','Lean')))
```


### DEG table 

```{r}
## Table of top DE genes 
res_OB_MCF_sig <- res_OB_MCF %>% 
  arrange(padj) %>% 
  filter(padj < 0.1) %>% 
  dplyr::select(c(padj, baseMean, log2FoldChange)) %>% 
  mutate(across(c(1:3), ~round(.x, digits=3)))
  
  
res_OB_MCF_sig
```

```{r, VENN2}
v <- Vennerable::Venn(list(obese = rownames(res_OB_sig), obese_MCF = rownames(res_OB_MCF_sig)))
plot(v, doWeights = TRUE)
```

There are only 4 new DE miRNAs detected by accounting for milk collection date.  Those are: 

"hsa-miR-137|0"     "hsa-miR-381-3p|0"  "hsa-miR-3195|0"    "hsa-miR-3928-3p|0"


## Analyze DE by collection time

Controlled for BMI, we create a contrast on collection time.  I am using an LRT model for time-dependent miRNAs. 

In this analysis, I am using the Likelihood Ratio Test to test a model fit to `W_1 + W_2 + milk_col_dt + status_ov` vs a reduced model with just `W_1 + W_2 + status_ov`.   miRNAs that are DE here are those that are better fit by accounting for milk collection date, all other factors being equal. I am using overweight/obese vs lean as a factor for BMI status in this analysis.  

I am also testing the LRT again using milk collection factor (a discrete factor instead of continuous integer value) . 
 
```{r, DESeq2_LRT, class.source = 'fold-show'}

#### Milk collection date (continuous)
dat <- norm.dat.k2$set
dds <- DESeqDataSetFromMatrix(countData = counts(dat[26:828,]),
                              colData = pData(dat),
                              design = ~ W_1 + W_2 + milk_col_dt + status_ov)  #since we largely normalized out our technical batches, we will not include it explicitly in the design formula
dds <- DESeq(dds, test = "LRT", reduced = ~ W_1 + W_2 + status_ov)
res_col_dt_LRT <- results(dds) %>% as.data.frame() %>% filter(padj < 0.05) %>% arrange(padj)

#### Milk collection factor (discrete quartiles)
dat <- norm.dat.k2$set
pData_2 <- pData(dat) %>% as_tibble() %>% mutate(milk_col_factor = as.factor(ntile(milk_col_dt, 4))) #Quartile bins of collection date

dds <- DESeqDataSetFromMatrix(countData = counts(dat[26:828,]),
                              colData = pData_2,
                              design = ~ W_1 + W_2 + milk_col_factor + status_ov)  #s

dds <- DESeq(dds, test = "LRT", reduced = ~ W_1 + W_2 + status_ov)
res_col_fac_LRT <- results(dds) %>% as.data.frame() %>% filter(padj < 0.05) %>% arrange(padj)
```

These are the miRNAs that are DE with collection date, controlling for BMI: 
```{r}
res_col_fac_LRT
```

In these plots, we see that: 

 1. Whether we use the continuous variable "milk collection date" or a quartile collection "factor" you find many of the same miRNAs.  There are more DE miRNAs found using the factor. 
 
 2. DE miRNAs on BMI (Obese vs Lean) controlling for milk collection date do not substantially overlap with DE miRNAs on collection date (controlling for BMI).   This is good.  
 
```{r}
v <- Vennerable::Venn(list(milk_col_day = rownames(res_col_dt_LRT), milk_col_quartiles = rownames(res_col_fac_LRT)))
plot(v, doWeights = TRUE)

v <- Vennerable::Venn(list(obese = rownames(res_OB_MCF_sig), milk_col_quartiles = rownames(res_col_fac_LRT)))
plot(v, doWeights = TRUE)

```

## Counts by BMI and Milk Collection Quartile

```{r, plot_counts_by_time}

### dds here is the Deseq2 object calculated on milk_col_factor using the LRT
df <- plotCounts(dds, gene = "hsa-miR-579-3p|0.128", intgroup = c("milk_col_factor", "status_ov"), returnData = TRUE)
p <- ggboxplot(df, 
               x='milk_col_factor', 
               y='count', 
               color = "status_ov",
               add = "jitter")
               #label = rownames(df), 
               #label.select = list(criteria = "`y` > 50"),
               #repel = TRUE) 

p <- p + ggtitle(" hsa-miR-579-3p|0.128 Counts by Status and Milk Col. Date Quartile")
p

df <- plotCounts(dds, gene = "hsa-miR-502-5p|0.005", intgroup = c("milk_col_factor", "status_ov", "batch"), returnData = TRUE)
p <- ggboxplot(df, 
               x='milk_col_factor', 
               y='count', 
               color = "batch",
               add = "jitter")
               #label = rownames(df), 
               #label.select = list(criteria = "`y` > 50"),
               #repel = TRUE) 

p <- p + ggtitle(" hsa-miR-502-5p|0.005 Counts by Status and Milk Col. Date Quartile")
p

```


## DE analysis subsetting by batch 

From our exploratory analysis, it seems that batches 5-9 are the most homogenous w.r.t each other.  Batch 4 is a definite outlier, and 1-3 are different in important ways as well, including being biased towards later collection dates.  

While I tried interrogating just batches 5-9 for DE miRNA, unfortunately there were no results.  This may be owing to a reduction in power or for some other reason.  I am not sure.  

So here we will just drop batch 4 (4-1 and 4-2) altogether. 


```{r, batch_subset, class.source = 'fold-show'}

dat <- norm.dat.k2$set
dat_batch <- dat[,dat$batch %in% c("1","2","3","5","6","7","8","9")]
pData_batch <- pData(dat_batch)
pData_batch <- pData_batch %>% as_tibble() %>% mutate(milk_col_factor = as.factor(ntile(milk_col_dt, 4))) #Quartile bins of collection date

dds_batch <- DESeqDataSetFromMatrix(countData = counts(dat_batch[26:828,]),
                              colData = pData_batch,
                              design = ~ W_1 + W_2 + status_ob)  
dds_batch <- DESeq(dds_batch)

```

```{r}

res_OB_drop4 <- as.data.frame(results(dds_batch,contrast = c('status_ob','Obese','Lean')))
```


### DEG table 

```{r}
## Table of top DE genes 
res_OB_drop4_sig <- res_OB_drop4 %>% 
  arrange(padj) %>% 
  filter(padj < 0.1) %>% 
  dplyr::select(c(padj, baseMean, log2FoldChange)) %>% 
  mutate(across(c(1:3), ~round(.x, digits=3)))
  
  
res_OB_drop4_sig
```
There are `r dim(res_OB_drop4_sig)[1]` DE miRNAs when dropping batch 4 from the analysis.  

```{r}
v <- Vennerable::Venn(list(obese_milk_factor = rownames(res_OB_MCF_sig), obese_drop_batch4 = rownames(res_OB_drop4_sig)))
plot(v, doWeights = TRUE)
```

These are the 8 in common between the full analysis and the drop batch 4 analysis: 

```{r}
v@IntersectionSets$`11`
```

## DE analysis milk collection day < 40 

In this section, we will look for DE miRNAs using only those samples collected less than or equal to 40 days post-natality.  We will compare obese to normal.  

```{r, milk_dat_subset, class.source = 'fold-show'}

dat <- norm.dat.k2$set
dat_lt40 <- dat[,dat$milk_col_dt <= 40]
pData_lt40 <- pData(dat_lt40)


dds_lt40 <- DESeqDataSetFromMatrix(countData = counts(dat_lt40[26:828,]),
                              colData = pData_lt40,
                              design = ~ W_1 + W_2 + status_ob)  
dds_lt40 <- DESeq(dds_lt40, fit='local')


```
```{r}

res_OB_lt40 <- as.data.frame(results(dds_lt40,contrast = c('status_ob','Obese','Lean')))
```


### DEG table 

```{r}
## Table of top DE genes 
res_OB_lt40_sig <- res_OB_lt40 %>% 
  arrange(padj) %>% 
  filter(padj < 0.1) %>% 
  dplyr::select(c(padj, baseMean, log2FoldChange)) %>% 
  mutate(across(c(1:3), ~round(.x, digits=3)))
  
  
res_OB_lt40_sig
```

Unfortunately, there are only two miRNAs detected this way, with 575|0 being DE owing to one outlier sample.  On the other hand, 205-5p|0 does look promising and was also detected in the "drop batch4" analysis above.  

```{r}
df <- plotCounts(dds_lt40, gene = "hsa-miR-575|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ob', 
               y='count', 
               fill = "status_ob",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 50"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-575|0 Counts by Status")
p

df <- plotCounts(dds_lt40, gene = "hsa-miR-205-5p|0", intgroup = "status_ob", returnData = TRUE)
p <- ggboxplot(df, 
               x='status_ob', 
               y='count', 
               fill = "status_ob",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 50"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-205-5p|0 Counts by Status")
p

```

## DE analysis collection date < 40

Feb 22, 2022 request: 

Please compare the miRNA profiles between the group of sample collection date <40 and the group of the rest. We may need to exclude batch 4s, and control the BMI before comparing these groups.

```{r, milk_dat_subset2, class.source = 'fold-show'}

dat <- norm.dat.k2$set
dat_drop4 <- dat[,dat$batch %in% c("1","2","3","5","6","7","8","9")]
pData_drop4 <- pData(dat_drop4)
pData_drop4 <- pData_drop4 %>% as_tibble() %>% mutate(milk_col_factor = as.factor(milk_col_dt < 40)) #

dds_drop4_day40 <- DESeqDataSetFromMatrix(countData = counts(dat_drop4[26:823,]),
                              colData = pData_drop4,
                              design = ~ W_1 + W_2 + status_ov + milk_col_factor)  

dds_drop4_day40 <- DESeq(dds_drop4_day40, fit='local')

```

```{r}
res_OB_drop4_day40 <- as.data.frame(results(dds_drop4_day40,contrast = c('milk_col_factor','TRUE','FALSE')))
```


### DEG table 

```{r}
## Table of top DE genes 
res_OB_drop4_day40 <- res_OB_drop4_day40 %>% 
  arrange(padj) %>% 
  filter(padj < 0.1) %>% 
  dplyr::select(c(padj, baseMean, log2FoldChange)) %>% 
  mutate(across(c(1:3), ~round(.x, digits=3)))
  
  
res_OB_drop4_day40
```

```{r}
df <- plotCounts(dds_drop4_day40, gene = "hsa-miR-27b-3p|0", intgroup = c("status_ov","milk_col_factor"), returnData = TRUE)
p <- ggboxplot(df, 
               x='milk_col_factor', 
               y='count', 
               #fill = "status_ov",
               add = "jitter", 
               label = rownames(df), 
               label.select = list(criteria = "`y` > 50"),
               repel = TRUE) 

p <- p + ggtitle("hsa-miR-27b-3p|0 Counts by Status")
p

```


```{r, volPlot3}
df <- res_OB_drop4_day40 %>% mutate(gene_name = rownames(.)) %>% filter(padj < 0.1) 
p <- do_vol_plot(df = df, sig = 0.05, fc = 0.3, size = 2)
p <- p + ggtitle("Before Day 40 vs After Day 40, controlled for BMI; red = padj<0.05")
p
```

# Conclusions

**In this analysis, we aim to determine differentially expressed bEV-miRNAs in obese mothers;  EV-miRNA expression was profiled with Nanostring Human NanoString miRNA Expression Panel of 798 biologically relevant miRNAs. Our working hypothesis is that immune-response and metabolism-related miRNAs are altered in overweight/obese mother breast milk.**

**After correcting for sample batch using the "RUVg" method of Love, et. al., we found `r dim(res_OV_sig)[1]` miRNAs that were DE (p-adj < 0.1) between overweight/obese and lean samples.  We also found `r dim(res_OB_sig)[1]` miRNAs that were DE (p-adj < 0.1) between obese and lean samples.  22 of the overweight DE miRNAs were also captured in the obese comparison, despite having fewer samples in the analysis.**  

**Many of the most DE miRNAs have significant outlier samples that may be biasing the results.  We may wish to drop one or more samples, or possibly the most outlier batch (probably batch 4) and re-evaluate. **

**When dropping batch 4 from the analysis, we obtain fewer DE miRNAs.  There are still 8 that overlap the analysis for Obese vs Lean for the full dataset.  I would possibly consider prioritizing these 8 for follow up wet lab studies to validate. **

**In future studies, we may wish to put more emphasis on controlling the technical factors in sample preparation for sequencing.  There appear to be large batch effects that are present and likely masking some of the biologically interesting signal.  Additionally, factors like "milk collection date" should be randomized across batches or standardized to one particular interval (i.e., 12 days).**


# Computational Environment 

```{r, sesh}
sessionInfo()
```