## 1. Preparation

### 1.1 Library the packages

In [None]:
library(tidyverse)
library(vegan)
library(ape)
library(microbiome)
library(mediation)

### 1.2 Load the datasets

In [None]:
mbio <- read.table("/data/precomputed/lld.metaphlan2.species.tsv") # Species abundance table
phen <- read.table("/data/precomputed/lld.phen.tsv")               # Phenotype table

## 2 Preprocess

The original metaphlan2 output looks 'ugly', we can make some modification to make it nicer.

In [None]:
# Shorten the long taxonomy names, only keep the species names
rownames(mbio) <- str_replace_all(rownames(mbio), ".*s__", "s__")

# transpose the abundance table: rows are samples and columns are species
mbio <- t(mbio)

# Rescale the table, make sure the sum of each row is 1
mbio <- apply(mbio, 1, myfun<-function(x){x/sum(x)}) %>% t() %>% as.data.frame()
# rowSums(mbio) # Check the result, all 1 is correct.

## 3 Overall ecological parameters

For the analysis of overall ecological parameters (alpha diversity, beta diversity...), we should use the **unfiltered** abundance table.

### 3.1 Alpha diversity

In [None]:
# Calculate Shannon diversity using vegan's function, and count the species number (richness).
mbio.alpha <- data.frame(Shannon  = vegan::diversity(mbio, index="shannon"), 
                         Richness = rowSums(mbio!=0))

if(!dir.exists("alphaDiversity")){dir.create("alphaDiversity")}
write.table(mbio.alpha, "alphaDiversity/alphaDiversity.tsv", sep = "\t", row.names = T, col.names = NA, quote = F)

# Check if there is a difference in gut microbiome Shonnon index between hyperglycemia patients and health controls
wilcox.test(mbio.alpha$Shannon ~ phen$Hyperglycemia)

# Visualize the between-group differences in Shannon index using a boxplot
ggplot(mbio.alpha, aes(as.factor(phen$Hyperglycemia), Shannon, color = as.factor(phen$Hyperglycemia)))+
  geom_boxplot() + 
  xlab("Groups") +
  ylab("Shannon index") + 
  theme(legend.position = "none")

### 3.2 Beta diversity

Principal coordinates analysis (PCoA) can help us reduce the dimension of microbiome composition data, capture the main information of high-dimension dataset and project them to a few number of new dimensions. The PCoA plot can visually reflect the distance (dissimilarity) between samples.

In [None]:
# Calculate dissimilarities in microbiome composition between samples.
mbio.dist <- vegdist(mbio, index = "bray")

# PCoA
mbio.pcoa <- ape::pcoa(mbio.dist)

# Extract the principal coordinates (PCs) (new coordinates generated by PCoA)
mbio.pc   <- data.frame(mbio.pcoa$vectors)

# PCoA plot: visualize the first 2 principal coordinates (PCs)
ggplot(mbio.pc, aes(Axis.1, Axis.2, color = as.factor(phen$Hyperglycemia))) +
  geom_point() + 
  xlab("PCo1") +
  ylab("PCo2") +
  theme(legend.position = "none")

### 3.3 PERMANOVA

Permutational multivariate analysis of variance (PERMANOVA) can help us estimate how many proportion of the microbiome composition variance can be explained by each factor we care about.

In [None]:
mbio.adonis.res <- adonis2(mbio.dist ~ Age + Gender + FruitIntakeFrequency + Glucose, phen, permutations = 999)

if(!dir.exists("betaDiversity")){dir.create("betaDiversity")}
write.table(as.data.frame(mbio.adonis.res), "betaDiversity/PERMANOVA.tsv", sep = "\t", row.names = T, col.names = NA, quote = F)
knitr::kable(mbio.adonis.res)

## 4 Preparation before association analysis

### 4.1 Pre-calculation of prevalence and mean abundance

Calculate the prevalence and mean abundance of each species, and get the list of species which are highly abundant and highly prevalent in our samples, then extract them **after CLR transformation**.

In [None]:
# Calculate the abundance and prevalence of each species
spe.meanAbun <- colMeans(mbio)             # Mean abundance
spe.nonZero  <- colSums(mbio>0)/nrow(mbio) # Prevalence

# Get the list of species which mean abundance is greater than 0.01% and present in more than 5% samples
# We will remove the species that are not in this list later
spe.keep <- colnames(mbio)[spe.meanAbun > 0.0001 & spe.nonZero > 0.05]
# length(spe.keep) # Check how many species are in the list.

### 4.2 Centered log-ratio (clr) transformation and filtering

Relative abundance profile is a compositional data, we remove the dependency between variables using CLR transformation, then do the filtering.

In [None]:
# CLR transformation, please note that the input table of microbiome::transform() function requires samples in columns and taxa in rows, this is why we transpose the table with the function t() and then transpose it back.
mbio.clr <- microbiome::transform( t(mbio), transform = "clr") %>% t()

# Filter out the species with a low abundance or low prevalence (or extract the highly abundant and highly prevalent species)
mbio.clr.filtered <- mbio.clr[,spe.keep]

### 4.3 Assessment and removal of confounding effect caused by sequencing depth

In metagenomic sequencing-based microbiome studies, sequencing depth always varies across different samples, and may bias the diversity estimation and species associations. Two strategies can be used to solve the sequencing depth issue, (1) **Rarefaction or downsize**, it means we randomly re-sample the reads from sequencing data for each sample to certain number of read count, in such a way all sample will have the same sequencing depth, but the drawback of this approach is that it wastes data and money. Another strategy is (2) **Including sequencing depth (read count) as a covariate** in downstream analysis, using statistical approach to remove the confounding effect introduced by sequencing depth. In our studies and this course, we recommend the second strategy.

Here we can assess the impact of sequencing depth on microbiome diversity estimation and species abundance.

In [None]:
# Check impact of sequencing depth on microbiome diversity
cor.test(phen$CleanReadCount, mbio.alpha$Shannon)
cor.test(phen$CleanReadCount, mbio.alpha$Richness)

# Association between species abundance and read count
spe.read.nocovar <- as.data.frame(matrix(NA, ncol = 3, nrow = ncol(mbio.clr.filtered)))
colnames(spe.read.nocovar)<-c("Species", "R", "P")

for (i in 1:ncol(mbio.clr.filtered)) {
    spe_i <- cor.test(mbio.clr.filtered[ ,i], phen$CleanReadCount, method = "spearman") # Spearman correlation analysis
    
    spe.read.nocovar$Species[i] <- colnames(mbio.clr.filtered)[i]             # Species name
    spe.read.nocovar$R[i]       <- spe_i$estimate                             # R value (correlation strength) from Spearman correlation
    spe.read.nocovar$P[i]       <- spe_i$p.value                              # P value from Spearman correlation
}

spe.read.nocovar$FDR <- p.adjust(spe.read.nocovar$P, method = "fdr") # Get adjusted P value
spe.read.nocovar     <- dplyr::arrange(spe.read.nocovar, P)          # Sort result table by raw P value

if(!dir.exists("beforeAssociation")){dir.create("beforeAssociation")}
write.table(spe.read.nocovar, "beforeAssociation/read.species.spearman.tsv", sep = "\t", row.names = F, col.names = T, quote = F)

## 5 Association analysis

### 5.1 Correlation between species abundance and phenotypes (No covariate)

We use **Spearman's correlation** (between two continuous variables) or **Wilcoxon rank-sum test** (between a continuous variable and a binary variable 0/1) to conduct association analysis if we don't consider any confounding effect.

In [None]:
# Association between species abundance and binary variables (or between group comparison for binary variables)
spe.hyperglycemia.nocovar <- as.data.frame(matrix(NA, ncol = 4, nrow = ncol(mbio.clr.filtered)))
colnames(spe.hyperglycemia.nocovar)<-c("Species", "Mean_0", "Mean_1", "P")

for (i in 1:ncol(mbio.clr.filtered)) {
    spe_i <- wilcox.test(mbio.clr.filtered[ ,i] ~ phen$Hyperglycemia)                         # Wilcoxon rank-sum test
    
    spe.hyperglycemia.nocovar$Species[i] <- colnames(mbio.clr.filtered)[i]                    # Species name
    spe.hyperglycemia.nocovar$Mean_0[i]  <- mean(mbio.clr.filtered[phen$Hyperglycemia==0, i]) # Mean species abundance in group 0
    spe.hyperglycemia.nocovar$Mean_1[i]  <- mean(mbio.clr.filtered[phen$Hyperglycemia==1, i]) # Mean species abundance in group 1
    spe.hyperglycemia.nocovar$P[i]       <- spe_i$p.value                                     # P value from Wilcoxon rank-sum test
}

spe.hyperglycemia.nocovar$Difference <- spe.hyperglycemia.nocovar$Mean_1 - spe.hyperglycemia.nocovar$Mean_0 # Get differences between group 0 and 1
spe.hyperglycemia.nocovar$FDR        <- p.adjust(spe.hyperglycemia.nocovar$P, method = "fdr")               # Get adjusted P value
spe.hyperglycemia.nocovar            <- dplyr::arrange(spe.hyperglycemia.nocovar, P)                        # Sort result table by raw P value

if(!dir.exists("associationNoCovar")){dir.create("associationNoCovar")}
write.table(spe.hyperglycemia.nocovar, "associationNoCovar/hyperglycemia.differential.species.wilcoxon.tsv", sep = "\t", row.names = F, col.names = T, quote = F)

# Association between species abundance and continuous variables
spe.glucose.nocovar <- as.data.frame(matrix(NA, ncol = 3, nrow = ncol(mbio.clr.filtered)))
colnames(spe.glucose.nocovar)<-c("Species", "R", "P")

for (i in 1:ncol(mbio.clr.filtered)) {
    spe_i <- cor.test(mbio.clr.filtered[ ,i], phen$Glucose, method = "spearman") # Spearman correlation analysis
    
    spe.glucose.nocovar$Species[i] <- colnames(mbio.clr.filtered)[i]             # Species name
    spe.glucose.nocovar$R[i]       <- spe_i$estimate                             # R value (correlation strength) from Spearman correlation
    spe.glucose.nocovar$P[i]       <- spe_i$p.value                              # P value from Spearman correlation
}

spe.glucose.nocovar$FDR <- p.adjust(spe.glucose.nocovar$P, method = "fdr") # Get adjusted P value
spe.glucose.nocovar     <- dplyr::arrange(spe.glucose.nocovar, P)          # Sort result table by raw P value

if(!dir.exists("associationNoCovar")){dir.create("associationNoCovar")}
write.table(spe.glucose.nocovar, "associationNoCovar/Glucose.species.spearman.tsv", sep = "\t", row.names = F, col.names = T, quote = F)

### 5.2 Correlation between species abundance and phenotypes (with covariates)

We use **linear regression** (when y is a continuous variable) or **logistic regression** (when y is a binary variable 0/1) to conduct association analysis, and add potential confounding factors in the models as covariates to remove the confounding effect. The common confounding factor in microbiome study is sequencing depth; In human microbiome study, we also need to consider age and gender as potential confounding factors.

In [None]:
# Association between species abundance and binary variables
spe.hyperglycemia <- as.data.frame(matrix(NA, ncol = 3, nrow = ncol(mbio.clr.filtered)))
colnames(spe.hyperglycemia)<-c("Species", "Beta", "P")

for (i in 1:ncol(mbio.clr.filtered)) {
    spe_i <- glm(as.factor(Hyperglycemia) ~ Age+Gender+CleanReadCount+mbio.clr.filtered[,i], data = phen, family = "binomial") # logistic regresion analysis
    spe_i_summ <- summary(spe_i)                                                                                # Summary the result to get P value
    
    spe.hyperglycemia$Species[i] <- colnames(mbio.clr.filtered)[i]  # Species name
    spe.hyperglycemia$Beta[i]    <- spe_i_summ$coefficients[5,1]    # Beta coefficient from logistic regression
    spe.hyperglycemia$P[i]       <- spe_i_summ$coefficients[5,4]    # P value from logistic regression
}

spe.hyperglycemia$FDR <- p.adjust(spe.hyperglycemia$P, method = "fdr") # Get adjusted P value
spe.hyperglycemia     <- dplyr::arrange(spe.hyperglycemia, P)          # Sort result table by raw P value

if(!dir.exists("associationCovar")){dir.create("associationCovar")}
write.table(spe.hyperglycemia, "associationCovar/Hyperglycemia.species.logisticReg.tsv", sep = "\t", row.names = F, col.names = T, quote = F)


# Association between species abundance and continuous variables
spe.glucose <- as.data.frame(matrix(NA, ncol = 3, nrow = ncol(mbio.clr.filtered)))
colnames(spe.glucose)<-c("Species", "Beta", "P")

for (i in 1:ncol(mbio.clr.filtered)) {
    spe_i <- lm(Glucose ~ Age+Gender+CleanReadCount+mbio.clr.filtered[,i], data = phen) # linear regresion analysis
    spe_i_summ <- summary(spe_i)                                         # Summary the result to get P value
    
    spe.glucose$Species[i] <- colnames(mbio.clr.filtered)[i] # Species name
    spe.glucose$Beta[i]    <- spe_i_summ$coefficients[5,1]   # Beta coefficient from linear regression
    spe.glucose$P[i]       <- spe_i_summ$coefficients[5,4]   # P value from linear regression
}

spe.glucose$FDR <- p.adjust(spe.glucose$P, method = "fdr") # Get adjusted P value
spe.glucose     <- dplyr::arrange(spe.glucose, P)          # Sort result table by raw P value

if(!dir.exists("associationCovar")){dir.create("associationCovar")}
write.table(spe.glucose, "associationCovar/glucose.species.linearReg.tsv", sep = "\t", row.names = F, col.names = T, quote = F)

## 6 Mediation analysis

**Mediation analysis** is a statistic technique which can help us **infer the causal relationship or regulatory relationship between multiple factors**. For instance, in the previous result, we can see that the bacterial species *Eubacterium eligens* showed the most significant and negative association with glucose, we would like to know if the *Eubacterium eligen* can be used as a target to decrease glucose using lifestyle intervention (e.g. eat more fruit), then we can test the regulation relationship between fruit intake frequency, *Eubacterium eligen* and glucose.

In [None]:
Microbe<-mbio.clr.filtered[,grep("s__Eubacterium_eligens", colnames(mbio.clr.filtered))]
mediation.input<-data.frame(phen, Microbe)

# Check independent variable (FruitIntakeFrequency) to mediator (Microbe)
fit.mediator <- lm(Microbe~Age+Gender+CleanReadCount+FruitIntakeFrequency, data = mediation.input)
summary(fit.mediator)
# Check mediator (Microbe) to outcome (Glucose)
fit.dv <- lm(Glucose~Age+Gender+CleanReadCount+FruitIntakeFrequency+Microbe, data = mediation.input)
summary(fit.dv)

# Mediation analysis
mediation.res <- mediate(fit.mediator, fit.dv, treat='FruitIntakeFrequency', mediator='Microbe', boot=T)
summary(mediation.res)a

# ACME (average causal mediation effect) represents the indirect effect of independent factor (fruit intake frequency) on dependent factor or response factor (glucose) that goes through the mediator (Eubacterium eligen), this is the key parameter we are going to report in our study.
# ADE (average direct effect) represents the direct effect of independent factor on dependent factor
# Total Effect means the sum of direct and indirect effect
# Prop. Mediated describes the proportion of indirect effect in total effect, this is also a parameter we are interested in.