---
# Comparison & Classification Analysis in Microbiome Research

### Questions:
- Univariate analyses: Are there any individual microbial taxa or that significantly differ between groups?
- Multivariate analyses: How do the microbial communities differ across groups as a whole, considering the interactions between multiple taxa?
- LEfSe (Linear Discriminant Analysis Effect Size): Which microbial taxa are most differentially abundant between groups, and how strong is their effect size in class separation?
- Random Forest: Can we predict group membership (e.g., male vs female) based on the microbiome composition?

### Objectives:
- Univariate analyses: Identify significant differences in microbial abundance between predefined groups.
- Multivariate analyses: Identify overall patterns in microbiome composition across groups using all or a subset of features.
- LEfSe (Linear Discriminant Analysis Effect Size): Discover biomarkers (features) that discriminate between groups.
- Random Forest: Build a predictive model to classify microbiome samples based on their taxonomic composition or other features. Identify the most important microbial features driving classification performance.

### Keypoints:
- Univariate analyses:  Univariate analysis focuses on individual features and doesn’t capture interactions between features, which might be important in complex microbiome data.
- Multivariate analyses: We can explore relationships and differences between groups in a multivariate space. This will reduce the dimensionality of high-dimensional data (many taxa) to uncover the main axes of variation. 
- LEfSe (Linear Discriminant Analysis Effect Size): LEfSe combines Kruskal-Wallis test (to find features that significantly differ between groups) with Linear Discriminant Analysis (LDA) (to rank features by their effect size and ability to separate groups),
- Random Forest: Random Forest (RF) is a powerful machine learning algorithm used for classification tasks, particularly when there are many features and complex interactions.
---

## Getting Started

In [None]:
# set the variables for your netid
netid = "NETID"

In [None]:
# make a variable for the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/15_comparisons"

### Comparison & Classification Analyses

In this assignment, we are going to explore Comparison & Classification Analyses in the Microbiome Analyst using four techniques:

* Univariate Analysis is useful for identifying individual features that are significantly different between groups, but it does not account for complex interactions between features.
* Multivariate Analysis is powerful for exploring patterns across multiple features, revealing overall community differences, and reducing dimensionality for easier visualization and interpretation.
* LEfSe is designed specifically for biomarker discovery, providing both statistical significance and effect size to identify features that best separate groups.
* Random Forest is ideal for predictive classification and can highlight the most important microbial features for distinguishing between groups, especially when dealing with complex, high-dimensional data. 

# Let's try this out in Microbiome Analyst. 

Go through the same steps from the previous homeworks with your project biom file to get to the Analysis Overview step below. Go to the Comparison & Classification Section.

![image.png](attachment:image.png)

#### Part 1

Differential Abundance Analysis using classical univariate analysis. 

From the ‘Analysis Overview’ page and click ‘Single Factor Analysis’ in the Comparison & Classification Section. 

Try out the different options for univariate analyses. 

The results for all differential-abundance analyses follow the same layout. The top half of the page contains parameters with which users can customize their analyses, such as the taxonomic level, statistical method, and significance cutoff. 

![image.png](attachment:image.png)


![image-2.png](attachment:image-2.png)

#### Question 1: 

Univariate Comparisons:

Find features that are significantly different between groups for your project using univariate comparisons.

Describe your parameters / methods here:


Paste your Figure here, and describe your results:

#### Part 2

Multiple Linear Regression with Covariate Adjustment

From the ‘Analysis Overview’ page and click ‘Multi Factor Analysis’ in the Comparison & Classification Section. 

Try out the different options for multi-factor analyses. 

In the context of microbiome analysis, where the goal is often to examine how different covariates (e.g., environmental factors, host characteristics) influence microbial abundances or diversity, the choice of model depends on the characteristics of the data—especially the distribution of the count data (such as species abundances). Here’s a summary of differences for each model type, including how they differ from one another:

![image.png](attachment:image.png)

Use multiple linear regression to find differences in taxonomic abundance between groups for your project. Try out each of the different models. Be sure to try negative binomial and zero-inflated negative binomial models that are most commonly used because of their ability to handle the typical characteristics of microbiome count data, including overdispersion and excess zeros.



#### Question 2: 

Multiple Linear Regression with Covariate Adjustment:

Find features that are significantly different between groups for your project using multivariate comparisons.

Describe your parameters / methods here:


Paste your Figure here, and describe your results:

#### Part 3

Biomarker discovery with linear discriminant analysis effect size (LEfSe).

Next, we will identify robust biomarkers using the LEfSe approach. Linear Discriminant Analysis Effect Size (LEfSe) is a statistical method used for biomarker discovery and identification of features that are significantly different between classes (e.g., different groups, treatments, or conditions) in microbiome and other omics data. It aims to identify both the features (e.g., taxa, genes, metabolites) that discriminate between classes and to estimate the strength of the discriminative effect of these features.

Key Outputs of LEfSe:
* Biomarker Identification: Features that are differentially abundant between the groups, with associated LDA scores and p-values.
* Effect Size: The LDA score, representing the effect size of each feature in class discrimination.
* Visualizations: Cladograms (for microbiome data), bar charts, and other plots that help users interpret the significance and effect sizes of biomarkers.


#### Question 3: 

Biomarker discovery with linear discriminant analysis effect size (LEfSe)

Find features that are significantly different between groups for your project using LEfSe. Note that you might not find any (which is OK)!

Describe your parameters / methods here:


Paste your Figure here (if you have one), and/or describe your results:

#### Part 4

Classification using ‘Random Forests’. From the ‘Analysis Overview’ page, click ‘Random Forests’. The random forests (RF) algorithm is a powerful machine-learning method that can be applied to microbiome data for classification and selection of important features. 



#### Question 4: 

Random Forest (RF)

Find features that are significantly different between groups for your project using RF.

Describe your parameters / methods here:

1. RF model is created using X trees?
2. The ‘Taxonomic level’ is set to?

Paste your Figure here, and describe your results:

1. What was the out-of-bag (OOB) error?
2. Which taxa are identified as the most important differences between groups for your project?

#### Question 5:

Discussion: Do you see the same or different taxa coming up as being significant in each of the analyses? Based on what is described in the exercise and in this homework for each method, why might you see these differences? 

## The End

Copy your notebook to turn-in...

In [None]:
!cp ~/be487-fall-2024/exercises/15_comparisons/ex15_comparisons.ipynb $work_dir