---
# Clustering and Correlation Analyses

### Questions:
- How can we use clustering and correlation analysis to analyze our microbiome samples?

### Objectives:
- Understand how clustering and correlation analysis can be used to understand complex microbial community structures, interactions, and their relationships.

### Keypoints:
- We can extract meaningful patterns from microbiome data to identify significant relationships between microbial taxa.
- These analyses can be used to uncover insights into the roles microbes play in various contexts (environment, host-factor, or health status)

---

## Getting Started

In [None]:
# set the variables for your netid
netid = "NETID"

In [None]:
# make a variable for the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/exercises/14_clustering"

### Clustering and Correlation Networks

In this exercise, we are going to explore clustering techniques and correlation networks in microbiome analysis. Clustering and correlation analysis are two powerful statistical techniques used to understand complex microbial community structures, interactions, and their relationships with environmental or health-related variables. The main objectives of using these approaches are to extract meaningful patterns from microbiome data, identify significant relationships between microbial taxa, and uncover insights into the roles microbes play in various contexts (e.g., health, disease, or environmental factors). 

#### The Purpose of Clustering Data in Microbiome Analysis using Heatmaps:

1. Clustering of Taxa and Samples:

Hierarchical Clustering: A heatmap allows for the visual representation of the hierarchical clustering of both microbial taxa and samples based on their similarities or dissimilarities. The rows of the heatmap represent microbial taxa (e.g., species, genera, or OTUs), while the columns represent the samples (e.g., patient samples, time points, experimental conditions).

Patterns of Similarity: By clustering taxa and samples based on their abundance or presence/absence patterns, the heatmap can reveal which taxa are similarly abundant across the same set of samples. This helps to group samples that share a similar microbial profile, potentially indicating a common ecological or clinical state.

2. Identifying Community Structure and Trends:

Taxa Distribution: A heatmap can show which taxa are abundant or scarce in each sample, making it easier to spot trends, such as the presence of certain microbes associated with specific conditions or treatments. For example, a heatmap could show how the abundance of certain bacterial genera varies between healthy and diseased groups.

Clustered Patterns: Heatmaps can highlight specific taxa that tend to co-occur in certain groups of samples, indicating potential microbial interactions or co-occurrence patterns that may be biologically relevant.

3. Visualizing Complex Data:

Microbiome data can be complex, especially with high-dimensional datasets containing many taxa and samples. A heatmap condenses this complexity into a 2D matrix, allowing researchers to quickly visualize key patterns, anomalies, and trends.

Color Coding: The use of a color scale (e.g., from blue for low abundance to red for high abundance) makes it easy to interpret the relative abundance of each taxa across samples.

4. Identifying Significant Features:

By visualizing how specific taxa behave across different experimental conditions or groups, researchers can identify potential biomarkers or key species that differentiate groups. For example, a heatmap can reveal taxa that are enriched in certain clinical conditions, helping to hypothesize their role in disease progression.

5. Comparative Analysis Across Groups:

A heatmap can be particularly useful when comparing different experimental groups (e.g., treatment vs. control, healthy vs. diseased). By clustering both taxa and samples, the heatmap can highlight differences in microbial composition between groups, providing insights into which microbes are contributing to those differences.

6. Assessing the Effects of Variables:

Researchers can use heatmaps to examine how different variables (e.g., diet, treatment, geographical location) impact microbial community structure. For example, clustering samples based on the geographical location might reveal distinct microbial communities between regions.

# Let's try this out in Microbiome Analyst. 

Go through the same steps from the previous exercises with the example.biom file to get to the Analysis Overview step below. Go to the Clustering and Correlation Network Section.

![image.png](attachment:image.png)

#### Step 1

Select the Heatmap visualization. This analysis is meant to help you visualize taxa that are highly abundant in certain groups within the data. Here, you can try out a number of distance metrics and clustering methods to group the data by experimental factors.

![image.png](attachment:image.png)

Here are the possible distance metrics:

![image.png](attachment:image.png)


Here are the possible clustering methods:

![image.png](attachment:image.png)

> ## Exercise: Heatmap Exploration
> Try out different taxonomic levels, distance metrics and clustering algorithms for the host_sex experimental factor. Do you see better clustering by sex by using different parameters?

### Heatmap Example: 

Here I am using the Minkowski distance and the complete linkage clustering method. In this example I am grouping by host_sex and use the "Family" for the taxonomy level.

![image.png](attachment:image.png)

### Exploring your Heatmap:

Try mousing over different parts of the heatmap to see the names and abundances for taxa that are different between the groups (male vs female)

![image.png](attachment:image.png)

> ## Discussion: Clustering Taxa with Heatmaps
> In microbiome analysis, a heatmap is often used for visualizing and clustering taxa to reveal patterns in the composition and structure of microbial communities across different samples. It helps to identify relationships between taxa and how they vary across samples, which can be critical for understanding microbial diversity, identifying biomarkers, or uncovering associations with health or disease states.

What are the advantages of using a heatmap?
> 
<details>
  <summary markdown="span">Solution</summary>
  <ul> 

Advantages of Heatmaps in Microbiome Analysis:

Intuitive visualization: Heatmaps are easy to interpret visually and provide a clear overview of microbial diversity and relationships across samples.

Data exploration: Heatmaps facilitate the exploration of complex datasets by allowing researchers to easily spot clusters of taxa or samples with similar abundance patterns.

Clear comparison: Heatmaps provide a quick way to compare multiple samples and identify which taxa are consistently present or absent across different experimental conditions or groups.

</details>

### Step 2

Dendrogram Analysis:

A dendrogram is a tree-like diagram that is commonly used in microbiome research to represent the hierarchical clustering of samples or taxa based on their similarity or dissimilarity in terms of abundance, composition, or other metrics. The dendrogram is particularly valuable for visualizing and interpreting the structure of microbial communities, relationships between different samples.

Go back to Analysis Overview and select the Dendogram Analysis. This analysis looks at the distance between samples based on a phylogentic analysis.

![image.png](attachment:image.png)

> ## Exercise: Dendrogram Exploration
> Try out different taxonomic levels, distance metrics and clustering algorithms for the host_sex experimental factor. Do you see better clustering by sex by using different parameters?

> 
![image.png](attachment:image.png)

### Correlations

The aim of correlation networks is to identify potential interactions between microbes that could represent mutualistic, commensal, parasitic or even competitive relationships. Uncovering such interactions could hold important therapeutic implications for the health of the microbial community and ultimately lead to understanding microbiome function. Several simple methods for computing correlation networks exist, such as Pearson’s correlation, which determines whether linear relationships exist between two taxa, and Spearman’s and Kendall’s rank correlations, which measure rank relationships between pairs. However, these naïve methods often fail to address the compositional nature of microbiome data and can be unreliable because of the identification of spurious correlations.

Alternatively, compositionally robust methods such as SparCC and sparse inverse covariance estimation for ecological association and statistical inference (SPIEC-EASI) have been introduced, both of which make a strong assumption of a sparse correlation network. SparCC uses a log ratio transformation and performs multiple iterations to identify taxa pairs that are outliers to background correlations. SPIEC-EASI uses graphical network models to infer the entire correlation network at once. Both methods are computationally intensive, although an efficient implementation of the SparCC algorithm, named FastSpar, was recently introduced. MicrobiomeAnalyst implements FastSpar as well as Pearson’s, Spearman’s and Kendall’s methods for correlation analysis.

![image.png](attachment:image.png)

> ## Exercise: Correlation Exploration
> Try using a Spearman rank correlation at the Class level for host_sex as the experimental factor.

![image.png](attachment:image.png)

Can you find classes of bacteria that have different abundances across the classes?

![image.png](attachment:image.png)

Note that you will get an error if the correlation analysis does not produce any meaningful results. For example, if I run the same analysis as above, but at the family level, there are no meaningful results and I get an error that looks like this:

![image.png](attachment:image.png)

> ## Discussion: Correlation Networks 
> In microbiome analysis, a correlation network is a graphical representation that helps illustrate the relationships between different microbial taxa (such as bacterial species, genera, or operational taxonomic units, OTUs) based on their co-occurrence or co-exclusion patterns. These patterns are derived from the abundance data of various microbes present in a sample or set of samples.

What is the purpose of a correlation network?
> 
<details>
  <summary markdown="span">Solution</summary>
  <ul> 

Community structure: By visualizing correlations, researchers can identify how microbial communities might interact with each other. Positive correlations suggest microbes that are likely to be in a similar ecological niche, while negative correlations suggest competitive or antagonistic relationships.

Identification of keystone species: Certain microbes may act as hubs in a correlation network, being strongly connected to many other taxa. These "keystone" species are often important for maintaining the balance of the microbiome ecosystem.

Functional insights: Correlation networks can also offer insights into functional relationships among microbes. For instance, microbes that metabolize similar substrates or share a similar environmental niche may exhibit positive correlations.

Health implications: In clinical microbiome studies, changes in the structure of these networks (such as the loss of positive correlations or the emergence of negative ones) may reflect shifts in microbial ecology associated with disease states, such as dysbiosis, inflammatory bowel disease (IBD), or obesity.

</details>

## The End

Copy your notebook for future reference...

In [None]:
!cp ~/be487-fall-2024/exercises/14_clustering/ex14_clustering.ipynb $work_dir