Machine learning to characterise microbial communities
======
This Jupyter notebook covers the progress of Matthew Shaun Grainger's MRes research project.

## Introduction
There are two main lines of investigation for my project, depending upon the data that I am able to work with:
- Compare microbial communities (snapshots of the same overall microbial community) from the same location at different time points.
- Compare microbial communities from different locations at the same time point.

The former approach investigates how the microbial community of a given environment changes over time (temporally), with the hopes of understanding its assembly. The latter approach investigates how microbial communities change over space (spatially). These 'spaces' could also correspond to different ecosystem functions, such as within a comparison between the microbial community associated with healthy wheat and that which is associated with diseased wheat. Both approaches can easily be combined, allowing for a comparison between the assembly dynamics of different microbial communities across space.

Whichever approach is chosen, microbial communities will need to be identified. To do this, I will conduct a few different pipelines on the same data, with the intention of reaching consensus in characterising its microbial community.


## General Pipeline for Microbial Community Identification
To identify a microbial community, microbial 'species' (ASVs or OTUs, hereafter referred to as OTUs) must be grouped. A microbial community is simply a group of OTUs that interact.

The identification of microbial communities often requires several steps:
- Bioinformatics
- Data preparation
- Co-occurrence calculation (or alternatives)
- Network inference (constructing a network)
- Network analysis
  
These steps are discussed in the notebook sections below. Out of these steps, pipelines can differ in the following ways:
- Sequencing
- OTU quantification
- Link type for clustering and measurement of link type for clustering (such as different measurements of correlation)
- Clustering method
- Network analysis method


## Bioinformatics
D

## Data Preparation
Data preparation can be broken down further into the following steps:
- The microbial community of interest must be sampled. This could be a sample of water or soil for instance. The sample contains microbes.
- DNA is extracted from the sample, amplified, and then sequenced. Different sequencing techniques may be used.
- OTUs are determined.
- OTUs are quantified.

Different sequencing methods provide different results. In this project, I will be using 16S rRNA sequencing data. Sampling at the 16S rRNA region is commonly practiced; this region is present in all bacteria, and it is highly variable such that can be used to classify bacterial OTUs. OTUs are classified based upon some high degree of sequence similarity.

Once OTUs have been classified, an OTU table must be constructed. An OTU table comprises a quantification of each OTU for each sample. There are various possible methods of quantification. Here, I will be using presence/absence data.

## Co-occurrence Calculation
There are several ways of constructing a network from microbial community samples. A network consists of nodes with links between them. For example, samples (and clusters of samples) can be nodes, connected by links in the form of their pairwise beta diversity similarities. Here, the samples are assumed to be communities, and each cluster within the network is considered to be a 'community class'. Alternatively, OTUs (and clusters of OTUs) can be nodes, and metabolic compatability between different OTUs can be used as links.
Perhaps the most common approach to constructing a microbial network is to use OTUs (and clusters of OTUs) as nodes, and to use pairwise correlations between nodes as links. A positive correlation between a pair of nodes is considered to be co-occurrence, and a negative correlaton between nodes is considered to be segragation. In this investigation, co-occurrences between OTUs will be used to construct a network.

For calculating the all-against-all pairwise correlation between OTUs, there are various different metrics:
- Pearson correlation coefficient
- Spearman's rank correlation coefficient
- SparCC (FUNCTIONINK USES THIS)
- SPIEC-EASI (JAKE)
- CoNet (OFTEN MENTIONED)
- FlashWeave (BEST)

#### Pros and cons of co-occurrence calculation methods
The different measures of pairwise correlation between OTUs have different features, assumptions, and limitations. The Pearson correlation coefficient and the Spearman's rank correlation coefficient, are not recommended for use on OTU relative abundance data. Because the relative abundances of all OTUs must sum to 1, OTUs are biased towards a negative correlation with each other. For example, if there are only 2 OTUs, an increase in the (absolute abundance and thus the) relative abundance of 1 OTU will result in a direct decrease in the relative abudnance of the other OTU, as it now accounts for a lower relative amount of the community's composition. This results in a negative correlation. The negative correlation is not biological in nature, as the absolute abundance of the second OTU has not actually decreased, so this negative correlation is rather an artefact of applying Pearson (or Spearman's rank) correlation to compositional data. Following on from this, if one OTU has an a relative abundance of 0.97, changes in its (absolute abundance and thus its) relative abundance will cause changes in the relative abundances of all the other OTUs. If the major OTU's relative abundance increases, all of the other OTUs' relative abundances decrease, and if the major OTU's relative abundance decreases, all of the other OTUs' relative abundances increase. This results in not only an artificial negative correlation between the major OTU and all other OTUs, but it results in an artificial positive correlation between all of the lower percentage OTUs. These artificial correlations can mask true biological correlations. Despite the issues with these methods of correlation calculation, however, they are still often employed in studies of microbial communities.
  

#### Pearson correlation coefficient
The Pearson correlation coefficient between all pairs of OTUs can be performed in R or Python, in either the base programming language or via the use of packages, and by using either existing functions or by using a custom-made function by myself.
In Python, the Pearson correlation coefficient can be calculated using base Python, Pandas, NumPy, or SciPy.
In the following cells, I import an example OTU table and use Pandas in Python to calculate the Pearson correlation between OTUs.

In [13]:
# Importing the required packages:
import pandas as pd
import os

# Importing the example data into a pandas data frame
os.chdir('../data') # Navigating to the data directory
example_1_otu_table = pd.read_csv('waring_example_table.csv')
example_1_otu_table.set_index('id', inplace=True) # Making the first column containing the row names into the index.

# Printing the data frame for inspection
print(example_1_otu_table)

     f59ec1a6553dac604541343598c08d44  e5150e28f0dcd12d650144c8fe033139  \
id                                                                        
A10                                 0                                 0   
A11                                 0                                 0   
A3                                  0                                 0   
A4                                  0                                 6   
A5                                  0                                 0   
..                                ...                               ...   
H70                                 0                                 0   
H71                                 0                                 0   
H72                                 0                                 0   
SB                                  0                                 0   
SS                                  0                                 0   

     67478b5e984e5525769

In [14]:
# Using pandas Pearson correlation function
otu_pears_correlation_matrix_1 = example_1_otu_table.corr(method='pearson')

# Printing the resulting OTU correlation matrix
print("This is an OTU correlation matrix, where the Pearson correlation coefficient was used to quantify correlation between OTUs: ")
print(otu_pears_correlation_matrix_1.head())
    

This is an OTU correlation matrix, where the Pearson correlation coefficient was used to quantify correlation between OTUs: 
                                  f59ec1a6553dac604541343598c08d44  \
f59ec1a6553dac604541343598c08d44                          1.000000   
e5150e28f0dcd12d650144c8fe033139                         -0.044203   
67478b5e984e5525769b788d3c4dff5a                         -0.064480   
1cb39d511497c2c772fc50e09d80d2eb                         -0.058044   
21026942efca55e8cc4625f67f3f5031                         -0.026120   

                                  e5150e28f0dcd12d650144c8fe033139  \
f59ec1a6553dac604541343598c08d44                         -0.044203   
e5150e28f0dcd12d650144c8fe033139                          1.000000   
67478b5e984e5525769b788d3c4dff5a                          0.156461   
1cb39d511497c2c772fc50e09d80d2eb                          0.217715   
21026942efca55e8cc4625f67f3f5031                         -0.045462   

                                 

#### Spearman's rank correlation coefficient
The Spearman's rank correlation coefficient is a non-parametric alternative to the Pearson correlation coefficient. This means that it can be applied to data which is not normally distributed, at the cost of some precision. The relative abundances of some of the bacterial OTUs are likely to be non-normally distributed. For example, a bacterial OTU might only be present under a specific environmental condition - this would result in the majority of its distribution across samples being an abundance of 0, with only samples matching the correct environmental condition showing an abundance greater than 0. For this reason, the Spearman's rank correlation coefficient might be more appropriate for assessing correlations between OTUs than the Pearson correlation coefficient.

As with the Pearson correlation coefficient, the Spearman's rank correlation coefficient between all pairs of OTUs can be performed in R or Python, in either the base programming language or via the use of packages, and by using either existing functions or by using a custom-made function by myself. Again, it can be calculated in Python via base Python, Pandas, NumPy, or SciPy. In the following cell, I use Pandas in Python to calculate the Spearman's rank correlation between OTUs.

In [15]:
# Using pandas Spearman's rank correlation function
otu_spear_correlation_matrix_1 = example_1_otu_table.corr(method='spearman')

# Printing the resulting OTU correlation matrix
print("This is an OTU correlation matrix, where the Spearman's rank correlation coefficient was used to quantify correlation between OTUs: ")
print(otu_spear_correlation_matrix_1.head())

This is an OTU correlation matrix, where the Spearman's rank correlation coefficient was used to quantify correlation between OTUs: 
                                  f59ec1a6553dac604541343598c08d44  \
f59ec1a6553dac604541343598c08d44                          1.000000   
e5150e28f0dcd12d650144c8fe033139                         -0.048076   
67478b5e984e5525769b788d3c4dff5a                         -0.070837   
1cb39d511497c2c772fc50e09d80d2eb                         -0.063783   
21026942efca55e8cc4625f67f3f5031                         -0.027022   

                                  e5150e28f0dcd12d650144c8fe033139  \
f59ec1a6553dac604541343598c08d44                         -0.048076   
e5150e28f0dcd12d650144c8fe033139                          1.000000   
67478b5e984e5525769b788d3c4dff5a                          0.144751   
1cb39d511497c2c772fc50e09d80d2eb                          0.179997   
21026942efca55e8cc4625f67f3f5031                         -0.048076   

                         

#### SparCC
SparCC was published in (Friedman & Alm, 2012), and has since been used frequently in inferring microbial correlations. It is intended to remove biases imposed by compositional data (relative abundances of OTUs that sum to 1), and to take into account the sparse nature of microbial networks (where there are likely to be relatively few genuine ecological interactions between taxa compared to the number of taxa). Because of these two features, SparCC is widely considered to be more effective at inferring bacterial co-occurrences than the Pearson correlation coefficient and the Spearman's rank correlation coefficient. SparCC can also be applied to any distribution of OTU relative abundances.

The first step to SparCC involves the replacement of relative abundances with a less compositional method of quantifying OTUs within each sample. This is the calculation of log-ratio transformation values for each pair of OTUs, within each sample. For a pair of OTUs A and B within a sample, both the ratio of the relative abundance of OTU A to that of OTU B and the ratio of the relative abundance of OTU B to that of OTU A are calculated. These values are then log-transformed.
The purpose of calculating these log-ratio transformations is that they are equal to the ratio of the true abundances of the OTUs. They are independent of which other OTUs are included within the sample, removing the compositionality of the data. In other words, these ratios are unaffected by changes in the abundance of OTUs outside of the pair.

Next, the dependencies between OTUs are investigated. To do this, the variances of the log-ratio transformations for each pair of OTUs across all samples is calculated. The log-ratio transformations themselves do not provide any information about correlation. However, if the ratio between a pair of OTUs never changes, then they are likely to be correlated. This means that a variance of 0 would indicate a perfect correlation between a pair of OTUs.

The quantity describing the dependencies between OTUs is referred to as t<sub>ij</sub> where i refers to OTU i and j refers to OTU j. Looking at the variance in the log-ratio transformations for a pair of OTUs alone lacks a scale; to solve this, it is related to the correlation between the true abundances of OTUs.
The variances of the log-transformed basis abundances of OTUs i and j are added, and the correlation between them is subtracted from this value (the variance of the log-transformed basis abundance of OTU i is the variance across all samples of the log of the ratio of the relative abundance of OTU i over the relative abundance of OTU j). If t<sub>ij</sub> is greater than the sum of the variances of the log-transformed basis abundances of the two OTUs, then they are negatively correlated, and if it is less than the sum, then they are positively correlated.
I DO NOT TOTALLY UNDERSTAND THE EQUATION - COME BACK TO THIS LATER.


SparCC is coded in Python, using NumPy, SciPy, and pandas, and it is available officially from the following github repository:
https://github.com/bio-developer/sparcc
There is also an implementation of SparCC (alongside some additional features) contained within the 'sparrow' package for Python.
In the below cell, I use the SparCC code cloned from its github repository to apply SparCC to the example OTU table:

#### SpiecEasi
SpiecEasi was published in (Kurtz et al., 2015). It is intended to address both the compositional nature of relative abundances, and the fact that sequencing data usually comprises hundreds of OTUs to only tens to hundreds of samples (making the inference of OTU-OTU association networks severely under-powered).




NOTES NOTES NOTES NOTES NOTES
SparCC accounts for compositional biases, but it is still based upon correlation. Correlation may not be an effective measure of interaction between OTUs, because it can be a result of indirect connection.

OTU tables have a high level of dimensionality, because the number of OTUs is in the hundreds to thousands, but the number of samples is in the tens to hundreds.

SpiecEasi infers networks from potentially high dimensional community composition data.

First, it transforms the OTU data, and then it estimates the interaction graph from the transformed data using either neighbourhood selection or sparse inverse covariance selection.

Unlike SparCC, which uses empirical correlation or covariance estimation, SpiecEasi infers an underlying graphical model based upon conditional independence.

Two OTUS conditionally dependent if, given abundance of all other nodes in the network, neither node provides additional information about the other.
If two OTUs are linked in the graphical model, this means that their abundances are not conditionally independent, and that there is a relationship between them.
This avoids detection of correlated, but indirectly connected OTUs.
Model is undirected graph where links between nodes represent signed associations between OTUs.

The first step of SpiecEasi involves transforming OTU count data.
This begins with a table of the relative abundances for each OTU for each sample.
The centered log-ratio transformations for each pair of OTUs are then calculated to remove compositionality.
Next, a covariance matrix is constructed.
An inverse covariance matrix (also known as a precision matrix) is then calculated. Values within this matrix quantify whether a pair of OTUs have an impact on each other's abundance, when the impact of other OTUs is removed. The Graphical Lasso algorithm is used to create this precision matrix
There are then two options for the next steps:
1. Neighbourhood selection
2. Inverse covariance selection




SpiecEasi assumes that the underlying network is sparse


END OF NOTES END OF NOTES END OF NOTES

The following cell contains code written within R. Make sure to switch to the R kernel to run it. This is code for SpiecEasi.



NameError: name 'a' is not defined

#### FlashWeave
FlashWeave is another way of inferring interaction networks based on co-occurrence (or co-abundance).
It has several benefits:
- Optimized for computational speed
- Reduces compositionality effects and other artifacts (such as bystander effects, shared-niche biases, and sequencing biases)
- Allows integration of environmental factors (such as pH and temperature), estimating their in3fluence, and removing indirect associations driven by them
- Mitigates impact of unmeasured confounding influences mediated by structural zeros (non-random absences driven by environmental or technical factors






















