Machine learning to characterise microbial communities
======
This Jupyter notebook covers the progress of Matthew Shaun Grainger's MRes research project.

## Introduction
There are two main lines of investigation for my project, depending upon the data that I am able to work with:
- Compare microbial communities (snapshots of the same overall microbial community) from the same location at different time points.
- Compare microbial communities from different locations at the same time point.

The former approach investigates how the microbial community of a given environment changes over time (temporally), with the hopes of understanding its assembly. The latter approach investigates how microbial communities change over space (spatially). These 'spaces' could also correspond to different ecosystem functions, such as within a comparison between the microbial community associated with healthy wheat and that which is associated with diseased wheat. Both approaches can easily be combined, allowing for a comparison between the assembly dynamics of different microbial communities across space.

Whichever approach is chosen, microbial communities will need to be identified. To do this, I will conduct a few different pipelines on the same data, with the intention of reaching consensus in characterising its microbial community.


## General Pipeline for Microbial Community Identification
To identify a microbial community, microbial 'species' (ASVs or OTUs, hereafter referred to as OTUs) must be grouped. A microbial community is simply a group of OTUs that interact.

The identification of microbial communities often requires several steps:
- Data preparation
- Co-occurrence calculation (or alternatives)
- Network inference (constructing a network)
- Network analysis
  
These steps are discussed in the notebook sections below. Out of these steps, pipelines can differ in the following ways:
- Sequencing
- OTU quantification
- Link type for clustering and measurement of link type for clustering (such as different measurements of correlation)
- Clustering method
- Network analysis method


## Data Preparation
Data preparation can be broken down further into the following steps:
- The microbial community of interest must be sampled. This could be a sample of water or soil for instance. The sample contains microbes.
- DNA is extracted from the sample, amplified, and then sequenced. Different sequencing techniques may be used.
- OTUs are determined.
- OTUs are quantified.

Different sequencing methods provide different results. In this project, I will be using 16S rRNA sequencing data. Sampling at the 16S rRNA region is commonly practiced; this region is present in all bacteria, and it is highly variable such that can be used to classify bacterial OTUs. OTUs are classified based upon some high degree of sequence similarity.

Once OTUs have been classified, an OTU table must be constructed. An OTU table comprises a quantification of each OTU for each sample. There are various possible methods of quantification. Here, I will be using presence/absence data.

## Co-occurrence Calculation
There are several ways of constructing a network from microbial community samples. A network consists of nodes with links between them. For example, samples (and clusters of samples) can be nodes, connected by links in the form of their pairwise beta diversity similarities. Here, the samples are assumed to be communities, and each cluster within the network is considered to be a 'community class'. Alternatively, OTUs (and clusters of OTUs) can be nodes, and metabolic compatability between different OTUs can be used as links.
Perhaps the most common approach to constructing a microbial network is to use OTUs (and clusters of OTUs) as nodes, and to use pairwise correlations between nodes as links. A positive correlation between a pair of nodes is considered to be co-occurrence, and a negative correlaton between nodes is considered to be segragation. In this investigation, co-occurrences between OTUs will be used to construct a network.

For calculating the all-against-all pairwise correlation between OTUs, there are various different metrics:
- Pearson correlation coefficient
- Spearman's rank correlation coefficient
- SparCC (FUNCTIONINK USES THIS)
- SPIEC-EASI (JAKE)
- CoNet (OFTEN MENTIONED)
  

#### Pearson correlation coefficient
The Pearson correlation coefficient between all pairs of OTUs can be performed in R or Python, in either the base programming language or via the use of packages, and by using either existing functions or by using a custom-made function by myself.
In Python, the Pearson correlation coefficient can be calculated using base Python, Pandas, NumPy, or SciPy.
In the following cells, I import an example OTU table and use Pandas in Python to calculate the Pearson correlation between OTUs.

In [40]:
# Importing the required packages:
import pandas as pd

# Importing the example data into a pandas data frame
%cd ../data # Navigating to the data directory
example_1_otu_table = pd.read_csv('waring_example_table.csv')
example_1_otu_table.set_index('id', inplace=True) # Making the first column containing the row names into the index.

# Printing the data frame for inspection
print(example_1_otu_table)

[Errno 2] No such file or directory: '../data # Navigating to the data directory'
/home/matthew/Documents/ResearchProject/ResearchProjectRepository/data
     f59ec1a6553dac604541343598c08d44  e5150e28f0dcd12d650144c8fe033139  \
id                                                                        
A10                                 0                                 0   
A11                                 0                                 0   
A3                                  0                                 0   
A4                                  0                                 6   
A5                                  0                                 0   
..                                ...                               ...   
H70                                 0                                 0   
H71                                 0                                 0   
H72                                 0                                 0   
SB                    

In [41]:
# Using pandas Pearson correlation function
otu_pears_correlation_matrix_1 = example_1_otu_table.corr(method='pearson')

# Printing the resulting OTU correlation matrix
print("This is an OTU correlation matrix, where the Pearson correlation coefficient was used to quantify correlation between OTUs: ")
print(otu_pears_correlation_matrix_1)
    

This is an OTU correlation matrix, where the Pearson correlation coefficient was used to quantify correlation between OTUs: 
                                  f59ec1a6553dac604541343598c08d44  \
f59ec1a6553dac604541343598c08d44                          1.000000   
e5150e28f0dcd12d650144c8fe033139                         -0.044203   
67478b5e984e5525769b788d3c4dff5a                         -0.064480   
1cb39d511497c2c772fc50e09d80d2eb                         -0.058044   
21026942efca55e8cc4625f67f3f5031                         -0.026120   
...                                                            ...   
1d8da76c31e99cf199779e1001f26966                         -0.018402   
8ba040022c422c12b79b41b1fbccde4c                         -0.022477   
e525189021a1c53f9f93c489b4d906d7                         -0.018402   
93c1e2fae0206f301fab1216e8176919                         -0.018402   
96646b3f8b973ed90a72002ae72544e2                         -0.018402   

                                  

#### Spearman's rank correlation coefficient
The Spearman's rank correlation coefficient is a non-parametric alternative to the Pearson correlation coefficient. This means that it can be applied to data which is not normally distributed, at the cost of some precision. The relative abundances of some of the bacterial OTUs are likely to be non-normally distributed. For example, a bacterial OTU might only be present under a specific environmental condition - this would result in the majority of its distribution across samples being an abundance of 0, with only samples matching the correct environmental condition showing an abundance greater than 0. For this reason, the Spearman's rank correlation coefficient might be more appropriate for assessing correlations between OTUs than the Pearson correlation coefficient.

As with the Pearson correlation coefficient, the Spearman's rank correlation coefficient between all pairs of OTUs can be performed in R or Python, in either the base programming language or via the use of packages, and by using either existing functions or by using a custom-made function by myself. Again, it can be calculated in Python via base Python, Pandas, NumPy, or SciPy. In the following cell, I use Pandas in Python to calculate the Spearman's rank correlation between OTUs.

In [42]:
# Using pandas Spearman's rank correlation function
otu_spear_correlation_matrix_1 = example_1_otu_table.corr(method='spearman')

# Printing the resulting OTU correlation matrix
print("This is an OTU correlation matrix, where the Spearman's rank correlation coefficient was used to quantify correlation between OTUs: ")
print(otu_spear_correlation_matrix_1)

This is an OTU correlation matrix, where the Spearman's rank correlation coefficient was used to quantify correlation between OTUs: 
                                  f59ec1a6553dac604541343598c08d44  \
f59ec1a6553dac604541343598c08d44                          1.000000   
e5150e28f0dcd12d650144c8fe033139                         -0.048076   
67478b5e984e5525769b788d3c4dff5a                         -0.070837   
1cb39d511497c2c772fc50e09d80d2eb                         -0.063783   
21026942efca55e8cc4625f67f3f5031                         -0.027022   
...                                                            ...   
1d8da76c31e99cf199779e1001f26966                         -0.018981   
8ba040022c422c12b79b41b1fbccde4c                         -0.027022   
e525189021a1c53f9f93c489b4d906d7                         -0.018981   
93c1e2fae0206f301fab1216e8176919                         -0.018981   
96646b3f8b973ed90a72002ae72544e2                         -0.018981   

                          

#### SparCC
SparCC was published in (Friedman & Alm, 2012), and has since been used frequently in inferring microbial correlations. It is intended to remove biases imposed by compositional data (relative abundances of OTUs that sum to 1), and to take into account the sparse nature of microbial networks (where there are likely to be relatively few genuine ecological interactions between taxa compared to the number of taxa). Because of these two features, SparCC is widely considered to be more effective at inferring bacterial co-occurrences than the Pearson correlation coefficient and the Spearman's rank correlation coefficient.

SparCC is coded in Python, using NumPy, SciPy, and pandas, and it is available officially from the following github repository:
https://github.com/bio-developer/sparcc
There is also an implementation of SparCC (alongside some additional features) contained within the 'sparrow' package for Python.
In the below cell, I use the SparCC code cloned from its github repository to apply SparCC to the example OTU table: