Summary of 'Replaying the tape of ecology to domesticate wild microbiota'
=====
My research project uses data from 'Replaying the tape of ecology to domesticate wild microbiota'. Below is a summary of this paper:

#### Research questions answered:
    - Can bacterial community assembly be reproduced?
    - Do reproduced bacterial communities confer the same function?
      
#### Experiment:
    - 275 starting communities, each from a unique tree hole
    - Different initial taxonomic compositions
    - Composition of these starting communities before cryopreservation
    - After cryopreservation, 4 replicates for each of 275 starting communities, standardised environment for 7 days, then sample composition
    - Communities associated with leaf litter degradation function

#### Techniques used:
    - Identified that the 4 final community replicates of each of the 275 tree holes were clustered
    - Used ordination to track changes in abundance of taxa, showing that cirection of travel from starting to final communities consistent among communities and replicates
    - Unsupervised clustering of communities based on composition - 5 initial classes, 2 final classes
    - Metagenomic predictions of functional differences between classes: identify associations between 2 final community classes and functional performance of communities by inputting metagenomes of communities from 16S rRNA sequencing data, using PiCRUST to categorise genes by function using the KEGG database - find out what aorts of genes the community classes have in terms of what the genes do ex. take up nutrients
    - Direct measurements of functional performance of final communities: degradation rate of 4 substrates, whole community metabolic activity, respiration rate, and cell numbers.

My biological research question:
======
### How do differences in the network structure of microbial communities link to function?

How can this be achieved?
=====

#### Co-occurrence network options

Co-occurrence network of starting communities:
There are 2 options...
  
- OPTION 1: Because there are no replicates for the starting communities of a given tree hole, only one co-occurrence network can be constructed across the starting communities of all of the tree holes.

- OPTION 2: Separate the tree holes according to community class, and construct co-occurrence networks for each class (5 in total).

Co-occurrence network of final communities:
There are 3 options...
  
- OPTION 1: Calculate a co-occurrence network across all tree holes for each final community replicate. This would result in 4 co-occurrence networks. The co-occurrence network for final community replicate 1, for instance, would be across the replicate 1 final communities for all 275 tree holes. Could also do the same, whilst ignoring replicates, resulting in a network across 4 replicates of the 275 tree holes (1100 samples) - this seems a bit strange as the number of samples between this network and the starting network are mismatched.
  
- OPTION 2: Because there are 4 replicate final communities for every tree hole (and thus every starting community), it is possible to create a co-occurrence network for each tree hole across its 4 replicates. This would result in 275 co-occurrence networks. Unlike in the starting community co-occurrence network and in OPTION 1, here we would be calculating co-occurrence across replicates of a single tree hole, rather than co-occurrence across multiple tree holes. I am not sure how comparable this would be to the starting network, given that the starting tree holes are heterogeneous, so they seemingly cannot be generalised to replicates.
  
- OPTION 3: Separate the tree holes according to community class and replicate, and construct a co-occurrence network for each class within each replicate (8 total because 2 community classes in each of 4 replicates). 

#### Narrowing down the options
- I think that OPTION 2 for the starting community co-occurrence network and OPTION 3 for the final community co-occurrence network are the best options. Because the function of the community classes are known, the functions of the networks constructed within the community classes will also be known. The structure of the networks in the different classes can then be compared in relation to this function
- PROBLEM: How does this add to the previous paper? What can we use to compare network structure in a biologically meaningful way. I am confused about what additional information constructing a network adds to the existing information upon community composition that is available for each community class.

- ALTERNATIVE:
- Could use final community OPTION 2 to get a co-occurrence network for each tree hole, and then compare these co-occurrence network structures in relation to the known function of each sample. Here, we would either ignore the starting communities altogether, or treat them as a single tree hole - whereby each individual tree hole starting communtiy would be treated as a replicate as an overall meta-treehole.
- PROBLEM: The starting tree hole communities were different, so there wasn't really a single starting poitn for all of the final communities
- PROBLEM: Still need to find out what network features to comaprei n a biologically meaningful way, and how this adds to the previosu paper.

#### Current choice:
- I have currently chosen OPTION 2 for the starting community co-occurrence network and OPTION 3 for the final community co-occurrence network - I will split the data by samples and replicate, and construct co-occurrence network for each combination.

#### What to compare between networks for biological meaning
- Main question is what to compare between networks for biological meaning. We already know the composition of each community class, and so we already know the composition of each network (assuming we pick the combination of starting community OPTION 2 and final communtiy OPTION 3). What can we compare that provides additional information. Here are some options:
    - Number of links
    - Topology
    - Central species
    - Presence and types of communities


Pipeline
======
This is the pipeline for Matthew Shaun Grainger's MRes Research Project.
Overall, it involves constructing co-occurrence networks for each class of each replicate of the starting and final microbial communities, and then comparing the structures of these networks.

## 1. Preliminaries
1. Importing all of the data from the treeholes paper
2. Familiarizing myself with the data, in particular with the metadata. Extracting the communities that I will be working with.
3. Installing FlashWeave, and making it work. So far, it seems to outperform other methods.

#### 1) Importing all of the data from the treeholes paper
The below code is written in Python - be sure to switch to the Python kernel to run it!

In [1]:
# Importing packages
import pandas as pd

# Importing ASV table
asv_table = pd.read_csv('../data/seqtable_readyforanalysis.csv', sep='\t')
# Importing metadata
metadata = pd.read_csv('../data/metadata_Time0D-7D-4M_May2022_wJSDpart_ext.csv', sep='\t')
# Importing taxonomy
taxonomy_data = pd.read_csv('../data/taxa_wsp_readyforanalysis.csv', sep='\t')

#### 2) Familiarizing myself with the data, in particular with the metadata. Extracting the communities that I will be working with.
The below code is written in Python - be sure to switch to the Python kernel to run it!

##### ASV Table
The ASV table is in the standard format. There is a column for each ASV, and there is a row for each sample. Each cell contains the abundance of an OTU in a given sample.

In [None]:
# Looking at the ASV table
asv_table

##### Metadata
The metadata contains a row for each sample. It has columns which provide additional information about each sample. Out of these columns, the following are relevant:
- sampleid: Each sample has an unique ID
- replicate: Replicate of the experiment. Starting communities are labelled as Rep0
- parent: ID of the starting communities from which final communities departed. For example, final community with id WYT14.1 is the replicate 1 of starting community WYT14, which is the ID indicated in this field.
- Location: Sampling field from which the starting communities were sampled.
- Experiment: Either starting communities (0D) or final communities (7D_rep$) were "$" = 1-4 depending on the replicate. Samples belonging to experiment 4M should be ignored.
- Part_Time0D_17: Id of the class the starting communities belong to, corresponding to the maximum of the Calinski-Harabasz index found considering starting communities only. (1 to 17 and NA for samples not belonging to the set)
- Part_Time0D_6: Id of the class the starting communities belong to, corresponding to the second maximum of the Calinski-Harabasz index found considering starting communities only (analysed in this work) (runs from 1 to 6 and NA for samples not belonging to the set)
- Part_Time7D_rep1_2: Id of the class the first replicate of final communities belong to, corresponding to the maximum of the Calinski-Harabasz index. (1 to 2 and NA for samples not belonging to the set)
- Part_Time7D_rep2_2: Id of the class the second replicate of final communities belong to, corresponding to the maximum of the Calinski-Harabasz index. (1 to 2 and NA for samples not belonging to the set)
- Part_Time7D_rep3_2: Id of the class the third replicate of final communities belong to, corresponding to the maximum of the Calinski-Harabasz index. (1 to 2 and NA for samples not belonging to the set)
- Part_Time7D_rep4_2: Id of the class the fourth replicate of final communities belong to, corresponding to the maximum of the Calinski-Harabasz index. (1 to 2 and NA for samples not belonging to the set)
- replicate.partition: Combination of the replicate and partition ids. Note that the ids obtained for each replicate independently (1 and 2 for each replicate) can be considered paired across replicates, since we showed they have similar compositions. Also note that Rep0.Class1 and Rep0.Class2 are ids used in Experiments 0D and 4M. Therefore, those belonging to Experiment = 4M should be ignored.
- partition: Only the class. As in the previous field, one should exclude samples in Experiment = 4M.
- ExpCompact: Another identifier for the experiment, in which the 4 replicates of final communities have the same id (note that in the field Experiment the different replicates were differentiated). Levels are Starting, Final (and Evolved should be excluded).
- exp.replicate.partition: Combination of ExpCompact, and replicate.partition
- exp.partition: Combination of ExpCompact, and partition

In [None]:
# Looking at the metadata
metadata

In [None]:
# The names of the columns in the metadata
print(metadata.columns)

In [None]:
print(metadata['replicate.partition'].value_counts())

In [None]:
print(metadata['exp.replicate.partition'].value_counts())

Taxonomy

In [None]:
taxonomy_data

#### Extracting the communities that I will be working with:

In [2]:
# Getting rid of the samples belonging to experiment 4M:
asv_table.reset_index(inplace=True) # Making the sample ID into a column for the ASV table
asv_table.rename(columns={'index': 'sampleid'}, inplace=True) # renaming this new column to 'sampleid'
metadata_asv_table = pd.merge(asv_table, metadata, on='sampleid') # Merging metadata and asv_table by 'sampleid'
main_data = metadata_asv_table[metadata_asv_table['Experiment'] != '4M'] # Taking rows which are not 4M samples

# Separating the samples by class and replicate (a separate data frame for each class within each replicate).
starting_class1 = main_data[main_data['exp.replicate.partition'] == 'Starting.Rep0.Class1']
starting_class2 = main_data[main_data['exp.replicate.partition'] == 'Starting.Rep0.Class2']
starting_class3 = main_data[main_data['exp.replicate.partition'] == 'Starting.Rep0.Class3']
starting_class4 = main_data[main_data['exp.replicate.partition'] == 'Starting.Rep0.Class4']
starting_class5 = main_data[main_data['exp.replicate.partition'] == 'Starting.Rep0.Class5']
starting_class6 = main_data[main_data['exp.replicate.partition'] == 'Starting.Rep0.Class6']

final1_class1 = main_data[main_data['exp.replicate.partition'] == 'Final.Rep1.Class1']
final1_class2 = main_data[main_data['exp.replicate.partition'] == 'Final.Rep1.Class2']
final2_class1 = main_data[main_data['exp.replicate.partition'] == 'Final.Rep2.Class1']
final2_class2 = main_data[main_data['exp.replicate.partition'] == 'Final.Rep2.Class2']
final3_class1 = main_data[main_data['exp.replicate.partition'] == 'Final.Rep3.Class1']
final3_class2 = main_data[main_data['exp.replicate.partition'] == 'Final.Rep3.Class2']
final4_class1 = main_data[main_data['exp.replicate.partition'] == 'Final.Rep4.Class1']
final4_class2 = main_data[main_data['exp.replicate.partition'] == 'Final.Rep4.Class2']

# Removing metadata columns
columns_to_drop = ['sampleid', 'Name.2', 'Community', 'Species', 'replicate',
       'BreakingBag', 'parent', 'Location', 'Experiment', 'Part_Time0D_17',
'Community', 'Species', 'replicate',
       'BreakingBag', 'parent', 'Location', 'Experiment', 'Part_Time0D_17',
       'Part_Time0D_6', 'Part_Time4M_64', 'Part_Time7D_rep1_2',
       'Part_Time7D_rep2_2', 'Part_Time7D_rep3_2', 'Part_Time7D_rep4_2',
       'replicate.partition', 'partition', 'ExpCompact',
       'exp.replicate.partition', 'exp.partition', 'Part_Time0D_6', 'Part_Time4M_64', 'Part_Time7D_rep1_2',
       'Part_Time7D_rep2_2', 'Part_Time7D_rep3_2', 'Part_Time7D_rep4_2',
       'replicate.partition', 'partition', 'ExpCompact',
       'exp.replicate.partition', 'exp.partition']

starting_class1 = starting_class1.drop(columns=columns_to_drop)
starting_class2 = starting_class2.drop(columns=columns_to_drop)
starting_class3 = starting_class3.drop(columns=columns_to_drop)
starting_class4 = starting_class4.drop(columns=columns_to_drop)
starting_class5 = starting_class5.drop(columns=columns_to_drop)
starting_class6 = starting_class6.drop(columns=columns_to_drop)

final1_class1 = final1_class1.drop(columns=columns_to_drop)
final1_class2 = final1_class2.drop(columns=columns_to_drop)
final2_class1 = final2_class1.drop(columns=columns_to_drop)
final2_class2 = final2_class2.drop(columns=columns_to_drop)
final3_class1 = final3_class1.drop(columns=columns_to_drop)
final3_class2 = final3_class2.drop(columns=columns_to_drop)
final4_class1 = final4_class1.drop(columns=columns_to_drop)
final4_class2 = final4_class2.drop(columns=columns_to_drop)

# Creating .csv files for each of these subsets.
# This code is not intended to be reproducible, so it does not need to be super efficient.
# The easiest way to acccess these Python-created subsets in Julia, where Flashweave is coded, is to import them from .csv files.
starting_class1.to_csv('../data/starting_class1.csv', index=False)
starting_class2.to_csv('../data/starting_class2.csv', index=False)
starting_class3.to_csv('../data/starting_class3.csv', index=False)
starting_class4.to_csv('../data/starting_class4.csv', index=False)
starting_class5.to_csv('../data/starting_class5.csv', index=False)
starting_class6.to_csv('../data/starting_class6.csv', index=False)

final1_class1.to_csv('../data/final1_class1.csv', index=False)
final1_class2.to_csv('../data/final1_class2.csv', index=False)
final2_class1.to_csv('../data/final2_class1.csv', index=False)
final2_class2.to_csv('../data/final2_class2.csv', index=False)
final3_class1.to_csv('../data/final3_class1.csv', index=False)
final3_class2.to_csv('../data/final3_class2.csv', index=False)
final4_class1.to_csv('../data/final4_class1.csv', index=False)
final4_class2.to_csv('../data/final4_class2.csv', index=False)

In [None]:
starting_class5

#### 3) Installing FlashWeave, and making it work. So far, it seems to outperform other methods.

Installed FlashWeave.
FlashWeave is programmed within Julia - please make sure to switch to the Julia kernel to run the below code.

### 2. Infer networks



In [1]:
using FlashWeave
data_path = "../data/final1_class1.csv"
netw_results = learn_network(data_path, sensitive=true, heterogeneous=false)


### Loading data ###

### Normalizing ###

Removing variables with 0 variance (or equivalently 1 level) and samples with 0 reads
	-> discarded 0 samples and 567 variables

Normalization

### Learning interactions ###

Inferring network with FlashWeave - sensitive (conditional)

	Run information:
	sensitive - true
	heterogeneous - false
	max_k - 3
	alpha - 0.01
	sparse - false
	workers - 1
	OTUs - 901
	MVs - 0

Automatically setting 'n_obs_min' to 20 for enhanced reliability
Computing univariate associations

Univariate degree stats:
Summary Stats:
Length:         901
Missing Count:  0
Mean:           27.738069
Minimum:        1.000000
1st Quartile:   13.000000
Median:         20.000000
3rd Quartile:   36.000000
Maximum:        129.000000



Starting conditioning search

Preparing workers..

Done. Starting inference..
Starting convergence checks at 2437 edges.
Latest convergence step change: 0.31253
Latest convergence step change: 0.54347
Latest convergence step change: 0.09666
Latest 


Mode:
FlashWeave - sensitive (conditional)

Network:
2692 interactions between 901 variables (901 OTUs and 0 MVs)

Unfinished variables:
none

Rejections:
not tracked


In [2]:
save_network("../data/final1_class1_network_output.edgelist", netw_results)

### Applying functionInk

First, modifying the format of the output of FlashWeave to match the input format for functionInk.
Please switch to the Python kernel to run the code below.

In [None]:
import pandas as pd # Importing Pandas again, as switched back to Python kernel

# Removing the headers
with open('../data/final1_class1_network_output.edgelist', 'r') as f:
    lines = f.readlines()
with open('../data/final1_class1_network_output.edgelist', 'w') as f:
    f.writelines(lines[2:])

# Adding new headers and a column for the type if interaction (here, all assumed to be the same)
final1_class1_network_data = pd.read_csv("../data/final1_class1_network_output.edgelist", sep="\t", header=None, names=["ASV_A", "ASV_B", "Interaction"])
final1_class1_network_data['Type'] = 1

# Outputting as a .tsv
final1_class1_network_data.to_csv('../data/final1_class1_network_data.tsv', sep='\t', index=False, header=['#ASV_A', 'ASV_B', 'Interaction', 'Type'])

Next, running the detailed pipeline for functionInk, as described in its vignette. 

In [1]:
%cd functionInk # Moving to the functionInk repositroy to run its commands

/home/matthew/Documents/ResearchProject/ResearchProjectRepository/code/functionInk


In [5]:
# The first step to the pipeline - computing similarities between nodes
!./NodeSimilarity.pl -w 1 -d 0 -t 0 -f ../../data/final1_class1_network_data.tsv

  
**************************************************  
* Building nodes similarities NodeSimilarity.pl  *  
**************************************************  
  
>> Reading input arguments: 
~~~ The network is weighted=1 or 2/unweighted=0? Value = 1
~~~ The network is directed=1/undirected=0? Value = 0
~~~ The network has different types=1/or a single type=0? Value = 0
~~~ The network file is = ../../data/final1_class1_network_data.tsv
 
>> Processing input arguments: 
~~~ Reading Node A from column 1
~~~ Reading Node B from column 2
~~~ Working with a weighted network -- Tanimoto coefficient
~~~ Reading weights from column 3
~~~ Working with a network with a single type of link
  
~~~ Input path:  ../../data/final1_class1_network_data.tsv
~~~ Input file:  final1_class1_network_data.tsv
  
 
>> Reading the network: 
~~~ The first lines for the fields read from file and after conversions are: 
..Skip header:  1
~~~ Reading fields: nodeA = ASV_17,nodeB= ASV_20, type= 1, weight = 0.240

In [7]:
# The second step - clustering nodes using the similarity metrics calculated
!./NodeLinkage.pl -fn ../../data/final1_class1_network_data.tsv -fs Nodes-Similarities_final1_class1_network_data.tsv

>> Reading input arguments: 
~~~ The network file is = ../../data/final1_class1_network_data.tsv
~~~ The similarity file is = Nodes-Similarities_final1_class1_network_data.tsv
~~~ The clustering will run until 
~~~~ we get a single cluster 
~~~ Reading the similarity from column  = 3
~~~ Clustering with = Average
  
***********************************************  
* Finding communities with nodes clustering   *  
***********************************************  
  
~~ FIELDS for the FIRST FILE
~~~ Reading Node A from column 1
~~~ Reading Node B from column 2
~~~ Reading Similarities from column 3
~~ FIELDS for the SECOND FILE
~~~ Reading Node A from column 1
~~~ Reading Node B from column 2
~~~ Reading Similarities from column 3
~~ CLUSTERING parameters: 
~~~ Performing a clustering with Average Linkage method
~~~ Clustering with no stopping point
  
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
~~ Opening the first input file: 
~~~ Input path:  Nodes-Similarities_final1_class1_network_data.t

The code in the below cell is copied and modified from nodeLinkage_analysis.R found within the functionInk repository. Please switch to the R kernel to run it.

In [3]:
# Third step - identifying the optimal partition

library(ggplot2)
library(here)

# --- Path to dir of history compact file
dir="." #  path to dir of history compact file relative to the root of the repo (e.g. fix to "." if it is located in the root of the directory)

# --- Name of the history file
file.hist="HistCompact-NL_Average_NoStop_final1_class1_network_data.tsv" #"history file name"

src.dir=here("scripts","analysis_R")
root.dir=here()
setwd(src.dir)
source("extractPartDensity.R")
if(dir == "."){
  setwd(root.dir)
}else{
  setwd(here(dir))
}

hist.comp=read.table(file=file.hist,sep="\t",header = TRUE) # for current NodeLink.pl version (Dec 2018)

part_density=extractPartDensity(hist.comp)

part_density$total_dens # maximum of the total partition density
part_density$int_dens # maximum of the internal partition density
part_density$ext_dens # maximum of the external partition density
part_density$total_dens_step # step of the clustering in which the maximum of the total partition density was found
part_density$int_dens_step # step of the clustering in which the maximum of the internal partition density was found
part_density$ext_dens_step # step of the clustering in which the maximum of the external partition density was found




here() starts at /home/matthew/Documents/ResearchProject/ResearchProjectRepository



ERROR: Error in setwd(src.dir): cannot change working directory


In [4]:
# Moving some of the output files to the data directory
import os
source_file = 'Nodes-Similarities_final1_class1_network_data.tsv'
destination_directory = '../../data'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_NoStop_final1_class1_network_data.tsv'
destination_directory = '../../data'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Nodes-Similarities_final1_class1_network_data.tsv'
destination_directory = '../../data'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

# Moving the final output files to the results directory