# Matthew Shaun Grainger's MRes Research Project Pipeline
Pipeline for inferring ecological interactions between microbes in treehole communities. Explanation of how the different steps of the pipeline work can be found below the pipeline, in the 'Pipeline explanation' section.

Pipeline steps:
1. Importing tree hole data
2. Applying FlashWeave to infer co-occurrence networks
3. Applying functionInk to infer communities from co-occurrence networks
4. Visualising communities using Cytoscape
5. Comparing co-occurrence networks
   - Compare which ASVs are present/absent between networks using Cohen's Kappa
   - Out of the ASVs which are present in all networks, compare the type of correlation (positive, negative, no correlation)
   - Out of the ASV pairs which are (POSITIVELY CORRELATED?) in all networks, compare whether these pairs are found in the same cluster (community)
   - Table containing ASV pairs that are found in all networks with the same type of correlation

## 1 - Importing tree hole data
Importing the raw data, and converting it to the correct input format for FlashWeave.
The code below is written in the Python kernel.

In [None]:
# Importing pandas
import pandas as pd

# Importing data
asv_table = pd.read_csv('../data/seqtable_readyforanalysis.csv', sep='\t') # Importing ASV table
metadata = pd.read_csv('../data/metadata_Time0D-7D-4M_May2022_wJSDpart_ext.csv', sep='\t') # Importing metadata
taxonomy_data = pd.read_csv('../data/taxa_wsp_readyforanalysis.csv', sep='\t') # Importing taxonomy

##### Cleaning the data, and giving it the correct format. #####

# Getting rid of the samples belonging to experiment 4M:
asv_table.reset_index(inplace=True) # Making the sample ID into a column for the ASV table
asv_table.rename(columns={'index': 'sampleid'}, inplace=True) # renaming this new column to 'sampleid'
asv_table = pd.merge(asv_table, metadata, on='sampleid') # Merging metadata and asv_table by 'sampleid' so that can remove 4M samples
asv_table = asv_table[asv_table['Experiment'] != '4M'] # Taking rows which are not 4M samples

# Splitting the ASV table into starting and final communities
starting_asv_table = asv_table[asv_table['Experiment'] == '0D']
final_asv_table = asv_table[asv_table['Experiment'] != '0D']

# Splitting the metadata and ASV tables
starting_metadata = starting_asv_table[['partition']] # The metadata only contains the column for the community classes
final_metadata = final_asv_table[['partition']]
columns_to_drop = ['sampleid', 'Name.2', 'Community', 'Species', 'replicate',
       'BreakingBag', 'parent', 'Location', 'Experiment', 'Part_Time0D_17',
'Community', 'Species', 'replicate',
       'BreakingBag', 'parent', 'Location', 'Experiment', 'Part_Time0D_17',
       'Part_Time0D_6', 'Part_Time4M_64', 'Part_Time7D_rep1_2',
       'Part_Time7D_rep2_2', 'Part_Time7D_rep3_2', 'Part_Time7D_rep4_2',
       'replicate.partition', 'partition', 'ExpCompact',
       'exp.replicate.partition', 'exp.partition', 'Part_Time0D_6', 'Part_Time4M_64', 'Part_Time7D_rep1_2',
       'Part_Time7D_rep2_2', 'Part_Time7D_rep3_2', 'Part_Time7D_rep4_2',
       'replicate.partition', 'partition', 'ExpCompact',
       'exp.replicate.partition', 'exp.partition'] # Columns to drop for ASV tables
starting_asv_table = starting_asv_table.drop(columns=columns_to_drop) # The ASV tables are now in the correct input format for FlashWeave
final_asv_table = final_asv_table.drop(columns=columns_to_drop)

# Exporting to .tsv files, which can then be inputted to FlashWeave
starting_metadata.to_csv('../data/starting_metadata.tsv', sep='\t', index=False)
final_metadata.to_csv('../data/final_metadata.tsv', sep='\t', index=False)
starting_asv_table.to_csv('../data/starting_asv_table.tsv', sep='\t', index=False)
final_asv_table.to_csv('../data/final_asv_table.tsv', sep='\t', index=False)


The output files of the above process are:
- starting_metadata.tsv : Metadata for starting communities, in correct input format for FlashWeave
- final_metadata.tsv : Metadata for final communities, in correct input format for FlashWeave
- starting_asv_table.tsv : ASV table of starting samples, in correct input format for FlashWeave
- final_asv_table.tsv : ASV table of final samples, in correct input format for FlashWeave

## 2 - Applying FlashWeave
Applying heterogeneous FlashWeave to the starting communities and the final communities with classes as a factor, and applying homogeneous FlashWeave to the starting communities and the final communities - ignoring community classes. This creates 4 networks in total. The code below is written in the Julia kernel.

In [None]:
##### Applying FlashWeave #####
# Co-occurrence network for starting communities, ignoring classes
using FlashWeave
starting_data_path = "../data/starting_asv_table.tsv"
starting_netw_results = learn_network(starting_data_path, sensitive=true, heterogeneous=false)

# Co-occurrence network for final communities, ignoring classes
using FlashWeave
final_data_path = "../data/final_asv_table.tsv"
final_netw_results = learn_network(final_data_path, sensitive=true, heterogeneous=false)

# Co-occurrence network for starting communities, taking into account classes
using FlashWeave
starting_classes_data_path = "../data/starting_asv_table.tsv"
starting_classes_meta_data_path = "../data/starting_metadata.tsv"
starting_classes_netw_results = learn_network(starting_classes_data_path, starting_classes_meta_data_path, sensitive=true, heterogeneous=true)

# Co-occurrence network for final communities, taking into account classes
using FlashWeave
final_classes_data_path = "../data/final_asv_table.tsv"
final_classes_meta_data_path = "../data/final_metadata.tsv"
final_classes_netw_results = learn_network(final_classes_data_path, final_classes_meta_data_path, sensitive=true, heterogeneous=true)

# Saving the networks
save_network("../data/starting_network_output.edgelist", starting_netw_results)
save_network("../data/starting_classes_network_output.edgelist", starting_classes_netw_results)
save_network("../data/final_network_output.edgelist", final_netw_results)
save_network("../data/final_classes_network_output.edgelist", final_classes_netw_results)

The output files of the above process are:
- starting_network_output.edgelist
- starting_classes_network_output.edgelist
- final_network_output.edgelist
- final_classes_network_output.edgelist
Each of these is a FlashWeave output file, containing correlations between all possible pairs of ASVs. There is one for the starting samples, one for the final samples, one for the starting samples when incorporating classes into FlashWeave's algortihm, and one for the final samples when incorporating classes into FlashWeave's algorithm.

## 3 - Applying functionInk
First, converting the FlashWeave outputs into the correct input format for functionInk. The below code is written in the Python kernel.

In [1]:
##### Converting the FlashWeave outputs into the correct input format for functionInk #####

import pandas as pd # Importing Pandas again, as switched back to Python kernel

# Removing the headers
with open('../data/starting_network_output.edgelist', 'r') as f:
    lines = f.readlines()
with open('../data/starting_network_output.edgelist', 'w') as f:
    f.writelines(lines[2:])

with open('../data/starting_classes_network_output.edgelist', 'r') as f:
    lines = f.readlines()
with open('../data/starting_classes_network_output.edgelist', 'w') as f:
    f.writelines(lines[2:])

with open('../data/final_network_output.edgelist', 'r') as f:
    lines = f.readlines()
with open('../data/final_network_output.edgelist', 'w') as f:
    f.writelines(lines[2:])

with open('../data/final_classes_network_output.edgelist', 'r') as f:
    lines = f.readlines()
with open('../data/final_classes_network_output.edgelist', 'w') as f:
    f.writelines(lines[2:])

# Adding new headers
starting_network_data = pd.read_csv("../data/starting_network_output.edgelist", sep="\t", header=None, names=["ASV_A", "ASV_B", "Interaction"])

starting_classes_network_data = pd.read_csv("../data/starting_classes_network_output.edgelist", sep="\t", header=None, names=["ASV_A", "ASV_B", "Interaction"])

final_network_data = pd.read_csv("../data/final_network_output.edgelist", sep="\t", header=None, names=["ASV_A", "ASV_B", "Interaction"])

final_classes_network_data = pd.read_csv("../data/final_classes_network_output.edgelist", sep="\t", header=None, names=["ASV_A", "ASV_B", "Interaction"])

# Outputting as .tsv files
starting_network_data.to_csv('../data/starting_network_data.tsv', sep='\t', index=False, header=['#ASV_A', 'ASV_B', 'Interaction'])
starting_classes_network_data.to_csv('../data/starting_classes_network_data.tsv', sep='\t', index=False, header=['#ASV_A', 'ASV_B', 'Interaction'])
final_network_data.to_csv('../data/final_network_data.tsv', sep='\t', index=False, header=['#ASV_A', 'ASV_B', 'Interaction'])
final_classes_network_data.to_csv('../data/final_classes_network_data.tsv', sep='\t', index=False, header=['#ASV_A', 'ASV_B', 'Interaction'])

The output files of the above cell are:
- starting_network_data.tsv
- starting_classes_network_data.tsv
- final_network_data.tsv
- final_classes_network_data.tsv
These are modified versions of the FlashWeave output files, such that they fit the functionInk input format. Again, they contain the all-against-all pairwise correlations of the ASVs, for each of the 4 networks.

Applying functionInk to each of the 4 networks. The below code is written in the Python kernel.

In [None]:
##### Applying functionInk #####

import os # Importing os package
os.chdir('../code/functionInk') # Moving from the directory in which this notebook is found into the root of the functionInk repository

# The first step to the pipeline - computing similarities between nodes
!./NodeSimilarity.pl -w 2 -d 0 -t 1 -f ../../data/starting_network_data.tsv
!./NodeSimilarity.pl -w 2 -d 0 -t 1 -f ../../data/starting_classes_network_data.tsv
!./NodeSimilarity.pl -w 2 -d 0 -t 1 -f ../../data/final_network_data.tsv
!./NodeSimilarity.pl -w 2 -d 0 -t 1 -f ../../data/final_classes_network_data.tsv

# The second step - clustering nodes using the similarity metrics calculated
!./NodeLinkage.pl -fn ../../data/starting_network_data.tsv -fs Nodes-Similarities_starting_network_data.tsv
!./NodeLinkage.pl -fn ../../data/starting_classes_network_data.tsv -fs Nodes-Similarities_starting_classes_network_data.tsv
!./NodeLinkage.pl -fn ../../data/final_network_data.tsv -fs Nodes-Similarities_final_network_data.tsv
!./NodeLinkage.pl -fn ../../data/final_classes_network_data.tsv -fs Nodes-Similarities_final_classes_network_data.tsv

The output files of the above cell are:
- Nodes-Similarities_starting_network_data.tsv
- Nodes-Similarities_starting_classes_network_data.tsv
- Nodes-Similarities_final_network_data.tsv
- Nodes-Similarities_final_classes_network_data.tsv
- HistExtend-NL_Average_NoStop_starting_network_data.tsv
- HistExtend-NL_Average_NoStop_starting_classes_network_data.tsv
- HistExtend-NL_Average_NoStop_final_network_data.tsv
- HistExtend-NL_Average_NoStop_final_classes_network_data.tsv
- HistCompact-NL_Average_NoStop_starting_network_data.tsv
- HistCompact-NL_Average_NoStop_starting_classes_network_data.tsv
- HistCompact-NL_Average_NoStop_final_network_data.tsv
- HistCompact-NL_Average_NoStop_final_classes_network_data.tsv
The Nodes-Similarities files contain for all pairs of ASVs the Tanimoto and Jaccard coefficient, and information about the number of shared neighbours.
The HistExtend and HistCompact files contain information about the clustering, which here occurred all the way to the final step in which all ASVs were clustered together. ASVs were clustered based upon average linkage.

Extracting the partition densities. Please switch to the R kernel to run the code below.

In [None]:
##### Sourcing function that extracts partition densities #####
library(ggplot2) # Loading ggplot
source("functionInk/scripts/analysis_R/extractPartDensity.R") # Sourcing the function that extracts the partition densities
setwd("functionInk") # Moving to the functionInk repository

In [None]:
##### Extracting partition densities #####
# Importing the partition histories and cleaning them
hist_comp_starting=read.table(file="HistCompact-NL_Average_NoStop_starting_network_data.tsv")
colnames(hist_comp_starting) <- as.character(unlist(hist_comp_starting[1, ]))
hist_comp_starting <- hist_comp_starting[-1, ]
columns_to_convert <- c("Step", "Similarity", "Density", "DensityInt", "DensityExt", "NumNodesA", "NumEdgesA", 
                       "NumNodesB", "NumEdgesB", "NumNodesAB", "NumEdgesAB", "NumIntNodesA", "NumIntNodesB",
                       "NumExtNodesA", "NumExtNodesB", "NumIntNodesAB", "NumExtNodesAB", "NumIntEdgesA",
                       "NumIntEdgesB", "NumExtEdgesA", "NumExtEdgesB", "NumIntEdgesAB", "NumExtEdgesAB",
                       "NcumInt", "NcumExt", "Ncum")
hist_comp_starting[columns_to_convert] <- lapply(hist_comp_starting[columns_to_convert], as.numeric)

hist_comp_starting_classes=read.table(file="HistCompact-NL_Average_NoStop_starting_classes_network_data.tsv")
colnames(hist_comp_starting_classes) <- as.character(unlist(hist_comp_starting_classes[1, ]))
hist_comp_starting_classes <- hist_comp_starting_classes[-1, ]
hist_comp_starting_classes[columns_to_convert] <- lapply(hist_comp_starting_classes[columns_to_convert], as.numeric)

hist_comp_final=read.table(file="HistCompact-NL_Average_NoStop_final_network_data.tsv")
colnames(hist_comp_final) <- as.character(unlist(hist_comp_final[1, ]))
hist_comp_final <- hist_comp_final[-1, ]
hist_comp_final[columns_to_convert] <- lapply(hist_comp_final[columns_to_convert], as.numeric)

hist_comp_final_classes=read.table(file="HistCompact-NL_Average_NoStop_final_classes_network_data.tsv")
colnames(hist_comp_final_classes) <- as.character(unlist(hist_comp_final_classes[1, ]))
hist_comp_final_classes <- hist_comp_final_classes[-1, ]
hist_comp_final_classes[columns_to_convert] <- lapply(hist_comp_final_classes[columns_to_convert], as.numeric)


# Calculating partition densities, plotting them, and moving the plot into results
print("Starting network partition densities:")
part_density_starting=extractPartDensity(hist.comp=hist_comp_starting, plot = TRUE)
system(paste("mv", "figures/Plot_PartitionDensityVsStep.pdf", "../../results/starting_Plot_PartitionDensityVsStep.pdf"))
print("Step of the clustering in which the maximum of the total partition density was found: ")
part_density_starting$total_dens_step
print("Step of the clustering in which the maximum of the internal partition density was found ")
part_density_starting$int_dens_step
print("Step of the clustering in which the maximum of the external partition density was found: ")
part_density_starting$ext_dens_step

print("Starting classes network partition densities:")
part_density_starting_classes=extractPartDensity(hist.comp=hist_comp_starting_classes, plot = TRUE)
system(paste("mv", "figures/Plot_PartitionDensityVsStep.pdf", "../../results/starting_classes_Plot_PartitionDensityVsStep.pdf"))
print("Step of the clustering in which the maximum of the total partition density was found: ")
part_density_starting_classes$total_dens_step
print("Step of the clustering in which the maximum of the internal partition density was found ")
part_density_starting_classes$int_dens_step
print("Step of the clustering in which the maximum of the external partition density was found: ")
part_density_starting_classes$ext_dens_step

print("Final network partition densities:")
part_density_final=extractPartDensity(hist.comp=hist_comp_final, plot = TRUE)
system(paste("mv", "figures/Plot_PartitionDensityVsStep.pdf", "../../results/final_Plot_PartitionDensityVsStep.pdf"))
print("Step of the clustering in which the maximum of the total partition density was found: ")
part_density_final$total_dens_step
print("Step of the clustering in which the maximum of the internal partition density was found ")
part_density_final$int_dens_step
print("Step of the clustering in which the maximum of the external partition density was found: ")
part_density_final$ext_dens_step

print("Final classes network partition densities:")
part_density_final_classes=extractPartDensity(hist.comp=hist_comp_final_classes, plot = TRUE)
system(paste("mv", "figures/Plot_PartitionDensityVsStep.pdf", "../../results/final_classes_Plot_PartitionDensityVsStep.pdf"))
print("Step of the clustering in which the maximum of the total partition density was found: ")
part_density_final_classes$total_dens_step
print("Step of the clustering in which the maximum of the internal partition density was found ")
part_density_final_classes$int_dens_step
print("Step of the clustering in which the maximum of the external partition density was found: ")
part_density_final_classes$ext_dens_step


The output files of this cell are:
- starting_Plot_PartitionDensityVsStep.pdf
- starting_classes_Plot_PartitionDensityVsStep.pdf
- final_Plot_PartitionDensityVsStep.pdf
- final_classes_Plot_PartitionDensityVsStep.pdf
These are plots of the partition density (density of links relative to the number of clustered nodes) across clustering steps.

Obtaining communities. Please switch to the Python kernel to run the code below.

In [None]:
import os # Importing os again, as switched back to Python kernel
os.chdir('functionInk') # Moving from the directory in which this notebook is found into the root of the functionInk repository

In [None]:
##### Running the clustering until the step at which the maximum total partition density is reached #####
!./NodeLinkage.pl -fn ../../data/starting_network_data.tsv -fs Nodes-Similarities_starting_network_data.tsv -s step -v 855
!./NodeLinkage.pl -fn ../../data/starting_classes_network_data.tsv -fs Nodes-Similarities_starting_classes_network_data.tsv -s step -v 201
!./NodeLinkage.pl -fn ../../data/final_network_data.tsv -fs Nodes-Similarities_final_network_data.tsv -s step -v 860
!./NodeLinkage.pl -fn ../../data/final_classes_network_data.tsv -fs Nodes-Similarities_final_classes_network_data.tsv -s step -v 341

##### Running the clustering until the step at which the maximum internal partition density is reached #####
!./NodeLinkage.pl -fn ../../data/starting_network_data.tsv -fs Nodes-Similarities_starting_network_data.tsv -s step -v 1003
!./NodeLinkage.pl -fn ../../data/starting_classes_network_data.tsv -fs Nodes-Similarities_starting_classes_network_data.tsv -s step -v 217
!./NodeLinkage.pl -fn ../../data/final_network_data.tsv -fs Nodes-Similarities_final_network_data.tsv -s step -v 859
!./NodeLinkage.pl -fn ../../data/final_classes_network_data.tsv -fs Nodes-Similarities_final_classes_network_data.tsv -s step -v 364

The output files of the above cell for the maximum total partition density are:
- HistExtend-NL_Average_StopStep-855_starting_network_data.tsv
- HistExtend-NL_Average_StopStep-201_starting_classes_network_data.tsv
- HistExtend-NL_Average_StopStep-860_final_network_data.tsv
- HistExtend-NL_Average_StopStep-341_final_classes_network_data.tsv
- 
- HistCompact-NL_Average_StopStep-855_starting_network_data.tsv
- HistCompact-NL_Average_StopStep-201_starting_classes_network_data.tsv
- HistCompact-NL_Average_StopStep-860_final_network_data.tsv
- HistCompact-NL_Average_StopStep-341_final_classes_network_data.tsv
- 
- Clusters-NL_Average_StopStep-855_starting_network_data.tsv
- Clusters-NL_Average_StopStep-201_starting_classes_network_data.tsv
- Clusters-NL_Average_StopStep-860_final_network_data.tsv
- Clusters-NL_Average_StopStep-341_final_classes_network_data.tsv
- 
- Partition-NL_Average_StopStep-855_starting_network_data.tsv
- Partition-NL_Average_StopStep-201_starting_classes_network_data.tsv
- Partition-NL_Average_StopStep-860_final_network_data.tsv
- Partition-NL_Average_StopStep-341_final_classes_network_data.tsv
- 
The output files of the above cell for the maximum internal partition density are:
- HistExtend-NL_Average_StopStep-1003_starting_network_data.tsv
- HistExtend-NL_Average_StopStep-217_starting_classes_network_data.tsv
- HistExtend-NL_Average_StopStep-859_final_network_data.tsv
- HistExtend-NL_Average_StopStep-364_final_classes_network_data.tsv
- 
- HistCompact-NL_Average_StopStep-1003_starting_network_data.tsv
- HistCompact-NL_Average_StopStep-217_starting_classes_network_data.tsv
- HistCompact-NL_Average_StopStep-859_final_network_data.tsv
- HistCompact-NL_Average_StopStep-364_final_classes_network_data.tsv
- 
- Clusters-NL_Average_StopStep-1003_starting_network_data.tsv
- Clusters-NL_Average_StopStep-217_starting_classes_network_data.tsv
- Clusters-NL_Average_StopStep-859_final_network_data.tsv
- Clusters-NL_Average_StopStep-364_final_classes_network_data.tsv
- 
- Partition-NL_Average_StopStep-1003_starting_network_data.tsv
- Partition-NL_Average_StopStep-217_starting_classes_network_data.tsv
- Partition-NL_Average_StopStep-859_final_network_data.tsv
- Partition-NL_Average_StopStep-364_final_classes_network_data.tsv
The HistExtend and HistCompact files, again, contain information upon the clustering. This time, however, the clustering was stopped at either the point at which the maximum total partition density was achieved, or the point at which the maximum internal partition density was achieved.
The Clusters files describe the clusters (communities) within the network. They include information upon which ASVs are in each community, the types of link within each community, etc...
The Partition files describe which cluster each ASV belongs to.

Moving the output files from the functionInk pipeline from the data directory into the results directory.

In [None]:
## Moving the output files from the functionInk pipeline #####
# Moving the node similarity files to the results folder
source_file = 'Nodes-Similarities_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Nodes-Similarities_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Nodes-Similarities_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Nodes-Similarities_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the compact node clustering history files with no stop
source_file = 'HistCompact-NL_Average_NoStop_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_NoStop_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_NoStop_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_NoStop_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the detailed node clustering history files with no stop
source_file = 'HistExtend-NL_Average_NoStop_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_NoStop_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_NoStop_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_NoStop_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the compact node clustering history files with the maximum total partition density stop
source_file = 'HistCompact-NL_Average_StopStep-855_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_StopStep-201_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_StopStep-860_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_StopStep-341_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the extended node clustering history files with the maximum total partition density stop
source_file = 'HistExtend-NL_Average_StopStep-855_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_StopStep-201_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_StopStep-860_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_StopStep-341_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the cluster description files for the maximum total partition density stop
source_file = 'Clusters-NL_Average_StopStep-855_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Clusters-NL_Average_StopStep-201_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Clusters-NL_Average_StopStep-860_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Clusters-NL_Average_StopStep-341_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the file describing the cluster to which each node belongs for the maximum total partition density stop
source_file = 'Partition-NL_Average_StopStep-855_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Partition-NL_Average_StopStep-201_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Partition-NL_Average_StopStep-860_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Partition-NL_Average_StopStep-341_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the compact node clustering history files with the maximum internal partition density stop
source_file = 'HistCompact-NL_Average_StopStep-1003_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_StopStep-217_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_StopStep-859_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistCompact-NL_Average_StopStep-364_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the extended node clustering history files with the maximum internal partition density stop
source_file = 'HistExtend-NL_Average_StopStep-1003_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_StopStep-217_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_StopStep-859_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'HistExtend-NL_Average_StopStep-364_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the cluster description files for the maximum internal partition density stop
source_file = 'Clusters-NL_Average_StopStep-1003_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Clusters-NL_Average_StopStep-217_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Clusters-NL_Average_StopStep-859_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Clusters-NL_Average_StopStep-364_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)



# Moving the file describing the cluster to which each node belongs for the maximum internal partition density stop
source_file = 'Partition-NL_Average_StopStep-1003_starting_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Partition-NL_Average_StopStep-217_starting_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Partition-NL_Average_StopStep-859_final_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

source_file = 'Partition-NL_Average_StopStep-364_final_classes_network_data.tsv'
destination_directory = '../../results'
file_name = os.path.basename(source_file)
destination_path = os.path.join(destination_directory, file_name)
os.rename(source_file, destination_path)

## 4 - Visualisation in Cytoscape

Cytoscape visualisations of the four networks can be found in the results folder.

## 5 - Comparing co-occurrence network structure

There are several ways of comparing the structure of co-occurrence networks:
- Comparing network metrics:
  - Clustering coefficient: the degree to which ASVs tend to cluster together
  - Average path length: average shortest path length between all pairs of nodes
  - Betweenness centrality: identify nodes that connect different parts of network
  - Node connectivity: number of neighbours directly connected to a node
  - Number of clusters
  - Number of edges

- Comparing whether the same motifs (and therefore communities) are found between co-occurrence networks, using Cohen's Kappa.

#### 5.1 Comparing network metrics

#### Clustering coefficient

The actual networks that were created by functionInk have links between ASVs that were determined by clustering the most closely correlated ASVs up until the point at which the maximum link density was reached. Here, to compute the clustering coefficient, we determine links differently: by using a manually-determined threshold at which a correlation between a pair of ASVs is significant enough to be considered a link. In other words, the links that are used for calculating the clustering coefficient are different to the links that are present in the functionInk network. The way to get around this would be to find out which ASVs are linked in the functionInk networks, and then relate this back to the clustering coefficient. Code below written in R.

In [None]:
# Loading igraph package
library(igraph)

# Importing correlation matrix
starting_correlation_matrix <- read.table("../data/starting_network_output.edgelist", sep = "\t", header = FALSE)

# CURRENTLY ARBITRARY threshold for significant correlation
threshold_significant_corr <- 0.7
starting_adjacency_matrix <- ifelse(abs(starting_correlation_matrix[,3]) > threshold_significant_corr, 1, 0) # Creating adjacency_matrix
starting_adjacency_matrix <- as.matrix(starting_adjacency_matrix)
starting_adjacency_matrix <- as.numeric(starting_adjacency_matrix)
#n <- nrow(starting_correlation_matrix)
# starting_adjacency_matrix <- matrix(starting_adjacency_matrix, nrow = n, ncol = n)

starting_graph <- graph_from_adjacency_matrix(starting_adjacency_matrix, mode = "undirected") # Creating graph based upon adjacency matrix
starting_graph <- delete_vertices(starting_graph, which(degree(starting_graph) == 0))
starting_clustering_coefficient <- transitivity(starting_graph, type = "local")
print(starting_clustering_coefficient)



#### 5.2 Comparing motifs between networks
Comparing the motifs between the networks. There are three steps to comparing clusters between networks:
1. Compare which ASVs are present within each network
2. For ASV pairs within both networks, look at all possible pairs and compare whether they are correlated within both networks, and whether they have the same type of correlation between networks
3. For ASV pairs within both networks, identify whether each ASV pair is found within the same cluster, within each network.

Below, I use Cohen's Kappa to compare whether or not each pair of ASVs are found within the same cluster within each network. First, I construct a table containing information about whether each ASV pair is in the same cluster, for each network
The code below is written in R.

#### STEP 1: Comparing which ASVs are present within each network.

In [None]:
## Data frames for each network with one column containing every ASV and another column containing presence
# Starting network
presence_data_starting <- read.table("../results/Partition-NL_Average_StopStep-1003_starting_network_data.tsv", sep = "\t", header = FALSE)
presence_data_starting$Present <- TRUE
presence_data_starting <- presence_data_starting[,c(1, 3)]
colnames(presence_data_starting) <- c("ASV", "Present") # Assigning column name
# Starting classes network
presence_data_starting_classes <- read.table("../results/Partition-NL_Average_StopStep-217_starting_classes_network_data.tsv", sep = "\t", header = FALSE)
presence_data_starting_classes$Present <- TRUE
presence_data_starting_classes <- presence_data_starting_classes[,c(1, 3)]
colnames(presence_data_starting_classes) <- c("ASV", "Present") # Assigning column name
presence_data_starting_classes <- presence_data_starting_classes[grepl("ASV", presence_data_starting_classes$ASV), ] # Getting rid of classes/partitions included as ASVs
# Final network
presence_data_final <- read.table("../results/Partition-NL_Average_StopStep-859_final_network_data.tsv", sep = "\t", header = FALSE)
presence_data_final$Present <- TRUE
presence_data_final <- presence_data_final[,c(1, 3)]
colnames(presence_data_final) <- c("ASV", "Present") # Assigning column name
# Final classes network
presence_data_final_classes <- read.table("../results/Partition-NL_Average_StopStep-364_final_classes_network_data.tsv", sep = "\t", header = FALSE)
presence_data_final_classes$Present <- TRUE
presence_data_final_classes <- presence_data_final_classes[,c(1, 3)]
colnames(presence_data_final_classes) <- c("ASV", "Present") # Assigning column name
presence_data_final_classes <- presence_data_final_classes[grepl("ASV", presence_data_final_classes$ASV), ] # Getting rid of classes/partitions included as ASVs

## Data frames for pairs of networks, containing presence/absence
# Starting-Final
presence_data_starting_final <- merge(presence_data_starting, presence_data_final, by = "ASV", all = TRUE)
presence_data_starting_final$Present.x[is.na(presence_data_starting_final$Present.x)] <- FALSE
presence_data_starting_final$Present.y[is.na(presence_data_starting_final$Present.y)] <- FALSE
colnames(presence_data_starting_final) <- c("ASV", "PresentStarting", "PresentFinal") # Assigning column name
# StartingClasses-FinalClasses
presence_data_starting_cl_final_cl <- merge(presence_data_starting_classes, presence_data_final_classes, by = "ASV", all = TRUE)
presence_data_starting_cl_final_cl$Present.x[is.na(presence_data_starting_cl_final_cl$Present.x)] <- FALSE
presence_data_starting_cl_final_cl$Present.y[is.na(presence_data_starting_cl_final_cl$Present.y)] <- FALSE
colnames(presence_data_starting_cl_final_cl) <- c("ASV", "PresentStartingCl", "PresentFinalCl") # Assigning column name
# Starting-StartingClasses
presence_data_starting_starting_cl <- merge(presence_data_starting, presence_data_starting_classes, by = "ASV", all = TRUE)
presence_data_starting_starting_cl$Present.x[is.na(presence_data_starting_starting_cl$Present.x)] <- FALSE
presence_data_starting_starting_cl$Present.y[is.na(presence_data_starting_starting_cl$Present.y)] <- FALSE
colnames(presence_data_starting_starting_cl) <- c("ASV", "PresentStarting", "PresentStartingCl") # Assigning column name
# Final-FinalClasses
presence_data_final_final_cl <- merge(presence_data_final, presence_data_final_classes, by = "ASV", all = TRUE)
presence_data_final_final_cl$Present.x[is.na(presence_data_final_final_cl$Present.x)] <- FALSE
presence_data_final_final_cl$Present.y[is.na(presence_data_final_final_cl$Present.y)] <- FALSE
colnames(presence_data_final_final_cl) <- c("ASV", "PresentFinal", "PresentFinalCl") # Assigning column name
# All
presence_data_all <- merge(presence_data_starting, presence_data_final, by = "ASV", all = TRUE)
presence_data_all <- merge(presence_data_all, presence_data_starting_classes, by = "ASV", all = TRUE)
presence_data_all <- merge(presence_data_all, presence_data_final_classes, by = "ASV", all = TRUE)
colnames(presence_data_all) <- c("ASV", "PresentStarting", "PresentFinal", "PresentStartingCl", "PresentFinalCl")
presence_data_all$PresentStarting[is.na(presence_data_all$PresentStarting)] <- FALSE
presence_data_all$PresentFinal[is.na(presence_data_all$PresentFinal)] <- FALSE
presence_data_all$PresentStartingCl[is.na(presence_data_all$PresentStartingCl)] <- FALSE
presence_data_all$PresentFinalCl[is.na(presence_data_all$PresentFinalCl)] <- FALSE
presence_data_all <- as.matrix(presence_data_all[2:5])

In [None]:
## Calculating Light's Kappa between all networks for presence/absence of ASVs
library(psych) # Loading required package
all_asv_kappa <- cohen.kappa(presence_data_all)
all_asv_kappa

In [None]:
## Calculating Cohen's Kappa between starting and final networks for presence/absence of ASVs
library(psych) # Loading required package
starting_final_asv_kappa <- cohen.kappa(x=cbind(presence_data_starting_final$PresentStarting, presence_data_starting_final$PresentFinal))
starting_final_asv_kappa

In [None]:
## Calculating Cohen's Kappa between starting classes and final classes networks for presence/absence of ASVs
library(psych) # Loading required package
startingcl_finalcl_asv_kappa <- cohen.kappa(x=cbind(presence_data_starting_cl_final_cl$PresentStartingCl, presence_data_starting_cl_final_cl$PresentFinalCl))
startingcl_finalcl_asv_kappa

In [None]:
## Calculating Cohen's Kappa between starting and starting classes networks for presence/absence of ASVs
library(psych) # Loading required package
starting_startingcl_asv_kappa <- cohen.kappa(x=cbind(presence_data_starting_starting_cl$PresentStarting, presence_data_starting_starting_cl$PresentStartingCl))
starting_startingcl_asv_kappa

In [None]:
## Calculating Cohen's Kappa between final and final classes networks for presence/absence of ASVs
library(psych) # Loading required package
final_finalcl_asv_kappa <- cohen.kappa(x=cbind(presence_data_final_final_cl$PresentFinal, presence_data_final_final_cl$PresentFinalCl))
final_finalcl_asv_kappa

In [None]:
asv_kappa_df <- data.frame(
  Network_Pair = c("Starting-Final", "StartingCl-FinalCl", "Starting-StartingCl", "Final-FinalCl"),
  K = c("-0.17", "-0.45", "0", "0"),
  CI = c("-0.19 ≤ K ≤ -0.16", "-0.51 ≤ K ≤ -0.39", "0 ≤ K ≤ 0", "0 ≤ K ≤ 0")
)
asv_kappa_df

#### Results table for report. CODE WRITTEN IN PYTHON.

In [None]:
## FOR IMPORTING INTO REPORT
#### Kappa comparison of ASV presence between networks
from tabulate import tabulate

# Create a list of lists with your data
asv_kappa_table = [['Starting-Final', '-0.17', '-0.19 ≤ K ≤ -0.16'],
        ['StartingCl-FinalCl', '-0.45', '-0.51 ≤ K ≤ -0.39'],
        ['Starting-StartingCl', '0', '0 ≤ K ≤ 0'],
        ['Final-FinalCl', '0', '0 ≤ K ≤ 0']]

# Define the headers
headers = ['Network Pair', 'K', 'CI']

# Print the table using tabulate
print(tabulate(asv_kappa_table, headers=headers, tablefmt='html'))

#### STEP 2: 
For ASV pairs within both networks, looking at all possible pairs and comparing whether they have the same type of correlation between networks (present or absent, positive or negative).
Couldn't use narrowed-down dataframe of shared ASVs from previous step, because this only contained single ASVs, not pairs of ASVs with correlations. Likewise, could not use the dataframe below to identify shared ASVs, because only ASV pairs are contained within this dataframe, not single ASVs.
CODE WRITTEN IN R.

In [None]:
#### Importing data frames with correlation
## Starting network
starting_correlation_data <- read.table("../data/starting_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_correlation_data) <- c("ASV1", "ASV2", "Interaction") # Assigning column names
starting_correlation_data$ASV_pair <- rep(NA, nrow(starting_correlation_data)) # Column for ASV pairs
for (j in 1:nrow(starting_correlation_data)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", starting_correlation_data[j,1]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", starting_correlation_data[j,2]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    starting_correlation_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }
starting_correlation_data <- starting_correlation_data[,3:4] # Getting rid of separate ASV pair columns
starting_correlation_data <- starting_correlation_data[, c("ASV_pair", "Interaction")] # Reordering columns
starting_correlation_data$correlation.sign <- rep(NA, nrow(starting_correlation_data)) # Column for sign of correlation
for (i in 1:nrow(starting_correlation_data)){ # Loop to identify positive, negative, or absent links
    if (starting_correlation_data[i, 2] > 0){
        starting_correlation_data[i, 3] <- "Positive"
        }
    else if (starting_correlation_data[i, 2] < 0) {
        starting_correlation_data[i, 3] <- "Negative"
        }
    else if (starting_correlation_data[i, 2] == 0) {
        starting_correlation_data[i, 3] <- "No link"
        }
    }
starting_correlation_data <- starting_correlation_data[, -2] # Removing correlation column
# Combining with functionInk output so that only pairs from the completed network are included in the comparison (FlashWeave correlation file contains some ASVs that aren't included in completed network)
# Importing list of ASVs from functionInk starting network
starting_network_pairs <- read.table("../results/Partition-NL_Average_StopStep-1003_starting_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_network_pairs) <- c("ASV", "Cluster") # Assigning column names
# All possible combinations of ASV pairs for starting ASVs, excluding duplicates (e.g. ASV1-ASV1)
starting_asv_pairs <- combn(unique(starting_network_pairs$ASV), 2, simplify = TRUE) # Each pair is a column of 2 rows, and each row is an ASV
# Ensuring that the pairs are smallest ASV - largest ASV so that can merge correctly with correlation data later on
starting_sorted_asv_pairs <- rep(NA, ncol(starting_asv_pairs)) # Vector with a slot for every ASV pair
for (m in 1:ncol(starting_asv_pairs)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", starting_asv_pairs[1,m]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", starting_asv_pairs[2,m]))
    starting_sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    starting_sorted_asv_pairs[m] <- paste("ASV_", starting_sorted_asvs[1], "-", "ASV_", starting_sorted_asvs[2], sep = "")
    }
# Filtering out ASV pairs that are not included within the functionInk completed network
starting_correlation_data <- starting_correlation_data[starting_correlation_data$ASV_pair %in% starting_sorted_asv_pairs, ]
# Printing info on numbers of links
print(paste("There are ", nrow(starting_correlation_data), " total links in the starting network."))
print(paste("There are ", nrow(subset(starting_correlation_data, correlation.sign == "Positive")), " positive links in the starting network."))
print(paste("There are ", nrow(subset(starting_correlation_data, correlation.sign == "Negative")), " negative links in the starting network."))


## Starting classes network
starting_classes_correlation_data <- read.table("../data/starting_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_classes_correlation_data) <- c("ASV1", "ASV2", "Interaction") # Assigning column names
starting_classes_correlation_data <- starting_classes_correlation_data[grepl("ASV", starting_classes_correlation_data$ASV1) & grepl("ASV", starting_classes_correlation_data$ASV2), ] # Removing rows containing paritions instead of ASVs
starting_classes_correlation_data$ASV_pair <- rep(NA, nrow(starting_classes_correlation_data)) # Column for ASV pairs
for (j in 1:nrow(starting_classes_correlation_data)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", starting_classes_correlation_data[j,1]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", starting_classes_correlation_data[j,2]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    starting_classes_correlation_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }
starting_classes_correlation_data <- starting_classes_correlation_data[,3:4] # Getting rid of separate ASV pair columns
starting_classes_correlation_data <- starting_classes_correlation_data[, c("ASV_pair", "Interaction")] # Reordering columns
starting_classes_correlation_data$correlation.sign <- rep(NA, nrow(starting_classes_correlation_data)) # Column for sign of correlation
for (i in 1:nrow(starting_classes_correlation_data)){ # Loop to identify positive, negative, or absent links
    if (starting_classes_correlation_data[i, 2] > 0){
        starting_classes_correlation_data[i, 3] <- "Positive"
        }
    else if (starting_classes_correlation_data[i, 2] < 0) {
        starting_classes_correlation_data[i, 3] <- "Negative"
        }
    else if (starting_classes_correlation_data[i, 2] == 0) {
        starting_classes_correlation_data[i, 3] <- "No link"
        }
    }
starting_classes_correlation_data <- starting_classes_correlation_data[, -2] # Removing correlation column
# Combining with functionInk output so that only pairs from the completed network are included in the comparison (FlashWeave correlation file contains some ASVs that aren't included in completed network)
# Importing list of ASVs from functionInk starting_classes network
starting_classes_network_pairs <- read.table("../results/Partition-NL_Average_StopStep-217_starting_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_classes_network_pairs) <- c("ASV", "Cluster") # Assigning column names
# Filtering out partition/classes that are included as ASVs
starting_classes_network_pairs <- starting_classes_network_pairs[grepl("ASV", starting_classes_network_pairs$ASV), ]
# All possible combinations of ASV pairs for starting classes ASVs, excluding duplicates (e.g. ASV1-ASV1)
starting_classes_asv_pairs <- combn(unique(starting_classes_network_pairs$ASV), 2, simplify = TRUE) # Each pair is a column of 2 rows, and each row is an ASV
# Ensuring that the pairs are smallest ASV - largest ASV so that can merge correctly with correlation data later on
starting_classes_sorted_asv_pairs <- rep(NA, ncol(starting_classes_asv_pairs)) # Vector with a slot for every ASV pair
for (m in 1:ncol(starting_classes_asv_pairs)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", starting_classes_asv_pairs[1,m]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", starting_classes_asv_pairs[2,m]))
    starting_classes_sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    starting_classes_sorted_asv_pairs[m] <- paste("ASV_", starting_classes_sorted_asvs[1], "-", "ASV_", starting_classes_sorted_asvs[2], sep = "")
    }
# Filtering out ASV pairs that are not included within the functionInk completed network
starting_classes_correlation_data <- starting_classes_correlation_data[starting_classes_correlation_data$ASV_pair %in% starting_classes_sorted_asv_pairs, ]
# Printing info on numbers of links
print(paste("There are ", nrow(starting_classes_correlation_data), " total links in the starting classes network."))
print(paste("There are ", nrow(subset(starting_classes_correlation_data, correlation.sign == "Positive")), " positive links in the starting classes network."))
print(paste("There are ", nrow(subset(starting_classes_correlation_data, correlation.sign == "Negative")), " negative links in the starting classes network."))


## Final network
final_correlation_data <- read.table("../data/final_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_correlation_data) <- c("ASV1", "ASV2", "Interaction") # Assigning column names
final_correlation_data$ASV_pair <- rep(NA, nrow(final_correlation_data)) # Column for ASV pairs
for (j in 1:nrow(final_correlation_data)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", final_correlation_data[j,1]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", final_correlation_data[j,2]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    final_correlation_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }
final_correlation_data <- final_correlation_data[,3:4] # Getting rid of separate ASV pair columns
final_correlation_data <- final_correlation_data[, c("ASV_pair", "Interaction")] # Reordering columns
final_correlation_data$correlation.sign <- rep(NA, nrow(final_correlation_data)) # Column for sign of correlation
for (i in 1:nrow(final_correlation_data)){ # Loop to identify positive, negative, or absent links
    if (final_correlation_data[i, 2] > 0){
        final_correlation_data[i, 3] <- "Positive"
        }
    else if (final_correlation_data[i, 2] < 0) {
        final_correlation_data[i, 3] <- "Negative"
        }
    else if (final_correlation_data[i, 2] == 0) {
        final_correlation_data[i, 3] <- "No link"
        }
    }
final_correlation_data <- final_correlation_data[, -2] # Removing correlation column
# Combining with functionInk output so that only pairs from the completed network are included in the comparison (FlashWeave correlation file contains some ASVs that aren't included in completed network)
# Importing list of ASVs from functionInk starting_classes network
final_network_pairs <- read.table("../results/Partition-NL_Average_StopStep-859_final_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_network_pairs) <- c("ASV", "Cluster") # Assigning column names
# All possible combinations of ASV pairs for starting classes ASVs, excluding duplicates (e.g. ASV1-ASV1)
final_asv_pairs <- combn(unique(final_network_pairs$ASV), 2, simplify = TRUE) # Each pair is a column of 2 rows, and each row is an ASV
# Ensuring that the pairs are smallest ASV - largest ASV so that can merge correctly with correlation data later on
final_sorted_asv_pairs <- rep(NA, ncol(final_asv_pairs)) # Vector with a slot for every ASV pair
for (m in 1:ncol(final_asv_pairs)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", final_asv_pairs[1,m]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", final_asv_pairs[2,m]))
    final_sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    final_sorted_asv_pairs[m] <- paste("ASV_", final_sorted_asvs[1], "-", "ASV_", final_sorted_asvs[2], sep = "")
    }
# Filtering out ASV pairs that are not included within the functionInk completed network
final_correlation_data <- final_correlation_data[final_correlation_data$ASV_pair %in% final_sorted_asv_pairs, ]
# Printing info on numbers of links
print(paste("There are ", nrow(final_correlation_data), " total links in the final network."))
print(paste("There are ", nrow(subset(final_correlation_data, correlation.sign == "Positive")), " positive links in the final network."))
print(paste("There are ", nrow(subset(final_correlation_data, correlation.sign == "Negative")), " negative links in the final network."))


## Final classes network
final_classes_correlation_data <- read.table("../data/final_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_classes_correlation_data) <- c("ASV1", "ASV2", "Interaction") # Assigning column names
final_classes_correlation_data <- final_classes_correlation_data[grepl("ASV", final_classes_correlation_data$ASV1) & grepl("ASV", final_classes_correlation_data$ASV2), ]
final_classes_correlation_data$ASV_pair <- rep(NA, nrow(final_classes_correlation_data)) # Column for ASV pairs
for (j in 1:nrow(final_classes_correlation_data)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", final_classes_correlation_data[j,1]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", final_classes_correlation_data[j,2]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    final_classes_correlation_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }
final_classes_correlation_data <- final_classes_correlation_data[,3:4] # Getting rid of separate ASV pair columns
final_classes_correlation_data <- final_classes_correlation_data[, c("ASV_pair", "Interaction")] # Reordering columns
final_classes_correlation_data$correlation.sign <- rep(NA, nrow(final_classes_correlation_data)) # Column for sign of correlation
for (i in 1:nrow(final_classes_correlation_data)){ # Loop to identify positive, negative, or absent links
    if (final_classes_correlation_data[i, 2] > 0){
        final_classes_correlation_data[i, 3] <- "Positive"
        }
    else if (final_classes_correlation_data[i, 2] < 0) {
        final_classes_correlation_data[i, 3] <- "Negative"
        }
    else if (final_classes_correlation_data[i, 2] == 0) {
        final_classes_correlation_data[i, 3] <- "No link"
        }
    }
final_classes_correlation_data <- final_classes_correlation_data[, -2] # Removing correlation column
# Combining with functionInk output so that only pairs from the completed network are included in the comparison (FlashWeave correlation file contains some ASVs that aren't included in completed network)
# Importing list of ASVs from functionInk starting_classes network
final_classes_network_pairs <- read.table("../results/Partition-NL_Average_StopStep-364_final_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_classes_network_pairs) <- c("ASV", "Cluster") # Assigning column names
# Filtering out partition/classes that are included as ASVs
final_classes_network_pairs <- final_classes_network_pairs[grepl("ASV", final_classes_network_pairs$ASV), ]
# All possible combinations of ASV pairs for starting classes ASVs, excluding duplicates (e.g. ASV1-ASV1)
final_classes_asv_pairs <- combn(unique(final_classes_network_pairs$ASV), 2, simplify = TRUE) # Each pair is a column of 2 rows, and each row is an ASV
# Ensuring that the pairs are smallest ASV - largest ASV so that can merge correctly with correlation data later on
final_classes_sorted_asv_pairs <- rep(NA, ncol(final_classes_asv_pairs)) # Vector with a slot for every ASV pair
for (m in 1:ncol(final_classes_asv_pairs)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", final_classes_asv_pairs[1,m]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", final_classes_asv_pairs[2,m]))
    final_classes_sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    final_classes_sorted_asv_pairs[m] <- paste("ASV_", final_classes_sorted_asvs[1], "-", "ASV_", final_classes_sorted_asvs[2], sep = "")
    }
# Filtering out ASV pairs that are not included within the functionInk completed network
final_classes_correlation_data <- final_classes_correlation_data[final_classes_correlation_data$ASV_pair %in% final_classes_sorted_asv_pairs, ]
# Printing info on numbers of links
print(paste("There are ", nrow(final_classes_correlation_data), " total links in the final classes network."))
print(paste("There are ", nrow(subset(final_classes_correlation_data, correlation.sign == "Positive")), " positive links in the final classes network."))
print(paste("There are ", nrow(subset(final_classes_correlation_data, correlation.sign == "Negative")), " negative links in the final classes network."))


## Data frames of pairs of networks
# All
correlation_all <- merge(starting_correlation_data, final_correlation_data, by = "ASV_pair", all = FALSE)
correlation_all <- merge(correlation_all, starting_classes_correlation_data, by = "ASV_pair", all = FALSE)
correlation_all <- merge(correlation_all, final_classes_correlation_data, by = "ASV_pair", all = FALSE)
colnames(correlation_all) <- c("ASV_pair", "CorrelationStarting", "CorrelationFinal", "CorrelationStartingCl", "CorrelationFinalCl")
correlation_all <- correlation_all[2:5]

# Starting final
correlation_starting_final <- merge(starting_correlation_data, final_correlation_data, by = "ASV_pair", all = FALSE)
colnames(correlation_starting_final) <- c("ASV_Pair", "CorrelationStarting", "CorrelationFinal") # Assigning column name
correlation_starting_final <- correlation_starting_final[2:3]
# Printing info on numbers of links
print(paste("There are ", nrow(correlation_starting_final), " total links shared between the starting and final networks."))
print(paste("There are ", nrow(subset(correlation_starting_final, CorrelationStarting == "Positive" & CorrelationFinal == "Positive")), " positive links shared between the starting and final networks."))
print(paste("There are ", nrow(subset(correlation_starting_final, CorrelationStarting == "Negative" & CorrelationFinal == "Negative")), " negative links shared between the starting and final networks."))

# Starting classes final classes
correlation_startingcl_finalcl <- merge(starting_classes_correlation_data, final_classes_correlation_data, by = "ASV_pair", all = FALSE)
colnames(correlation_startingcl_finalcl) <- c("ASV_Pair", "CorrelationStartingCl", "CorrelationFinalCl") # Assigning column name
correlation_startingcl_finalcl <- correlation_startingcl_finalcl[2:3]
print(paste("There are ", nrow(correlation_startingcl_finalcl), " total links shared between the starting and final networks."))
print(paste("There are ", nrow(subset(correlation_startingcl_finalcl, CorrelationStartingCl == "Positive" & CorrelationFinalCl == "Positive")), " positive links shared between the starting classes and final classes networks."))
print(paste("There are ", nrow(subset(correlation_startingcl_finalcl, CorrelationStartingCl == "Negative" & CorrelationFinalCl == "Negative")), " negative links shared between the starting classes and final classes networks."))

# Starting starting classes 
correlation_starting_startingcl <- merge(starting_correlation_data, starting_classes_correlation_data, by = "ASV_pair", all = FALSE)
colnames(correlation_starting_startingcl) <- c("ASV_Pair", "CorrelationStarting", "CorrelationStartingCl") # Assigning column name
correlation_starting_startingcl <- correlation_starting_startingcl[2:3]
print(paste("There are ", nrow(correlation_starting_startingcl), " total links shared between the starting and starting classes networks."))
print(paste("There are ", nrow(subset(correlation_starting_startingcl, CorrelationStarting == "Positive" & CorrelationStartingCl == "Positive")), " positive links shared between the starting and starting classes networks."))
print(paste("There are ", nrow(subset(correlation_starting_startingcl, CorrelationStarting == "Negative" & CorrelationStartingCl == "Negative")), " negative links shared between the starting and starting classes networks."))

# Final final classes 
correlation_final_finalcl <- merge(final_correlation_data, final_classes_correlation_data, by = "ASV_pair", all = FALSE)
colnames(correlation_final_finalcl) <- c("ASV_Pair", "CorrelationFinal", "CorrelationFinalCl") # Assigning column name
correlation_final_finalcl <- correlation_final_finalcl[2:3]
print(paste("There are ", nrow(correlation_final_finalcl), " total links shared between the final and final classes networks."))
print(paste("There are ", nrow(subset(correlation_final_finalcl, CorrelationFinal == "Positive" & CorrelationFinalCl == "Positive")), " positive links shared between the final and final classes networks."))
print(paste("There are ", nrow(subset(correlation_final_finalcl, CorrelationFinal == "Negative" & CorrelationFinalCl == "Negative")), " negative links shared between the final and final classes networks."))

In [None]:
## Calculating Light's Kappa between all networks
library(psych) # Loading required package
all_correlation_kappa <- cohen.kappa(as.matrix(correlation_all))
all_correlation_kappa

In [None]:
## Calculating Cohen's Kappa between starting and final networks for sign of correlation
library(psych) # Loading required package
starting_final_correlation_kappa <- cohen.kappa(correlation_starting_final)
starting_final_correlation_kappa

In [None]:
## Calculating Cohen's Kappa between starting classes and final classes networks for sign of correlation
library(psych) # Loading required package
startingcl_finalcl_correlation_kappa <- cohen.kappa(correlation_startingcl_finalcl)
startingcl_finalcl_correlation_kappa

In [None]:
## Calculating Cohen's Kappa between starting and starting classes networks for sign of correlation
library(psych) # Loading required package
starting_startingcl_correlation_kappa <- cohen.kappa(correlation_starting_startingcl)
starting_startingcl_correlation_kappa

In [None]:
## Calculating Cohen's Kappa between final and final classes networks for sign of correlation
library(psych) # Loading required package
final_finalcl_correlation_kappa <- cohen.kappa(correlation_final_finalcl)
final_finalcl_correlation_kappa

#### STEP 3: Comparing which cluster each pair of ASVs belongs to between networks
1. Comparing whether ASVs that are present in all networks are found in the same cluster (taking ASVs that are present in all networks, and performing all-pairwise comparisons of whether they are in the same cluster).
2. Comparing whether ASV pairs that are present in all networks, and that are positively correlated in all networks, are found in the same cluster between networks.

In [None]:
# Importing the data - data frame indicating which cluster each ASV in the network belongs to.
starting_cluster_data <- read.table("../results/Partition-NL_Average_StopStep-1003_starting_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_cluster_data) <- c("ASV", "Cluster") # Assigning column names
starting_classes_cluster_data <- read.table("../results/Partition-NL_Average_StopStep-217_starting_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_classes_cluster_data) <- c("ASV", "Cluster") # Assigning column names
starting_classes_cluster_data <- starting_classes_cluster_data[grepl("ASV", starting_classes_cluster_data$ASV), ] # Removing non-ASV nodes (partition/classes)
final_cluster_data <- read.table("../results/Partition-NL_Average_StopStep-859_final_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_cluster_data) <- c("ASV", "Cluster") # Assigning column names
final_classes_cluster_data <- read.table("../results/Partition-NL_Average_StopStep-364_final_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_classes_cluster_data) <- c("ASV", "Cluster") # Assigning column names
final_classes_cluster_data <- final_classes_cluster_data[grepl("ASV", final_classes_cluster_data$ASV), ] # Removing non-ASV nodes (partition/classes)


# All possible combinations of ASV pairs, as smallest-largest, and excluding both duplicates and non-ASV nodes
# Starting
starting_asv_pairs <- combn(unique(starting_cluster_data$ASV), 2, simplify = TRUE) # Each pair is a column of 2 rows, and each row is an ASV
# Ensuring that the pairs are smallest ASV - largest ASV so that can merge correctly with correlation data later on
starting_sorted_asv_pairs <- rep(NA, ncol(starting_asv_pairs)) # Vector with a slot for every ASV pair
for (j in 1:ncol(starting_asv_pairs)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", starting_asv_pairs[1,j]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", starting_asv_pairs[2,j]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    starting_sorted_asv_pairs[j] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }

# Starting classes
starting_classes_asv_pairs <- combn(unique(starting_classes_cluster_data$ASV), 2, simplify = TRUE) # Each pair is a column of 2 rows, and each row is an ASV
# Ensuring that the pairs are smallest ASV - largest ASV so that can merge correctly with correlation data later on
starting_classes_sorted_asv_pairs <- rep(NA, ncol(starting_classes_asv_pairs)) # Vector with a slot for every ASV pair
for (j in 1:ncol(starting_classes_asv_pairs)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", starting_classes_asv_pairs[1,j]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", starting_classes_asv_pairs[2,j]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    starting_classes_sorted_asv_pairs[j] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }

# Final
final_asv_pairs <- combn(unique(final_cluster_data$ASV), 2, simplify = TRUE) # Each pair is a column of 2 rows, and each row is an ASV
# Ensuring that the pairs are smallest ASV - largest ASV so that can merge correctly with correlation data later on
final_sorted_asv_pairs <- rep(NA, ncol(final_asv_pairs)) # Vector with a slot for every ASV pair
for (j in 1:ncol(final_asv_pairs)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", final_asv_pairs[1,j]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", final_asv_pairs[2,j]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    final_sorted_asv_pairs[j] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }

# Final classes
final_classes_asv_pairs <- combn(unique(final_classes_cluster_data$ASV), 2, simplify = TRUE) # Each pair is a column of 2 rows, and each row is an ASV
# Ensuring that the pairs are smallest ASV - largest ASV so that can merge correctly with correlation data later on
final_classes_sorted_asv_pairs <- rep(NA, ncol(final_classes_asv_pairs)) # Vector with a slot for every ASV pair
for (j in 1:ncol(final_classes_asv_pairs)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", final_classes_asv_pairs[1,j]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", final_classes_asv_pairs[2,j]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    final_classes_sorted_asv_pairs[j] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }


## Starting network
# Vector indicating whether each pair is in the same group
same_group <- logical(length = ncol(starting_asv_pairs))
# Checking if each pair is in the same group
for (i in 1:ncol(starting_asv_pairs)) { # Each ASV pair is a column - looping over each ASV pair
  asv_pair <- starting_asv_pairs[, i]
  asv1_group <- starting_cluster_data$Cluster[starting_cluster_data$ASV == asv_pair[1]]
  asv2_group <- starting_cluster_data$Cluster[starting_cluster_data$ASV == asv_pair[2]]
  same_group[i] <- asv1_group == asv2_group
}
# Create a data frame with ASV pairs and whether they are in the same group
starting_pairs_groups <- data.frame(ASV_pair = starting_sorted_asv_pairs,
                        Same_group = same_group)

## Starting classes network
# Vector indicating whether each pair is in the same group
same_group <- logical(length = ncol(starting_classes_asv_pairs))
# Checking if each pair is in the same group
for (i in 1:ncol(starting_classes_asv_pairs)) { # Each ASV pair is a column - looping over each ASV pair
  asv_pair <- starting_classes_asv_pairs[, i]
  asv1_group <- starting_classes_cluster_data$Cluster[starting_classes_cluster_data$ASV == asv_pair[1]]
  asv2_group <- starting_classes_cluster_data$Cluster[starting_classes_cluster_data$ASV == asv_pair[2]]
  same_group[i] <- asv1_group == asv2_group
}
# Create a data frame with ASV pairs and whether they are in the same group
starting_classes_pairs_groups <- data.frame(ASV_pair = starting_classes_sorted_asv_pairs,
                        Same_group = same_group)

## Final network
# Vector indicating whether each pair is in the same group
same_group <- logical(length = ncol(final_asv_pairs))
# Checking if each pair is in the same group
for (i in 1:ncol(final_asv_pairs)) { # Each ASV pair is a column - looping over each ASV pair
  asv_pair <- final_asv_pairs[, i]
  asv1_group <- final_cluster_data$Cluster[final_cluster_data$ASV == asv_pair[1]]
  asv2_group <- final_cluster_data$Cluster[final_cluster_data$ASV == asv_pair[2]]
  same_group[i] <- asv1_group == asv2_group
}
# Create a data frame with ASV pairs and whether they are in the same group
final_pairs_groups <- data.frame(ASV_pair = final_sorted_asv_pairs,
                        Same_group = same_group)

## Final classes network
# Vector indicating whether each pair is in the same group
same_group <- logical(length = ncol(final_classes_asv_pairs))
# Checking if each pair is in the same group
for (i in 1:ncol(final_classes_asv_pairs)) { # Each ASV pair is a column - looping over each ASV pair
  asv_pair <- final_classes_asv_pairs[, i]
  asv1_group <- final_classes_cluster_data$Cluster[final_classes_cluster_data$ASV == asv_pair[1]]
  asv2_group <- final_classes_cluster_data$Cluster[final_classes_cluster_data$ASV == asv_pair[2]]
  same_group[i] <- asv1_group == asv2_group
}
# Create a data frame with ASV pairs and whether they are in the same group
final_classes_pairs_groups <- data.frame(ASV_pair = final_classes_sorted_asv_pairs,
                        Same_group = same_group)



# The above data frames ("starting_pairs_groups", etc) are for all pairs, regardless of sign of correlation
# Below, narrowing it down to ASV pairs that are positively correlated

# Getting correlation data
## Starting network
starting_correlation_data <- read.table("../data/starting_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_correlation_data) <- c("ASV1", "ASV2", "Interaction") # Assigning column names
starting_correlation_data$ASV_pair <- rep(NA, nrow(starting_correlation_data)) # Column for ASV pairs
for (j in 1:nrow(starting_correlation_data)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", starting_correlation_data[j,1]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", starting_correlation_data[j,2]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    starting_correlation_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }
starting_correlation_data <- starting_correlation_data[,3:4] # Getting rid of separate ASV pair columns
starting_correlation_data <- starting_correlation_data[, c("ASV_pair", "Interaction")] # Reordering columns
starting_correlation_data$correlation.sign <- rep(NA, nrow(starting_correlation_data)) # Column for sign of correlation
for (i in 1:nrow(starting_correlation_data)){ # Loop to identify positive, negative, or absent links
    if (starting_correlation_data[i, 2] > 0){
        starting_correlation_data[i, 3] <- "Positive"
        }
    else if (starting_correlation_data[i, 2] < 0) {
        starting_correlation_data[i, 3] <- "Negative"
        }
    else if (starting_correlation_data[i, 2] == 0) {
        starting_correlation_data[i, 3] <- "No link"
        }
    }
starting_correlation_data <- starting_correlation_data[, -2] # Removing correlation column

## Starting classes network
starting_classes_correlation_data <- read.table("../data/starting_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_classes_correlation_data) <- c("ASV1", "ASV2", "Interaction") # Assigning column names
starting_classes_correlation_data <- starting_classes_correlation_data[grepl("ASV", starting_classes_correlation_data$ASV1) & grepl("ASV", starting_classes_correlation_data$ASV2), ] # Removing rows containing paritions instead of ASVs
starting_classes_correlation_data$ASV_pair <- rep(NA, nrow(starting_classes_correlation_data)) # Column for ASV pairs
for (j in 1:nrow(starting_classes_correlation_data)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", starting_classes_correlation_data[j,1]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", starting_classes_correlation_data[j,2]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    starting_classes_correlation_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }
starting_classes_correlation_data <- starting_classes_correlation_data[,3:4] # Getting rid of separate ASV pair columns
starting_classes_correlation_data <- starting_classes_correlation_data[, c("ASV_pair", "Interaction")] # Reordering columns
starting_classes_correlation_data$correlation.sign <- rep(NA, nrow(starting_classes_correlation_data)) # Column for sign of correlation
for (i in 1:nrow(starting_classes_correlation_data)){ # Loop to identify positive, negative, or absent links
    if (starting_classes_correlation_data[i, 2] > 0){
        starting_classes_correlation_data[i, 3] <- "Positive"
        }
    else if (starting_classes_correlation_data[i, 2] < 0) {
        starting_classes_correlation_data[i, 3] <- "Negative"
        }
    else if (starting_classes_correlation_data[i, 2] == 0) {
        starting_classes_correlation_data[i, 3] <- "No link"
        }
    }
starting_classes_correlation_data <- starting_classes_correlation_data[, -2] # Removing correlation column

## Final network
final_correlation_data <- read.table("../data/final_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_correlation_data) <- c("ASV1", "ASV2", "Interaction") # Assigning column names
final_correlation_data$ASV_pair <- rep(NA, nrow(final_correlation_data)) # Column for ASV pairs
for (j in 1:nrow(final_correlation_data)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", final_correlation_data[j,1]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", final_correlation_data[j,2]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    final_correlation_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }
final_correlation_data <- final_correlation_data[,3:4] # Getting rid of separate ASV pair columns
final_correlation_data <- final_correlation_data[, c("ASV_pair", "Interaction")] # Reordering columns
final_correlation_data$correlation.sign <- rep(NA, nrow(final_correlation_data)) # Column for sign of correlation
for (i in 1:nrow(final_correlation_data)){ # Loop to identify positive, negative, or absent links
    if (final_correlation_data[i, 2] > 0){
        final_correlation_data[i, 3] <- "Positive"
        }
    else if (final_correlation_data[i, 2] < 0) {
        final_correlation_data[i, 3] <- "Negative"
        }
    else if (final_correlation_data[i, 2] == 0) {
        final_correlation_data[i, 3] <- "No link"
        }
    }
final_correlation_data <- final_correlation_data[, -2] # Removing correlation column

## Final classes network
final_classes_correlation_data <- read.table("../data/final_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_classes_correlation_data) <- c("ASV1", "ASV2", "Interaction") # Assigning column names
final_classes_correlation_data <- final_classes_correlation_data[grepl("ASV", final_classes_correlation_data$ASV1) & grepl("ASV", final_classes_correlation_data$ASV2), ]
final_classes_correlation_data$ASV_pair <- rep(NA, nrow(final_classes_correlation_data)) # Column for ASV pairs
for (j in 1:nrow(final_classes_correlation_data)){ # Looping over every column of asv_pairs (over every ASV pair) and sorting so that the smallest is first in the pair
    asv1_numeric <- as.numeric(gsub("ASV_", "", final_classes_correlation_data[j,1]))
    asv2_numeric <- as.numeric(gsub("ASV_", "", final_classes_correlation_data[j,2]))
    sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
    final_classes_correlation_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
    }
final_classes_correlation_data <- final_classes_correlation_data[,3:4] # Getting rid of separate ASV pair columns
final_classes_correlation_data <- final_classes_correlation_data[, c("ASV_pair", "Interaction")] # Reordering columns
final_classes_correlation_data$correlation.sign <- rep(NA, nrow(final_classes_correlation_data)) # Column for sign of correlation
for (i in 1:nrow(final_classes_correlation_data)){ # Loop to identify positive, negative, or absent links
    if (final_classes_correlation_data[i, 2] > 0){
        final_classes_correlation_data[i, 3] <- "Positive"
        }
    else if (final_classes_correlation_data[i, 2] < 0) {
        final_classes_correlation_data[i, 3] <- "Negative"
        }
    else if (final_classes_correlation_data[i, 2] == 0) {
        final_classes_correlation_data[i, 3] <- "No link"
        }
    }
final_classes_correlation_data <- final_classes_correlation_data[, -2] # Removing correlation column



### Merging correlation data frame into cluster data frame
positive_starting_pairs_groups <- merge(starting_pairs_groups, starting_correlation_data, by = "ASV_pair", all = FALSE)
positive_starting_classes_pairs_groups <- merge(starting_classes_pairs_groups, starting_classes_correlation_data, by = "ASV_pair", all = FALSE)
positive_final_pairs_groups <- merge(final_pairs_groups, final_correlation_data, by = "ASV_pair", all = FALSE)
positive_final_classes_pairs_groups <- merge(final_classes_pairs_groups, final_classes_correlation_data, by = "ASV_pair", all = FALSE)
# Only keeping positive correlations
positive_starting_pairs_groups <- positive_starting_pairs_groups[positive_starting_pairs_groups$correlation.sign == "Positive", ]
positive_starting_classes_pairs_groups <- positive_starting_classes_pairs_groups[positive_starting_classes_pairs_groups$correlation.sign == "Positive", ]
positive_final_pairs_groups <- positive_final_pairs_groups[positive_final_pairs_groups$correlation.sign == "Positive", ]
positive_final_classes_pairs_groups <- positive_final_classes_pairs_groups[positive_final_classes_pairs_groups$correlation.sign == "Positive", ]
# Getting rid of correlation column
positive_starting_pairs_groups <- positive_starting_pairs_groups[1:2]
positive_starting_classes_pairs_groups <- positive_starting_classes_pairs_groups[1:2]
positive_final_pairs_groups <- positive_final_pairs_groups[1:2]
positive_final_classes_pairs_groups <- positive_final_classes_pairs_groups[1:2]

### Data frames for comparing networks
# Starting-Final, all
starting_final_clusters <- merge(starting_pairs_groups, final_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(starting_final_clusters) <- c("ASV_pair", "Same_group_starting", "Same_group_final") # Assigning column names
starting_final_clusters <- starting_final_clusters[,2:3]
# Starting-Final, positive only
positive_starting_final_clusters <- merge(positive_starting_pairs_groups, positive_final_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(positive_starting_final_clusters) <- c("ASV_pair", "Same_group_starting", "Same_group_final") # Assigning column names
positive_starting_final_clusters <- positive_starting_final_clusters[,2:3]
# Starting classes-Final classes, all
startingcl_finalcl_clusters <- merge(starting_classes_pairs_groups, final_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(startingcl_finalcl_clusters) <- c("ASV_pair", "Same_group_startingcl", "Same_group_finalcl") # Assigning column names
startingcl_finalcl_clusters <- startingcl_finalcl_clusters[,2:3]
# Starting classes-Final classes, positive only
positive_startingcl_finalcl_clusters <- merge(positive_starting_classes_pairs_groups, positive_final_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(positive_startingcl_finalcl_clusters) <- c("ASV_pair", "Same_group_startingcl", "Same_group_finalcl") # Assigning column names
positive_startingcl_finalcl_clusters <- positive_startingcl_finalcl_clusters[,2:3]
# Starting-Starting classes, all
starting_startingcl_clusters <- merge(starting_pairs_groups, starting_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(starting_startingcl_clusters) <- c("ASV_pair", "Same_group_starting", "Same_group_startingcl") # Assigning column names
starting_startingcl_clusters <- starting_startingcl_clusters[,2:3]
# Starting-Starting classes, positive only
positive_starting_startingcl_clusters <- merge(positive_starting_pairs_groups, positive_starting_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(positive_starting_startingcl_clusters) <- c("ASV_pair", "Same_group_starting", "Same_group_startingcl") # Assigning column names
positive_starting_startingcl_clusters <- positive_starting_startingcl_clusters[,2:3]
# Final-Final classes, all
final_finalcl_clusters <- merge(final_pairs_groups, final_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(final_finalcl_clusters) <- c("ASV_pair", "Same_group_final", "Same_group_finalcl") # Assigning column names
final_finalcl_clusters <- final_finalcl_clusters[,2:3]
# Final-Final classes, positive only
positive_final_finalcl_clusters <- merge(positive_final_pairs_groups, positive_final_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(positive_final_finalcl_clusters) <- c("ASV_pair", "Same_group_final", "Same_group_finalcl") # Assigning column names
positive_final_finalcl_clusters <- positive_final_finalcl_clusters[,2:3]
# All, all
all_clusters <- merge(starting_pairs_groups, final_pairs_groups, by = "ASV_pair", all = FALSE)
all_clusters <- merge(all_clusters, starting_classes_pairs_groups, by = "ASV_pair", all = FALSE)
all_clusters <- merge(all_clusters, final_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(all_clusters) <- c("ASV_pair","Same_group_starting", "Same_group_final", "Same_group_startingcl", "Same_group_finalcl") # Assigning column names
all_clusters <- all_clusters[,2:3]
# All, positive only
positive_all_clusters <- merge(positive_starting_pairs_groups, positive_final_pairs_groups, by = "ASV_pair", all = FALSE)
positive_all_clusters <- merge(positive_all_clusters, positive_starting_classes_pairs_groups, by = "ASV_pair", all = FALSE)
positive_all_clusters <- merge(positive_all_clusters, positive_final_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(positive_all_clusters) <- c("ASV_pair","Same_group_starting", "Same_group_final", "Same_group_startingcl", "Same_group_finalcl") # Assigning column names
positive_all_clusters <- positive_all_clusters[,2:3]


In [None]:
library(dplyr)
### Subsetting ASVs that are positively correlated, in the same cluster, and shared between networks
# Starting-Final
starting_final_positive_pairs <- merge(positive_starting_pairs_groups, positive_final_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(starting_final_positive_pairs) <- c("ASV_pair", "Same_group_starting", "Same_group_final") # Assigning column names
starting_final_positive_pairs <- starting_final_positive_pairs %>%
  filter(Same_group_starting == TRUE, Same_group_final == TRUE)
starting_final_positive_pairs <- starting_final_positive_pairs[1]

# StartingCl-FinalCl
startingcl_finalcl_positive_pairs <- merge(positive_starting_classes_pairs_groups, positive_final_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(startingcl_finalcl_positive_pairs) <- c("ASV_pair", "Same_group_startingcl", "Same_group_finalcl") # Assigning column names
startingcl_finalcl_positive_pairs <- startingcl_finalcl_positive_pairs %>%
  filter(Same_group_startingcl == TRUE, Same_group_finalcl == TRUE)
startingcl_finalcl_positive_pairs <- startingcl_finalcl_positive_pairs[1]

# Starting-StartingCl
starting_startingcl_positive_pairs <- merge(positive_starting_pairs_groups, positive_starting_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(starting_startingcl_positive_pairs) <- c("ASV_pair", "Same_group_starting", "Same_group_startingcl") # Assigning column names
starting_startingcl_positive_pairs <- starting_startingcl_positive_pairs %>%
  filter(Same_group_starting == TRUE, Same_group_startingcl == TRUE)
starting_startingcl_positive_pairs <- starting_startingcl_positive_pairs[1]

# Final-FinalCl
final_finalcl_positive_pairs <- merge(positive_final_pairs_groups, positive_final_classes_pairs_groups, by = "ASV_pair", all = FALSE)
colnames(final_finalcl_positive_pairs) <- c("ASV_pair", "Same_group_final", "Same_group_finalcl") # Assigning column names
final_finalcl_positive_pairs <- final_finalcl_positive_pairs %>%
  filter(Same_group_final == TRUE, Same_group_finalcl == TRUE)
final_finalcl_positive_pairs <- final_finalcl_positive_pairs[1]

# Writing these lists to csvs:
write.csv(starting_final_positive_pairs, "../results/starting_final_positive_pairs.csv", row.names = FALSE)
write.csv(startingcl_finalcl_positive_pairs, "../results/startingcl_finalcl_positive_pairs.csv", row.names = FALSE)
write.csv(starting_startingcl_positive_pairs, "../results/starting_startingcl_positive_pairs.csv", row.names = FALSE)
write.csv(final_finalcl_positive_pairs, "../results/final_finalcl_positive_pairs.csv", row.names = FALSE)

In [None]:
## Calculating Cohen's Kappa between starting and final networks for whether pairs are in the same cluster
library(psych) # Loading required package
starting_final_cluster_kappa <- cohen.kappa(starting_final_clusters)
starting_final_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between starting and final networks for whether positively correlated pairs are in the same cluster
library(psych) # Loading required package
positive_starting_final_cluster_kappa <- cohen.kappa(positive_starting_final_clusters)
positive_starting_final_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between starting classes and final classes networks for whether pairs are in the same cluster
library(psych) # Loading required package
startingcl_finalcl_cluster_kappa <- cohen.kappa(startingcl_finalcl_clusters)
startingcl_finalcl_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between starting classes and final classes networks for whether positively correlated pairs are in the same cluster
library(psych) # Loading required package
positive_startingcl_finalcl_cluster_kappa <- cohen.kappa(positive_startingcl_finalcl_clusters)
positive_startingcl_finalcl_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between starting and starting classes networks for whether pairs are in the same cluster
library(psych) # Loading required package
starting_startingcl_cluster_kappa <- cohen.kappa(starting_startingcl_clusters)
starting_startingcl_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between starting and starting classes networks for whether positively correlated pairs are in the same cluster
library(psych) # Loading required package
positive_starting_startingcl_cluster_kappa <- cohen.kappa(positive_starting_startingcl_clusters)
positive_starting_startingcl_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between final and final classes networks for whether pairs are in the same cluster
library(psych) # Loading required package
final_finalcl_cluster_kappa <- cohen.kappa(final_finalcl_clusters)
final_finalcl_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between final and final classes networks for whether positively correlated pairs are in the same cluster
library(psych) # Loading required package
positive_final_finalcl_cluster_kappa <- cohen.kappa(positive_final_finalcl_clusters)
positive_final_finalcl_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between all networks for whether pairs are in the same cluster
library(psych) # Loading required package
all_cluster_kappa <- cohen.kappa(all_clusters)
all_cluster_kappa

In [None]:
## Calculating Cohen's Kappa between all networks for whether positively correlated pairs are in the same cluster
library(psych) # Loading required package
positive_all_cluster_kappa <- cohen.kappa(positive_all_clusters)
positive_all_cluster_kappa

#### Basic network information

In [None]:
##### Number of ASVs in each network #####

# Importing the data - data frame indicating which cluster each ASV in the network belongs to.
starting_cluster_data <- read.table("../results/Partition-NL_Average_StopStep-1003_starting_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_cluster_data) <- c("ASV", "Cluster") # Assigning column names
starting_classes_cluster_data <- read.table("../results/Partition-NL_Average_StopStep-217_starting_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(starting_classes_cluster_data) <- c("ASV", "Cluster") # Assigning column names
starting_classes_cluster_data <- starting_classes_cluster_data[grepl("ASV", starting_classes_cluster_data$ASV), ] # Removing non-ASV nodes (partition/classes)
final_cluster_data <- read.table("../results/Partition-NL_Average_StopStep-859_final_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_cluster_data) <- c("ASV", "Cluster") # Assigning column names
final_classes_cluster_data <- read.table("../results/Partition-NL_Average_StopStep-364_final_classes_network_data.tsv", sep = "\t", header = FALSE)
colnames(final_classes_cluster_data) <- c("ASV", "Cluster") # Assigning column names
final_classes_cluster_data <- final_classes_cluster_data[grepl("ASV", final_classes_cluster_data$ASV), ] # Removing non-ASV nodes (partition/classes)

# Printing number of ASVs
print(paste("There are", nrow(starting_cluster_data), "ASVs in the starting network."))
print(paste("There are", nrow(final_cluster_data), "ASVs in the final network."))
print(paste("There are", nrow(starting_classes_cluster_data), "ASVs in the starting classes network."))
print(paste("There are", nrow(final_classes_cluster_data), "ASVs in the final classes network."))

# Shared ASVs
starting_final_shared_asvs <- merge(starting_cluster_data, final_cluster_data, by = "ASV", all = FALSE)
startingcl_finalcl_shared_asvs <- merge(starting_classes_cluster_data, final_classes_cluster_data, by = "ASV", all = FALSE)
starting_startingcl_shared_asvs <- merge(starting_cluster_data, starting_classes_cluster_data, by = "ASV", all = FALSE)
final_finalcl_shared_asvs <- merge(final_cluster_data, final_classes_cluster_data, by = "ASV", all = FALSE)

# Printing number shared ASVs
print(paste("There are", nrow(starting_final_shared_asvs), "ASVs shared between the starting network and the final network."))
print(paste("There are", nrow(startingcl_finalcl_shared_asvs), "ASVs shared between the starting classes network and the final classes network."))
print(paste("There are", nrow(starting_startingcl_shared_asvs), "ASVs shared between the starting network and the starting classes network."))
print(paste("There are", nrow(final_finalcl_shared_asvs), "ASVs shared between the final network and the final classes network."))


### Results

### Comparing ASVs between networks. Written in R.

#### Kappa for comparing ASV presence between networks:
All networks show no agreement. The starting network disagrees with the final network, and the starting classes network disagrees with the final classes network.
It makes sense that the starting network is different from the starting classes network (and the same for the final networks) because there are far fewer ASVs in the classes network. This might suggest that a lot of the ASVs are only relevant to the network due to sharing the same environment, but that these ASVs are then removed when considering classes.
The differences between the starting networks and final networks are more interesting. This suggests that the ASVs that are present within each are different, despite the fact that there are a similar number of ASVs between the starting and final networks, and between the starting classes and final classes networks.
When comparing the number of total ASVs between each network, and the number of ASVs that they share, the final network and the starting network actually seem to share most of their ASVs. The starting classes and final classes networks, on the other hand, share a much lower proportion of their ASVs.
Overall, the results from the comparison of ASV presence make sense. The starting and final networks have mostly the same ASVs. The networks that incorporate classes have far fewer ASVs than those that do not, suggesting that the environmental conditions associated with classes explained the correlation of many ASVs. MOST INTERESTINGLY, the ASVs shared between the starting classes and final classes networks are much lower.

In [None]:
library(IRdisplay)
asv_kappa_df <- data.frame(
  Network_Pair = c("Starting-Final", "StartingCl-FinalCl", "Starting-StartingCl", "Final-FinalCl"),
  K = c("-0.17", "-0.45", "0", "0"),
  CI = c("-0.19 ≤ K ≤ -0.16", "-0.51 ≤ K ≤ -0.39", "0 ≤ K ≤ 0", "0 ≤ K ≤ 0")
)
display(asv_kappa_df)

#### Number of ASVs in each network:
The number of ASVs in the final network was slightly lower than in the starting network, suggesting that some ASVs were unable to adapt to the tea. Strangely, there are more ASVs in the final classes network than in the starting classes network - perhaps this reflects an artefact due to the fact that there were more classes (5) within the starting classes network, and only 2 within the final classes network. There certainly weren't new ASVs being introduced.

In [None]:
library(IRdisplay)
asv_numbers <- data.frame(
  Network = c("Starting", "Final", "StartingCl", "FinalCl"),
  Number = c("1284", "1207", "304", "506")
)
display(asv_numbers)

#### Number of ASVs shared between networks:

In [None]:
library(IRdisplay)
asv_shared <- data.frame(
  Network_Pair = c("Starting-Final", "StartingCl-FinalCl", "Starting-StartingCl", "Final-FinalCl"),
  Shared = c("1023", "163", "304", "506"))

display(asv_shared)

### Comparing links between networks

#### Kappa for comparing signs of links between networks.
The starting and final communities have strongly similar types of links, and so do the starting classes and final classes. This makes sense because ASVs that are linked in a certain way in starting communities seem unlikely to change in the way that they are linked.
The final and final classes have strongly similar types of links.
The starting and starting classes have different types of links. Not sure why this is. Perhaps the environment was suggesting many more positive links than was seen when taking it into consideration via classes.

In [None]:
library(IRdisplay)
correlation_kappa_df <- data.frame(
  Network_Pair = c("Starting-Final", "StartingCl-FinalCl", "Starting-StartingCl", "Final-FinalCl"),
  K = c("0.71", "1", "-0.015", "0.93"),
  CI = c("0.43 ≤ K ≤ 0.98", "1 ≤ K ≤ 1", "-0.036 ≤ K ≤ 0.0059", "0.86 ≤ K ≤ 1")
)
display(correlation_kappa_df)

#### Number of links in each network
There are mostly positive links in each network. This could suggest positive interactions such as cross-feeding. There are more positive links in the final networks, perhaps suggesting an increased level of cooperativity. However, there are also more negative links in the final networks.

In [None]:
library(IRdisplay)
num_links <- data.frame(
  Network = c("Starting", "Final", "StartingCl", "FinalCl"),
  TotalLinks = c("7458", "8944", "591", "1477"),
  PositiveLinks = c("7272", "8430", "489", "1193"),
  NegativeLinks = c("186", "514", "102", "284"))

display(num_links)

#### Number of links shared between networks
There are comparatively few links shared between networks.

In [None]:
library(IRdisplay)
shared_links <- data.frame(
  Network_pair = c("Starting-Final", "StartingCl-FinalCl", "Starting-StartingCl", "Final-FinalCl"),
  TotalLinksShared = c("278", "50", "89", "382"),
  PositiveLinksShared = c("269", "47", "86", "356"),
  NegativeLinksShared = c("5", "3", "0", "23"))

display(shared_links)

### Comparing clusters between networks

#### Kappa for comparing whether ASV pairs are in the same cluster between networks.
The clusters of ASV pairs are dissimilar between networks. This suggests that different pairs of ASVs are clustered together in different networks - they have different clusters.

In [None]:
library(IRdisplay)
correlation_kappa_df <- data.frame(
  Network_Pair = c("Starting-Final", "StartingCl-FinalCl", "Starting-StartingCl", "Final-FinalCl"),
  K = c("0.029", "0.044", "0.13", "0.057"),
  CI = c("0.021 ≤ K ≤ 0.037", "0.0069 ≤ K ≤ 0.081", "0.1 ≤ K ≤ 0.15", "0.042 ≤ K ≤ 0.072")
)
display(correlation_kappa_df)

#### Kappa for comparing whether positively correlated ASV pairs are in the same cluster between networks.
In general, there is still little agreement when considering whether positively correlated ASV pairs are found in the same cluster between networks. This is surprising. If a pair of ASVs are positively correlated in two networks, you would expect them to be in the same cluster in both networks.

In [None]:
library(IRdisplay)
correlation_kappa_df <- data.frame(
  Network_Pair = c("Starting-Final", "StartingCl-FinalCl", "Starting-StartingCl", "Final-FinalCl"),
  K = c("0.13", "0.27", "0.41", "-0.058"),
  CI = c("0.0033 ≤ K ≤ 0.25", "-0.017 ≤ K ≤ 0.56", "0.21 ≤ K ≤ 0.61", "-0.15 ≤ K ≤ 0.038")
)
display(correlation_kappa_df)


#### Number of Clusters in each network:
The final networks have a reduced number of clusters compared to the starting networks. Perhaps this suggests that the community became more interconnected. The number of ASVs decreased by 77 from the starting network to the final network, so perhaps there were a lot of two ASV clusters. Even so, this would not explain the extent of the decrease in the number of clusters. The number of clusters even decreased in the final classes compared to the starting classes, despite the likely artefact whereby the number of ASVs was suggested to increase - not sure why this is.

In [None]:
library(IRdisplay)
num_clusters <- data.frame(
  Network = c("Starting", "Final", "StartingCl", "FinalCl"),
  Clusters = c("281", "93", "453", "143"))

display(num_clusters)

#### Partition density
Maximum partition densities from across all clustering, without stop step.
The internal partition density was highest for all of them.

In [None]:
library(IRdisplay)
partition_densities <- data.frame(
  Network = c("Starting", "Final", "StartingCl", "FinalCl"),
  InternalPartition = c("0.0801", "0.0945", "0.1627", "0.1072"),
  TotalPartition = c("0.1139", "0.1291", "0.2209", "0.1337"),
  ExternalPartition = c("0.0471", "0.0402", "0.1248", "0.0471"))

display(partition_densities)

### ASV pairs that are positively correlated, in the same cluster, and shared between networks:

In [None]:
# Starting-Final
starting_final_positive_pairs <- read.csv("../results/starting_final_positive_pairs.csv")
starting_final_positive_pairs

In [None]:
# StartingCl-FinalCl
startingcl_finalcl_positive_pairs <- read.csv("../results/startingcl_finalcl_positive_pairs.csv")
startingcl_finalcl_positive_pairs

In [None]:
# Starting-StartingCl
starting_startingcl_positive_pairs <- read.csv("../results/starting_startingcl_positive_pairs.csv")
starting_startingcl_positive_pairs

In [None]:
# Final-FinalCl
final_finalcl_positive_pairs <- read.csv("../results/final_finalcl_positive_pairs.csv")
final_finalcl_positive_pairs

Pipeline Explanation
-----

### Results
Network comparison:
- Maximum internal partition density for each network (from functionInk), as well as maximum total and maximum external
- Kappa comparison of ASV presence between networks
- Kappa comparison of sign of correlation of ASV pairs between networks
- Kappa comparison of whether pairs of ASVs are found in the same cluster between networks
- Table of ASV pairs that are found within all networks, that have the same type of correlation in all networks, and that are in the same cluster in all networks
- Comparison of betweenness centrality between networks
- Comparison of number of clusters between networks
- Comparison of number of edges between networks
- Cytoscape visualisation of networks
Identify key motifs/ ASVs

### Data set
The data set includes samples from 275 tree holes. There is 1 initial sample from each tree hole (275 total) as well as 4 replicate samples for each of the tree holes from after about a week (1100 total). The data is in the form of an ASV table, where the counts of ASVs are listed. The samples each represent a community, and they have been grouped into community classe within both the starting samples and each of the replicate sets of the final samples, based upon beta diversity dissimilarity.

### What is FlashWeave?
FlashWeave is considered to be the gold standard for inferring co-occurrences between microbial ASVs. It has shown an improved accuracy and performance upon synthetic data, when compared to other commonly used methods such as SparCC and SpiecEasi.

### How does FlashWeave work?
For each ASV, FlashWeave identifies its directly associated neighbouring ASVs.

### What is functionInk?

### How does functionInk work?

### Different ways of comparing co-occurrence networks











## Visualising co-occurrence networks
Creating visualisations of co-occurrence networks in Cytoscape. There are visualisations of:
- Starting classes network (w/ ASV interaction pairs shared with final classes network in diff colour)
- Final classes network (w/ ASV interaction pairs shared with starting classes network in diff colour)
- Shared ASVs only
- All ASVs across both networks, with interactions from both networks
To visualise a co-occurrence network, you need
1. The FlashWeave output (in the functionInk format)
2. The functionInk partition output file for the stop step at which the maximum internal partition density is reached

Here, I:
- Take the starting classes modified FlashWeave output file (functionInk input file) and add a type column for positive or negative. I then use this and the partition output file to visualise a network, and use a different colour for ASV pairs shared with the final classes network. I do not need to run functionInk again on the modified FlashWeave output file thath as the type column added, because these types correspond to positive and negative, and so they should result in the same partition output file if I were to run functionInk upon them.
- Do the same for the final classes network
- Combine the starting classes network and final classes network functionInk input files together, with different types of interactions corresponding to the network of origin (or both networks) and the sign - there are 9 different types. Compute functionInk upon this new input file to get the new Partition file (because not sure how to use both starting classes and final classes). Construct network and then see if the functionInk output table will be helpful
- Construct network of only ASVs shared between the two networks

Code below written in R.

In [28]:
# Importing FlashWeave outputs that are in functionInk input format
starting_classes_network_data <- read.delim('../data/starting_classes_network_data.tsv', header = TRUE)
final_classes_network_data <- read.delim('../data/final_classes_network_data.tsv', header = TRUE)

# Fixing colnames
colnames(starting_classes_network_data) <- c("#ASV_A", "ASV_B", "Interaction")
colnames(final_classes_network_data) <- c("#ASV_A", "ASV_B", "Interaction")

# Adding type columns
starting_classes_network_data$Type <- rep(NA, nrow(starting_classes_network_data))
final_classes_network_data$Type <- rep(NA, nrow(final_classes_network_data))

# Making it so that the type columns are different for positive, negative, or no link
for (i in 1:nrow(starting_classes_network_data)) {
    if (starting_classes_network_data[i, 3] > 0){
        starting_classes_network_data[i, 4] <- 1
        }
    else if (starting_classes_network_data[i, 3] == 0){
        starting_classes_network_data[i, 4] <- 2
        }
    else if (starting_classes_network_data[i, 3] < 0){
        starting_classes_network_data[i, 4] <- 3
        }
    }

for (i in 1:nrow(final_classes_network_data)) {
    if (final_classes_network_data[i, 3] > 0){
        final_classes_network_data[i, 4] <- 1
        }
    else if (final_classes_network_data[i, 3] == 0){
        final_classes_network_data[i, 4] <- 2
        }
    else if (final_classes_network_data[i, 3] < 0){
        final_classes_network_data[i, 4] <- 3
        }
    }

# Getting rid of interaction columns
starting_classes_network_data <- starting_classes_network_data[, -which(colnames(starting_classes_network_data) == "Interaction")]
final_classes_network_data <- final_classes_network_data[, -which(colnames(final_classes_network_data) == "Interaction")]

### Merging
## First, sort ASV A and ASV B to make sure they are in the same order in both data frames
## Then concatenate them so that they are a single column
## Merge both data frames by this column
## Get rid of concatenated column
# Column for sorted, concatenated ASVs
starting_classes_network_data$pair <- rep(NA, nrow(starting_classes_network_data))
final_classes_network_data$pair <- rep(NA, nrow(final_classes_network_data))

# Column for if ASV_A or ASV B is an ASV or class
starting_classes_network_data$ASV_A_is_ASV <- grepl("ASV_", starting_classes_network_data[,1])
final_classes_network_data$ASV_A_is_ASV <- grepl("ASV_", final_classes_network_data[,1])
starting_classes_network_data$ASV_B_is_ASV <- grepl("ASV_", starting_classes_network_data[,2])
final_classes_network_data$ASV_B_is_ASV <- grepl("ASV_", final_classes_network_data[,2])

# Filling the column
for (j in 1:nrow(starting_classes_network_data)){ 
    if (starting_classes_network_data[j, 5] == TRUE & starting_classes_network_data[j, 6] == TRUE){ # If ASV_A and ASV_B are ASVs and not partitions
        asv1_numeric <- as.numeric(gsub("ASV_", "", starting_classes_network_data[j,1]))
        asv2_numeric <- as.numeric(gsub("ASV_", "", starting_classes_network_data[j,2]))
        sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
        starting_classes_network_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
        }
    else if (starting_classes_network_data[j, 5] == TRUE & starting_classes_network_data[j, 6] == FALSE){ # If partition and ASV, partition always first
        a <- starting_classes_network_data[j, 1]
        b <- starting_classes_network_data[j, 2]
        starting_classes_network_data[j, 1] <- b
        starting_classes_network_data[j, 2] <- a
        starting_classes_network_data[j,4] <- paste(starting_classes_network_data[j, 1], "-", starting_classes_network_data[j, 2], sep = "")
        }
    else if (starting_classes_network_data[j, 5] == FALSE & starting_classes_network_data[j, 6] == FALSE){ # If both A and B are partitions
        part1_numeric <- as.numeric(gsub("partition_Class", "", starting_classes_network_data[j,1]))
        part2_numeric <- as.numeric(gsub("partition_Class", "", starting_classes_network_data[j,2]))
        sorted_asvs <- sort(c(part1_numeric, part2_numeric))
        starting_classes_network_data[j,4] <- paste("partition_Class", sorted_asvs[1], "-", "partition_Class", sorted_asvs[2], sep = "")
        }
    }

for (j in 1:nrow(final_classes_network_data)){ 
    if (final_classes_network_data[j, 5] == TRUE & final_classes_network_data[j, 6] == TRUE){ # If ASV_A and ASV_B are ASVs and not partitions
        asv1_numeric <- as.numeric(gsub("ASV_", "", final_classes_network_data[j,1]))
        asv2_numeric <- as.numeric(gsub("ASV_", "", final_classes_network_data[j,2]))
        sorted_asvs <- sort(c(asv1_numeric, asv2_numeric))
        final_classes_network_data[j,4] <- paste("ASV_", sorted_asvs[1], "-", "ASV_", sorted_asvs[2], sep = "")
        }
    else if (final_classes_network_data[j, 5] == TRUE & final_classes_network_data[j, 6] == FALSE){ # If partition and ASV, partition always first
        a <- final_classes_network_data[j, 1]
        b <- final_classes_network_data[j, 2]
        final_classes_network_data[j, 1] <- b
        final_classes_network_data[j, 2] <- a
        final_classes_network_data[j,4] <- paste(final_classes_network_data[j, 1], "-", final_classes_network_data[j, 2], sep = "")
        }
    else if (final_classes_network_data[j, 5] == FALSE & final_classes_network_data[j, 6] == FALSE){ # If both A and B are partitions
        part1_numeric <- as.numeric(gsub("partition_Class", "", final_classes_network_data[j,1]))
        part2_numeric <- as.numeric(gsub("partition_Class", "", final_classes_network_data[j,2]))
        sorted_asvs <- sort(c(part1_numeric, part2_numeric))
        final_classes_network_data[j,4] <- paste("partition_Class", sorted_asvs[1], "-", "partition_Class", sorted_asvs[2], sep = "")
        }
    }

# Merging by the column
network_data <- merge(starting_classes_network_data, final_classes_network_data, by = "pair", all = TRUE)

# Removing pair column
network_data <- network_data[,2:11]

## Getting rid of excess columns
for (m in 1:nrow(network_data)){
    if (is.na(network_data[m, 1])){
        network_data[m, 1] <- network_data[m, 6]
        network_data[m, 2] <- network_data[m, 7]
        }
    if (is.na(network_data[m, 6])){
        network_data[m, 6] <- network_data[m, 1]
        network_data[m, 7] <- network_data[m, 2]
        }    
    }
network_data <- network_data[, c(1,2,3,8)]


# Go through each network's type column
# If the type columns are the same, nothing happens
# If the type columns are different, change them to a new value that indicates discrepancy
# If only in one network, give different type
# 1 = positive, both networks ; 2 = no link, both networks ; 3 = negative, both networks
# 4 = positive, starting only ; 5 = no link, starting only ; 6 = negative, starting only
# 7 = positive, final only ; 8 = no link, final only ; 9 = negative, final only
# 10 = discrepancy of type between networks
for (i in 1:nrow(network_data)){
    
    if (is.na(network_data[i, 3]) & network_data[i,4] == 1){ # Positive only in final
        network_data[i, 3] <- 7
        network_data[i, 4] <- 7
        } 
    
    else if (is.na(network_data[i, 3]) & network_data[i,4] == 2){ # No link only in final
        network_data[i, 3] <- 8
        network_data[i, 4] <- 8
        } 

    else if (is.na(network_data[i, 3]) & network_data[i,4] == 3){ # Negative only in final
        network_data[i, 3] <- 9
        network_data[i, 4] <- 9
        } 

    else if (is.na(network_data[i, 4]) & network_data[i,3] == 1){ # Positive only in starting
        network_data[i, 3] <- 4
        network_data[i, 4] <- 4
        } 

    else if (is.na(network_data[i, 4]) & network_data[i,3] == 2){ # No link only in starting
        network_data[i, 3] <- 5
        network_data[i, 4] <- 5
        } 

    else if (is.na(network_data[i, 4]) & network_data[i,3] == 3){ # Negative only in starting
        network_data[i, 3] <- 6
        network_data[i, 4] <- 6
        } 

    else if (!is.na(network_data[i, 4]) & !is.na(network_data[i,3]) & network_data[i,3] != network_data[i,4]){ # Different
        network_data[i, 3] <- 10
        network_data[i, 4] <- 10
        }    
}

# Getting rid of one of the now identical type columns
network_data <- network_data[1:3]
# Fixing column names
colnames(network_data) <- c("#ASV_A", "ASV_B", "Type")
# Exporting data frame as .tsv file
write.table(network_data, file = '../data/network_types_data.tsv', sep = "\t", row.names = FALSE)


Running functionInk upon this new input file. Below cell is in Python.

In [1]:
##### Applying functionInk #####

import os # Importing os package
os.chdir('../code/functionInk') # Moving from the directory in which this notebook is found into the root of the functionInk repository

# The first step to the pipeline - computing similarities between nodes
!./NodeSimilarity.pl -w 0 -d 0 -t 1 -f ../../data/network_types_data.tsv

# The second step - clustering nodes using the similarity metrics calculated
!./NodeLinkage.pl -fn ../../data/network_types_data.tsv -fs Nodes-Similarities_network_types_data.tsv

  
**************************************************  
* Building nodes similarities NodeSimilarity.pl  *  
**************************************************  
  
>> Reading input arguments: 
~~~ The network is weighted=1 or 2/unweighted=0? Value = 0
~~~ The network is directed=1/undirected=0? Value = 0
~~~ The network has different types=1/or a single type=0? Value = 1
~~~ The network file is = ../../data/network_types_data.tsv
 
>> Processing input arguments: 
~~~ Reading Node A from column 1
~~~ Reading Node B from column 2
~~~ Working with an unweighted network -- Jaccard Similarity
~~~ Working with a network with different types of links
~~~ Reading types from column 3
  
~~~ Input path:  ../../data/network_types_data.tsv
~~~ Input file:  network_types_data.tsv
  
 
>> Reading the network: 
~~~ The first lines for the fields read from file and after conversions are: 
~~~ Reading fields: nodeA = "#ASV_A",nodeB= "ASV_B", type= "Type", weight = 1, 
~~~ Reading fields: nodeA = "ASV_

PLEASE SWITCH TO R KERNEL FOR CODE BELOW.

In [1]:
##### Sourcing function that extracts partition densities #####
library(ggplot2) # Loading ggplot
source("functionInk/scripts/analysis_R/extractPartDensity.R") # Sourcing the function that extracts the partition densities
setwd("functionInk") # Moving to the functionInk repository

In [2]:
##### Extracting partition densities #####
# Importing the partition histories and cleaning them
hist_comp_starting=read.table(file="HistCompact-NL_Average_NoStop_network_types_data.tsv")
colnames(hist_comp_starting) <- as.character(unlist(hist_comp_starting[1, ]))
hist_comp_starting <- hist_comp_starting[-1, ]
columns_to_convert <- c("Step", "Similarity", "Density", "DensityInt", "DensityExt", "NumNodesA", "NumEdgesA", 
                       "NumNodesB", "NumEdgesB", "NumNodesAB", "NumEdgesAB", "NumIntNodesA", "NumIntNodesB",
                       "NumExtNodesA", "NumExtNodesB", "NumIntNodesAB", "NumExtNodesAB", "NumIntEdgesA",
                       "NumIntEdgesB", "NumExtEdgesA", "NumExtEdgesB", "NumIntEdgesAB", "NumExtEdgesAB",
                       "NcumInt", "NcumExt", "Ncum")
hist_comp_starting[columns_to_convert] <- lapply(hist_comp_starting[columns_to_convert], as.numeric)

# Calculating partition densities, plotting them, and moving the plot into results
print("Network partition densities:")
part_density_starting=extractPartDensity(hist.comp=hist_comp_starting, plot = TRUE)
system(paste("mv", "figures/Plot_PartitionDensityVsStep.pdf", "../../results/network_Plot_PartitionDensityVsStep.pdf"))
print("Step of the clustering in which the maximum of the total partition density was found: ")
part_density_starting$total_dens_step
print("Step of the clustering in which the maximum of the internal partition density was found ")
part_density_starting$int_dens_step
print("Step of the clustering in which the maximum of the external partition density was found: ")
part_density_starting$ext_dens_step

# max total at 394
# max internal at 509
# max external at 365

[1] "Network partition densities:"
[1] "-- The maximum value of the total partition density is 0.1388 found at step = 394"
[1] "-- The maximum value of the internal partition density is 0.0669 found at step = 509"
[1] "-- The maximum value of the external partition density is 0.0929 found at step = 365"
[1] "Step of the clustering in which the maximum of the total partition density was found: "


[1] "Step of the clustering in which the maximum of the internal partition density was found "


[1] "Step of the clustering in which the maximum of the external partition density was found: "


PLEASE SWITCH TO PYTHON KERNEL FOR CODE BELOW.

In [1]:
import os # Importing os again, as switched back to Python kernel
os.chdir('functionInk') # Moving from the directory in which this notebook is found into the root of the functionInk repository

In [2]:
##### Running the clustering until the step at which the maximum total partition density is reached #####
!./NodeLinkage.pl -fn ../../data/network_types_data.tsv -fs Nodes-Similarities_network_types_data.tsv -s step -v 394
##### Running the clustering until the step at which the maximum internal partition density is reached #####
!./NodeLinkage.pl -fn ../../data/network_types_data.tsv -fs Nodes-Similarities_network_types_data.tsv -s step -v 509
##### Running the clustering until the step at which the maximum external partition density is reached #####
!./NodeLinkage.pl -fn ../../data/network_types_data.tsv -fs Nodes-Similarities_network_types_data.tsv -s step -v 365


>> Reading input arguments: 
~~~ The network file is = ../../data/network_types_data.tsv
~~~ The similarity file is = Nodes-Similarities_network_types_data.tsv
~~~ The clustering will run until step
~~~ value as stopping criteria = 394
~~~ Reading the similarity from column  = 3
~~~ Clustering with = Average
  
***********************************************  
* Finding communities with nodes clustering   *  
***********************************************  
  
~~ FIELDS for the FIRST FILE
~~~ Reading Node A from column 1
~~~ Reading Node B from column 2
~~~ Reading Similarities from column 3
~~ FIELDS for the SECOND FILE
~~~ Reading Node A from column 1
~~~ Reading Node B from column 2
~~~ Reading Similarities from column 3
~~ CLUSTERING parameters: 
~~~ Performing a clustering with Average Linkage method
~~~ Clustering and recovering classification at step: 394
  
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
~~ Opening the first input file: 
~~~ Input path:  Nodes-Similarities_network_types