A pipeline for generating predictions for the selective targeting of microbial groups based on one the processing of assembled and annotated metagenomics. A detailed description of this workflow is provided at Ofaim S, Ofek-Lalzar M, Sela N, Jinag J, Kashi Y, Minz D, Freilich S: Analysis of Microbial Functions in the Rhizosphere Using a Metabolic-Network Based Framework for Metagenomics Interpretation. Frontiers in microbiology 2017, 8:1606.https://www.frontiersin.org/articles/10.3389/fmicb.2017.01606/full.
All datasets mentioned in text are available in https://volcanicenter-my.sharepoint.com/:f:/g/personal/shmedina_volcani_agri_gov_il/EicqyHpmSmBJmwgHO0oqJ8MBcxNtvXw0KUs6hYyyQ4wvzA?e=khRLmH.
Install R packages:
- library("tidyverse")
- library("edgeR")
Pip install:
- [python (version >= 3.8)]
- [matplotlib (version 3.3.3)]
- [networkx (version 2.5)]
- [pandas (version 1.1.4)]
- [skbio (version 0.5.6)]
- [numpy (version 1.19.4)]
- [ete3 (version 3.1.2)]
For Step 9 add:
- [plotly 4.14.1]
- [dash 1.18.1]
- [dash_core_components (1.14.1)]
- [dash_html_components (1.1.1)]
- [dash_bootstrap_components (0.11.1)]
- [dash_extensions (0.0.41)]
- [scipy (1.5.0)]
- [statsmodels (0.11.1)]
# Create venv #
mkdir netcom2
cd netcom2
virtualenv netcom2
source netcom2/bin/activate
mkdir netcom2
cd netcom2
python -m venv C:\path\netcom2
C:\path\netcom2\Scripts\Activate
# You may need to install svn and pip3 #
pip install svn
svn export https://github.com/ot483/NetCom2/trunk/Scripts
curl -L https://raw.githubusercontent.com/ot483/NetCom2/main/requirements.txt > requirements.txt
pip install -r requirements.txt
NetCom2 makes use of the input files listed below which are the outcome of the assembly and annotation of metagenomics data using well established standard procedures. Example input files are provided in the folder Step 7 in the drive (https://volcanicenter-my.sharepoint.com/:f:/g/personal/shmedina_volcani_agri_gov_il/EicqyHpmSmBJmwgHO0oqJ8MBcxNtvXw0KUs6hYyyQ4wvzA?e=khRLmH.). Full details for the preparation of the input files are provided in Ofaim et al 2017 https://www.frontiersin.org/articles/10.3389/fmicb.2017.01606/full.
- Taxonomic annotations of the genes. This is how the data should look like for running the code:
k127_1346738_14 "root;cellular organisms;Bacteria;"
k127_12605820_2 "root;cellular organisms;Bacteria;"
k127_12605820_11 "root;cellular organisms;Bacteria;"
k127_12605820_14 "root;cellular organisms;Bacteria;"
k127_12605820_19 "root;cellular organisms;Bacteria;"
...
Input files can be prepared and download from MEGAN - Metagenome Analyzer (Huson et al 2007). Input file can be exported by: i. From the menu, choose Select -> All Nodes. ii. From the menu, choose File -> Export -> CSV Format. iii. To choose all data go to -> In "Choose format", choose "readName_to_taxonPath".
- Functional annotations of the genes. This is how the data should look like for running the code:
k127_13407954_1 "Not assigned"
k127_13407954_2 "Not assigned"
k127_13407954_3 "Not assigned"
k127_13407954_5 "Not assigned"
k127_7684156_1 "Not assigned"
k127_7684156_3 "Not assigned"
k127_6559801_1 "Not assigned"
...
Input files can be prepared and download from MEGAN - Metagenome Analyzer (Huson et al 2007). Input file can be exported by: i. From the menu, choose Select -> All Nodes. ii. From the menu, choose File -> Export -> CSV Format. iii. To choose all data go to -> In "Choose format", choose "readName_to_tKeggName".
- Count table indicating the abundance of contigs across samples. This is how the data should loook like for running the code:
ID NTC_G210_1 NTC_G210_2 NTC_G210_3 NTC_G210_4 NTC_G210_5
k127_11694930 2.1 4 4.8 4.9 5
k127_23574583 3.4 3.8 5.1 5.5 9.7
k127_13509563 2.3 3.1 3.5 1.8 2.2
k127_2728460 3.2 5 5.9 4.1 5.9
k127_11409483 1.5 2.8 1.9 3.7 2
k127_22473034 4 5.1 4.8 4.9 3.8
k127_3501511 4.8 4.7 4.6 2.6 2.6
k127_9469869 6.5 8 8.1 8.8 21
...
The script determines the taxonomic classification of each contig according to the annotations of its respective genes. The script takes as input a Megan output file with taxonomic annotation for each gene and determines the taxonomic annotation of the corresponding contigs according to most frequent taxonomy of the associated genes.
Input: NTC_G210-ReadName_to_TaxonPath.txt
python get_contig_taxonomy.py -i NTC_G210-ReadName_to_TaxonPath.txt -t temp_1 -o get_contig_taxonomy_ranked_NTC_G210.txt --get_rank
Output: get_contig_taxonomy_ranked_NTC_G210.txt
Contig most_frequent_taxonomy common_taxonomy Same_taxonomy? taxonomy_rank
k127_24108426 ;root;cellular_organisms;Bacteria;Nitrospirae;Nitrospira_<class>;Nitrospirales;Nitrospiraceae;Nitrospira ;root;cellular_organisms;Bacteria;Nitrospirae;Nitrospira_<class>;Nitrospirales;Nitrospiraceae;Nitrospira Yes class
k127_19142040 ;root;cellular_organisms;Bacteria;FCB_group;Bacteroidetes/Chlorobi_group;Bacteroidetes;Cytophagia;Cytophagales;Hymenobacteraceae ;root;cellular_organisms;Bacteria;FCB_group;Bacteroidetes/Chlorobi_group;Bacteroidetes;Cytophagia;Cytophagales;Hymenobacteraceae Yes family
k127_8052916 ;root;cellular_organisms;Bacteria ;root;cellular_organisms;Bacteria Yes superkingdom
k127_20237200 ;root;cellular_organisms;Bacteria ;root;cellular_organisms;Bacteria Yes superkingdom
k127_12695121 ;root;cellular_organisms;Bacteria;Terrabacteria_group;Actinobacteria_<phylum>;Actinobacteria;Micromonosporales;Micromonosporaceae;Actinoplanes ;root;cellular_organisms;Bacteria;Terrabacteria_group;Actinobacteria_<phylum>;Actinobacteria;Micromonosporales;Micromonosporaceae;Actinoplanes Yes genus
k127_26869126 ;root;cellular_organisms;Bacteria ;root;cellular_organisms;Bacteria Yes superkingdom
k127_13862072 ;root;cellular_organisms;Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Bradyrhizobiaceae;Bradyrhizobium ;root;cellular_organisms;Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Bradyrhizobiaceae;Bradyrhizobium Yes genus
k127_10729228 ;root;cellular_organisms;Bacteria;Terrabacteria_group ;root;cellular_organisms;Bacteria;Terrabacteria_group Yes no_rank
k127_19943285 ;root;cellular_organisms;Bacteria ;root;cellular_organisms;Bacteria Yes superkingdom
...
The script merges the count table describing the relative abundance of contigs across samples and the taxonomic annotations of the contigs (generated in step 1) and creates a count table based on the selected taxonomic level by merging respective contigs.
Input: get_contig_taxonomy_ranked_NTC_G210.txt, NTC_G210.count_tab.matrix
python prepare_taxonomy_count_table.py -o count_table_taxonomy_NTC_G210.contig.genus.txt -c NTC_G210.count_tab.matrix -t get_contig_taxonomy_ranked_NTC_G210.txt -l genus
-l : specifying taxonomic rank according to ncbi taxonomy etc species,genus,order,phylum
Output: count_table_taxonomy_NTC_G210.contig.genus.txt
NTC_G210_1 NTC_G210_2 NTC_G210_3 NTC_G210_4 NTC_G210_5
Kosakonia 10 3.9 1.7 1.7 1.8
Reinekea 41.9 52.8 48.7 49.2 81.8
Ochrobactrum 66.7 72.9 79.4 58.4 96.2
Actinopolyspora 111.1 102.3 80.3 69.8 87.4
Unidentified_Chroococcales 33.9 96.6 81.8 68.3 90.6
Ustilago 2 3.5 2.8 2 1.7
Chloroflexus 1111.6 1261.1 1250.5 1056.5 1365.5
Methanolacinia 14.7 20.7 24.6 15.8 20.9
Cp51virus 1.5 2.1 3.4 2.3 2.7
Thiolinea 37.3 58.8 58.1 61.3 85.4
Mycolicibacter 13.1 16.2 18.9 14.6 25.7
Gigaspora 7.4 11.6 12.3 5.8 3.7
Demequina 55.2 59.1 59.4 45.8 74.8
Yuavirus 3.3 4.5 7.1 1.8 1.2
Gonium 7.4 6.5 6 6.6 21.2
Fortiea 36.5 51 47.2 39.2 76.9
...
The script takes as input a functional annotations input file (see step 0 for file format and example file in the 'Step 3' drive) and filters only entities with EC/KEGGs KO identifiers.
Input: NTC_G210-ReadName_to_KeggName_Ultimate.txt
Perl Filter_enzymatic_functions.pl
Change parameters: my $file ="NTC_G210-ReadName_to_KeggName_Ultimate"
open IN ,$file
open (OUT,">protein_EC_codes_total_NTC_G210.txt")
Output: protein_EC_codes_total_NTC_G210.txt
k127_5121010_270 1.1.1.1
k127_10112354_97 1.1.1.1
k127_7896669_137 1.1.1.1
k127_1162434_42 1.1.1.1
k127_6366961_84 1.1.1.1
k127_1164994_43 1.1.1.1
k127_23402251_87 1.1.1.1
k127_19133019_43 1.1.1.1
k127_4002997_33 1.1.1.1
k127_25639875_4 1.1.1.1
k127_14110403_48 1.1.1.1
k127_1494996_15 1.1.1.1
k127_3628809_52 1.1.1.1
...
The script merges the count table with functional annotations generated in Step 3, creating a count table based on the functional keys (EC/KEGGs KO).
Input: protein_EC_codes_total_NTC_G210.txt, NTC_G210.count_tab.matrix, NTC_G210-ReadName_to_TaxonPath.txt
python prepare_function_count_table.py -d protein_EC_codes_total_NTC_G210.txt -t temp_1 -o count_table_function NTC_G210.contig.txt -c NTC_G210.count_tab.matrix -p NTC_G210-ReadName_to_TaxonPath.txt
Output: count_table_function_NTC_G210.contig.txt
NTC_G210_1 NTC_G210_2 NTC_G210_3 NTC_G210_4 NTC_G210_5
3.4.11.1 240.49999999999997 302.0 285.4000000000001 214.6 303.5
2.6.1.21 99.39999999999999 122.19999999999999 98.8 80.89999999999999 112.10000000000001
4.2.1.57 10.8 14.899999999999999 12.5 13.5 16.6
5.3.1.- 25.6 47.6 48.9 34.1 47.7
3.6.3.8 87.2 106.39999999999998 107.70000000000002 75.8 113.80000000000001
6.3.4.18 110.7 139.7 124.10000000000001 99.0 160.60000000000002
2.4.1.14 0.0 0.0 0.0 0.0 0.0
1.1.1.163 0.0 0.0 0.0 0.0 0.0
3.4.21.7 0.0 0.0 0.0 0.0 0.0
1.1.1.262 105.3 109.89999999999999 123.99999999999999 81.7 112.29999999999998
4.1.1.71 0.0 0.0 0.0 0.0 0.0
3.3.2.8 6.1 3.6 4.9 4.3 4.7
2.3.1.29 137.5 159.20000000000005 168.70000000000002 141.49999999999997 195.50000000000003
5.3.1.12 34.8 44.800000000000004 44.29999999999999 32.4 67.7
1.13.11.66 0.0 0.0 0.0 0.0 0.0
2.7.1.170 81.50000000000001 94.1 93.4 59.5 96.0
...
The script merges the count table according to a taxonomic (Step 2) and functional (Step 4) keys into a double-key (taxonomic/functional) count table. The output file details the taxonomic distribution of reads assigned to each enzymes at a selected taxonomic level.
Input: protein_EC_codes_total_NTC_G210.txt, NTC_G210.count_tab.matrix, NTC_G210-ReadName_to_TaxonPath.txt, get_contig_taxonomy_NTC_G210.txt
python prepare_function_taxonomy_ctable_contigwise.py -d protein_EC_codes_total_NTC_G210.txt -r temp_1 –o function_taxonomy _countable_NTC_G210.contig.genus.txt -c NTC_G210.count_tab.matrix -p NTC_G210-ReadName_to_TaxonPath.txt -t get_contig_taxonomy_NTC_G210.txt -l genus
-l : specifying taxonomic rank according to ncbi taxonomy etc species,genus,order,phylum
Output: function_taxonomy_countable_NTC_G210.contig.genus.txt
EC taxa NTC_G210_1 NTC_G210_2 NTC_G210_3 NTC_G210_4 NTC_G210_5
1.1.1.1 Unidentified 107.6 148.89999999999998 127.8 103.19999999999999 130.49999999999997
1.1.1.1 Ilumatobacter 4.0 20.0 12.0 9.9 16.0
1.1.1.1 Unidentified_Verrucomicrobiales 3.7 9.2 7.1 4.8 2.4
1.1.1.1 Conexibacter 2.9 5.2 4.1 3.5 6.2
1.1.1.1 Unidentified_Geodermatophilales 6.3 6.6 6.0 4.5 5.4
1.1.1.1 Unidentified_Sphingomonadales 19.2 21.7 18.4 15.2 22.0
1.1.1.1 Candidatus_Entotheonella 4.3 5.2 7.5 6.6 7.7
1.1.1.1 Gaiella 2.4 2.3 1.5 3.3 3.1
...
The script takes as input the enzyme – diversity table generated at Step 5 and calculates a diversity score for each enzyme, indicating whether it is dominantly present in a specific taxonomic group (at a selected taxonomic level) or widely distributed across microbial community.
Input: function_taxonomy_countable_BjSa_NTC.contig.genus.txt
python calculate_enzyme_diveristy_score.py function_taxonomy_countable_BjSa_NTC.contig.genus.txt /path/to/input/file/
Output: ECs_with_dominant_taxon.txt, ECs_with_dominant_taxon_knockout.txt, RscriptOutput.csv, RscriptOutput_filter_IdentMostFreq.csv, RscriptOutput_filter_IdentMostFreqbyEnzyme.csv, RscriptOutput_filter_dominance.csv
ECs_with_dominant_taxon.txt
all 1.13.11.73 2.4.1.83 1.5.-.- 3.3.2.1 3.5.1.84 3.5.3.13 3.5.1.98 3.1.2.12 1.1.1.47 3.1.7.2 1.6.3.- 2.4.1.14 1.8.7.1 2.4.1.18 2.1.3.11 1.6.5.4 1.2.98.1 4.2.1.149 1.1.1.401 2.7.1.- 1.1.99.28 3.1.30.1 3.1.3.78 3.5.1.6 3.5.3.15 3.2.1.52 1.7.2.5 2.1.1.98 3.1.1.5 2.7.9.3 3.5.99.10 1.14.11.8 5.4.99.24 6.3.5.3 1.3.5.6 1.14.14.9 1.5.99.5 2.3.2.6 2.1.1.321 1.3.99.12 1.14.11.27 6.3.5.9 1.13.11.79 3.6.1.54 3.6.3.41 2.3.1.122 1.1.9.1 3.6.1.9 3.7.1.14 2.8.3.16 2.3.1.19 1.3.3.5 2.5.1.48 1.5.1.50 3.1.3.48 6.2.1.27 5.3.1.4 2.3.1.5 2.6.1.108 2.6.1.9 2.8.3.15 3.5.2.14 2.7.9.2 5.4.99.17 1.3.99.27 2.5.1.72 5.4.2.2 2.7.7.43 6.3.1.13 2.5.1.6 4.6.1.- 2.7.7.85 3.2.1.26 5.4.99.63 1.3.98.3 1.14.13.195 5.1.1.17 2.1.1.6 6.3.2.17 2.6.1.86 1.14.11.52 1.1.1.193 5.4.99.64 2.8.1.13 2.1.1.198 4.1.1.22 1.14.99.29 2.7.1.190 4.6.1.16 3.4.24.15 1.8.4.8 1.3.1.43 2.1.1.77 2.3.3.14 3.4.-.- 3.6.1.67 3.2.1.82 3.6.1.57 2.3.1.168 4.2.1.115 5.3.99.11 2.4.99.16 4.2.1.113 1.4.1.26 5.4.4.3 2.6.1.106 3.5.1.4 6.3.1.10 4.2.1.118 2.3.1.51 6.6.1.2 1.10.3.11 3.6.1.63 2.8.1.- 2.7.3.9 1.1.1.303 2.3.1.201 4.1.1.28 3.4.24.21 4.3.1.2 3.4.23.36 4.1.1.99 2.5.1.60 3.4.21.61 2.1.1.104 1.3.8.12 2.1.1.219 4.1.1.33 6.3.1.- 6.2.1.43 3.2.1.50 4.2.1.49 1.1.1.400 3.5.1.29 2.3.1.9 1.1.1.140 3.4.15.5 1.2.1.5 4.1.2.27 5.1.99.6 2.2.1.2 2.5.1.132 2.7.7.6 1.1.1.376 2.6.1.62 3.1.11.6 3.1.3.77 2.1.1.151 1.3.7.11 2.3.2.- 1.1.1.287 2.6.1.85 4.2.1.129 2.3.1.117 5.1.3.2 3.6.1.- 2.3.1.37 2.6.1.44
...
ECs_with_dominant_taxon_knockout.txt
all 1.13.11.73 2.4.1.83 1.5.-.- 3.3.2.1 3.5.1.84 3.5.3.13 3.5.1.98 3.1.2.12 1.1.1.47 3.1.7.2 1.6.3.- 2.4.1.14 1.8.7.1 2.4.1.18 2.1.3.11 1.6.5.4 1.2.98.1 4.2.1.149 1.1.1.401 2.7.1.- 1.1.99.28 3.1.30.1 3.1.3.78 3.5.1.6 3.5.3.15 3.2.1.52 1.7.2.5 2.1.1.98 3.1.1.5 2.7.9.3 3.5.99.10 1.14.11.8 5.4.99.24 6.3.5.3 1.3.5.6 1.14.14.9 1.5.99.5 2.3.2.6 2.1.1.321 1.3.99.12 1.14.11.27 6.3.5.9 1.13.11.79 3.6.1.54 3.6.3.41 2.3.1.122 1.1.9.1 3.6.1.9 3.7.1.14 2.8.3.16 2.3.1.19 1.3.3.5 2.5.1.48 1.5.1.50 3.1.3.48 6.2.1.27 5.3.1.4 2.3.1.5 2.6.1.108 2.6.1.9 2.8.3.15 3.5.2.14 2.7.9.2 5.4.99.17 1.3.99.27 2.5.1.72 5.4.2.2 2.7.7.43 6.3.1.13 2.5.1.6 4.6.1.- 2.7.7.85 3.2.1.26 5.4.99.63 1.3.98.3 1.14.13.195
...
RscriptOutput.csv
Enzyme SimpsonScore_NTC_G210_1 Dominance_NTC_G210_1 mostFreqOrg_NTC_G210_1 SimpsonScore_NTC_G210_2 Dominance_NTC_G210_2 mostFreqOrg_NTC_G210_2 SimpsonScore_NTC_G210_3 Dominance_NTC_G210_3 mostFreqOrg_NTC_G210_3 SimpsonScore_NTC_G210_4 Dominance_NTC_G210_4 mostFreqOrg_NTC_G210_4 SimpsonScore_NTC_G210_5 Dominance_NTC_G210_5 mostFreqOrg_NTC_G210_5 DominanceMean
0 1.1.1.1 0.794587955 0.205412045 Unidentified 0.800093573 0.199906427 Unidentified 0.815299756 0.184700244 Unidentified 0.79956269 0.20043731 Unidentified 0.804941617 0.195058383 Unidentified 0.197102882
1 1.1.1.2 0.626634958 0.373365042 Unidentified 0.586642044 0.413357956 Unidentified 0.58678501 0.41321499 Unidentified 0.5592 0.4408 Unidentified 0.610945866 0.389054134 Unidentified 0.405958424
2 1.1.1.27 0.888954987 0.111045013 Unidentified 0.864858142 0.135141858 Unidentified 0.863252353 0.136747647 Unidentified 0.862202181 0.137797819 Unidentified 0.860412903 0.139587097 Unidentified 0.132063887
3 1.1.2.8 0.780311764 0.219688236 Unidentified 0.766498647 0.233501353 Unidentified 0.794083108 0.205916892 Unidentified 0.75553934 0.24446066 Unidentified 0.753870376 0.246129624 Unidentified 0.229939353
4 1.1.1.284 0.776145944 0.223854056 Unidentified 0.780229568 0.219770432 Unidentified 0.792758639 0.207241361 Unidentified 0.781272752 0.218727248 Unidentified 0.781929204 0.218070796 Unidentified 0.217532779
5 1.2.1.3 0.835101162 0.164898838 Unidentified 0.820531253 0.179468747 Unidentified 0.841063839 0.158936161 Unidentified 0.827654289 0.172345711 Unidentified 0.813291109 0.186708891 Unidentified 0.17247167
...
RscriptOutput_filter_IdentMostFreq.csv
Enzyme SimpsonScore_NTC_G210_1 Dominance_NTC_G210_1 mostFreqOrg_NTC_G210_1 SimpsonScore_NTC_G210_2 Dominance_NTC_G210_2 mostFreqOrg_NTC_G210_2 SimpsonScore_NTC_G210_3 Dominance_NTC_G210_3 mostFreqOrg_NTC_G210_3 SimpsonScore_NTC_G210_4 Dominance_NTC_G210_4 mostFreqOrg_NTC_G210_4 SimpsonScore_NTC_G210_5 Dominance_NTC_G210_5 mostFreqOrg_NTC_G210_5
17 2.7.1.1 0.441616234 0.558383766 Heliobacterium 0.516619183 0.483380817 Heliobacterium 0.450123457 0.549876543 Heliobacterium 0.370844444 0.629155556 Heliobacterium 0.368543388 0.631456612 Heliobacterium
68 2.3.3.8 0.293623795 0.706376205 Unidentified_Nitrospirales 0.348869024 0.651130976 Unidentified_Nitrospirales 0.302205115 0.697794885 Unidentified_Nitrospirales 0.354595543 0.645404457 Unidentified_Nitrospirales 0.380746349 0.619253651 Unidentified_Nitrospirales
105 1.1.3.5 0.5 0.5 Myxococcus 0.375 0.625 Myxococcus 0.48 0.52 Myxococcus 0.444444444 0.555555556 Myxococcus 0.244897959 0.755102041 Myxococcus
106 1.1.1.11 0.444913495 0.555086505 Unidentified_Burkholderiales 0.419981082 0.580018918 Unidentified_Burkholderiales 0.411606979 0.588393021 Unidentified_Burkholderiales 0.350849714 0.649150286 Unidentified_Burkholderiales 0.377580755 0.622419245 Unidentified_Burkholderiales
119 2.7.1.53 0.634920635 0.365079365 Mesorhizobium 0.603550296 0.396449704 Mesorhizobium 0.58979206 0.41020794 Mesorhizobium 0.520775623 0.479224377 Mesorhizobium 0.53968254 0.46031746 Mesorhizobium
146 1.1.1.307 0 1 Gemmata 0 1 Gemmata 0 1 Gemmata 0 1 Gemmata 0 1 Gemmata
...
RscriptOutput_filter_IdentMostFreqbyEnzyme.csv
Organism Enzyme
0 Acanthamoeba ['1.13.11.20']
1 Actinoplanes ['1.1.1.287', '1.1.3.9', '3.2.1.64', '1.14.13.113', '3.4.24.77', '4.1.99.20', '3.2.1.82', '1.20.4.3', '2.3.1.194', '1.14.99.56']
2 Afipia ['1.3.1.87']
3 Algiphilus ['1.1.1.170']
4 Amycolatopsis ['2.7.1.190']
5 Arenimonas ['2.4.1.290']
6 Bradyrhizobium ['3.6.1.7', '1.13.11.34', '1.13.12.9', '3.4.21.68', '1.14.13.81', '1.3.7.7', '2.5.1.62', '2.5.1.133', '4.2.1.131', '2.1.1.210', '1.14.14.15', '1.14.14.-', '2.6.1.106', '4.2.1.159', '4.1.1.61', '3.2.1.81', '4.99.1.5']
7 Candidatus_Entotheonella ['2.1.1.272']
...
RscriptOutput_filter_dominance.csv
Enzyme SimpsonScore_NTC_G210_1 Dominance_NTC_G210_1 mostFreqOrg_NTC_G210_1 SimpsonScore_NTC_G210_2 Dominance_NTC_G210_2 mostFreqOrg_NTC_G210_2 SimpsonScore_NTC_G210_3 Dominance_NTC_G210_3 mostFreqOrg_NTC_G210_3 SimpsonScore_NTC_G210_4 Dominance_NTC_G210_4 mostFreqOrg_NTC_G210_4 SimpsonScore_NTC_G210_5 Dominance_NTC_G210_5 mostFreqOrg_NTC_G210_5 DominanceMean
1 1.1.1.2 0.626634958 0.373365042 Unidentified 0.586642044 0.413357956 Unidentified 0.58678501 0.41321499 Unidentified 0.5592 0.4408 Unidentified 0.610945866 0.389054134 Unidentified 0.405958424
6 1.2.1.5 0.59581252 0.40418748 Unidentified 0.537131519 0.462868481 Unidentified 0.603807258 0.396192742 Unidentified 0.615574023 0.384425977 Unidentified 0.638671875 0.361328125 Unidentified 0.401800561
7 1.2.1.9 0.35345 0.64655 Unidentified 0.365510391 0.634489609 Unidentified 0.34860392 0.65139608 Unidentified 0.366133182 0.633866818 Unidentified 0.26410194 0.73589806 Unidentified 0.660440113
17 2.7.1.1 0.441616234 0.558383766 Heliobacterium 0.516619183 0.483380817 Heliobacterium 0.450123457 0.549876543 Heliobacterium 0.370844444 0.629155556 Heliobacterium 0.368543388 0.631456612 Heliobacterium 0.570450659
29 4.1.1.1 0 1 Unidentified 0 1 Unidentified 0 1 Unidentified 0 1 Unidentified 0 1 Unidentified 1
41 1.2.7.- 0.537079604 0.462920396 Unidentified 0.548895968 0.451104032 Unidentified 0.611364337 0.388635663 Unidentified 0.537909841 0.462090159 Unidentified 0.54836851 0.45163149 Unidentified 0.443276348
...
A pipeline that implements steps 1-6 as a single script using a multi-thread processing.
Input: NTC_G210-ReadName_to_TaxonPath.txt, NTC_G210.count_tab.matrix, NTC_G210-ReadName_to_KeggName_Ultimate.txt Output: all outputs of steps 1-6
Here input files are required - ReadName_to_TaxonPath.txt, count_tab.matrix and ReadName_to_KeggName_Ultimate.txt" There are 4 arguments that have to be stated by the following order: ReadName_to_TaxonPath.txt count_tab.matrix ReadName_to_KeggName_Ultimate.txt and BaseFolder - where the script and input files are.
Removal_Pipeline.py ReadName_to_TaxonPath.txt count_tab.matrix ReadName_to_KeggName_Ultimate.txt /Path/To/InputFiles/
The script determines differential abundance between entities whose relative abundance is described in a count table. The script will produce, for each treatment, a table file with the results of the differential analysis and a plot showing the distribution of differentially represented elements.
Input: count_table_taxonomy_genus_BjSaVSNTC.txt, sample_metadata.txt (Metadata file, with at least four columns: samples ID, name of the rootstock, name of treatment and name of rootstock treatment), contrasts_used_BjSa_NTC.txt
Rscript using_edgeR_generic.R. count_table_taxonomy_genus_ BjSaVSNTC.txt sample_metadata.txt Perfix_for_output_files Treatment_Rootstock contrasts_used_BjSa_NTC.txt 0.05
Parameters: count table, metadata file, prefix for output files, compared treatments as specified by column names in metadata file, text file with the treatments being compared (e.g., NTC vs BjSa), FDR threshold. 0.05-FDR threshold used to determine significance.
Output:
Genus_0.05_BjSaVSNTC
X logFC logCPM F PValue FDR association
Sulfuritalea -5.58755896994623 5.70635795499091 671.232809188422 4.93267572355644e-21 9.00355227037429e-18 NTC
Blastopirellula 2.94868245094395 8.31874579196599 652.040459410995 7.27221978153918e-21 9.00355227037429e-18 BjSa
Kribbella 4.51846129896242 7.64468080204723 635.417496127515 1.02702117152558e-20 9.00355227037429e-18 BjSa
Bdellovibrio 3.40751471675525 7.55515074395945 602.062275727325 2.10905124530474e-20 1.38670119378786e-17 BjSa
Unidentified_Mucorales 10.7416670206439 2.51304852574235 567.715548572001 4.61004629850151e-20 2.42488435301179e-17 BjSa
...
The script identifies pathways that are enriched with differentially abundant enzymes in treatment1 vs treatment2, constructs corresponding metabolic networks, predicts environmental resources that differ between the treatments and simulates metabolic activity within each of the predicted environments. This code is a local version of the online NetCom tool https://freilich-lab-tools.com/netcom/ described in the Ofir et al 2021.
Inputs: The algorithm takes as input an EdgeR output file (generated at Step 8) that screens for the differential abundance of enzymatic reactions in two different treatments (Step 8). Enzymes are classified as associated with Treatment_1, Treatment_2 or not associated. The EdgeR results file is tab separated and is consist of the following columns: enzyme logFC logCPM PValue FDR association.
Instructions: The code is implemented in Python 3.8 (Linux OS, Anaconda). To run the code locally:
Download and unzip Netcom from Scripts/step9 directory
cd NetCom
virtualenv NetCom
source NetCom/bin/activate
pip install -r requirements.txt
execute by:
python app.py
open browser at http://127.0.0.1:8050/netcom/
After a file is uploaded, a new section appears, enables the selection of the control and the experiment treatments. The selection of control and experiment treatments drives the calculation of basic statistics features that are then shown.
Entities number in a pathway: select a range of minimum and maximum number of entities linked with a pathway in the enrichment analysis. The selection of the range above drives the enrichment analysis calculation and a new option is then enabled.
Select pathways to dropout: select which pathways to drop out of the network.
Environmental resource node color: select the color of the compounds which are essential for the network to develop (seeds).
Unique node color: select the color of the compounds which are differentially abundant due to the treatment.
Limit node hubness: Select the maximum allowed number of connected edges to a node.
Set network layout iterations: NetworkX network layout iterations. Higher numbers would produce an arranged network, but consume more time to calculate (up to a few minutes).
Parameters: fisher exact test - scipy.stats.fisher_exact, alternative hypothesis 'greater' Test results and p-value correction for multiple tests - statsmodels.stats.multitest.multipletests default parameters, alpha=0.05
Outputs: The algorithm generates as output:
(i) Lists of differentially abundant enzymes and their pathway
association.
(ii) Prediction of environmental resources that are unique to each
treatment and their pathway association.
(iii) Prediction of environmental compounds that are produced by
the microbial community and pathway association of compounds
that are treatment-specific.
(iv) Network visualization of enzymes, environmental resources and
produced compounds that are treatment specific (2 & 3D).
The script conducts community 'knockouts' simulations in which selected taxonomic groups are removed from the community network by eliminating enzymes associated with the group (according to the scores in step 6). The impact of the removal group is estimated according to differences in the number of metabolites between the network expanded from the truncated enzyme set, and the reference meta-network that includes the full set of enzymatic functions
Input: Knock_out_file_step_6_update_name, Env.txt, dictionary files (From the NetCom package, Step 9, XXX_compounds.txt)
svn export https://github.com/ot483/NetCom2/trunk/Dict
Python Sim4RemovalNet.py
Output:
Conventional_Compounds_pathway.csv
Pathway entities
54 Cysteine and methionine metabolism ['C01137', 'C05526', 'C00021', 'C01180', 'C08276', 'C00409', 'C00097', 'C03145', 'C05824', 'C00059', 'C15651', 'C00051', 'C03539', 'C02291', 'C04188', 'C00957', 'C15650', 'C03089', 'C11537', 'C04582', 'C05528', 'C05823', 'C00506', 'C00094', 'C15606', 'C00019', 'C00283', 'C00155', 'C00170', 'C00073']
67 Fatty acid elongation ['C05264', 'C05270', 'C05269', 'C01944', 'C05274', 'C05271', 'C05262', 'C05265', 'C02843', 'C05263', 'C00638', 'C05275', 'C05267', 'C00136', 'C00264', 'C05276', 'C00640', 'C03221', 'C05268', 'C05266', 'C00658']
132 Porphyrin and chlorophyll metabolism ['C05791', 'C05787', 'C05772', 'C06508', 'C02469', 'C06505', 'C00194', 'C11538', 'C05778', 'C05774', 'C18022', 'C06506', 'C17401', 'C18021', 'C02463', 'C06510', 'C06507', 'C01051', 'C00853', 'C06509']
23 Biosynthesis of antibiotics ['C00956', 'C11984', 'C11981', 'C00449', 'C11982', 'C11967', 'C20139', 'C11980', 'C11983', 'C04076', 'C11921', 'C12426', 'C11963', 'C00097', 'C01944', 'C00059', 'C11965', 'C18680', 'C02291', 'C00224', 'C11968', 'C18681', 'C12414', 'C20138', 'C05551', 'C12395', 'C00136', 'C05556', 'C18678', 'C12045', 'C02954', 'C11979', 'C00019', 'C11964', 'C12425', 'C00395', 'C06630', 'C12406', 'C12405', 'C06567', 'C12413', 'C15777', 'C06635', 'C11966']
7 Anthocyanin biosynthesis ['C16315', 'C08604', 'C16288', 'C16290', 'C12644', 'C08639', 'C16289', 'C05905', 'C12138', 'C12642', 'C12643', 'C16297', 'C16298', 'C12137', 'C12640', 'C08620', 'C16301']
8 Arachidonic acid metabolism ['C14770', 'C00219', 'C14769', 'C14749', 'C14772', 'C14812', 'C14771', 'C14748', 'C14820', 'C14768', 'C14773', 'C14775', 'C14821', 'C00157', 'C14778', 'C14823', 'C14774']
...
simulation_compounds.txt
all C00153 C02947 C03231 C00048 C00350 C04916 C20239 C13425 C15930 C00108 C00605 C00097 C00638 C01153 C02265 C06441 C05682 G00089 C12456 C05301 C17959 C20330 C14866 C06113 C00644 C05745 C04619 C00279 C01495 C03847 C00483 C11435 C04615 C00230 C00249 C00526 C06586 C03269 C05729 C00464 C05840 C00024 C03440 G00090 C00085 C01168 C00545 C16519 C04751 C00468 G09660 C08604 C01061 G10609 G11121 G11122 C00212 C01089 C00603 C00705 C00162 C00542 C00886 C00134 C00385 C04691 C06630 C00826 C16688 G10737 C06057 C02167 C00668 C05577 C04146 C01185 C01563 C03287 C00658 C04405 C04257 C04409 C16406 C05345 C00642 C06033 C06622 C00094 C15782 C03657 C14871 C00286
...
Input/output files names can be updated in the script. Knock_out_file_step_6_update_name is the output of step 6. Env.txt is an output of step 9 – product (ii) Prediction of environmental resources that are unique to each treatment and their pathway association. The script requires dictionary files provided in Dict folder: compounds_lables_jun_1.txt, ec_reac_mapping_jun.txt, full_enzymes_labels_jun.txt, reactions_3_balanced.txt
The script takes as input lists of enzymes, predicted environmental resources and unique compounds and produces a network representation.
Input: allCompounds_BjSa.txt, allCompounds_NTC.txt, Compounds_BjSa_Order.txt, EC_ALL.txt, ECs_BjSa.txt, ECs_NTC.txt, enzymes_BjSa_order.txt, pathways_BjSa_order.txt, seeds_BjSa.txt, seeds_NTC.txt, compounds_lables_jun_1.txt, ec_reac_mapping_jun.txt, full_enzymes_labels_jun.txt, reactions_3_balanced.txt
Output: BjSa_order_removal_network.pdf, BjSa_oder_removal_network.png
The user should change the file names in the script - lines 80-128
python3 Network_Figures.py
OR
The script can be used by arguments as below, just uncomment lines 33-78 and comment lines 80-128
python3 Network_Figures.py <This is base folder> <drop fragments with size smaller then> \
<integer of hubness to filter> <prefix> <All_compounds_B_input txt file> \
<Seeds_B_input txt file> <ECs_All_input txt file> \
<patches_Compounds> <patches_Enzymes> <Pathways_Compounds>
Ofir Tal ( @Ofir = https://sites.google.com/view/ofir-tal-lab/home )
Maria Berihu ( @Maria = https://www.freilich-lab.com/maria-detailes )
Edoardo Pimbo ( @Edoard = https://ec.linkedin.com/in/eduardo-pimbo-525a8193 )
Shiri Freilich ( @Shiri = https://www.freilich-lab.com/ )
Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome research 2007, 17(3):377-386.
Ofaim S, Zarecki R, Porob S, Gat D, Lahav T, Kashi Y, Aly R, Eizenberg H, Ronen Z, Freilich S: Genome-scale reconstruction of Paenarthrobacter aurescens TC1 metabolic model towards the study of atrazine bioremediation. Sci Rep 2020, 10(1):13019.
Tal O, Bartuv R, Vetcos M, Medina S, Jiang J, Freilich S: NetCom: A Network-Based Tool for Predicting Metabolic Activities of Microbial Communities Based on Interpretation of Metagenomics Data. Microorganisms 2021, 9(9):1838.
This work was funded by the United States - Israel Binational Agricultural Research and Development Fund (BARD) [grant number [US-5390-21]