This repository is the main analysis for the associated shotgun metagenomics benchmarking manuscript. It contains the raw data, analysis methods, and documentation.
-
figures.ipynb
: main file for analysis and generation of statistics. -
requirements.txt
: a list of packages needed for this analysis. You can install these by first creating a Python virtual environment, then:pip install -r requirements.txt
This folder structure contains subdirectories for each pipeline assessed. These are:
-
Amos: Amos, G. C. A. et al. Developing standards for the microbiome field. Microbiome 8, 98, 10.1186/s40168-020-00856-3 (2020).
-
BMock12: Sevim, V. et al. Shotgun metagenome data of a defined mock community using Oxford Nanopore, PacBio and Illumina technologies. Sci Data 6, 285 (2019).
-
CamiSim: Fritz, A., Lesker, T., Bremges, A. & McHardy, A. Cami 2 - multisample benchmark dataset of human microbiome project, 10.4126/FRL01-006425518 (2019).
-
NIST: Bioproject PRJNA970731. SRA Accession SUB12091075.
-
Tourlousse: Tourlousse, D. M. et al. Characterization and demonstration of mock communities as control reagents for accurate human microbiome community measurements. Microbiol. Spectr. 10, e01915–21, 10.1128/spectrum.01915-21 (2022). https://journals.asm.org/doi/pdf/10.1128/spectrum.01915-21.
Within each subdirectory are CSVs of ground truth standardized names and expected abundances. The annotated files have TAXIDs attached in the following format: Species, RA, TAXID.
This folder contains the cleaned outputs from the raw pipeline output, as well as the statistics. The folder layout is identical to the Expected Pipelines, but there are subdirectories for each pipeline. Also, there is a [threshold]_all_stats_replicates_[genus/species].csv file, which summarizes the statistics for the community across all of the pipelines.
This folder contains much of the project's code to convert raw data into data that can be analyzed. There are several subdirectories here. Omitted subdirectories were not needed for this project. There is also useful information about each pipeline individually in the Pipelines file.
-
nist_confusion
: contains the code for making the confusion matrix tables in the supplement. -
pipeline_clean
: contains all the pipeline cleaning code for bioBakery, JAMS, WGSA2, and Woltka. -
compositions.py
: Aitchison Distance functions with associated test file. -
figures_utils
: useful figure file functions.
Contains many of the analysis tools to calculate statistics and tables for the manuscript.
-
analysis
: contains functions for statistical analysis.ad_vs_reads
: calculates the correlation between AD and read length.results
: folder for output of analysissummary_tables
: CSV summary table creation (AD, Sens, FPRA, Num_FP).analyze_stats.ipynb
: performs the majority of the combined statistical analysis, including KW test and combining the stats together into theresults/all_stats_species.csv
files.
-
ncbi
: NCBI standardization and lineage searching protocols.databases
: where the NCBI database should go (i.e., from https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/) and uncompress it.pkls
: initially empty folder -- where the standardized outputs will go from the NCBI parsing routine.names
: contains the name standardization functionslineage.py/.ipynb
: determines the correct genus name from TAXID.jams_convert.py
: annotates non-Excel JAMS output.species_annotated_to_genus
: converts all of the species-level annotated RA tables to genus-level.
-
paper
: diagrams and figures for the manuscript.heatmaps.ipynb
: creates the RA table comparison heatmaps.read_stats.ipynb
: creates and analyzes read stats tables
-
time
: calculates the time performance of each pipeline. -
data_paths.py
: holds the path to the raw data on the Biowulf computing cluster.-
Note: for running this analysis, the tarball
mock_communities.tar.gz
holds the raw data locally. If you would like to recreate this analysis, you will need to extract this archive. Extract the archive in theutils
directory.tar -xzf mock_communities.tar.gz
This will recreate the supercomputer's file structure with the necessary files in a directory named
Volumes
.
-
- The main sources of data can be found in the
data_paths.py
explanation. However, if you wish to conduct your own analysis, may be easier to use the data found inexpected_pipelines
andpipelines
, since these have been transformed into RA values and standardized with TAXIDs.
These flow charts were generated using diagrams.net.
The heatmaps were generated using the heatmaps.ipynb notebook with the raw data sourced from the expected_pipelines
and pipelines
folders.
These figures were generated from the analyze_stats.ipynb notebook, which also created the all_stats_species.csv data.
- Additionally, the pairwise Wilcoxon test analysis is conducted in the same file.
The bivariate plot was generated using the read_AD.ipynb.
The original read statistics CSV file can be found at read_stats.csv.
These metrics were calculated from figures.ipynb, which generated the all_stats
files in the pipelines
folders. Then, the tables.ipynb notebook was used to generate the wanted format. These were then cleaned manually and placed into ad_table.xlsx for easy transformation into LaTeX tables.
These tables were generated from the NIST confusion matrix analysis at nist_sens.ipynb and further cleaning was done with the nist_revisions.ipynb notebook.