The updated workflow consists of:

This metagenome analysis workflow is in the process of being updated. For specific queries please contact martin.ostrowski@uts.edu.au

An assembly and annotation pipeline using standardised tools to characterise shotgun metagenomes produced by the Marine Microbes and Australian Microbiome Project.

The updated workflow consists of:

(PBS scripts for HPCC)

Quality filtering and adapter trimming -Trimmomatic
Digital dereplication - bbnorm (bbnorm.sh)
Assembly -SPAdes meta (spades.sh)
Naming conventions - rename reformat (500) (reformat.sh)
Gene calling - prodigal (prodigal.sh)
Abundance mapping -bbmap performed at gene-level and contig-level (bbmap.sh)
Taxonomic assignment of contigs against the NCBI databasse - basta (~/bln.sh then basta loop)
Functional annotation through fine-grained orthologue detection - eggnogmapper v1.0.2 (update to 2.0.8 pending)
Reference Gene Catalogue -cd-hit 98 (optional)
Subsystems annotation -SEED mapping
Concatenation and table generation for input to statistical packages (compile_basta.r, bbmap.r)
Overview statistics and visualisation
Hypothesis testing
Visualisation of results using pathways, modules and other functional hierarchies

Each process is described in detail in a series of three R markdown notebooks. The main processing steps are coded for the UTS HPCC

The standardised outputs include

a gene abundance table aggreagated on Taxonomy (Genus-level) and a. eggnog orthologous gene IDs b. SEED figIDs c. a contig abundance table

The relative abundances of genes and/or contigs can be estimated in a number of ways (gene-wise, contig-wise and with or wi-thout taxonomic classification using raw data from the readmapping carried out with bbmap)

Description:

options for specific software packages and linking scripts

Steps 01 -08: to be imported see old workflow below

eggNOG mapper

split -l 1000000 -a 3 -d input_file.faa input_file.chunk_

fasta format one sequence per line, split in to chunks

utility scripts

all in bin

remove_smalls.pl

unwrap

Contig sequences are named consistently and the fasta files of predicted genes and proteins are converted to one entry per line to enable simple chunking and splitting operations.

perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' MAI_all500.nt > MAI_all500unwrap.nt

rename 1 perl -ane 'if(/^>/){$a++;print ">139751_$a\n"}else{print;}' /shared/c3/bio_db/BPA/assemblies/contigs/500/MAI/139751_contigs.nt >/shared/c3/bio_db/BPA/assemblies/contigs/500/MAI/139751_contigs.fasta ; rename 2

for i in 35*vnt.out; do CODE=$(basename $i _contigs.vnt.out); basta sequence -p 51 -b 1 -d /shared/c3/bio_db/BPA/nt202008/ /shared/c3/bio_db/BPA/assemblies/contigs/500/MAI/${CODE}_contigs.vnt.out ${CODE}_contigs.vnt$

SEED

SEED annotations were acquired by a diamond search of the predicted proteins against the SEED database with the hierarchy of functional categories were mapped wioth the aid of the subsystems R package <https://github$

Initial ordination and clustering to check the data sanity was carried out with vegan::metaMDS on sqrt-transformed normalised read count data for the top 100,000+ gene profiles (by total abundance) followed by km$

old workflow below

metagenome

metagenome analysis workflow

AMP

AMP Metagenomic Workflow Description

Metagenome Processing Products Description and Purpose

1. Australian Reference Gene Catalogue (ARGC)

A. non-redundant catalogue of predicted gene sequences (> 200 nt, clustered at 95% nt) identity[clustered alongside the marRef https://mmp.sfb.uit.no Database?]
B. gene abundance table (c.f. a zOTU table)
C. The ARGC can be annotated against any source database
D. contigs assembled for each discrete sample, with Taxonomic assignment
E. Predicted gene sequences and protein sequences (fasta files)

Tools

Trimmomatic
Fastqc
Spades/megahit
metagenemark
cd-hit
basta

see AMRGC_Readme.txt for a starting point. see metagenome workflow for details on the assembly and read mapping

Discussion: The RGC structure mirrors the structure of the zOTU table and can be delivered to functional interfaces in .biom format and potentially utilise the same suite of ecological tools.

The size of the database approaches 56,000,000 rows, but could be cut significantly by omitting singletons or short sequences (e.g. less than 500) if there was an apetite or need

Ideally the gene abundances would be calculated with the exact same methods as the EBI profiles for inter comparison with global datasets, however I can’t find documentation on the process so cannot comment on the accuracy or applicability to the AMP data.

Format: ideally a subsettable .biom file served from the Bioplatforms Portal

Annotation sources include: best hit and LCA taxonomic assignment against nt, COG/NOG, GO-slim, nr, Interproscan. (Searches against Datbases in bold have been done). The others are not implemented.

2. SUPER-FOCUS functional characterisation -short term functional classification and enumeration against the SEED database

Discussion: can be implemented quickly. Hierarchal organisation of functional classes at 4 levels. Raw or Filtered reads (Fastq or Fasta) can be analysed

Can be used to produce rapid results. A curated RGC with defined read mapping criteria and inter comparison with global datasets is preferred.

A. a table of results for all levels and all samples. See “output_all_levels_and_functions.xls”

Tools

SUPER-FOCUS

3. Metagenome Assembled Genomes (MAGs)

A. Regional coassemblies with metadata file (length, LCA, coverage)
B. Regional Assembled Genome bins with metadata (length, LCA, coverage, method of binning, contamination, completeness)

i). Bacterial bins ii) Archaeal bins iii) Viral bins -complete viral genomes iv) eukaryote bins

Tools

bowtie2
samtools
MetaBat
GroopM
Basta
contig assembler (Geneious, CAP3 etc.)
CheckM

Discussion: Some optimisation may be required to maximise the completeness and reduce contamination during the binning process.

Format: ideally a bioproject on one of the database sites

Priorities

Test the complete workflow on datasets that are a priority for the community

YON
GAB
EAC - Amaranta MQ
Entire dataset - assembled contigs

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
IN2016v04_KO_2021.Rmd		IN2016v04_KO_2021.Rmd
LICENSE		LICENSE
README.md		README.md
metagenome_preliminary_version		metagenome_preliminary_version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IN2016v04_KO_2021.Rmd

IN2016v04_KO_2021.Rmd

LICENSE

LICENSE

README.md

README.md

metagenome_preliminary_version

metagenome_preliminary_version

Repository files navigation

The updated workflow consists of:

The standardised outputs include

Description:

utility scripts

all in bin

unwrap

old workflow below

metagenome

AMP

Metagenome Processing Products Description and Purpose

1. Australian Reference Gene Catalogue (ARGC)

Tools

Discussion: The RGC structure mirrors the structure of the zOTU table and can be delivered to functional interfaces in .biom format and potentially utilise the same suite of ecological tools.

Format: ideally a subsettable .biom file served from the Bioplatforms Portal

2. SUPER-FOCUS functional characterisation -short term functional classification and enumeration against the SEED database

Tools

3. Metagenome Assembled Genomes (MAGs)

Tools

Discussion: Some optimisation may be required to maximise the completeness and reduce contamination during the binning process.

Format: ideally a bioproject on one of the database sites

Priorities

About

Releases

Packages

License

martinostrowski/metagenome

Folders and files

Latest commit

History

Repository files navigation

The updated workflow consists of:

The standardised outputs include

Description:

utility scripts

all in bin

unwrap

old workflow below

metagenome

AMP

Metagenome Processing Products Description and Purpose

1. Australian Reference Gene Catalogue (ARGC)

Tools

Discussion: The RGC structure mirrors the structure of the zOTU table and can be delivered to functional interfaces in .biom format and potentially utilise the same suite of ecological tools.

Format: ideally a subsettable .biom file served from the Bioplatforms Portal

2. SUPER-FOCUS functional characterisation -short term functional classification and enumeration against the SEED database

Tools

3. Metagenome Assembled Genomes (MAGs)

Tools

Discussion: Some optimisation may be required to maximise the completeness and reduce contamination during the binning process.

Format: ideally a bioproject on one of the database sites

Priorities

About

Resources

License

Stars

Watchers

Forks