Skip to content

Tutorial

Daniel Huson's lab edited this page Apr 14, 2025 · 124 revisions

Tutorial slides

Background

In microbiome analysis, samples are subjected to metagenomic sequencing and three main computational questions are:

  • Who is out there? What is the taxonomic content of a sample?
  • What are they doing, or what can they do? What is the functional content of a sample?
  • How do they compare? Do changes in the taxonomic or functional content of samples reflect changes in the microbiome?

The key idea of the DIAMOND+MEGAN pipeline is to align sequencing reads or assembled contigs against a protein reference database using DIAMOND (Buchfink et al, 2015), and then to analyze the alignments to perform taxonomic and functional binning of sequences, using MEGAN (Huson et al, 2016) and/or the daa-meganizer tool.

Why use protein alignments? DNA alignments can certainly be used to identify known genomes in a sample, in the context of known pathogen detection, or in the analysis of well-studied environments (such as the human gut, for well-studied populations), say. However, for the analysis of unknown organisms from less well studied environmental sources, protein alignment is more suitable due to the higher level of sequence conversation.

This plot shows that only a small part of the phylogenetic diversity estimated to exist in the environment is represented by full DNA sequences in genomic databases:

Phylogenetic diversity

In our approach, DNA sequencing reads or contigs, are translated into protein sequences and are aligned to protein reference sequences in a "translated alignment":

Translated alignment

Metagenomic sequencing projects can involve hundreds of samples each containing tens of millions of sequences. The NCBI-nr protein reference database contains over 500 million reference sequences. Thus, alignment-based metagenomic analysis is computationally demanding and the first steps are usually performed on a server or cluster.

The number of reference proteins is continuing to increase and thus alternative, smaller databases such as AnnoTree (Gautam et al, 2022) may be more suitable in the future:

NCBI-nr growth

To reduce computational load and the amount of required disk space, the DIAMOND+MEGAN pipeline is very stream-lined and produces only one output file for each input file, as indicated here:

DIAMOND+MEGAN pipeline

The two computationally demanding steps, alignment of sequences against a reference database (DIAMOND alignment), and then analysis of the resulting alignments (MEGANization), are usually run on a server, whereas the third step, interactive exploration and analysis of the results, is performed on a personal computer.

For purposes of this tutorial, we provide toy data and a toy reference database so that all three steps of the analysis can be run on a laptop.

Minimum requirements to run full tutorial

Although the DIAMOND+MEGAN pipeline is designed to be platform-independent, for the purpose of this tutorial, we recommend using a Linux or MacOS system for the initial steps of the analysis. This is because in metagenomic studies the initial steps are usually performed on a Linux server. The resulting data is then further analyzed on a personal computer using interactive analysis software (e.g. MEGAN) or scripting packages in R or Python, say.

If you are using a Windows-based system on your laptop, please run a Linux emulator. One approach is to set up a virtual machine with a Linux distribution such as Ubuntu, for example using VirtualBox, a widely-used virtualization software. VirtualBox allows you to create and manage virtual machines on your Windows operating system. For more information and to download VirtualBox, please visit VirtualBox.

Please note that Ubuntu is also available as Windows Subsystem for Linux (WSL). WSL allows you to install a complete Ubuntu terminal environment on your Windows machine, allowing you to run Linux applications under Windows (10 or 11). You can obtain WSL from the Windows Store or install it using PowerShell.

The first half of this tutorial (DIAMOND+MEGANization) assumes that you are running Linux, MacOS or Linux under Windows. Once these steps have been completed, the interactive analysis of data using MEGAN can be performed on Windows.

Installation

DIAMOND

You can download and install DIAMOND as follows:

Install with Conda

If Conda is installed on your system or virtual environment, you can easily install DIAMOND by using the following command:

conda create -n diamond -c conda-forge -c bioconda diamond=2.1.7
conda activate diamond

This command will fetch and install DIAMOND from the Bioconda channel, which provides a collection of bioinformatics software packages.

Install a pre-complied executable

A Linux binary can be downloaded as follows:

wget http://github.com/bbuchfink/diamond/releases/download/v2.1.7/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz

On Ubuntu, you can also install DIAMOND as follows:

sudo apt install diamond-aligner

On MacOS, DIAMOND can also be installed using brew:

brew install diamond

If you installed DIAMOND using Conda or brew, you can verify that the installation was successful by typing:

diamond help

If you have downloaded and unpacked the binary, then please define a variable DPATH that refers to the directory that contains the diamond binary. For example, if you have placed diamond in your home directory, then type this:

export dpath=~

In this case, you can verify that the installation was successful by typing:

$dpath/diamond help

Both commands will display the help information for DIAMOND, confirming that the installation was successful.

For additional installation methods and options, please see the DIAMOND installation wiki page.

MEGAN

MEGAN is written in Java and contains its own Java runtime environment (so a separate installation of Java is neither required nor useful).

MEGAN is installed using an installer program that can be download from MEGAN6 Download Page. If the page is down, try this alternative.

MEGAN is available in two editions: In this tutorial we will use the open source Community Edition. There is also a more feature-rich and better-supported "Ultimate Edition" that is built upon the Community Edition and is licensed by Computomics GmbH.

Please select and use the installer that is appropriate for your system:

MEGAN_Community_unix_6_25_10.sh - Linux 
MEGAN_Community_unix_ARM_6_25_10.sh - Linux running on an Apple chip
MEGAN_Community_macos_6_25_10.dmg - MacOS
MEGAN_Community_windows-x64_6_25_10.exe - Window

This tutorial is specifically designed with Linux or Mac OS X-based systems in mind. To complete the installation process, follow these steps:

For Linux-based systems, open your terminal and paste the command provided below.

wget https://software-ab.cs.uni-tuebingen.de/download/megan6/MEGAN_Community_unix_6_25_10.sh
chmod +x MEGAN_Community_unix_6_25_10.sh
./MEGAN_Community_unix_6_25_10.sh

This will open the installer dialog, and then follow the required steps to install MEGAN6.

For Mac OS X-based systems, download the installer file with the extension .dmg. Double-click on the downloaded file to open the installer dialog, and then follow the required steps to install MEGAN6. This will install MEGAN in your Applications folder.

For Windows, download the installer and run it. If you are running Windows and using Ubuntu from within Windows, then you will have to install MEGAN twice, once under Linux to access the daa-meganizer tool, and once under Windows to access the MEGAN program.

The installer will ask you how much memory to allow MEGAN to use. The more you allow, the faster the program will run. Ideally, allow 16G, but 8G should also work ok. For smaller values, the program might become unresponsive when faced with large data files. But do not exceed 3/4 of the physical memory of your machine.

Tutorial dataset

Application of the DIAMOND+MEGAN pipeline in microbiome analysis involves very large
datasets of sequencing reads (involving hundreds of millions of reads) that are compared against a very large protein reference database (such as the NCBI-nr database containing over 500 million sequences or the NCBI-nr50 database, contain over 50 million sequences). For the purposes of this tutorial, we have produced twelve very small human gut samples of one million reads each, subsampled from data presented in (Willmann et al, 2015). We provide a tiny toy reference database nr-tutorial.gz that contains a very small portion of the NCBI-nr50 database. This data is small enough to be run on any laptop, while large enough to give some sense of what a metagenomic analysis looks like.

There are twelve samples, for two healthy subjects, Alice and Bob, taken on 6 different days, 0, 1, 3, 6, 8 and 34. During days 1-6, both subjects were treated with an antibiotic.

Download data

The tutorial data is available here: https://software-ab.cs.uni-tuebingen.de/download/megan6/tutorial/tutorial-DM.zip. If the server is down, obtain the file from here.

Under Linux, can download it in a terminal window like this:

wget https://software-ab.cs.uni-tuebingen.de/download/megan6/tutorial/tutorial-DM.zip

The provided command will download the file named tutorial-DM.zip. To extract the contents of this file, you can use the following command:

unzip  tutorial-DM.zip

After extracting the "tutorial-DM.zip" file, you will find a folder named tutorial-DM/. This folder contains the following subdirectories:

  1. data/: This directory contains the data files that will be used for the alignment with DIAMOND.
  2. tutorial-nr.gz: This file is a database file that will be used by DIAMOND for the alignment of the files in the "data" folder.
  3. megan-map-tutorial.db: This file is a mapping file required by MEGAN for the meganization of the files generated by DIAMOND. These files and directories are essential for the tutorial and will be used in the subsequent steps of the tutorial. Please ensure that you have downloaded and extracted the "tutorial-DM.zip" file correctly to have access to these resources.
tutorial-DM/
├── data
│    ├── Alice00-1mio.fq.gz
│    ├── Alice01-1mio.fq.gz
│    ├── Alice03-1mio.fq.gz
│    ├── Alice06-1mio.fq.gz
│    ├── Alice08-1mio.fq.gz
│    ├── Alice34-1mio.fq.gz
│    ├── Bob00-1mio.fq.gz
│    ├── Bob01-1mio.fq.gz
│    ├── Bob03-1mio.fq.gz
│    ├── Bob06-1mio.fq.gz
│    ├── Bob08-1mio.fq.gz
│    ├── Bob34-1mio.fq.gz
|    └── metadata.txt
├── tutorial-nr.gz
└── megan-map-tutorial.db

Next, navigate into the "tutorial-DM" directory:

cd tutorial-DM

Data pre-processing

If you are using your own dataset for analysis, it is important to perform quality control (QC) checks to ensure the reliability of your data. Tools such as Trimmomatic or other QC software can be used to assess the quality of your sequencing reads and perform necessary trimming or filtering steps. QC checks help identify and remove low-quality or unreliable data, improving the accuracy and reliability of downstream analysis. It is recommended to carefully evaluate the quality of your data and apply appropriate QC measures before proceeding with further analysis.

DIAMOND+MEGAN pipeline

The DIAMOND+MEGAN pipeline consists of three steps:

  • Use DIAMOND to align all reads against a protein reference database.
  • Use the MEGAN tool daa-meganizer to perform taxonomic and functional analysis of all resulting alignments.
  • Use MEGAN to interactively explore and analyze the data

First step: DIAMOND alignment

To perform DIAMOND alignment, we first need to compute an index and then we can run the program on the input datasets.

Index generation

To begin the DIAMOND + MEGAN pipeline, the first step is to generate an index for the database file using DIAMOND. Activate the conda environment in which you have installed DIAMOND. If you have downloaded the DIAMOND binary directly, use the absolute path to the executable.

If you are using an installed version of DIAMOND (e.g. using conda), then type:

diamond makedb --in tutorial-nr.gz --db tutorial-nr

If you are using a downloaded binary version of DIAMOND, then type (here, $DPATH is a variable that refers to the directory that contains the diamond binary, and discussed above):

$dpath/diamond makedb --in tutorial-nr.gz --db tutorial-nr

We are using makedb mode of DIAMOND. Input file is specified using the --in flag and provide the name of the index file you want to generate using the --db option.

Running this command will generate the file "tutorial-nr.dmnd" in your SMB2023-tutorial directory.

Files alignment

First, create an output directory out. We will use this directory to store the DIAMOND-generated alignment files.

mkdir out

To align a single file using an installed version of DIAMOND (e.g. using conda), type:

diamond blastx --db tutorial-nr -q data/Alice00-1mio.fq.gz -o out/Alice00-1mio.daa -f 100 --masking 0 

If you are using a downloaded binary version of DIAMOND, then type:

$dpath/diamond blastx --db tutorial-nr -q data/Alice00-1mio.fq.gz -o out/Alice00-1mio.daa -f 100 --masking 0 

We will utilize the blastx mode of DIAMOND to perform translated alignment. To specify the index file generated in the previous step, use the --db option. For the alignment, we will specify the query file "Alice00-1mio.fq.gz" by using the -q option. The output file can be specified with the -o option. To ensure the output format is suitable for further use by MEGAN, we will set the -f option to 100, which corresponds to the DIAMOND Alignment Archive format. It took approximately two minutes to complete the alignment for this file. (We set masking to 0 to speed-up the calculation. Usually, do not do this.)

To align all files, we can iterate over them using a loop, like this (using an installed version):

for file in data/*.fq.gz
do
ofile="out/$(basename "${file%.*.*}").daa"
diamond blastx --db tutorial-nr -q $file -o $ofile -f 100 --masking 0
done

Or, like this, using a downloaded binary:

for file in data/*.fq.gz
do
ofile="out/$(basename "${file%.*.*}").daa"
$dpath/diamond blastx --db tutorial-nr -q $file -o $ofile -f 100 --masking 0
done

If you failed to run DIAMOND on the data, you can download the resulting files here:
https://software-ab.cs.uni-tuebingen.de/download/megan6/tutorial/diamond-out.zip or here.

Second step: MEGANization

MEGAN allows users to map and assign taxonomic and functional labels to sequences based on their alignment against a reference database. The output of DIAMOND alignment can be used as input for MEGAN, enabling the annotation of the sequences with taxonomic information. We call this process meganization.

MEGAN software includes a useful tool called daa-meganizer, which is located in the tools directory of MEGAN This tool is specifically designed for the meganization of DAA files. By utilizing the daa-meganizer, researchers can easily annotate and interpret their DAA files within the MEGAN software, enabling comprehensive taxonomic and functional analysis. Note that the tools directory is only available on Linux and MacOS.

For Linux, by default, MEGAN will be installed in a directory called megan in your home directory. In this case, type the following command to reference the location:

export megan=~/megan

For MacOS, by default, the program will be installed under /Applications/MEGAN. In this case, type:

export megan=/Applications/MEGAN

To perform meganization on a single file, you can use the following command:

$megan/tools/daa-meganizer -i out/Alice00-1mio.daa -mdb megan-map-tutorial.db

Here, we assume that you have installed MEGAN in your home directory under a folder named megan. If not, please adjust the folder path accordingly based on your installation. The -i flag is used to specify the input file (DAA), and the -mdb flag is used to specify the mapping file.

To perform meganization of all files, do this:

$megan/tools/daa-meganizer -i out/*.daa -mdb megan-map-tutorial.db

If you failed to meganize the 12 files, you can download the meganized files here:
https://software-ab.cs.uni-tuebingen.de/download/megan6/tutorial/meganizer-out.zip or here.

Third step: MEGAN interactive analysis

Now that we have completed the general steps for the DIAMOND+MEGAN pipeline, let's move on to the interactive analysis using the MEGAN GUI.

MEGAN is an interactive program for exploring and analysis metagenomic datasets. It can be used to explore the binning of reads and contigs to taxonomic and functional classes, to visualize and to capture these assignments in many different ways, to compare multiple samples, and to perform advanced analysis tasks such as gene-centric alignment and assembly of reads.

Launching MEGAN

Once MEGAN has been installed on your system, the easiest way to launch it in interactive mode is to search for MEGAN on your system and then to click on megan.app or MEGAN, or this icon: startup

From the terminal, you can also type $megan/MEGAN &

At startup, the program will display a window like this:

startup window

The select the File->Open... menu item and navigate to the tutorial-DM/out directory. If you have successfully run DIAMOND+MEGANizer on the input files, then you should be able to select an input file, such as Alice00-1mio.daa:

open file

Coordinating Windows and Ubuntu in Windows

If you are running the Ubuntu command-line app under Windows, then please launch MEGAN in Windows. How to then access the files that you have created using the Ubuntu app? To open File Explorer on the files created in Ubuntu, type the following into the Windows 'Type here to search' bar:

\\wsl$\Ubuntu\home\<Your Ubuntu Username>

Here, replace <Your Ubuntu Username> by your chosen Ubuntu user name (which appears at the beginning of the prompt in Ubuntu). Navigate to the directory containing the meganized DAA files (should be tutorial-DM\out) and copy all meganized DAA files from there to a directory in your Windows system. Then you can open the copied files using MEGAN.

Basic interaction with the main taxonomy window

This will open the main taxonomy window:

Main taxonomy window

This shows too much detail, so use the Rank tool bar menu to collapse the taxonomy at the Phylum rank (say):

Phylum rank

This data looks like typical human gut data, consisting mainly of Firmicutes, Bacteroidetes and Proteobacteria.

And then uncollapse the Firmicutes node, say, to see more details there:

below Firmicutes

In the taxonomy window, nodes are scaled by the number of reads assigned to them. When you select a node, you will see two numbers, Assigned=34,479 and Summed=188,762, say, indicating that 34,479 reads have been assigned to the node and that 188,762 reads have been assigned to the node, or any of its descendants in the taxonomy, as shown here:

Assigned and Summed

Not only does MEGAN allow you to see how many reads have been assigned to a particular node, you can also inspect the reads and see the alignments that they have:

Inspecting reads

You can save the reads and/or alignments associated with any set of nodes to a file for further analysis.

The program provides many ways of exporting (and importing) data:

Export dialog

Opening other classification windows

MEGAN also bins reads according to the GTDB taxonomy and you can open this viewer by pressing the GTDB toolbar button:

GTDB

You can collapse and uncollapse nodes, and inspector reads and alignments, in this window, too.

MEGAN provides a number of functional classification windows, such as EggNOG/COG, opened by pressing the COG toolbar button:

COG

The number of assigned reads in this view is very small, due to the fact that this is a tutorial dataset and we have used a tutorial database for alignment.

Comparing datasets

Use the File->Compare... menu item to open multiple samples together as a single document:

Compare dialog

This will show multiple samples together:

Comparison opened

The nodes can be shown as bar charts, pie charts or heat maps, and the values can be scaled linearly, by square root or logarithms. Here we are using a heat map and linear scale:

Heat map, linear scale

Working with metadata and the samples viewer:

Sample metadata is crucial for comparative analysis and such data can be loaded into MEGAN. There is a toy example of metadata in the file data/metadata.txt:

#SampleID	Day	Treatment	Subject
Alice00-1mio	0	0	Alice
Alice01-1mio	1	1	Alice
Alice03-1mio	3	1	Alice
Alice06-1mio	6	1	Alice
Alice08-1mio	8	0	Alice
Alice34-1mio	34	0	Alice
Bob00-1mio	0	0	Bob
Bob01-1mio	1	1	Bob
Bob03-1mio	3	1	Bob
Bob06-1mio	6	1	Bob
Bob08-1mio	8	0	Bob
Bob34-1mio	34	0	Bob

The first line is a header line, starting with the key word #SampleID and then followed by the names of all defined attributes, in this case Day, Treatment and Subject. Each subsequent line lists the name of a sample, Alice00-1mio etc (the sample file name without path or suffix), followed by the value for each of the named attributes. Note that entries are separated by tabs.

Import the metadata file like this:

Import metadata

The metadata will be displayed in a samples viewer and can be used for a number of purposes. For example, one can select an attribute and then request that samples get colored by that attribute:

Samples viewer

These colors are then used in the taxonomy and classification viewers, but also in charts and PCoA or trees plots (shown below):

Colored by subject

The samples viewer provides a number of calculations on the samples, for example, one can compute new aggregated datasets, based on a selected attribute, here Alice vs Bob:

Selecting Alice vs Bob

this gives rise to a new comparison document:

Showing Alice vs Bob

The charts viewer:

MEGAN provides a charts viewer that provides 14 different types charts. There are several options of customizing the charts, such as sorting, colors, scaling (linear, square root, logarithmic, or percentage), font sizes, etc. Click on the charts viewer toolbar button to select a type of chart to show:

image

This will open the charts viewer and will display the selected chart, applied all selected nodes, in the current taxonomy or classification viewer. (If no nodes are selected, all leaves will be selected). For example, this is a word cloud chart:

Phylum-level word cloud

After uncollapsing the nodes in the taxonomy viewer to the genus level and then pressing the synchronize button at the end of the charts viewer's toolbar, the chart is updated to show more details:

Genus-level word cloud

Now, we want to collapse the taxonomy view at the rank of class and show a stacked bar chart using percentages and sorting the displayed taxa by increasing total value:

Class-level stacked bar chart

The cluster analysis viewer:

MEGAN provides a cluster analysis viewer that can be used to show the relationship between different samples (so-called beta-diversity) in a number of different ways (PCoA plot, hierarchical clustering, phylogenetic outlines) using a number of different ecological indices. Like the charts viewer, the calculation is based on the nodes selected in the taxonomy or other classification viewer and there is a sync button at the end of the toolbar to recompute the analysis after changing the selection. Here we show a PCoA plot based on the rank of genus, using JSD (Square-root Jensen-Shannon divergence) distances:

PCoA

This plot shows a clustering of samples by days affected by "treatment" (antibiotics were administered on days 1-6).

Here is the same data displayed using a "phylogenetic outline":

Phylogenetic outline

Megan-Server:

The first two steps of the DIAMOND+MEGAN pipeline (DIAMOND alignment and meganization) are usually performed on a server. The resulting files are very big. While MEGAN provides a command-line program to extract a small summary file from a meganized DAA file, for full exploration, access to the DAA is required. To address this, MEGAN provides a program called megan-server (Gautam-et-al-2023) that can be run on a server and provides online access to served files from within MEGAN.

We run an instance of Megan-Server and its address is preconfigured in MEGAN. You can access it (and any server that you are running) from the File->Open from Server menu item:

Megan Server

Here we provide access to the full datasets upon which the tutorial datasets are based (in directory Willmann-et-al-2015) and also to long-read datasets from several different studies (in directory long-reads).

Can you match samples to environments?

Use MeganServer to open these six long-read samples:

  • ERR3201932.daa (Betrand et al, 2019),
  • ERR3561495.daa (Brandt et al, 2020),
  • SRR11268056.daa (Liem et al, 2021),
  • ERR3661022.daa (Overholt et al, 2021),
  • SRR11673963.daa (Singleton et al, 2021) and
  • DRR214963.daa (Yahara et al, 2021).

To select them all in the MeganServer dialog, use this search term, select regex and press `all in the search dialog:

ERR3201932|ERR3561495|SRR11268056|ERR3661022|SRR11673963|DRR214963

If MeganServer is down, you can download small MEGAN summary file of these six samples here, or all six files in full detail here (4 GB).

image

Then press the Compare button to open all six as a single comparison document, uncollapse the taxonomy to the rank of genus and then display the samples as word clouds:

Six word clouds

Each comes from a different environment, can you match the samples to the following:

  1. biogas plant
  2. ground water
  3. human gut
  4. oral
  5. sea water
  6. waste water.

16S tutorial

There is a SILVA+MEGAN tutorial here.

References:

Gautam, A., Zeng, W. and Huson, D.H., 2023. DIAMOND+ MEGAN Microbiome Analysis. In Metagenomic Data Analysis (pp. 107-131). New York, NY: Springer US.

Gautam, A., Zeng, W. and Huson, D.H., 2023. MeganServer: facilitating interactive access to metagenomic data on a server. Bioinformatics, 39(3), p.btad105.

Gautam, A., Felderhoff, H., Bağci, C. and Huson, D.H., 2022. Using AnnoTree to get more assignments, faster, in DIAMOND+ MEGAN microbiome analysis. Msystems, 7(1), pp.e01408-21.

Bağcı, C., Patz, S. and Huson, D.H., 2021. DIAMOND+ MEGAN: fast and easy taxonomic and functional analysis of short and long microbiome sequences. Current protocols, 1(3), p.e59.

Arumugam, K., Bağcı, C., Bessarab, I., Beier, S., Buchfink, B., Górska, A., Qiu, G., Huson, D.H. and Williams, R.B., 2019. Annotated bacterial chromosomes from frame-shift-corrected long-read metagenomic data. Microbiome, 7(1), pp.1-13.

Huson, D.H., Albrecht, B., Bağcı, C., Bessarab, I., Gorska, A., Jolic, D. and Williams, R.B., 2018. MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biology direct, 13(1), pp.1-17.

Huson, D.H., Tappu, R., Bazinet, A.L., Xie, C., Cummings, M.P., Nieselt, K. and Williams, R., 2017. Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads. Microbiome, 5, pp.1-10.

Buchfink, B., Xie, C. and Huson, D.H., 2015. Fast and sensitive protein alignment using DIAMOND. Nature methods, 12(1), pp.59-60.

Huson, D.H., Beier, S., Flade, I., Górska, A., El-Hadidi, M., Mitra, S., Ruscheweyh, H.J. and Tappu, R., 2016. MEGAN community edition- interactive exploration and analysis of large-scale microbiome sequencing data. PLoS computational biology, 12(6), p.e1004957.

Huson, D.H., Mitra, S., Ruscheweyh, H.J., Weber, N. and Schuster, S.C., 2011. Integrative analysis of environmental sequences using MEGAN4. Genome Research, 21(9), pp.1552-1560.

Huson, D.H., Auch, A.F., Qi, J. and Schuster, S.C., 2007. MEGAN analysis of metagenomic data. Genome Research, 17(3), pp.377-386.