-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial
- Tutorial slides
- Background
- Minimum requirements to run full tutorial
- Installation
- Tutorial dataset
- DIAMOND+MEGAN pipeline
- First step: DIAMOND alignment
- Second step: MEGANization
- Third step: MEGAN interactive analysis
- Megan-Server
-
April 2025 tutorial slides can be downloaded here: Introduction-to-DIAMOND+MEGAN-April2025.pdf.
-
September 2023 tutorial slides can be downloaded here: Tutorial-DIAMOND+MEGAN-Sep2023.pdf or here.
-
July 2023 (ISMB) tutorial slides can be downloaded here: Tutorial-DIAMOND+MEGAN_ISMB-July2023.pdf or here.
In microbiome analysis, samples are subjected to metagenomic sequencing and three main computational questions are:
- Who is out there? What is the taxonomic content of a sample?
- What are they doing, or what can they do? What is the functional content of a sample?
- How do they compare? Do changes in the taxonomic or functional content of samples reflect changes in the microbiome?
The key idea of the DIAMOND+MEGAN pipeline is to align sequencing reads or assembled contigs against a protein reference database using DIAMOND (Buchfink et al, 2015), and then to analyze the alignments to perform taxonomic and functional binning of sequences, using MEGAN (Huson et al, 2016) and/or the daa-meganizer tool.
Why use protein alignments? DNA alignments can certainly be used to identify known genomes in a sample, in the context of known pathogen detection, or in the analysis of well-studied environments (such as the human gut, for well-studied populations), say. However, for the analysis of unknown organisms from less well studied environmental sources, protein alignment is more suitable due to the higher level of sequence conversation.
This plot shows that only a small part of the phylogenetic diversity estimated to exist in the environment is represented by full DNA sequences in genomic databases:

In our approach, DNA sequencing reads or contigs, are translated into protein sequences and are aligned to protein reference sequences in a "translated alignment":

Metagenomic sequencing projects can involve hundreds of samples each containing tens of millions of sequences. The NCBI-nr protein reference database contains over 500 million reference sequences. Thus, alignment-based metagenomic analysis is computationally demanding and the first steps are usually performed on a server or cluster.
The number of reference proteins is continuing to increase and thus alternative, smaller databases such as AnnoTree (Gautam et al, 2022) may be more suitable in the future:

To reduce computational load and the amount of required disk space, the DIAMOND+MEGAN pipeline is very stream-lined and produces only one output file for each input file, as indicated here:

The two computationally demanding steps, alignment of sequences against a reference database (DIAMOND alignment), and then analysis of the resulting alignments (MEGANization), are usually run on a server, whereas the third step, interactive exploration and analysis of the results, is performed on a personal computer.
For purposes of this tutorial, we provide toy data and a toy reference database so that all three steps of the analysis can be run on a laptop.
Although the DIAMOND+MEGAN pipeline is designed to be platform-independent, for the purpose of this tutorial, we recommend using a Linux or MacOS system for the initial steps of the analysis. This is because in metagenomic studies the initial steps are usually performed on a Linux server. The resulting data is then further analyzed on a personal computer using interactive analysis software (e.g. MEGAN) or scripting packages in R or Python, say.
If you are using a Windows-based system on your laptop, please run a Linux emulator. One approach is to set up a virtual machine with a Linux distribution such as Ubuntu, for example using VirtualBox, a widely-used virtualization software. VirtualBox allows you to create and manage virtual machines on your Windows operating system. For more information and to download VirtualBox, please visit VirtualBox.
Please note that Ubuntu is also available as Windows Subsystem for Linux (WSL). WSL allows you to install a complete Ubuntu terminal environment on your Windows machine, allowing you to run Linux applications under Windows (10 or 11). You can obtain WSL from the Windows Store or install it using PowerShell.
The first half of this tutorial (DIAMOND+MEGANization) assumes that you are running Linux, MacOS or Linux under Windows. Once these steps have been completed, the interactive analysis of data using MEGAN can be performed on Windows.
You can download and install DIAMOND as follows:
If Conda is installed on your system or virtual environment, you can easily install DIAMOND by using the following command:
conda create -n diamond -c conda-forge -c bioconda diamond=2.1.7
conda activate diamond
This command will fetch and install DIAMOND from the Bioconda channel, which provides a collection of bioinformatics software packages.
A Linux binary can be downloaded as follows:
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.7/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
On Ubuntu, you can also install DIAMOND as follows:
sudo apt install diamond-aligner
On MacOS, DIAMOND can also be installed using brew
:
brew install diamond
If you installed DIAMOND using Conda or brew, you can verify that the installation was successful by typing:
diamond help
If you have downloaded and unpacked the binary, then please define a variable DPATH
that refers to the directory
that contains the diamond
binary. For example, if you have placed diamond
in your home directory, then type this:
export dpath=~
In this case, you can verify that the installation was successful by typing:
$dpath/diamond help
Both commands will display the help information for DIAMOND, confirming that the installation was successful.
For additional installation methods and options, please see the DIAMOND installation wiki page.
MEGAN is written in Java and contains its own Java runtime environment (so a separate installation of Java is neither required nor useful).
MEGAN is installed using an installer program that can be download from MEGAN6 Download Page. If the page is down, try this alternative.
MEGAN is available in two editions: In this tutorial we will use the open source Community Edition. There is also a more feature-rich and better-supported "Ultimate Edition" that is built upon the Community Edition and is licensed by Computomics GmbH.
Please select and use the installer that is appropriate for your system:
MEGAN_Community_unix_6_25_10.sh - Linux
MEGAN_Community_unix_ARM_6_25_10.sh - Linux running on an Apple chip
MEGAN_Community_macos_6_25_10.dmg - MacOS
MEGAN_Community_windows-x64_6_25_10.exe - Window
This tutorial is specifically designed with Linux or Mac OS X-based systems in mind. To complete the installation process, follow these steps:
For Linux-based systems, open your terminal and paste the command provided below.
wget https://software-ab.cs.uni-tuebingen.de/download/megan6/MEGAN_Community_unix_6_25_10.sh
chmod +x MEGAN_Community_unix_6_25_10.sh
./MEGAN_Community_unix_6_25_10.sh
This will open the installer dialog, and then follow the required steps to install MEGAN6.
For Mac OS X-based systems, download the installer file with the extension .dmg.
Double-click on the downloaded file to open the installer dialog, and then follow the required steps to install MEGAN6. This will install MEGAN in your Applications folder.
For Windows, download the installer and run it. If you are running Windows and using Ubuntu from within Windows, then you will have to install MEGAN twice, once under Linux to access the daa-meganizer tool, and once under Windows to access the MEGAN program.
The installer will ask you how much memory to allow MEGAN to use. The more you allow, the faster the program will run. Ideally, allow 16G, but 8G should also work ok. For smaller values, the program might become unresponsive when faced with large data files. But do not exceed 3/4 of the physical memory of your machine.
Application of the DIAMOND+MEGAN pipeline in microbiome analysis involves very large
datasets of sequencing reads (involving hundreds of millions of reads) that are compared against a very large protein reference database
(such as the NCBI-nr database containing over 500 million sequences or the NCBI-nr50 database, contain over 50 million sequences). For the purposes of this tutorial, we have produced twelve very small human gut samples of one million reads each, subsampled from
data presented in (Willmann et al, 2015).
We provide a tiny toy reference database nr-tutorial.gz
that contains a very small portion of the NCBI-nr50 database.
This data is small enough to be run on any laptop, while large enough to give some sense of what a metagenomic analysis looks like.
There are twelve samples, for two healthy subjects, Alice and Bob, taken on 6 different days, 0, 1, 3, 6, 8 and 34. During days 1-6, both subjects were treated with an antibiotic.
The tutorial data is available here: https://software-ab.cs.uni-tuebingen.de/download/megan6/tutorial/tutorial-DM.zip. If the server is down, obtain the file from here.
Under Linux, can download it in a terminal window like this:
wget https://software-ab.cs.uni-tuebingen.de/download/megan6/tutorial/tutorial-DM.zip
The provided command will download the file named tutorial-DM.zip
. To extract the contents of this file, you can use the following command:
unzip tutorial-DM.zip
After extracting the "tutorial-DM.zip" file, you will find a folder named tutorial-DM/
. This folder contains the following subdirectories:
-
data/
: This directory contains the data files that will be used for the alignment with DIAMOND. -
tutorial-nr.gz
: This file is a database file that will be used by DIAMOND for the alignment of the files in the "data" folder. -
megan-map-tutorial.db
: This file is a mapping file required by MEGAN for the meganization of the files generated by DIAMOND. These files and directories are essential for the tutorial and will be used in the subsequent steps of the tutorial. Please ensure that you have downloaded and extracted the "tutorial-DM.zip" file correctly to have access to these resources.
tutorial-DM/
├── data
│ ├── Alice00-1mio.fq.gz
│ ├── Alice01-1mio.fq.gz
│ ├── Alice03-1mio.fq.gz
│ ├── Alice06-1mio.fq.gz
│ ├── Alice08-1mio.fq.gz
│ ├── Alice34-1mio.fq.gz
│ ├── Bob00-1mio.fq.gz
│ ├── Bob01-1mio.fq.gz
│ ├── Bob03-1mio.fq.gz
│ ├── Bob06-1mio.fq.gz
│ ├── Bob08-1mio.fq.gz
│ ├── Bob34-1mio.fq.gz
| └── metadata.txt
├── tutorial-nr.gz
└── megan-map-tutorial.db
Next, navigate into the "tutorial-DM" directory:
cd tutorial-DM
If you are using your own dataset for analysis, it is important to perform quality control (QC) checks to ensure the reliability of your data. Tools such as Trimmomatic or other QC software can be used to assess the quality of your sequencing reads and perform necessary trimming or filtering steps. QC checks help identify and remove low-quality or unreliable data, improving the accuracy and reliability of downstream analysis. It is recommended to carefully evaluate the quality of your data and apply appropriate QC measures before proceeding with further analysis.
The DIAMOND+MEGAN pipeline consists of three steps:
- Use DIAMOND to align all reads against a protein reference database.
- Use the MEGAN tool daa-meganizer to perform taxonomic and functional analysis of all resulting alignments.
- Use MEGAN to interactively explore and analyze the data
To perform DIAMOND alignment, we first need to compute an index and then we can run the program on the input datasets.
To begin the DIAMOND + MEGAN pipeline, the first step is to generate an index for the database file using DIAMOND. Activate the conda environment in which you have installed DIAMOND. If you have downloaded the DIAMOND binary directly, use the absolute path to the executable.
If you are using an installed version of DIAMOND (e.g. using conda), then type:
diamond makedb --in tutorial-nr.gz --db tutorial-nr
If you are using a downloaded binary version of DIAMOND, then type (here, $DPATH is a variable that refers to
the directory that contains the diamond
binary, and discussed above):
$dpath/diamond makedb --in tutorial-nr.gz --db tutorial-nr
We are using makedb
mode of DIAMOND. Input file is specified using the --in
flag and provide the name of the index file you want to generate using the --db
option.
Running this command will generate the file "tutorial-nr.dmnd" in your SMB2023-tutorial directory.
First, create an output directory out
. We will use this directory to store the DIAMOND-generated alignment files.
mkdir out
To align a single file using an installed version of DIAMOND (e.g. using conda), type:
diamond blastx --db tutorial-nr -q data/Alice00-1mio.fq.gz -o out/Alice00-1mio.daa -f 100 --masking 0
If you are using a downloaded binary version of DIAMOND, then type:
$dpath/diamond blastx --db tutorial-nr -q data/Alice00-1mio.fq.gz -o out/Alice00-1mio.daa -f 100 --masking 0
We will utilize the blastx
mode of DIAMOND to perform translated alignment. To specify the index file generated in the previous step, use the --db
option. For the alignment, we will specify the query file "Alice00-1mio.fq.gz" by using the -q
option. The output file can be specified with the -o
option. To ensure the output format is suitable for further use by MEGAN, we will set the -f
option to 100, which corresponds to the DIAMOND Alignment Archive format. It took approximately two minutes to complete the alignment for this file. (We set masking to 0 to speed-up the calculation. Usually, do not do this.)
To align all files, we can iterate over them using a loop, like this (using an installed version):
for file in data/*.fq.gz
do
ofile="out/$(basename "${file%.*.*}").daa"
diamond blastx --db tutorial-nr -q $file -o $ofile -f 100 --masking 0
done
Or, like this, using a downloaded binary:
for file in data/*.fq.gz
do
ofile="out/$(basename "${file%.*.*}").daa"
$dpath/diamond blastx --db tutorial-nr -q $file -o $ofile -f 100 --masking 0
done
If you failed to run DIAMOND on the data, you can download the resulting files here:
https://software-ab.cs.uni-tuebingen.de/download/megan6/tutorial/diamond-out.zip
or here.
MEGAN allows users to map and assign taxonomic and functional labels to sequences based on their alignment against a reference database. The output of DIAMOND alignment can be used as input for MEGAN, enabling the annotation of the sequences with taxonomic information. We call this process meganization
.
MEGAN software includes a useful tool called daa-meganizer
, which is located in the tools
directory of MEGAN This tool is specifically designed for the meganization of DAA files. By utilizing the daa-meganizer
, researchers can easily annotate and interpret their DAA files within the MEGAN software, enabling comprehensive taxonomic and functional analysis.
Note that the tools
directory is only available on Linux and MacOS.
For Linux, by default, MEGAN will be installed in a directory called megan
in your home directory.
In this case, type the following command to reference the location:
export megan=~/megan
For MacOS, by default, the program will be installed under /Applications/MEGAN
. In this case, type:
export megan=/Applications/MEGAN
To perform meganization
on a single file, you can use the following command:
$megan/tools/daa-meganizer -i out/Alice00-1mio.daa -mdb megan-map-tutorial.db
Here, we assume that you have installed MEGAN in your home
directory under a folder named megan
. If not, please adjust the folder path accordingly based on your installation. The -i
flag is used to specify the input file (DAA), and the -mdb
flag is used to specify the mapping file.
To perform meganization
of all files, do this:
$megan/tools/daa-meganizer -i out/*.daa -mdb megan-map-tutorial.db
If you failed to meganize the 12 files, you can download the meganized files here:
https://software-ab.cs.uni-tuebingen.de/download/megan6/tutorial/meganizer-out.zip
or here.
Now that we have completed the general steps for the DIAMOND+MEGAN pipeline, let's move on to the interactive analysis using the MEGAN GUI.
MEGAN is an interactive program for exploring and analysis metagenomic datasets. It can be used to explore the binning of reads and contigs to taxonomic and functional classes, to visualize and to capture these assignments in many different ways, to compare multiple samples, and to perform advanced analysis tasks such as gene-centric alignment and assembly of reads.
Once MEGAN has been installed on your system, the easiest way to launch it in interactive mode
is to search for MEGAN on your system and then to click on megan.app
or MEGAN
,
or this icon:
From the terminal, you can also type
$megan/MEGAN &
At startup, the program will display a window like this:

The select the File->Open... menu item and navigate to the tutorial-DM/out
directory. If you have successfully run DIAMOND+MEGANizer on the input files, then you should be able to select an input file, such as Alice00-1mio.daa
:

If you are running the Ubuntu command-line app under Windows, then please launch MEGAN in Windows. How to then access the files that you have created using the Ubuntu app? To open File Explorer on the files created in Ubuntu, type the following into the Windows 'Type here to search' bar:
\\wsl$\Ubuntu\home\<Your Ubuntu Username>
Here, replace
<Your Ubuntu Username>
by your chosen Ubuntu user name (which appears at the beginning of the prompt in Ubuntu).
Navigate to the directory containing the meganized DAA files (should be tutorial-DM\out
) and copy all meganized DAA files from there to a directory in your Windows system. Then you can open the copied files using MEGAN.
This will open the main taxonomy window:

This shows too much detail, so use the Rank tool bar menu to collapse the taxonomy at the Phylum rank (say):

This data looks like typical human gut data, consisting mainly of Firmicutes, Bacteroidetes and Proteobacteria.
And then uncollapse the Firmicutes node, say, to see more details there:

In the taxonomy window, nodes are scaled by the number of reads assigned to them. When you select a node, you will see two numbers, Assigned=34,479
and Summed=188,762
, say, indicating that 34,479 reads have been assigned to the node
and that 188,762 reads have been assigned to the node, or any of its descendants in the taxonomy, as shown here:

Not only does MEGAN allow you to see how many reads have been assigned to a particular node, you can also inspect the reads and see the alignments that they have:

You can save the reads and/or alignments associated with any set of nodes to a file for further analysis.
The program provides many ways of exporting (and importing) data:

MEGAN also bins reads according to the GTDB taxonomy and you can open this viewer by pressing the GTDB
toolbar button:

You can collapse and uncollapse nodes, and inspector reads and alignments, in this window, too.
MEGAN provides a number of functional classification windows, such as EggNOG/COG, opened by pressing the COG
toolbar button:

The number of assigned reads in this view is very small, due to the fact that this is a tutorial dataset and we have used a tutorial database for alignment.
Use the File->Compare... menu item to open multiple samples together as a single document:

This will show multiple samples together:

The nodes can be shown as bar charts, pie charts or heat maps, and the values can be scaled linearly, by square root or logarithms. Here we are using a heat map and linear scale:

Sample metadata is crucial for comparative analysis and such data can be loaded into MEGAN.
There is a toy example of metadata in the file data/metadata.txt
:
#SampleID Day Treatment Subject
Alice00-1mio 0 0 Alice
Alice01-1mio 1 1 Alice
Alice03-1mio 3 1 Alice
Alice06-1mio 6 1 Alice
Alice08-1mio 8 0 Alice
Alice34-1mio 34 0 Alice
Bob00-1mio 0 0 Bob
Bob01-1mio 1 1 Bob
Bob03-1mio 3 1 Bob
Bob06-1mio 6 1 Bob
Bob08-1mio 8 0 Bob
Bob34-1mio 34 0 Bob
The first line is a header line, starting with the key word #SampleID
and then followed by the names of all defined attributes, in this case Day
, Treatment
and Subject
. Each subsequent line lists the name of a sample, Alice00-1mio
etc (the sample file name without path or suffix), followed by the value for each of the named attributes. Note that entries are separated by tabs.
Import the metadata file like this:

The metadata will be displayed in a samples viewer
and can be used for a number of purposes. For example, one can select an attribute and then request that samples get colored by that attribute:

These colors are then used in the taxonomy and classification viewers, but also in charts and PCoA or trees plots (shown below):

The samples viewer provides a number of calculations on the samples, for example, one can compute new aggregated datasets, based on a selected attribute, here Alice
vs Bob
:

this gives rise to a new comparison document:

MEGAN provides a charts viewer that provides 14 different types charts. There are several options of customizing the charts, such as sorting, colors, scaling (linear, square root, logarithmic, or percentage), font sizes, etc. Click on the charts viewer toolbar button to select a type of chart to show:

This will open the charts viewer and will display the selected chart, applied all selected nodes, in the current taxonomy or classification viewer. (If no nodes are selected, all leaves will be selected). For example, this is a word cloud chart:

After uncollapsing the nodes in the taxonomy viewer to the genus level and then pressing the synchronize
button at the end of the charts viewer's toolbar, the chart is updated to show more details:

Now, we want to collapse the taxonomy view at the rank of class
and show a stacked bar chart
using percentages and sorting the displayed taxa by increasing total value:

MEGAN provides a cluster analysis viewer that can be used to show the relationship between different samples (so-called beta-diversity) in a number of different ways (PCoA plot, hierarchical clustering, phylogenetic outlines) using a number of different ecological indices. Like the charts viewer, the calculation is based on the nodes selected in the taxonomy or other classification viewer and there is a sync
button at the end of the toolbar to recompute the analysis after changing the selection. Here we show a PCoA plot based on the rank of genus, using JSD (Square-root Jensen-Shannon divergence) distances:

This plot shows a clustering of samples by days affected by "treatment" (antibiotics were administered on days 1-6).
Here is the same data displayed using a "phylogenetic outline":

The first two steps of the DIAMOND+MEGAN pipeline (DIAMOND alignment and meganization) are usually performed on a server.
The resulting files are very big. While MEGAN provides a command-line program to extract a small summary file from a meganized DAA file, for full exploration, access to the DAA is required. To address this, MEGAN provides a program called megan-server
(Gautam-et-al-2023) that can be run on a server and provides online access to served files from within MEGAN.
We run an instance of Megan-Server and its address is preconfigured in MEGAN. You can access it (and any server that you are running) from the File->Open from Server
menu item:

Here we provide access to the full datasets upon which the tutorial datasets are based (in directory Willmann-et-al-2015
)
and also to long-read datasets from several different studies (in directory long-reads
).
Use MeganServer to open these six long-read samples:
- ERR3201932.daa (Betrand et al, 2019),
- ERR3561495.daa (Brandt et al, 2020),
- SRR11268056.daa (Liem et al, 2021),
- ERR3661022.daa (Overholt et al, 2021),
- SRR11673963.daa (Singleton et al, 2021) and
- DRR214963.daa (Yahara et al, 2021).
To select them all in the MeganServer dialog, use this search term, select regex
and press `all in the search dialog:
ERR3201932|ERR3561495|SRR11268056|ERR3661022|SRR11673963|DRR214963
If MeganServer is down, you can download small MEGAN summary file of these six samples here, or all six files in full detail here (4 GB).

Then press the Compare
button to open all six as a single comparison document, uncollapse the taxonomy to the rank of genus and then display the samples as word clouds:

Each comes from a different environment, can you match the samples to the following:
- biogas plant
- ground water
- human gut
- oral
- sea water
- waste water.
There is a SILVA+MEGAN tutorial here.