Run the MEGAHIT assembler

Tutorial by Titus Brown, Adina Howe, and Jihoon Yang

MEGAHIT is a very fast, quite good assembler designed for metagenomes.

First, install it:

cd
conda install -c bioconda -y megahit

MEGAHIT will then proceed to install and store its commands in such a way that you can access it.

Try typing:

megahit

You will now see the version and command line options for MEGAHIT. Note that MEGAHIT is looking for paired-end reads and a specified output directory.

Now, download some data to try on an assembly

cd
mkdir assembly
cd assembly
curl -o data.tar.gz -L https://ndownloader.figshare.com/files/16611230
tar -xzvf data.tar.gz
mv data/* .

Now, run the assembler!

megahit --12 example.R1.fastq, example.R2.fastq

This will take about 5-10 minutes; at the end you should see output like this:

... 4930 contigs, total 9642451 bp, min 201 bp, max 212315 bp, avg 1955 bp, N50 7771 bp
... ALL DONE. Time elapsed: 300 seconds

The output assembly will be in megahit_out/final.contigs.fa.

After the assembly is finished

At this point we can do a bunch of things:

annotate the assembly;
evaluate the assembly's inclusion of k-mers and reads;
set up a BLAST database so that we can search it for genes of interest;
bin the contigs in the assembly into species bins;

Annotation with BLAST

Install BLAST

conda install -y -c bioconda blast

Let's look at two blast commands

makeblastdb -h
blastn -h

First, let's make our blast index database file from our reference database, ref.fa

mkdir blast
cd blast
cp ../ref.fa .
makeblastdb -in ref.fa -dbtype nucl

Let's run a blast

blastn -db ref.fa -query ../megahit_out/final.contigs.fa | less

Use the spacebar to scroll through -- that's a lot of output! Let's put it in a file...

Let's try this again, where now we specify a specific output type and filename.

blastn -db ref.fa -query ../megahit_out/final.contigs.fa -outfmt 6 -out contigs.x.ref.blastnout

Let's take a look at this file together. First, this is a nice key for BLAST outputs in tabular format.

What can we see in this output?

**Note, use BLAST settings with [caution]

Checking against References

Let's say that you have some reference genomes to compare to and would like to know the rperesentation of the contigs in these genomes. These three genomes are NC_003112.2, NC_004310.3, NC_009089.1 -- three well known pathogens. Let's format each genome into a blast database and run three different blasts against our assembly.

Wait...this seems like a lot of BLASTING!

Let's got back to the assembly directory --

cd ../

Let's give this a try

for x in NC*fasta; do echo $x; done

In this case, the echo command just "echos" the variable x within the for-loop. We can change that command to "blastn".

for x in NC*fasta; do makeblastdb -in $x -dbtype nucl; done

You should have 3 indexed databases for each reference file. How could you do a blast of each database against the contigs? Give it a try.

Here's the solution, with an extra parameter to speed things up

for x in NC*fasta; do blastn -db $x -query megahit_out/final.contigs.fa -outfmt 6 -out contigs.x.$x.blastnout -num_threads 6; done

This is a real pro-trick in doing bioinformatics swiftly!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

megahit.rst

megahit.rst

Run the MEGAHIT assembler

After the assembly is finished

Annotation with BLAST

Checking against References

Files

megahit.rst

Latest commit

History

megahit.rst

File metadata and controls

Run the MEGAHIT assembler

After the assembly is finished

Annotation with BLAST

Checking against References