Skip to content

Falcon2Fastg is a tool for converting a FALCON assembly to FASTG format to visualize with Bandage

License

Notifications You must be signed in to change notification settings

md5sam/Falcon2Fastg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Falcon2Fastg

This software converts the results of PacBio assembly using FALCON, to a FASTG graph that can be visualized using Bandage.

Usage

python Falcon2Fastg.py [--only-output=reads|contigs]

This can be run in the output directory of FALCON assembly (2-asm-falcon). Please make sure to copy the preads4falcon.fasta file from the intermediate directory (1-preads_ovl) to the output directory (2-asm-falcon)

Falcon2Fastg needs the following 6 input files:

  • preads4falcon.fasta

  • sg_edges_list

  • utg_data (if --only-output is unset, or set to contigs)

  • ctg_paths (if --only-output is unset, or set to contigs)

  • p_ctg.fa (if --only-output is unset, or set to contigs)

  • p_ctg_tiling_path (if --only-output is unset, or set to contigs)

Dependencies :

Biopython (available at http://biopython.org/wiki/Download)

pyfaidx (available at https://github.com/mdshw5/pyfaidx)

Quick installation of dependencies:

pip install biopython pyfaidx  # add --user if you don't have root

Output :

The output of the tool is two FASTG files (reads.fastg and contigs.fastg) that can be opened with Bandage.

Additionally, the tool produces a CSV file : ReadsInContigs.csv that can be loaded with Bandage. This labels the reads according to the contigs that they are a part of, along with the mapping position within the contig.

Alt text

Above is a sample Bandage visualization of a reads.fastg file generated by Falcon2Fastg from a FALCON assembly (a plant mitochondrial genome).

  • Each node is a read, and each node is represented as a colored strip (colors are random)
  • Edges represent the overlaps between reads found by FALCON (better viewed in the zoomed-in image below)
  • Only the edges used in the string graph ("G" flagged in sg_edges_list) are used by Falcon2Fastg to produce the output file.

Zooming in on a smaller set of nodes shows the edges in black, connecting the colored nodes :

Alt text

For benchmarking, Falcon2Fastg was run on the preads4falcon.fasta and sg_edges_list file produced by the E.coli test dataset provided with the Falcon install. Instructions on obtaining the dataset are here : https://github.com/PacificBiosciences/FALCON/wiki/Setup:-Complete-example

Execution of Falcon2Fastg took 2 minutes on a desktop computer (size of preads4falcon.fasta: 449 MB).

The figure below represents a visualization of this E. coli data.

Alt text

Contigs visualization

Falcon2Fastg can also be used to visualize the contigs produced by FALCON, and overlaps between them. The contig graph is created in contigs.fastg. By default, Falcon2Fastg will output this file. You can choose that it outputs only the reads graph using the --only-output=reads parameter.

To test this visualization mode, we assembled Drosophila melanogaster reads available at:
https://github.com/PacificBiosciences/DevNet/wiki/Drosophila-sequence-and-assembly

The input file was 2.2G in size (dmel_FALCON_preassembled_reads.fasta).

FALCON assembly parameters were not optimized, and were as follows :

length_cutoff = 3000, length_cutoff_pr = 6000, overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 20

The final p_ctgs.fa file had 642 contigs with total length ~27 Mbp.

Execution of Falcon2Fastg took 5 minutes on a desktop computer (size of preads4falcon.fasta: 2.2 GB).

The figure below is the visualization of these D. mel. contigs (colors are random)

Alt text

Read density (approximate read coverage)

Bandage provides a way to visualize k-mer coverage, as reported by the assembler. As Falcon is a string graph assembler, it does not report such information. Ideally, to compute the coverage of a contig, one would need to re-map the reads back to the assembled contigs. Here, we report a more simple metric that is easy to compute from the output of Falcon.

Read density is calculated as (sum of length of all reads used by FALCON to construct the contig / length of contig). We believe that variation in read density reflects variation of coverage;

The figure below is a schematic of read density. The blue arrows represent reads that were used by Falcon to create the red (resp. black) contig. The contig above (black) has fewer reads within it. Its read density is around 2.0 The contig below (red) and has more reads within it. Its read density is around 5.0

Alt text

The figure below is the visualization of the same D. mel. contigs, colored by read density.

Alt text

Zooming in shows that bright red represents higher density (6.0x). Contigs colored black have a lower read density (2.0x)

Alt text

Memory Warning

The pyfaidx module is used to read an entire FASTA file into memory. If the size of your preads4falcon.fasta is greater than the amount of available RAM, it is advisable to run this computation on a server with greater available memory.

Caveats :

  • Reads within "contained" unitigs are not used in the calculation of Read density.

  • Read density is calculated by dividing total length of all reads in the contig by length of each contig (obtained from ctg_paths). Depending on the orientation, Falcon ignores either the first read or the last read while reporting a contig. Due to this, in the contigs.fastg file, the forward and rev_comp entries might have different read_densities and different lengths.

Any large differences are mostly restricted to short contigs, when one very long read at either extremity can affect the length of the contig.

  • Read density is set to "1" for entries in reads.fastg, as this measure is only relevant for contigs.fastg

Testing :

Please see the test/ directory for a small example dataset and output

FALCON can be installed following the instructions here : https://github.com/PacificBiosciences/FALCON/wiki/Setup:-Complete-example

Other tools

Additional tools for visualizing read overlap can be found in the utils directory. Please consult utils/README.md for details

License

This content is released under MIT License. Please see LICENSE.md for details.

Authors

Primary author : Samarth Rangavittal, The Pennsylvania State University (szr165@psu.edu)

Rayan Chikhi, University of Lille 1

Jean-Stéphane Varré, University of Lille 1