Skip to content

File Format Information

Volodymyr Kuleshov edited this page Aug 15, 2016 · 2 revisions

Architect takes as input:

  1. Genomic contigs in fasta format assembled using a standard (short-read) assembler.
  2. A mapping of read clouds to contigs in bam format.
  3. Optionally, an gold-truth alignment of contigs to the reference
  4. Optionally, an alignment of paired-end reads to the contigs.

Inputs 2 and 3 need to be converted to a special .containment format; we have added the scripts that we used for TruSeq Synthetic Long Read data in the /scripts subfolder, but other technologies may require different input processing scripts.

Input 4 is provided to Architect in .tsv format; scripts/pe-connections.py can generate this file, but more specialized scripts (of the kind that are included with paired-end or mate-pair scaffolders) will produce higher quality edges which will result in better assemblies.

The input of Architect is:

  1. A set of orderings in .fasta format, represented as regular scaffolds.
  2. A .ordering file that describes the ordering of the input contigs within each scaffold.

Input contigs/scaffolds in Fasta format

We follow the standard specification. Make sure that contig names are unique.

Containment format

A .containment file contains two types of records. The first column identifies the type of record we are dealing with.

Well hit records (W)

When a read-cloud from a well is found to align to a contig, we call this event a "hit", and mark it using a well hit record. The format is the following.

W    <vertex id>    <well id>    <start>    <end>

The start and end fields indicate the region of the contig to which the well was found to align.

Region records (R)

When analyzing the assembly of a genome with a known reference, it can be very useful to give Architect the true alignments of the input contigs to the known genome in the form of R-type records.

R    <vertex id>    <well id>    <chr>    <start>    <end>

The interval chr:start-end indicates where the sequence of a vertex aligns on the reference.

Architect keeps track of these intervals during the scaffolding process. This information can then be used to debug the scaffolding process.

Generating containment files

We may generate .containment files using the bam2containment.py script.

usage: bam2containment.py [-h] -b BAM -c CONTAINMENT [-t THRESHOLD]
                             [-s SHIFT] [-m MAP]

optional arguments:
  -h, --help            show this help message and exit
  -b BAM, --bam BAM
  -c CONTAINMENT, --containment CONTAINMENT
  -t THRESHOLD, --threshold THRESHOLD
  -s SHIFT, --shift SHIFT
  -m MAP, --map MAP

It determines the well id associated with each aligned read in one of two ways. First, we can provide it a map of read id's to wells as a separate file (using the --map flag). However, this is very slow. A better option is to encode the well id directly into the bam record. Right now, if no map file is provided, we assume that the read names start with well%d_, and use that number as the true well id.

In general, it's easy to modify this script to use a different input format. Also, make sure to set the threshold for minimum number of reads for calling a well hit to a value that is suitable for your specific read cloud technology.

TSV Format

The .tsv file encodes support from paired-end reads. The format is the following.

<TYPE>  <v1 id>   <v2 id>  <v1 connection>  <v2 connection>  <orientation>  <support>  <distance>

The type is currently always S, indicating a scaffold edge (from a paired-end alignment). In the future, we could support edges from contig overlaps (via an O type).

Vertex connections are either H or T (head or tail). The orientation is S or R (same or reverse). Support is the number of paired-end alignments that support the connection.

Ordering format

The format of the ordering file is:

<ordering id>   <ctg_id1;containment_string1;orientation1>    <ctg_id2;containment_string2;orientation2>

A containment string, if available, indicates the regions of the (known) reference genome to which a contig maps. These are derived from R regions in the input containment file. This string can be used for validation; its format is ctg:start-end.

Finally, orientation is either R or S, indicating whether the contig came from the reverse or the forward strand, respectively.