Skip to content
Jason Weirather edited this page Dec 18, 2016 · 19 revisions
  1. Requirements
  2. Generating reports
  3. Extracting data from reports
  4. Comparing reports
  5. Indexing alignment files
  6. Outputs
  7. Analyzing short-read alignments
  8. Analyzing transcriptome alignments
  9. Analyzing genomic DNA alignments
  10. Running with limited resources

1. Requirements

Analyzing alignments/generating reports

  1. Linux (with coreutils 8.6 or better)
  2. python 2.7
  3. R

Viewing/accessing analysis reports

  1. Firefox/Chrome/IE browser on any OS or Linux Command Line Interface with python 2.7.

2. Generating reports

with alignqc analyze

alignqc analyze input_bam (-r REFERENCE|--no_reference) (-a ANNOTATION|--no_annotation) <-o OUTPUT>

Input alignment (required)

BAM format alignment file (required)

AlignQC has been thoroughly tested with GMAP and the -f samse output format, and AlignQC aim's to be fully supportive of features in the SAM specification from Samtools.

  • Unaligned reads should be included because they contribute to better understanding your overall mappability.
  • The query sequence is not required for basic mappability, but is required for error rate and error pattern analysis.
  • The BAM file does not need to be ordered.
  • Secondary alignments embedded in optional fields are not currently parsed (i.e. BWA-mem secondary alignments)
  • Run time and memory requirements for large BAM files (i.e. > 1M entries) are high, and are the highest priority for improving AlignQC.

Inputs recommended

Reference genome FASTA file (optional, recommended)

Specify this with the -r <my fasta> or --reference <my fasta> tag. This is necessary for error rate and error pattern analysis. The BAM file will also need the query sequences to be included. If this analysis is not desired, the --no_reference option will omit this part of the analysis.

GenePred format transcriptome annotation file (optional, recommend)

Specify this with the -a <my genepred> or --annotation <my genepred> tag. This is necessary for annotation based analyses. This includes identifying reference annotations, determining coverage of the reference transcriptome, full/partial-length match detection, bias analysis, and rarefraction curves. If these analysis are not desired, the --no_annotation option will omit this part of the analysis.

This GenePred (gpd) format is described by UCSC as "GenePred table format" and more specifically defined as "Gene Predictions and RefSeq Genes with Gene Names".

The eleven tab-separated values that make up each entry are: 1. Gene name, 2. Transcript name, 3. Chromosome name, 4. Strand (+/-), 5. Transcription start position (0-index), 6. Transcription end position (1-index), 7. CDS start position (0-index), 8. CDS end position (1-index), 9. Exon count, 10. Exon starts (comma-separated, 0-index), 11. Exon ends (comma-separated, 1-index)

Output option (at least one is required)

  • -o or --output is a "full" xhtml output of compressed results accessible through a browser or CLI.
  • --portable_output is a "portable", or minimal, xhtml output accessible only through a browser.
  • --output_folder is a folder containing all analysis results, and especially useful for large datasets.

Options - other

Temporary folder parameters (mutually exclusive)

  • --tempdir <my dir> A temporary folder will be created in <my dir>, and then deleted upon completion.
  • --specific_tempdir <my dir> The directory <my dir> will be created and used, and it will NOT be deleted upon completion. Default is --tempdir /tmp.

Performance parameters

  • --threads <my count> The number of threads to run, default is the multiprocessing.cpu_count().

Alignment plot parameters

  • --min_intron_size <my int> The minimum intron size when smoothing indels (default: 68)
  • --min_aligned_bases <my int> The minimum number of bases to consider an alignment as part of a multi-alignment solution (i.e. gapped or chimeric) (default: 50).
  • --max_query_overlap <my int> The maximum number of query bases to overlap when considering a gapped or chimeric alignment.
  • --max_target_overlap <my int> The maximum number of target bases to overlap when considering a gapped or trans-chimeric alignment.
  • --max_query_gap <my int> The maximum number of bases missing from the query when considering an alignment a gapped alignment (not used by default)
  • --max_target_gap <my int> The maximum number of bases in a target sequence for an alignment to be considered as part of a gapped alignment (default: 500000)
  • --required_fractional_improvement <my float> Require a multi-alignment solution (gapped or chimeric) to provide at least this fraction more of mapped bases to be considered a better solution than a single alignment (default: 0.2)

Alignment error parameters

  • --alignment_error_scale <ins_min> <ins_max float> <mismatch_min float> <mismatch_max float> <del_min float> <del_max float> Set the color scale manually for the alignment plot. (by default plots auto-scale)
  • --alignment_error_max_length <my int> The stopping point, or minimum number of aligned bases to calculate the error from. (default: 200000)

Context error parameters

  • --context_error_scale <ins_min> <ins_max float> <mismatch_min float> <mismatch_max float> <del_min float> <del_max float> Set the color scale manually for the alignment plot. (by default plots auto-scale)
  • --context_error_stopping_point <my int> The stopping point, or minimum number of times to observe each individual context before stopping. (default: 1000)

Rarefraction plot parameters:

  • --sample_per_xval <my int> The number of times to sample each xvalue (default: 10)

Rscript parameters:

  • --rscript_path <my path> The location of the Rscript executable if it is not installed. (default: Rscript)

3. Extracting data from reports

Data can simply be clicked and downloaded from full xhtml reports in a web browser however access through CLI can be more convenient and is supported.

with alignqc dump

alignqc dump input_xhtml (-l | -e EXTRACT)

Input (required)

xhtml "full" report from alignqc analyze is required

Option (mutually-exclusive required)

  • -l or --list will produce a list of available data
  • -e <my file> or --extract <my file> will extract <my file> to stdout or -o if specified.

Optional parameters

Temporary folder parameters (mutually exclusive)

  • --tempdir <my dir> A temporary folder will be created in <my dir>, and then deleted upon completion.
  • --specific_tempdir <my dir> The directory <my dir> will be created and used, and it will NOT be deleted upon completion. Default is --tempdir /tmp.

4. Comparing reports

with alignqc compare

5. Indexing alignment files

alignqc compare input_xhtml [input_xhtml ...] -o OUTPUT_DIR

Results from different runs can be compared. Right now the only output being generated is a comprehensive table of mappability and error patterns statistics. This can be very useful for comparing performance of multiple SMRT cells or Flowcells.

with alignqc index

alignqc index input_BAM [input_BAM ...]

AlignQC creates a gzipped index file to help randomly access the bam alignment. This file is stored as a .bgi file, and it is not necessary to create, since AlignQC analyze will create one at run time, but having it already made can speed up run time, and also may be useful for some analyses.

There is one line per reported alignment in the index, and these are the fields:

  1. Query name - Name of read
  2. Target range - Genomic range covered by query
  3. bgzf file bock start - Byte start of block
  4. bgzf inner block start - Byte start of entry within block
  5. Aligned base count - The number of mapped bases (indels removed)
  6. flag - The SAM line flag modified so that only one primary alignment is listed for each Query name

6. Outputs

There are a lot of outputs, so please see the Outputs page for more details.

7. Analyzing short-read alignments

Warning: AlignQC was purpose-built for long read alignment analysis. It is compatible with short reads, such as Illumina, but it is not necessarily efficient or as informative. Additionally the treatment of paired end reads has not been given great consideration.

If you desire to use the error pattern analysis for very low error rate data such as Illumina short reads you should set a higher --context_error_stopping_point and --alignment_error_max_length in the alignqc analysis command, and be advised that this will increase the memory required. If error pattern is the only analysis you are trying to run, randomly downsampling the total file will greatly sleep up the run time.

8. Analyzing transcriptome alignments

Long read transcript alignment analysis was the main purpose of building AlignQC. The annotation analysis is the slowest step but can provide some informative data on how well long reads are spanning the full-length of known transcripts. The error pattern analysis can help you assess the quality of your reads. And the mappability can help assess the overall quality of the alignments, and efficacy of upstream size selection steps.

alignqc analyze myalignment.bam -r mygenome.fa -a mygenepred.gpd -o report.xhtml --portable_output report.portable.xhtml

9. Analyzing genomic DNA alignments

Although AlignQC was purpose-built for transcriptome analysis, it can also be helpful for genomic DNA sequencing alignments. In this case, the -a/--annotation option, providing a transcriptome genePred will probably cause some lengthy and unnecessary analyses to take place, so you could omit these with the --no_annotation tag.

alignqc analyze myalignment.bam -r mygenome.fa --no_alignment -o report.xhtml --poratable_output report.portable.xhtml

This command will generate mappability reports and error pattern analysis, without the lengthy annotation analysis steps.

10. Running with limited resources

AlignQC's most basic analysis of a the bam file for mappability can run efficiently with few resources. Please note that by default, alignqc will take as many cores as the system makes available via the cpu_count() command in python. If you want to limit the alignqc analysis resource usage, you can restrict it to 1 core with the --thread 1 option. This is especially important if you are using the -a/--annotation analysis, because it is more resource intensive. Also note that the memory requirements of the error pattern analysis are proportional to --context_error_stopping_point and --alignment_error_max_length, which resolve error patterns at lower error rates when increased.