tools for error correction and working with long read data
Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
test removed biopython dependency Dec 7, 2013
CA_Makefile deplex supports fastq Jun 3, 2014
LICENSE.md Update LICENSE.md Dec 1, 2014
README suggestions on how to use GNU parallel Oct 10, 2014
__init__.py initial commit Aug 1, 2013
algorithm.py extensions Mar 10, 2014
alignment_verify.py library update rename io Mar 26, 2014
args.py extensions Mar 10, 2014
bases_outsideclr.py initial commit Aug 1, 2013
clearFastqDescription.py clear a fastq description field Mar 12, 2014
consensus.sh added descriptions of scripts into the readme Feb 20, 2014
correct.sh histogram in schtats Feb 28, 2014
cov.py Updated coverage script Jul 18, 2014
create_idmap.txt library update rename io Mar 26, 2014
deltacat.py concatenate delta files Mar 12, 2014
deplex_pb.py deplex supports fastq Jun 3, 2014
downsample.py library update rename io Mar 26, 2014
fastaToFastq.py script to make fastq from fasta with fake quals Mar 12, 2014
filter.py information for generating frag files from corrected fa Sep 30, 2013
filter_cluster.py library update rename io Mar 26, 2014
filter_out.py updated Nov 27, 2013
filterpairs.py some new scripts Apr 17, 2014
fix_delta_header.py added gccontent graph Dec 7, 2013
gc_count.py library update rename io Mar 26, 2014
gccontent.py added gccontent graph Dec 7, 2013
getSplitReads.py schtats outputs spanning coverage Apr 29, 2014
graphCellStats.py minor changes to plotting scripts Jun 16, 2014
graphpid.py new graphing scripts for smrtcell throughput Apr 7, 2014
group_by.py scripts for looking at split reads across assembled unitigs Aug 12, 2013
join.py added descriptions of scripts into the readme Feb 20, 2014
key2linerecord.sh scripts for looking at split reads across assembled unitigs Aug 12, 2013
kmer.py library update rename io Mar 26, 2014
layout2frgid.sh added some scripts Aug 1, 2013
length_filter.py updated length filter Apr 21, 2014
lengthhist.py new graphing scripts for smrtcell throughput Apr 7, 2014
log.py extensions Mar 10, 2014
m4io.py deplex supports fastq Jun 3, 2014
m4region.py zmwstats a new scripts Apr 24, 2014
misc.py added support for reverse complement with gaps Jul 15, 2014
nucalign.py initial commit Aug 1, 2013
nucio.py Looking at alignment stats Apr 25, 2014
partition.py gzip compat Mar 26, 2014
pb_correct.py library update rename io Mar 26, 2014
pb_filter_for_shred.py library update rename io Mar 26, 2014
pb_uncovered.py library update rename io Mar 26, 2014
plot_hist.py zmwstats a new scripts Apr 24, 2014
pre_delta_filter.py library update rename io Mar 26, 2014
qualgen.py removed biopython dependency Dec 7, 2013
readlength_verror.py Looking at alignment stats Apr 25, 2014
reference_segments.py deplex supports fastq Jun 3, 2014
reverseComplement.py some new scripts Apr 17, 2014
schtats schtats outputs spanning coverage Apr 29, 2014
seqio.py gzip compat Mar 26, 2014
sequenceToLine.py simple seq to line script Jun 27, 2014
simulate_reads.py added descriptions of scripts into the readme Feb 20, 2014
split_read_layout.py scripts for looking at split reads across assembled unitigs Aug 12, 2013
stats.py printing bug for spanning coverage May 13, 2014
strutil.py histogram in schtats Feb 28, 2014
tig-sense.py added pb's consensus algorithm with my own modifications Jun 12, 2014
tigiterate.sh scripts for iteratively running the old unitigger Jan 28, 2014
uniqreads.py some new scripts Apr 17, 2014
zmwProductivityHeatmap.py simple seq to line script Jun 27, 2014
zmwstats.py schtats outputs spanning coverage Apr 29, 2014

README

Long Read Correction and other Correction tools

This package is a loose collection of scripts. To run the correction
routine see the section below. Descriptions of the other scripts
are at the bottom of this file.

Contact: gurtowsk@cshl.edu

###
#Correction
###

In short, the correction algorithm takes as input the unitigs from a short
read assembly and uses them to correct long read data.
More background information for the algorithm can be found:
http://schatzlab.cshl.edu/presentations/2013-06-18.PBUserMeeting.pdf

Running Correction.

ECTOOLS_HOME=/path/to/this/directory

1. Create Celera Unitigs from Illumina (or other high identity) data
   
   Output: organism.utg.fasta


2. Create a new directory to house the correction
   
   $> mkdir organism_correct

3. Simlink your organism.utg.fasta into the correction dir

   $> cd organism_correct 
   $> ln -s /path/to/organism.utg.fasta

3. Generally we only correct reads greater than 1kb. Filter out shorter
   reads if possible. If you have it, 20-30x of raw long read coverage
   is preferred so that you have ~20x remaining after correction.

4. Partition pacbio reads into small batches

   $> python ${ECTOOLS_HOME}/partition.py 20 500 pbreads.legnth_filtered.fa 
   
   Output: Series of directories and a ReadIndex.txt which maps 
   reads to partitions 
   
   *Note: You probably only want a few reads 
   per file (20 is usually good)

5. Copy correction script to working dir
   
   $> cp ${ECTOOLS_HOME}/correct.sh .

6. Modify the global variables a the top of correct.sh

7. Ensure Nucmer suite is in your PATH
   
   $> nucmer

8. Launch Correction The correct.sh script should be run in each of
   the partition directories. The number of files in directory should
   be specified as the array parameter to grid engine.  This simple
   for loop works well:
   
   $> for i in {0001..000N}; do cd $i; qsub -cwd -j y -t 1:${NUM_FILES_PER_PARTITION} ../correct.sh; cd ..; done
   
   Where N is the number of partitions (directories) and
   NUM_FILES_PER_PARTITION is the number of files per partition
   specified in the partitions script (500 in this case)


9. Wait for correction to complete.  Concatenate all corrected output
   files from all parititons into a single fa file:
   
   $> cat ????/*.cor.fa > organism.cor.fa


10. Run Celera on the output file.

   *Notes: 
   Use convert-fasta-to-v2.pl to make celera frg file from
   organism.cor.fa 

   example:
   $> convert-fasta-to-v2.pl -l organism_pbcor -s organism.cor.fa \
      -q <( python ${ECTOOLS_HOME}/qualgen.py organism.cor.fa) \
      > organism.cor.frg
   

   Celera's default parameters work pretty well with
   corrected PB data, you can also drop the utgGraphErrorRate to 0.01.  

   Make sure you use OverlapBasedTrimming (it is on by default).
   

11. (Optional)
If SGE is not available in your compute environment, you can use
GNU parallel to run ECTools. Although not the recommended method,
as alignment can be very compute intensive, for small genomes 
(bacteria), this method can be used on a single machine.

For each of the directories created by the partition script (0001..000N),
cd into the directory and run:

   $>for j in {1..500}; do echo "SGE_TASK_ID=$j TMPDIR=/tmp ../correct.sh"; done  | \
     parallel -j <# of compute cores>


###
#Script Descriptions
###

@@ schtats @@
Calculates read length statistics for sequence files.


@@ simulate_reads.py @@
Simple read simulator. Takes a genome and a file of line
separated read lengths. A read is produced for each given read
length at the specified error rate.

@@ gccontent.py @@
Builds an ascii graph of gc content for each item in a fasta file

@@ io.py @@
Has various parsers for parsing delta files

@@ join.py @@
general purpose script to do database like join on flat file.

@@ group_by.py @@
general purpose script for grouping sorted line oriented data
by a column.

@@ filter.py @@
general purpose script for filtering line separated data.
Takes a database and uses it to filter the query file.

@@ gc_count.py @@
summarizes gc content from a set of alignments

@@ fix_delta_header.py @@
Because the correct.sh writes all output to a temporary directory in
SGE and the copies it back to the original working directory, delta
files will have the header of the temporary directory. This script
just fixes the header so that downstream mummer tools work after
the pipeline has completed.