tools for error correction and working with long read data
License
jgurtowski/ectools
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
master
Could not load branches
Nothing to show
Could not load tags
Nothing to show
{{ refName }}
default
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
-
Clone
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more about the CLI.
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
Long Read Correction and other Correction tools This package is a loose collection of scripts. To run the correction routine see the section below. Descriptions of the other scripts are at the bottom of this file. Contact: gurtowsk@cshl.edu ### #Correction ### In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data. More background information for the algorithm can be found: http://schatzlab.cshl.edu/presentations/2013-06-18.PBUserMeeting.pdf Running Correction. ECTOOLS_HOME=/path/to/this/directory 1. Create Celera Unitigs from Illumina (or other high identity) data Output: organism.utg.fasta 2. Create a new directory to house the correction $> mkdir organism_correct 3. Simlink your organism.utg.fasta into the correction dir $> cd organism_correct $> ln -s /path/to/organism.utg.fasta 3. Generally we only correct reads greater than 1kb. Filter out shorter reads if possible. If you have it, 20-30x of raw long read coverage is preferred so that you have ~20x remaining after correction. 4. Partition pacbio reads into small batches $> python ${ECTOOLS_HOME}/partition.py 20 500 pbreads.legnth_filtered.fa Output: Series of directories and a ReadIndex.txt which maps reads to partitions *Note: You probably only want a few reads per file (20 is usually good) 5. Copy correction script to working dir $> cp ${ECTOOLS_HOME}/correct.sh . 6. Modify the global variables a the top of correct.sh 7. Ensure Nucmer suite is in your PATH $> nucmer 8. Launch Correction The correct.sh script should be run in each of the partition directories. The number of files in directory should be specified as the array parameter to grid engine. This simple for loop works well: $> for i in {0001..000N}; do cd $i; qsub -cwd -j y -t 1:${NUM_FILES_PER_PARTITION} ../correct.sh; cd ..; done Where N is the number of partitions (directories) and NUM_FILES_PER_PARTITION is the number of files per partition specified in the partitions script (500 in this case) 9. Wait for correction to complete. Concatenate all corrected output files from all parititons into a single fa file: $> cat ????/*.cor.fa > organism.cor.fa 10. Run Celera on the output file. *Notes: Use convert-fasta-to-v2.pl to make celera frg file from organism.cor.fa example: $> convert-fasta-to-v2.pl -l organism_pbcor -s organism.cor.fa \ -q <( python ${ECTOOLS_HOME}/qualgen.py organism.cor.fa) \ > organism.cor.frg Celera's default parameters work pretty well with corrected PB data, you can also drop the utgGraphErrorRate to 0.01. Make sure you use OverlapBasedTrimming (it is on by default). 11. (Optional) If SGE is not available in your compute environment, you can use GNU parallel to run ECTools. Although not the recommended method, as alignment can be very compute intensive, for small genomes (bacteria), this method can be used on a single machine. For each of the directories created by the partition script (0001..000N), cd into the directory and run: $>for j in {1..500}; do echo "SGE_TASK_ID=$j TMPDIR=/tmp ../correct.sh"; done | \ parallel -j <# of compute cores> ### #Script Descriptions ### @@ schtats @@ Calculates read length statistics for sequence files. @@ simulate_reads.py @@ Simple read simulator. Takes a genome and a file of line separated read lengths. A read is produced for each given read length at the specified error rate. @@ gccontent.py @@ Builds an ascii graph of gc content for each item in a fasta file @@ io.py @@ Has various parsers for parsing delta files @@ join.py @@ general purpose script to do database like join on flat file. @@ group_by.py @@ general purpose script for grouping sorted line oriented data by a column. @@ filter.py @@ general purpose script for filtering line separated data. Takes a database and uses it to filter the query file. @@ gc_count.py @@ summarizes gc content from a set of alignments @@ fix_delta_header.py @@ Because the correct.sh writes all output to a temporary directory in SGE and the copies it back to the original working directory, delta files will have the header of the temporary directory. This script just fixes the header so that downstream mummer tools work after the pipeline has completed.
About
tools for error correction and working with long read data
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published