-
Notifications
You must be signed in to change notification settings - Fork 0
Documentation
- Operating system: Linux or Mac OS !
- Python 2.6 or 2.7 is required and should be installed in system !
- 64GB RAM or higher is recommended !
- 20 times the amount of input data hard drive space for temporarily output, and at last, these space can be recycled by deleting the temp folder.
Download the Example Data.
♦ File Format
- Long reads should be in FASTA format.
- Single-end short reads are supported (Blank line is not allowed).
- fasta/fa format are supported for Short Reads.
- Length of short reads should be greater or equal to 50bp !
- 'Q' is used as a key word in LSCplus, There should be no any 'q' or 'Q' in reads files, including sequence tags.
♦ File Path
For a convient using and less configurations, we have set some limitions on LSCplus.
- LSCplus package contain an executable file: LSCplus and Four necessary folders: bowtie2, data, script.
- The names of four necessary folders should not be changed.
-
Configuration file should be kept in the data folder, named run.cfg.
The position and name of Configuration file should not be changed. - Reads files are recommended to using the default settings:data/LR.fa and data/SR.fa, but you can change these settings in Configuration file.
This is the most important file. It is a text file that contains the path to your sequencer reads and the configuration settings. It is simple to edit and you will need to edit it once for each data-set.
♦ Example run.cfg file
##
###################################################
#
# This cofiguration file contains all settings for a run
# of LScorr.
#
# lines begining with '#' are comments
#
###################################################
##
#########################
## Required Settings
##
# Long reads file
# (single value)
LR_pathfilename = data/LR.fa
##
# Short reads file
# (single value)
SR_pathfilename = data/SR.fa
##
# Short-reads Coverage Frequency(SCF)
SCF = 100
##
# Remove PacBio tails sub reads? (Y or N)
# The names of PacBio long reads must be in the format of the following example: ">m111006_202713_42141_c100202382555500000315044810141104_s1_p0/16/3441_3479"
# The last two numbers (3441 and 3479 in this example) are the positions of the sub reads.
# (single value)
RemoveBothTails = N
##
# Number of threading for short reads alignment to long reads
# (single value)
Nthread = 20
##
# Max memory usage for unix sort command (-S option) per thread depending on your system memory limit
# Note: This setting is per thread and number of threads is set through Nthread1 and Nthread2 parameters
# -1: no-setting (default sort option)
# example: 4G , 100M , ...
sort_max_mem = -1
#########################
##
# Min. number of non'N' after compressing
# (single value)
MinNumberofNonN = 39
##
# Max 'N' are allowed after compressing
# (single value)
MaxN = 1
#########################
##
# Maximum error rate percentage to accept a compressed LR-SR alignment
# (single value)
max_error_rate = 20
# Aligner command options
# Note: Do not specify number of threads in the command options, it is set through Nthread1
bowtie2_options = --end-to-end -a -f -L 15 --mp 1,1 --np 1 --rdg 0,1 --rfg 0,1 --score-min L,0,-0.12 --no-unal
This tutorial will guide you to get started with LSCplus. If you experience any problems, please don't hesitate to contact us.
♦ Step 1 - Prepare Datasets
Download the example data or prepare you own data as required.
♦ Step 2 - Check the directory contents and Set run.cfg
- 1. Download and extract the LSCplus CPP / Python package.
- 2. Make sure that the files and directories are in correct position with default names.
- 3. Set values for parameters in data/run.cfg .
♦ Step 3 - Run LSCplus
1. Change the folder and make sure that the terminal is pointed to the LSCplus folder.
2. Type the following commond in one line:
./LSCplus
3. There will be outputs:
=========== Welcome to LSCplus xx_cpp ============
**
** ***** **** *****
** ******* ******* **
** ** **
** ** ** **
** ** **
******** ******* ********
******** ***** ******
================================================
Start the Job ? (Y/N)y
Start at: 12h 1m 51s
Number of Threads: 20
SCF Value: 60
====== sort and unique SR data ======
Finished at: 12h 1m 54s
=========== solit SR data ===========
Finished at: 12h 1m 55s
========== Remove LR Tails ==========
Finished at: 12h 1m 56s
======= Compress SR & LR data =======
finsish genome
finsish genome
...
...
...
Finished at: 12h 2m 18s
========== bowire2 index LR ==========
Building a SMALL index
Settings:
Output files: "temp/LR_NoTails.fa.cps.*.bt2"
Line rate: 6 (line is 64 bytes)
...
...
...
ebwtTotLen: 7697472
ebwtTotSz: 7697472
color: 0
reverse: 1
Total time for backward call to driver() for mirror index: 00:00:17
Finished at: 12h 2m 43s
====== bowtier2 SR.??.cps ======
46744 reads; of these:
46744 (100.00%) were unpaired; of these:
37118 (79.41%) aligned 0 times
1788 (3.83%) aligned exactly 1 time
7838 (16.77%) aligned >1 times
20.59% overall alignment rate
...
...
...
Finished at: 12h 3m 40s
====== samParser SR.??.cps.nav ======
Finished at: 12h 4m 0s
========= generate LR_SR map ==========
sort finished at: 12h 4m 25s
Done with generating LR_SR.map
Finished at: 12h 4m 34s
========== Error Corrections ==========
LR_lines_number: 31133
Finished at: 12h 4m 57s
========== Arrangement and Summarize ==========
Done with corrected_LR_full.fa
Done with corrected_LR.fa
Finished at: 12h 4m 57s
All works have been done !
♦ Step 4 - Get output
1. In the process of running LSCplus, two new folders are created:
- temp:This is a temporary directory created during the execution of LSCplus. The results of the initial short reads mapping is stored here, so this directory can be quite large. Note: You can delete this folder after the job is done.
- output:This is directory stores all the useful output files after executing LSCplus. It is also created during the execution of LSCplus.
- Output Files:
- corrected_LR.fa: As long as there are short reads (SR) mapped to a long read, this long read can be corrected at the SR-covered regions. (Please see more details from the paper). The sequence from the left-most SR-covered base to the right-most SR-covered base is outputted in the file.
- corrected_LR_full.fa: Although the terminus sequences are corrected, they are concatenated with their corrected sequence to be a "full" sequence. Thus, this sequence covers the equivalent length as the raw read and is outputted in the file.
Only FASTA files are supported by the latest LSCplus. We have developed a toolkit (LSCplus_toolkit) for preparing SR.fa and LR.fa file
♦ Download LSCplus_ToolKit
1. ConverToPacBio_q2a.py
usage: ./ConverToPacBio_q2a.py input_filename"
or: or python ConverToPacBio_q2a.py input_filename
Convert Long Reads FASTQ format to FASTA format with modified names
(the pacbio reads names should be in the format "name/index/1_(length)",
where (length) is the length of the read.
The default output is LR.fa
- SR_fastq2a.py
usage: ./SR_fastq2a.py input_filename
or: python SR_fastq2a.py input_filename
Convert Short Reads FASTQ format to FASTA format.
The default output is "output_"+input_filename'
- mergeSR.py
usage: ./mergeSR.py filename1 filename2
or: python mergeSR.py filename1 filename2
Concatenate two short reads files
The default output is SR.fa
- selectTopN.py
usage: ./selectTopN.py input_file N output_file
or: python selectTopN.py input_file N output_file
Selecte the aligned subsequences for the first N reads;
and write to output_file
- SR_pair2single.py
usage: ./SR_pair2single.py input_filename
or python SR_pair2single.py input_filename
Convert Short Reads Pair-end format to single-end format with modified names
The default output is "output_"+input_filename'