Documentation

♣ System Requirements

Operating system: Linux or Mac OS !
Python 2.6 or 2.7 is required and should be installed in system !
64GB RAM or higher is recommended !
20 times the amount of input data hard drive space for temporarily output, and at last, these space can be recycled by deleting the temp folder.

♣ Example Data

Download the Example Data.

♣ Prepare Datasets

♦ File Format

Long reads should be in FASTA format.
Single-end short reads are supported (Blank line is not allowed).
fasta/fa format are supported for Short Reads.
Length of short reads should be greater or equal to 50bp !
'Q' is used as a key word in LSCplus, There should be no any 'q' or 'Q' in reads files, including sequence tags.

♦ File Path

For a convient using and less configurations, we have set some limitions on LSCplus.

LSCplus package contain an executable file: LSCplus and Four necessary folders: bowtie2, data, script.
The names of four necessary folders should not be changed.
Configuration file should be kept in the data folder, named run.cfg.
The position and name of Configuration file should not be changed.
Reads files are recommended to using the default settings：data/LR.fa and data/SR.fa, but you can change these settings in Configuration file.

♣ Configuration file

This is the most important file. It is a text file that contains the path to your sequencer reads and the configuration settings. It is simple to edit and you will need to edit it once for each data-set.

♦ Example run.cfg file

##
###################################################
#
# This cofiguration file contains all settings for a run
# of LScorr.
#
# lines begining with '#' are comments
#
###################################################
##

#########################
## Required Settings

##
# Long reads file
# (single value)

LR_pathfilename = data/LR.fa

##
# Short reads file
# (single value)

SR_pathfilename = data/SR.fa

## 
# Short-reads Coverage  Frequency(SCF)

SCF = 100

##
# Remove PacBio tails sub reads? (Y or N)
# The names of PacBio long reads must be in the format of the following example: ">m111006_202713_42141_c100202382555500000315044810141104_s1_p0/16/3441_3479"
# The last two numbers (3441 and 3479 in this example) are the positions of the sub reads. 
# (single value)

RemoveBothTails = N

##
# Number of threading for short reads alignment to long reads
# (single value)

Nthread = 20

##
# Max memory usage for unix sort command (-S option) per thread depending on your system memory limit
# Note: This setting is per thread and number of threads is set through Nthread1 and Nthread2 parameters
# -1: no-setting (default sort option) 
# example: 4G , 100M , ...

sort_max_mem = -1


#########################
##
# Min. number of non'N' after compressing 
# (single value)

MinNumberofNonN = 39

##
# Max 'N' are allowed after compressing
# (single value)

MaxN = 1


#########################
##
# Maximum error rate percentage to accept a compressed LR-SR alignment 
# (single value)
max_error_rate = 20

# Aligner command options   
# Note: Do not specify number of threads in the command options, it is set through Nthread1

bowtie2_options = --end-to-end -a -f -L 15 --mp 1,1 --np 1 --rdg 0,1 --rfg 0,1 --score-min L,0,-0.12 --no-unal

♣ Tutorial

This tutorial will guide you to get started with LSCplus. If you experience any problems, please don't hesitate to contact us.

♦ Step 1 - Prepare Datasets

Download the example data or prepare you own data as required.

♦ Step 2 - Check the directory contents and Set run.cfg

1. Download and extract the LSCplus CPP / Python package.
2. Make sure that the files and directories are in correct position with default names.
3. Set values for parameters in data/run.cfg .

♦ Step 3 - Run LSCplus

1. Change the folder and make sure that the terminal is pointed to the LSCplus folder.
2. Type the following commond in one line:
./LSCplus
3. There will be outputs:

=========== Welcome to LSCplus xx_cpp ============
                                        **   
        **       *****         ****   *****   
       **       *******      *******   **    
      **        **          **               
     **         ** **      **                
    **             **      **                
   ********   *******      ********          
   ********    *****        ******                 

================================================

Start the Job ? (Y/N)y
Start at: 12h 1m 51s
Number of Threads: 20
SCF Value: 60
====== sort and unique SR data ======
Finished at: 12h 1m 54s
=========== solit SR data ===========
Finished at: 12h 1m 55s
========== Remove LR Tails ==========
Finished at: 12h 1m 56s
======= Compress SR & LR data =======
finsish genome
finsish genome
...
...
...
Finished at: 12h 2m 18s
========== bowire2 index LR ==========
Building a SMALL index
Settings:
  Output files: "temp/LR_NoTails.fa.cps.*.bt2"
  Line rate: 6 (line is 64 bytes)
 ...
 ...
 ...
    ebwtTotLen: 7697472
    ebwtTotSz: 7697472
    color: 0
    reverse: 1
Total time for backward call to driver() for mirror index: 00:00:17
Finished at: 12h 2m 43s
====== bowtier2 SR.??.cps ======
46744 reads; of these:
  46744 (100.00%) were unpaired; of these:
    37118 (79.41%) aligned 0 times
    1788 (3.83%) aligned exactly 1 time
    7838 (16.77%) aligned >1 times
20.59% overall alignment rate
...
...
...
Finished at: 12h 3m 40s
====== samParser SR.??.cps.nav ======
Finished at: 12h 4m 0s
========= generate LR_SR map ==========
sort finished at: 12h 4m 25s
Done with generating LR_SR.map
Finished at: 12h 4m 34s
========== Error Corrections ==========
LR_lines_number: 31133
Finished at: 12h 4m 57s
========== Arrangement and Summarize ==========
Done with corrected_LR_full.fa
Done with corrected_LR.fa

Finished at: 12h 4m 57s
All works have been done !

♦ Step 4 - Get output

1. In the process of running LSCplus, two new folders are created:

temp:This is a temporary directory created during the execution of LSCplus. The results of the initial short reads mapping is stored here, so this directory can be quite large. Note: You can delete this folder after the job is done.
output:This is directory stores all the useful output files after executing LSCplus. It is also created during the execution of LSCplus.

Output Files:

corrected_LR.fa: As long as there are short reads (SR) mapped to a long read, this long read can be corrected at the SR-covered regions. (Please see more details from the paper). The sequence from the left-most SR-covered base to the right-most SR-covered base is outputted in the file.
corrected_LR_full.fa: Although the terminus sequences are corrected, they are concatenated with their corrected sequence to be a "full" sequence. Thus, this sequence covers the equivalent length as the raw read and is outputted in the file.

♣ LSCplus_ToolKit Document

Only FASTA files are supported by the latest LSCplus. We have developed a toolkit (LSCplus_toolkit) for preparing SR.fa and LR.fa file

♦ Download LSCplus_ToolKit

1. ConverToPacBio_q2a.py

    usage: ./ConverToPacBio_q2a.py input_filename"
       or: or python ConverToPacBio_q2a.py input_filename
   
    Convert Long Reads FASTQ format to FASTA format with modified names 
    (the pacbio reads names should be in the format "name/index/1_(length)", 
    where (length) is the length of the read. 
    The default output is LR.fa

SR_fastq2a.py

    usage: ./SR_fastq2a.py input_filename
       or: python SR_fastq2a.py input_filename
   
    Convert Short Reads FASTQ format to FASTA format.
    The default output is "output_"+input_filename'

mergeSR.py

    usage: ./mergeSR.py filename1 filename2
       or: python mergeSR.py filename1 filename2
   
    Concatenate two short reads files
    The default output is SR.fa

selectTopN.py

    usage: ./selectTopN.py input_file N output_file
       or: python selectTopN.py input_file N output_file
   
    Selecte the aligned subsequences for the first N reads;
    and write to output_file

SR_pair2single.py

	 usage: ./SR_pair2single.py input_filename
		or python SR_pair2single.py input_filename
	Convert Short Reads Pair-end format to single-end format with modified names 
	The default output is "output_"+input_filename'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly