Skip to content

Documentation

RuifengHu edited this page Nov 27, 2019 · 29 revisions

♣ System Requirements

  • Operating system: Linux or Mac OS !
  • Python 2.6 or 2.7 is required and should be installed in system !
  • 64GB RAM or higher is recommended !
  • 20 times the amount of input data hard drive space for temporarily output, and at last, these space can be recycled by deleting the temp folder.

♣ Example Data

     Download the Example Data.

♣ Prepare Datasets

♦ File Format

  • Long reads should be in FASTA format.
  • Single-end short reads are supported (Blank line is not allowed).
  • fasta/fa format are supported for Short Reads.
  • Length of short reads should be greater or equal to 50bp !
  • 'Q' is used as a key word in LSCplus, There should be no any 'q' or 'Q' in reads files, including sequence tags.

♦ File Path

     For a convient using and less configurations, we have set some limitions on LSCplus.

  • LSCplus package contain an executable file: LSCplus and Four necessary folders: bowtie2, data, script.
  • The names of four necessary folders should not be changed.
  • Configuration file should be kept in the data folder, named run.cfg.
    The position and name of Configuration file should not be changed.
  • Reads files are recommended to using the default settings:data/LR.fa and data/SR.fa, but you can change these settings in Configuration file.

♣ Configuration file

     This is the most important file. It is a text file that contains the path to your sequencer reads and the configuration settings. It is simple to edit and you will need to edit it once for each data-set.

♦ Example run.cfg file

##
###################################################
#
# This cofiguration file contains all settings for a run
# of LScorr.
#
# lines begining with '#' are comments
#
###################################################
##

#########################
## Required Settings

##
# Long reads file
# (single value)

LR_pathfilename = data/LR.fa

##
# Short reads file
# (single value)

SR_pathfilename = data/SR.fa

## 
# Short-reads Coverage  Frequency(SCF)

SCF = 100

##
# Remove PacBio tails sub reads? (Y or N)
# The names of PacBio long reads must be in the format of the following example: ">m111006_202713_42141_c100202382555500000315044810141104_s1_p0/16/3441_3479"
# The last two numbers (3441 and 3479 in this example) are the positions of the sub reads. 
# (single value)

RemoveBothTails = N

##
# Number of threading for short reads alignment to long reads
# (single value)

Nthread = 20

##
# Max memory usage for unix sort command (-S option) per thread depending on your system memory limit
# Note: This setting is per thread and number of threads is set through Nthread1 and Nthread2 parameters
# -1: no-setting (default sort option) 
# example: 4G , 100M , ...

sort_max_mem = -1


#########################
##
# Min. number of non'N' after compressing 
# (single value)

MinNumberofNonN = 39

##
# Max 'N' are allowed after compressing
# (single value)

MaxN = 1


#########################
##
# Maximum error rate percentage to accept a compressed LR-SR alignment 
# (single value)
max_error_rate = 20

# Aligner command options   
# Note: Do not specify number of threads in the command options, it is set through Nthread1

bowtie2_options = --end-to-end -a -f -L 15 --mp 1,1 --np 1 --rdg 0,1 --rfg 0,1 --score-min L,0,-0.12 --no-unal

♣ Tutorial

This tutorial will guide you to get started with LSCplus. If you experience any problems, please don't hesitate to contact us.

♦ Step 1 - Prepare Datasets

Download the example data or prepare you own data as required.

♦ Step 2 - Check the directory contents and Set run.cfg

  • 1. Download and extract the LSCplus CPP / Python package.
  • 2. Make sure that the files and directories are in correct position with default names.
  • 3. Set values for parameters in data/run.cfg .

♦ Step 3 - Run LSCplus

1. Change the folder and make sure that the terminal is pointed to the LSCplus folder.
2. Type the following commond in one line:
               ./LSCplus
3. There will be outputs:

=========== Welcome to LSCplus xx_cpp ============
                                        **   
        **       *****         ****   *****   
       **       *******      *******   **    
      **        **          **               
     **         ** **      **                
    **             **      **                
   ********   *******      ********          
   ********    *****        ******                 

================================================

Start the Job ? (Y/N)y
Start at: 12h 1m 51s
Number of Threads: 20
SCF Value: 60
====== sort and unique SR data ======
Finished at: 12h 1m 54s
=========== solit SR data ===========
Finished at: 12h 1m 55s
========== Remove LR Tails ==========
Finished at: 12h 1m 56s
======= Compress SR & LR data =======
finsish genome
finsish genome
...
...
...
Finished at: 12h 2m 18s
========== bowire2 index LR ==========
Building a SMALL index
Settings:
  Output files: "temp/LR_NoTails.fa.cps.*.bt2"
  Line rate: 6 (line is 64 bytes)
 ...
 ...
 ...
    ebwtTotLen: 7697472
    ebwtTotSz: 7697472
    color: 0
    reverse: 1
Total time for backward call to driver() for mirror index: 00:00:17
Finished at: 12h 2m 43s
====== bowtier2 SR.??.cps ======
46744 reads; of these:
  46744 (100.00%) were unpaired; of these:
    37118 (79.41%) aligned 0 times
    1788 (3.83%) aligned exactly 1 time
    7838 (16.77%) aligned >1 times
20.59% overall alignment rate
...
...
...
Finished at: 12h 3m 40s
====== samParser SR.??.cps.nav ======
Finished at: 12h 4m 0s
========= generate LR_SR map ==========
sort finished at: 12h 4m 25s
Done with generating LR_SR.map
Finished at: 12h 4m 34s
========== Error Corrections ==========
LR_lines_number: 31133
Finished at: 12h 4m 57s
========== Arrangement and Summarize ==========
Done with corrected_LR_full.fa
Done with corrected_LR.fa

Finished at: 12h 4m 57s
All works have been done !

♦ Step 4 - Get output

1. In the process of running LSCplus, two new folders are created:

  • temp:This is a temporary directory created during the execution of LSCplus. The results of the initial short reads mapping is stored here, so this directory can be quite large. Note: You can delete this folder after the job is done.
  • output:This is directory stores all the useful output files after executing LSCplus. It is also created during the execution of LSCplus.
  1. Output Files:
  • corrected_LR.fa: As long as there are short reads (SR) mapped to a long read, this long read can be corrected at the SR-covered regions. (Please see more details from the paper). The sequence from the left-most SR-covered base to the right-most SR-covered base is outputted in the file.
  • corrected_LR_full.fa: Although the terminus sequences are corrected, they are concatenated with their corrected sequence to be a "full" sequence. Thus, this sequence covers the equivalent length as the raw read and is outputted in the file.

♣ LSCplus_ToolKit Document

     Only FASTA files are supported by the latest LSCplus. We have developed a toolkit (LSCplus_toolkit) for preparing SR.fa and LR.fa file

♦ Download LSCplus_ToolKit

1. ConverToPacBio_q2a.py

    usage: ./ConverToPacBio_q2a.py input_filename"
       or: or python ConverToPacBio_q2a.py input_filename
   
    Convert Long Reads FASTQ format to FASTA format with modified names 
    (the pacbio reads names should be in the format "name/index/1_(length)", 
    where (length) is the length of the read. 
    The default output is LR.fa
  1. SR_fastq2a.py
    usage: ./SR_fastq2a.py input_filename
       or: python SR_fastq2a.py input_filename
   
    Convert Short Reads FASTQ format to FASTA format.
    The default output is "output_"+input_filename'	
  1. mergeSR.py
    usage: ./mergeSR.py filename1 filename2
       or: python mergeSR.py filename1 filename2
   
    Concatenate two short reads files
    The default output is SR.fa
  1. selectTopN.py
    usage: ./selectTopN.py input_file N output_file
       or: python selectTopN.py input_file N output_file
   
    Selecte the aligned subsequences for the first N reads;
    and write to output_file
  1. SR_pair2single.py
	 usage: ./SR_pair2single.py input_filename
		or python SR_pair2single.py input_filename
	Convert Short Reads Pair-end format to single-end format with modified names 
	The default output is "output_"+input_filename'