GitHub - ma9606/callallSV: callallSV package - the structural variant caller for WGS

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
demo_inFiles		demo_inFiles
demo_outExample		demo_outExample
ref		ref
script		script
LICENSE		LICENSE
README		README
callallSV-latest.tar.gz		callallSV-latest.tar.gz
callallsv.cfg		callallsv.cfg
quick.start_demo.sh		quick.start_demo.sh
Repository files navigation

callallSV
==========

CallallSV was developed for somatic structural variant (SV) detection from whole-
genome sequencing (WGS) data. It is optimized for the analysis of a large-scale 
dataset composed of hundreds of cases and outputs the lists of SV candidates with 
multilevel plausibility to reduce false negatives.

As input, the system requires zipped SAM files of whole genome sequencing (WGS) 
data mapped by BWA with -T0 option, which were generated by paired-end reads 
from paired samples (tumor and matched normal samples). The read depth recommended 
is 40 and 30 or higher for tumor and normal samples, respectively. The sequence 
data from normal samples are also used for constructing the normal panel for the 
dataset.

For the thorough detection of SVs, callallSV consists of two independent algorithms. 
The first algorithm used paired-end reads mapped discordantly (both ends align to 
the reference genome uniquely with improper spacing or orientation, or both). 
The second algorithm used single reads split and mapped apart (so-called "soft-
clipped reads") to identify SV breakpoints. 

The detection result is obtained by merging outputs from these two algorisms. Then 
three support types of SV are contained in the list of SV candidates. That is, 
(1) the SV detected by both paired-end and soft-clip algorithms, 
(2) detected by only paired-end algorithms, 
(3) detected by only soft-clip algorism. 
In the second paired-end type of SV candidates, the breakpoints were shown by not 
points but ranges. The merged SV list is filtered by conditions determined by a 
validation experiment (RT-PCR and Sanger DNA sequencing) per each support type. Then 
finally, the system output the detection result as "merged_SV.validated_filt.list" 
under Merge/ directory.


SETUP
=========
Please edit the config file (callallsv.cfg) in the top directory of callallSV to suit 
your environment. The top directory of callallSV must be declared as ${HDR}. CallallSV 
requires BWA, SAMtools, BEDtools, R, and reference files. 
Please edit the paths for these tools and reference files in your environment.

	$ git clone https://github.com/ma9606/callallSV
	 -- or --
	$ tar xvzf callallSV-latest.tar.gz
	
	$ cd ./callallSV		
	$ vi ./callallsv.cfg	
	$ source ./callallsv.cfg


REQUIREMENTS
==============
CallallSV was designed on the assumption that it runs on a supercomputer under 
the management of grid computing software (e.g., Altair Grid Engine: AGE). It 
was tested on the SHIROKANE supercomputer in Human Genome Center, the university 
of Tokyo, which consists of 140 total nodes with 24 CPU cores Intel Xeon E5-2650 
v4 (2.2 GHz) equipp-ed with 128 GB memory. It runs on the standard Linux bash-
shell environment (including Perl, AWK) and calls internally for the following 
tools:
	BWA (v0.7_or_later)
	SAMtools (v1.0_or_later)
	BEDtools (v2.0_or_later)
	R (v2.0_or_later)

The system also uses the mapping results of RefSeq (NCBI Reference Sequence Database) 
and the results of RepeatMasker to annotate genome segments. Please prepare the tab-
delimited format of refFlat.txt and RepeatMask track data from UCSC Genome Table 
(https://genome.ucsc.edu/cgi-bin/hgTables), and specify the paths of the files in 
callallSV.cfg.

For the RepeatMask track data, please re-format it as below.

	$ zcat ./RepeatMask.txt.gz  |  awk -F"\t" '{print $6 FS $7 FS $8 FS $12 FS $13" ["$11"]",$10}'  > rmsk_all.bed
	$ grep rmsk_all.bed ${HDR}/callallsv.cfg
	RMSK=${HDR}/ref/rmsk_all.bed


USAGE
=========
As input, callallSV requires zipped SAM files in which paired-end reads were mapped 
with -T0 option by BWA (version 0.7 or more). The system does not support SAM files 
composed of multiple types of read lengths.


** Quick Start on demo data **
The input dataset for demonstration are prepared. In addition to the TN paired-sam.gz 
files for the two cases, the data set includes a set of lists required for the callallSV 
run (list.insertSize_MX, list.readLen, and list.smplID) and two trimmed reference files 
(refFlat_chr19.txt and rmsk_chr19.bed) modified for the demo sample. 

Please e-mail us [miadachi@ncc.go.jp.] to request the dataset (the total size of about 
110 Mb), we will propose an appropriate transmission method according to your environment.

To check the callallSV pipeline runs correctly in your computer environment, we strongly 
recommend that you first should run callallSV with the demonstration data and compare 
the result with files on ${HDR}/demo_outExample/.

## The demonstration data and its output example are prepared in our server 	##
## (https://data.server/callallsv/demo/).     <-Sorry, now in preparation...   	##
##	$ cd callallSV/								##
##	$ rsync  https://ncc.data.server/demo_inFiles.tar.gz  .			##

Getting "demo_inFiles.tar.gz", please extract the tar.gz file at ${HDR} as following: 

	$ mv ./demo_inFiles.tar.gz ${HDR}/;  cd ${HDR}/
	$ tar xvfz  ./demo_inFiles.tar.gz  	
	$ chmod 770 ./quick.start_demo.sh 
	$ ./quick.start_demo.sh demo_inFiles &> quick.start_demo.stdout &

It takes about half an hour or more to output the results of the demo calculation and 
creates three directories: PE/, SoftClip/, and Merge/. For the format of each output 
data, refer to the "Output" section described below.


** Instruction **
This instruction assumes the analysis of a dataset composed of two or more samples. 
The top-level directory of callallSV is described as $HDR in the following.

1.  Prepare input data structure

1-1) prepare_indatStr_perSmpl.sh
For preparing the input data structure, go to the directory where input sam.gz is 
located and run $HDR/script/prepare_indatStr_perSmpl.sh:
	$ cd [directory_with_input_sam.gz]		
	$ $HDR/script/prepare_indatStr_perSmpl.sh [sampleID([A-Z][A-Z][0-9][0-9][0-9])]  [target.sam.gz]  [status(normal/tumor)]  (opt.sequencer-ID(YYMMDD_))

!!NOTE!! 
As sample ID, please specify five letters consisting of two uppercase letters 
([A-Z][A-Z]) and three integers ([0-9][0-9][0-9]). And as a sequencer ID, please 
designate a directory name that starts with six integers 
([0-9][0-9][0-9][0-9][0-9][0-9]_). The callallSV system targets data only in 
these formatted name directories.  

This process is mandatory for optimizing memory/CPU usage in the calculation, which 
splits an input-sam.gz file into 2M reads and stored in a designated tree structure. 
Running this operation on both tumor and normal samples for each sample produces a 
data structure as below: 

  	$ find . -type d 
	.─ sampleID
	      ├ normal
	      │ └ YYMMDD_sequencer-ID 
	      │        ├ target_N_split2M_[0-9][0-9][0-9].sam.gz
	      └ tumor
		└ YYMMDD_sequencer-ID
	       	       ├ target_T_split2M_[0-9][0-9][0-9].sam.gz

		 

1-2) prepare_indatStr_mklist.sh 
After completing the process for all samples, put these data structures in one directory 
(here referred to as [indata_directory]). Then, executing $HDR/script/prepare_indatStr_mklist.sh 
in [indata_directory] creates a directory (list/) and a data list in it that the callallSV 
system calls internally.

	$ cd [indata_directory] 
　	$ $HDR/script/prepare_indatStr_mklist.sh 

This process makes the following three tab-delimited format lists of all samples in the 
target dataset in list/ directory.

　 	list.insertSize_MX : 	Maximum insertsize 
　 	list.readLen : 	Read length
　 	list.smplID :	Sample ID

The maximum insert size is automatically estimated, and you can edit the values in the 
list if you want. The script will output error messages and terminate if the data structure 
is not formed correctly.


2. Run callallSV 
Pass [indata_directory] as the arguments to $HDR/script/SVcaller_process-1.sh and start 
running callallSV.

	$ cd $HDR
	$ ./script/SVcaller_process-1.sh  [indata_directory]  	...(step.1)  
	$ ./script/chk_process_1.sh > .chk_process_1.stdout

	$ ./script/SVcaller_process-2.sh  [indata_directory]   	...(step.2)   
	$ ./script/chk_process_2pe.sh > .chk_process_2pe.stdout 
	$ ./script/chkprocess_2sc.sh > .chk_process_2sc.stdout 

	$ ./script/SVcaller_process-3.sh  [indata_directory]    	...(step.3)   
	$ ./script/chk_process_3.sh > .chk_process_3.stdout 

The callallSV system is run on a per-dataset basis rather than on a per-sample basis for 
use with large WGS datasets. Please be sure to run each chk_process-${n}.sh after finishing 
SVcaller_process-${n}.sh. The scripts named chk_process-*.sh check intermediate files or 
standard errors to verify that each process was executed successfully. It deletes unnecessary 
intermediate files if it confirms that the preceding SV-caller process was completed without 
error. Otherwise, it returns an error log. If callallSV does not work expectedly, please 
contact us by attaching ".chk_process-*.stdout" and "result.tar" in $HDR directory along with 
your error report.


OUTPUT
=========
The callallSV output is a tab-delimited file named Merge/sampleID/merged_SV.validated_filt.list 
with the following columns:

        $1      Sample ID 
        $2      Type of SV [deletion/tandem-duplication/inversion/translocation]   
        $3      Coordinates of SV junction [Chr#1:breakpoint#1 | breakpoint#2:Chr#2]
        $4      Gene_1 : 	Gene name if the SV breakpoint#1 is within the gene coding region
        $5      Gene_2 : 	Gene name if the SV breakpoint#2 is within the gene coding region
        $6      Size :  	Distance between breakpoint#1 and breakpoint#2
        $7      Sequence modification at SV junction : 	"m" and "i" mean microhomology and interspace, respectively.
        $8      Supporting type : 	"pe" means paired-end reads and "sc" means soft-clipped alignment reads. "pe/sc" means both types of reads support the SV. 
        $9      No.Supp.Read[status(REF,VAR):p-value] : 	Number of reads supporting the presence of SV and the absence of SV. Indicated by "t" for tumor samples and "n" for normal samples, with the p-value calculated by Fisher's exact test indicated after the colon.
        $10     Dir_1 :		Strand direction of support reads for breakpoint#1 	
        $11     Aln_1 : 	Size of the genomic region to which the supporting reads for breakpoint#1 are mapped.
        $12     MQmx_1 :	Maximum mapping quality score of support reads for breakpoint#1
        $13     Asmx_1 : 	Maximum alignment score of support reads for breakpoint#1
        $14     Dir_2 : 	Strand direction of support reads for breakpoint#2 
        $15     Aln_2 : 	Size of the genomic region to which the supporting reads for breakpoint#2 are mapped.
        $16     MQmx_2 :	Maximum mapping quality score of support reads for breakpoint#2
        $17     Asmx_2 : 	Maximum alignment score of support reads for breakpoint#2


SVs in samples with low tumor rates or SVs generated at the genomic segment with 
low copy number may not be detected because they were not supported by enough support 
reads. To reduce these false negatives, SV lists are placed under the PE/, SoftClip/ 
and Merge/ directories as a log of SV breakpoints detected recurrently from the input 
sequence data, prior to filtering of the normal panel or reliability conditions.

Tree view of output structure
	$HDR
	├─ Merge
	│     ├sample_ID_1
	│     │	├ merged_SV.list
	│     │	└ merged_SV.validated_filt.list 
	│     ├sample_ID_2
	│      ....
	│ 
	├─ PE
	│     ├sample_ID_1 
	│     │	└ tumor
	│     │	    ├ rearrangement_pe.txt
	│     │	    └ rearrangement_pe.filt.txt 
	│     ├sample_ID_2
	│      ...
	│ 
	└─ SoftClip
	        ├sample_ID_1 
	        │	└ tumor
	        │	    ├ breakpoint_sc.txt
	        │	    ├ breakpoint_sc.filt.txt	
	        │	    └ breakpoint_sc.filt.realn.txt
	        ├sample_ID_2
	         ...


LICENCE
========

Author: Mihoko Adachi <miadachi@ncc.go.jp> and Yasushi Totoki <ytotoki@ncc.go.jp> 

CallallSV is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public 
License for more details.

You should have received a copy of the GNU General Public License along with 
this program. If not, see <http://www.gnu.org/licenses/>.

---------------------------------------------------------------------------------(EOF)---