callallSV package - the structural variant caller for WGS
License
ma9606/callallSV
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
callallSV ========== CallallSV was developed for somatic structural variant (SV) detection from whole- genome sequencing (WGS) data. It is optimized for the analysis of a large-scale dataset composed of hundreds of cases and outputs the lists of SV candidates with multilevel plausibility to reduce false negatives. As input, the system requires zipped SAM files of whole genome sequencing (WGS) data mapped by BWA with -T0 option, which were generated by paired-end reads from paired samples (tumor and matched normal samples). The read depth recommended is 40 and 30 or higher for tumor and normal samples, respectively. The sequence data from normal samples are also used for constructing the normal panel for the dataset. For the thorough detection of SVs, callallSV consists of two independent algorithms. The first algorithm used paired-end reads mapped discordantly (both ends align to the reference genome uniquely with improper spacing or orientation, or both). The second algorithm used single reads split and mapped apart (so-called "soft- clipped reads") to identify SV breakpoints. The detection result is obtained by merging outputs from these two algorisms. Then three support types of SV are contained in the list of SV candidates. That is, (1) the SV detected by both paired-end and soft-clip algorithms, (2) detected by only paired-end algorithms, (3) detected by only soft-clip algorism. In the second paired-end type of SV candidates, the breakpoints were shown by not points but ranges. The merged SV list is filtered by conditions determined by a validation experiment (RT-PCR and Sanger DNA sequencing) per each support type. Then finally, the system output the detection result as "merged_SV.validated_filt.list" under Merge/ directory. SETUP ========= Please edit the config file (callallsv.cfg) in the top directory of callallSV to suit your environment. The top directory of callallSV must be declared as ${HDR}. CallallSV requires BWA, SAMtools, BEDtools, R, and reference files. Please edit the paths for these tools and reference files in your environment. $ git clone https://github.com/ma9606/callallSV -- or -- $ tar xvzf callallSV-latest.tar.gz $ cd ./callallSV $ vi ./callallsv.cfg $ source ./callallsv.cfg REQUIREMENTS ============== CallallSV was designed on the assumption that it runs on a supercomputer under the management of grid computing software (e.g., Altair Grid Engine: AGE). It was tested on the SHIROKANE supercomputer in Human Genome Center, the university of Tokyo, which consists of 140 total nodes with 24 CPU cores Intel Xeon E5-2650 v4 (2.2 GHz) equipp-ed with 128 GB memory. It runs on the standard Linux bash- shell environment (including Perl, AWK) and calls internally for the following tools: BWA (v0.7_or_later) SAMtools (v1.0_or_later) BEDtools (v2.0_or_later) R (v2.0_or_later) The system also uses the mapping results of RefSeq (NCBI Reference Sequence Database) and the results of RepeatMasker to annotate genome segments. Please prepare the tab- delimited format of refFlat.txt and RepeatMask track data from UCSC Genome Table (https://genome.ucsc.edu/cgi-bin/hgTables), and specify the paths of the files in callallSV.cfg. For the RepeatMask track data, please re-format it as below. $ zcat ./RepeatMask.txt.gz | awk -F"\t" '{print $6 FS $7 FS $8 FS $12 FS $13" ["$11"]",$10}' > rmsk_all.bed $ grep rmsk_all.bed ${HDR}/callallsv.cfg RMSK=${HDR}/ref/rmsk_all.bed USAGE ========= As input, callallSV requires zipped SAM files in which paired-end reads were mapped with -T0 option by BWA (version 0.7 or more). The system does not support SAM files composed of multiple types of read lengths. ** Quick Start on demo data ** The input dataset for demonstration are prepared. In addition to the TN paired-sam.gz files for the two cases, the data set includes a set of lists required for the callallSV run (list.insertSize_MX, list.readLen, and list.smplID) and two trimmed reference files (refFlat_chr19.txt and rmsk_chr19.bed) modified for the demo sample. Please e-mail us [miadachi@ncc.go.jp.] to request the dataset (the total size of about 110 Mb), we will propose an appropriate transmission method according to your environment. To check the callallSV pipeline runs correctly in your computer environment, we strongly recommend that you first should run callallSV with the demonstration data and compare the result with files on ${HDR}/demo_outExample/. ## The demonstration data and its output example are prepared in our server ## ## (https://data.server/callallsv/demo/). <-Sorry, now in preparation... ## ## $ cd callallSV/ ## ## $ rsync https://ncc.data.server/demo_inFiles.tar.gz . ## Getting "demo_inFiles.tar.gz", please extract the tar.gz file at ${HDR} as following: $ mv ./demo_inFiles.tar.gz ${HDR}/; cd ${HDR}/ $ tar xvfz ./demo_inFiles.tar.gz $ chmod 770 ./quick.start_demo.sh $ ./quick.start_demo.sh demo_inFiles &> quick.start_demo.stdout & It takes about half an hour or more to output the results of the demo calculation and creates three directories: PE/, SoftClip/, and Merge/. For the format of each output data, refer to the "Output" section described below. ** Instruction ** This instruction assumes the analysis of a dataset composed of two or more samples. The top-level directory of callallSV is described as $HDR in the following. 1. Prepare input data structure 1-1) prepare_indatStr_perSmpl.sh For preparing the input data structure, go to the directory where input sam.gz is located and run $HDR/script/prepare_indatStr_perSmpl.sh: $ cd [directory_with_input_sam.gz] $ $HDR/script/prepare_indatStr_perSmpl.sh [sampleID([A-Z][A-Z][0-9][0-9][0-9])] [target.sam.gz] [status(normal/tumor)] (opt.sequencer-ID(YYMMDD_)) !!NOTE!! As sample ID, please specify five letters consisting of two uppercase letters ([A-Z][A-Z]) and three integers ([0-9][0-9][0-9]). And as a sequencer ID, please designate a directory name that starts with six integers ([0-9][0-9][0-9][0-9][0-9][0-9]_). The callallSV system targets data only in these formatted name directories. This process is mandatory for optimizing memory/CPU usage in the calculation, which splits an input-sam.gz file into 2M reads and stored in a designated tree structure. Running this operation on both tumor and normal samples for each sample produces a data structure as below: $ find . -type d .─ sampleID ├ normal │ └ YYMMDD_sequencer-ID │ ├ target_N_split2M_[0-9][0-9][0-9].sam.gz └ tumor └ YYMMDD_sequencer-ID ├ target_T_split2M_[0-9][0-9][0-9].sam.gz 1-2) prepare_indatStr_mklist.sh After completing the process for all samples, put these data structures in one directory (here referred to as [indata_directory]). Then, executing $HDR/script/prepare_indatStr_mklist.sh in [indata_directory] creates a directory (list/) and a data list in it that the callallSV system calls internally. $ cd [indata_directory] $ $HDR/script/prepare_indatStr_mklist.sh This process makes the following three tab-delimited format lists of all samples in the target dataset in list/ directory. list.insertSize_MX : Maximum insertsize list.readLen : Read length list.smplID : Sample ID The maximum insert size is automatically estimated, and you can edit the values in the list if you want. The script will output error messages and terminate if the data structure is not formed correctly. 2. Run callallSV Pass [indata_directory] as the arguments to $HDR/script/SVcaller_process-1.sh and start running callallSV. $ cd $HDR $ ./script/SVcaller_process-1.sh [indata_directory] ...(step.1) $ ./script/chk_process_1.sh > .chk_process_1.stdout $ ./script/SVcaller_process-2.sh [indata_directory] ...(step.2) $ ./script/chk_process_2pe.sh > .chk_process_2pe.stdout $ ./script/chkprocess_2sc.sh > .chk_process_2sc.stdout $ ./script/SVcaller_process-3.sh [indata_directory] ...(step.3) $ ./script/chk_process_3.sh > .chk_process_3.stdout The callallSV system is run on a per-dataset basis rather than on a per-sample basis for use with large WGS datasets. Please be sure to run each chk_process-${n}.sh after finishing SVcaller_process-${n}.sh. The scripts named chk_process-*.sh check intermediate files or standard errors to verify that each process was executed successfully. It deletes unnecessary intermediate files if it confirms that the preceding SV-caller process was completed without error. Otherwise, it returns an error log. If callallSV does not work expectedly, please contact us by attaching ".chk_process-*.stdout" and "result.tar" in $HDR directory along with your error report. OUTPUT ========= The callallSV output is a tab-delimited file named Merge/sampleID/merged_SV.validated_filt.list with the following columns: $1 Sample ID $2 Type of SV [deletion/tandem-duplication/inversion/translocation] $3 Coordinates of SV junction [Chr#1:breakpoint#1 | breakpoint#2:Chr#2] $4 Gene_1 : Gene name if the SV breakpoint#1 is within the gene coding region $5 Gene_2 : Gene name if the SV breakpoint#2 is within the gene coding region $6 Size : Distance between breakpoint#1 and breakpoint#2 $7 Sequence modification at SV junction : "m" and "i" mean microhomology and interspace, respectively. $8 Supporting type : "pe" means paired-end reads and "sc" means soft-clipped alignment reads. "pe/sc" means both types of reads support the SV. $9 No.Supp.Read[status(REF,VAR):p-value] : Number of reads supporting the presence of SV and the absence of SV. Indicated by "t" for tumor samples and "n" for normal samples, with the p-value calculated by Fisher's exact test indicated after the colon. $10 Dir_1 : Strand direction of support reads for breakpoint#1 $11 Aln_1 : Size of the genomic region to which the supporting reads for breakpoint#1 are mapped. $12 MQmx_1 : Maximum mapping quality score of support reads for breakpoint#1 $13 Asmx_1 : Maximum alignment score of support reads for breakpoint#1 $14 Dir_2 : Strand direction of support reads for breakpoint#2 $15 Aln_2 : Size of the genomic region to which the supporting reads for breakpoint#2 are mapped. $16 MQmx_2 : Maximum mapping quality score of support reads for breakpoint#2 $17 Asmx_2 : Maximum alignment score of support reads for breakpoint#2 SVs in samples with low tumor rates or SVs generated at the genomic segment with low copy number may not be detected because they were not supported by enough support reads. To reduce these false negatives, SV lists are placed under the PE/, SoftClip/ and Merge/ directories as a log of SV breakpoints detected recurrently from the input sequence data, prior to filtering of the normal panel or reliability conditions. Tree view of output structure $HDR ├─ Merge │ ├sample_ID_1 │ │ ├ merged_SV.list │ │ └ merged_SV.validated_filt.list │ ├sample_ID_2 │ .... │ ├─ PE │ ├sample_ID_1 │ │ └ tumor │ │ ├ rearrangement_pe.txt │ │ └ rearrangement_pe.filt.txt │ ├sample_ID_2 │ ... │ └─ SoftClip ├sample_ID_1 │ └ tumor │ ├ breakpoint_sc.txt │ ├ breakpoint_sc.filt.txt │ └ breakpoint_sc.filt.realn.txt ├sample_ID_2 ... LICENCE ======== Author: Mihoko Adachi <miadachi@ncc.go.jp> and Yasushi Totoki <ytotoki@ncc.go.jp> CallallSV is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. ---------------------------------------------------------------------------------(EOF)---
About
callallSV package - the structural variant caller for WGS
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published