Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
155 lines (116 sloc) 10.5 KB
1. Introduction
Bisulfite sequencing (BS-seq) has been extensively used for DNA methylome study. Ideally, BS-seq experiment should be able to directly and exactly identify the methylation state of a DNA fragment from the original genome. However, current BS-seq protocols still possess several intrinsic biases, which will impact methylation level estimation, such as overhang end-repair, 5� bisulfite conversion failure, sequencing into the adaptor and 3� low sequencing quality. Since BS-seq experiments are widely used and the resulting data will continue to grow exponentially in the near future, there is a strong need for a dedicated QC tool to evaluate and remove potential technical biases in BS-seq experiments. Here, we developed BSeQC package. It can comprehensively evaluate the quality of BS-seq experiments and automatically trim nucleotides with potential technical biases that may result in inaccurate methylation estimation. In addition, BSeQC also support removing duplicate reads and keeping one copy of the overlapping segment in paired-end sequencing.
2. Installation
Prerequisite:
. Python
. Scipy
. Numpy
. Matplotlib
(We recommend to install this python package. If not, we only can produce the Mbias table, not the Mbias plot)
2.1 Install BSeQC in defaut location
1. tar zxf BSeQC-VERSION.tar.gz
2. cd BSeQC-VERSION
3. python setup.py install
#Skip step4 if '/usr/local/lib/python2.7/site-packages' is already included in your PYTHONPATH.
4. export PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH
#Skip step5 if '/usr/local/bin' is already included in your PATH.
5. export PATH=/user/local/bin:$PATH
#To make permanent change to your PYTHONPATH or PATH variables, copy the commands (step4 and step5) into your
#'/home/user/.bashrc' or '/home/user/.bash_profile'.
2.2 Install BSeQC in user specified location
1. tar zxf BSeQC-VERSION.tar.gz
2. cd BSeQC-VERSION
#You need to change '/home/user/' accordingly
3. python setup.py install --prefix=/home/user/
#setup PYTHONPATH, so that BSeQC knows where to import modules
4. export PYTHONPATH=/home/user/lib/pythonX.Y/site-packages:$PYTHONPATH
#setup PATH, so that system knows where to find executable files.
5. export PATH=/home/user/bin:$PATH
#To make permanent change to your PYTHONPATH or PATH variables, copy the commands (step4 and step5) into your
#'/home/user/.bashrc' or '/home/user/.bash_profile'.
3. Input
WGBS mapping files:
. SAM files
. BAM files
4. Usage Information
mbias: BSeQC main executable function
-s or --sam The SAM file for quality analysis; Multiple SAM file should be separated by the ','. (Mandatory)
-r or --ref The reference genome fasta file. (Mandatory)
-t or --samtools The path of samtools. (Optional)
-n or --name The name for the output plot and table. default = 'NA' (Optional)
-l or --len If the original mapping reads have been trimmed with adapter or other reasons, the original read length for the sam file should be set. Multiple length can also be separated by ','.
If the read length of two mates in paired-end is different, please separated by '-'. (Optional)
-p or --pvalue How many stds will be set for the trimming cutoff. Default = 2. (Optional)
--drift How many drifts(%mC) will be set for the trimming cutoff. Default = 2. (Optional)
-f or --trim_file User can determine the trimming bp by the trim file. (Optional)
-a or --auto Automatically trim the biased bp. If not you can use the Mcall biases plot to manually decide how many bps to trim and make a trimming file. Default = True. (Optional)
-o or --remove_overlap Keep only one copy of the overlapping segment of two read mates in paired-end seq. Default = True. (Optional)
--filter_dup Pvalue cutoff Poisson distribution test in removing duplicate reads. Default = True. (Optional)
--p_poisson Pvalue cutoff Poisson distribution test in removing duplicate reads. Default = 1e-5.
-g or --gsize Effective genome size for calculate max coverage. It can be 1.0e7 or 10000000, or shortcuts: 'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
fruitfly (1.2e8). It is restricted by --filter_dup. (Optional)
--not_mapping Whether keep the not-unique mapping reads in the filter SAM file. Default = True. (Optional)
nonuniform: use the �Bis-SNP strategy� as an alternative method for trimming 5� bisulfite conversion failure
-s or --sam The SAM file for quality analysis; Multiple SAM file should be separated by the ','. (Mandatory)
-r or --ref The reference genome fasta file. (Mandatory)
-t or --samtools The path of samtools. (Optional)
-n or --name The name for the output plot and table. default = 'NA' (Optional)
-o or --remove_overlap Keep only one copy of the overlapping segment of two read mates in paired-end seq. Default = True. (Optional)
--filter_dup Pvalue cutoff Poisson distribution test in removing duplicate reads. Default = True. (Optional)
--p_poisson Pvalue cutoff Poisson distribution test in removing duplicate reads. Default = 1e-5.
-g or --gsize Effective genome size for calculate max coverage. It can be 1.0e7 or 10000000, or shortcuts: 'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for
fruitfly (1.2e8). It is restricted by --filter_dup. (Optional)
--not_mapping Whether keep the not-unique mapping reads in the filter SAM file. Default = True. (Optional)
Strand symbol and information:
Read type Strand Symbol Strand information
Single-end ++ Watson strand
Single-end -+ Crick strand
Paired-end ++ Forward strand of Watson of reference (mate1)
Paired-end +- Reverse strand of Watson of reference (mate2)
Paired-end -+ Forward strand of Crick of reference (mate1)
Paired-end -- Reverse strand of Crick of reference (mate2)
5. Example
mbias: BSeQC main executable function
##paired-end (testdata: mNPC_chr1)
#just focus on the mbias trimming
bseqc mbias -s SRR299067_chr1.bam,SRR299069_chr1.bam,SRR299071_chr1.bam,SRR299073_chr1.bam,SRR299075_chr1.bam,SRR299077_chr1.bam,SRR299079_chr1.bam,SRR299081_chr1.bam,SRR299083_chr1.bam,SRR299068_chr1.bam,SRR299070_chr1.bam,SRR299072_chr1.bam,SRR299074_chr1.bam,SRR299076_chr1.bam,SRR299078_chr1.bam,SRR299080_chr1.bam,SRR299082_chr1.bam -r mm9.fa -n mNPC_Paired_BSseq_chr1 -o --filter_dup
#filter duplicate reads, remove one copy of the overlapping segment, and not keep not_unique mapping reads
bseqc mbias -s SRR299067_chr1.bam,SRR299069_chr1.bam,SRR299071_chr1.bam,SRR299073_chr1.bam,SRR299075_chr1.bam,SRR299077_chr1.bam,SRR299079_chr1.bam,SRR299081_chr1.bam,SRR299083_chr1.bam,SRR299068_chr1.bam,SRR299070_chr1.bam,SRR299072_chr1.bam,SRR299074_chr1.bam,SRR299076_chr1.bam,SRR299078_chr1.bam,SRR299080_chr1.bam,SRR299082_chr1.bam -r mm9.fa -n mNPC_Paired_BSseq_chr1_roverlap_dup -g 1.9e8 --not_mapping
#use trimming file
bseqc mbias -s SRR299067_chr1.bam,SRR299069_chr1.bam,SRR299071_chr1.bam,SRR299073_chr1.bam,SRR299075_chr1.bam,SRR299077_chr1.bam,SRR299079_chr1.bam,SRR299081_chr1.bam,SRR299083_chr1.bam,SRR299068_chr1.bam,SRR299070_chr1.bam,SRR299072_chr1.bam,SRR299074_chr1.bam,SRR299076_chr1.bam,SRR299078_chr1.bam,SRR299080_chr1.bam,SRR299082_chr1.bam -r mm9.fa -n mNPC_Paired_BSseq_chr1_roverlap_dup -g 1.9e8 --not_mapping -f mNPC_Paired_BSseq_chr1_trim_file.txt
##single-end
#just focus on on the mbias trimming (testdata: H1_chr1)
#for replicate1: set the samtools path; set the original sequence length, because some bps in the 3' end of the reads with low quality or adapter have been trimmed during mapping
bseqc mbias -s methylC-seq_H1_r1_noaq_chr1_43.bam,methylC-seq_H1_r1_noaq_chr1_52.bam,methylC-seq_H1_r1_noaq_chr1_53.bam,methylC-seq_H1_r1_noaq_chr1_76.bam,methylC-seq_H1_r1_noaq_chr1_87.bam,methylC-seq_H1_r1_noaq_chr1_88.bam -r hg19.fa -l 43,52,53,76,87,88 -n H1_Single_BSseq_chr1_replicate1 --filter_dup
#for replicate2
bseqc mbias -s methylC-seq_H1_r2_noaq_chr1.bam -r hg19.fa -n H1_Single_BSseq_chr1_replicate2 -l 87 --filter_dup
nonuniform: use the �Bis-SNP strategy� as an alternative method for trimming 5� bisulfite conversion failure
##testdata: mNPC_chr1
bseqc nonuniform -s SRR299067_chr1.bam,SRR299069_chr1.bam,SRR299071_chr1.bam,SRR299073_chr1.bam,SRR299075_chr1.bam,SRR299077_chr1.bam,SRR299079_chr1.bam,SRR299081_chr1.bam,SRR299083_chr1.bam,SRR299068_chr1.bam,SRR299070_chr1.bam,SRR299072_chr1.bam,SRR299074_chr1.bam,SRR299076_chr1.bam,SRR299078_chr1.bam,SRR299080_chr1.bam,SRR299082_chr1.bam -r /mnt/Storage/data/Bowtie/mm9.fa -n Bis-SNP_strategy --filter_dup -o --not_mapping
##RRBS paired-end
bseqc rrbs -s SRR726536_chr1.sam -r mm9.fa -n rrbs_p
##RRBS single-end
bseqc rrbs -s SRR788619.sam -r mm9.fa -n rrbs_s_rep1
6. Output
mbias: BSeQC main executable function
name + 'CG_Mbias_plot.pdf': the CpG Mbias plot for each read length in each stand
name + 'nonCG_Mbias_plot.pdf': the non-CpG Mbias plot for each read length in each stand
name + 'Dup_dis.pdf': the duplicate reads distribution (when --filter_dup be set true)
name_CG_strand_readlength.csv(in directory:name+'Mbias_table): the CpG Mbias table for each read length in each stand
name_nonCG_strand_readlength.csv(in directory:name+'Mbias_table): the non-CpG Mbias table for each read length in each stand
name_CG_trim_file.txt(in directory:name+Mbias_table): the trimming decision for each read length in each stand from the CpG Mbias plot
name_nonCG_trim_file.txt(in directory:name+Mbias_table): the trimming decision for each read length in each stand from the non-CpG Mbias plot
name_final_trim_file.txt: the most stringent trimming decision made by either CpG or non-CpG cytosines M-bias plots
name_BSeQC_mbias_filter_report.txt: the detail trimming and filtering report
nonuniform: use the �Bis-SNP strategy� as an alternative method for trimming 5� bisulfite conversion failure
name_trimmed_nucleotides_dis.pdf: the distribution of the number of trimmed nucleotides based on the �Bis-SNP strategy�
name_BSeQC_mbias_filter_report.txt: the detail trimming and filtering report
7. NOTE
README file
Because different DNA strands and read lengths can have distinct biases, BSeQC will trim them differently. If users are concerned about the different coverage on two strands, we can use the parameter (-f TRIM_FILE) to let Watson(+) and Crick(-) strands be trimmed with same nucleotides in the 5’ and 3’.
The format of the trim_file(split by tab):
strand read_length trim_5'_pos trim_3'_pos
++ 100 2 98
++ 88 2 78
-+ 100 2 98
-+ 87 2 78
You can’t perform that action at this time.