This study was performed to establish a novel method named LDFF to accurately quantify cell-free fetal DNA fraction (FF) in maternal plasma by utilizing linkage disequilibrium (LD) information from maternal and fetal haplotypes. The workflow consists of four processes, the first of regional LD-ratios were calculated on the 521 genomic regions, as well as the genome coverage, coverage of the reads with a MQ score>0 and PCR duplication rate. Then, several multivariate regression models were generated with different MAF filtering cutoff using all training samples. Thrid, the outliers in the training samples in different MAF filtering models were identified and removed from the corresponding model to avoid over-fitting. Multivariate regression models were rebuilt using the remaining samples. Finally, MAF filtering cutoff was selected as the model has best accuracy.
The following steps are to work in the Perl 5 or R environment. And samtools and STITCH (version v1.5.3.0008) should be installed.
In addition, you need to prepare the genome reference file "hg19.fasta" and the reference panels. The reference panels contain haplotype files and legend files for each chromosome from 1000GP Phase3. The reference panel format used by STITCH is the same as used by QUILT. Please see the QUILT web page here for specific details of the format.
The "bamlist" file records the path of your input bam files with only 1 column as "example/bamlist". The "bamstats" file records the path of your input bam stats files which were generated by the command "samtools stats input.bam > input.bam.stats" with only 1 column as "example/bamstats". The "samplelist" file contains the sample names which record in the "SM" tag in the bam header of your input bam.
samtools mpileup -f database_hg19/hg19.fasta -b bamlist -r chr1:1-5000000 -v -o result/chr1.1.5000000/mpileup.chr1.1.5000000.vcf.gz -l database_hg19/1kg.easaf0.2/chr1.pos.txt
Rscript bin/STITCH.full.step2.R result/chr1.1.5000000/ bamlist database_hg19/1kg.easaf0.01/chr1.pos.txt chr1 1 5000000 database_hg19/1000GP_Phase3/1000GP_Phase3_chr1.legend.gz database_hg19/1000GP_Phase3/1000GP_Phase3_chr1.hap.gz database_hg19/1000GP_Phase3/1000GP_Phase3.sample (For more information about how STITCH works, please refer to the article DOI: 10.1038/ng.3594)
perl bin/regional_LDratio.pl result/chr1.1.5000000/mpileup.chr1.1.5000000.vcf.gz result/chr1.1.5000000/stitch.chr1.1.5000000.vcf.gz samplelist result/chr1.1.5000000/regional_LDratios.txt database_hg19/1000GP_Phase3/1000GP_Phase3_chr1.legend.gz
perl bin/bamstat.combine.pl bamstats > confounders
Pastes the regional LD-ratios of maf 0.02 in regional_LDratios.txt and the genome coverage, coverage of the reads with a MQ score>0 and PCR duplication rate in confounders into train.input.maf0.2
Rscript bin/Linear_regression_chrYbasedff.R train.input.maf0.2 test.input.maf0.2
Rscript bin/run_Reg_Diag.R Linear-model.RData |grep '*' > outliers_line.log
perl bin/remove_outliers.pl train.input.maf0.2 outliers_line.log > train.input.maf0.2.removeoutliers
Rscript bin/Linear_regression_chrYbasedff.R train.input.maf0.2.removeoutliers test.input.maf0.2> test.input.log