PHARAOH is a pipeline used for correcting the phasing of HiFi reads aligned to a diploid assembly. PHARAOH uses ONT UL reads to phase the HiFi reads in regions where the diploid assembly is falsely homozygous for stretches longer than the length of the HiFi reads. For the rest of the genome, PHARAOH uses secondary alignments to re-assign reads to the correct haplotype, as implemented in secphase. PHARAOH was developed to provide highly accurate read phasing to the DeepPolisher model for polishing the HPRC samples.
- Align haplotype 1 assembly to haplotype 2 assembly, and detect homozygous regions. By default we extract homozygous regions > 20kb, the average length of the HiFi reads.
- Align HiFi reads to the diploid assembly using minimap2 or winnowmap. Extract the reads in the homozygous regions identified by step 1, filtering out reads with too high a divergence from the assembly (de > 0.02).
- Align reads from the homozygous regions to each haplotype using minimap2 or winnowmap
- Call variants in the homozygous regions with deepvariant
- Align ONT UL reads > 100kb to each haplotype separately. Using these reads, apply Margin or WhatsHap to phase the variants in homozygous regions called by DeepVariant in Step 4.
- Run secphase in dual mode as described here. Secphase will use phased variants from step 5 to assign reads in homozygous regions to the correct haplotype. In non-homozygous regions, marker-mode of secphase will be applied.
- Final alignment is produced, with read phasing fixed in homozygous regions.
PHARAOH requires three different alignments to be run. First, we need an initial alignment of all the HiFi reads to the diploid assembly, with either minimap2 or winnowmap. The reccommended parameters for this alignment are --cs --eqx -L -Y -I8g
Next, we need an alignment of all ONT reads to the maternal assembly / haplotype 1, and all ONT reads to the paternal assembly / haplotype 2. The reccommended parameters for these alignments are --cs --eqx -L -Y
.
Currently PHARAOH is implemented as a wdl workflow. A WDL file can be run locally using Cromwell, which is an open-source Workflow Management System for bioinformatics. The latest releases of Cromwell are available here and the documentation is available here.
The wdl to use for PHARAOH is located under wdl/workflows/PHARAOH.wdl.
The command to use for running PHARAOH.wdl with cromwell is:
java -jar cromwell-85.jar run \
wdl/workflows/PHARAOH.wdl \
-i inputs.json \
-m outputs.json
All required parameters must be supplied to inputs.json file. You may use womtool to generate an example inputs.json file, and pass in your files. Please also review the disk, memory, and thread defaults to ensure they are appropriate for your mahcine. These can be set in the inputs.json file as well.
wget https://github.com/broadinstitute/cromwell/releases/download/85/womtool-85.jar
womtool-85.jar womtool-85.jar wdl/workflows/PHARAOH.wdl > inputs.json
Below are the reccommended parameters to review for your inputs.json file:
{
"PHARAOH.allHifiToDiploidBam": "File",
"PHARAOH.allHifiToDiploidBai": "File",
"PHARAOH.allONTToHap1Bam": "File",
"PHARAOH.allONTToHap1Bai": "File",
"PHARAOH.allONTToHap2Bam": "File",
"PHARAOH.allONTToHap2Bai": "File",
"PHARAOH.Hap1Fasta": "File",
"PHARAOH.Hap1FastaIndex": "File",
"PHARAOH.Hap2Fasta": "File",
"PHARAOH.Hap2FastaIndex": "File",
"PHARAOH.PharaohAligner": "String (optional, default = \"minimap2\")",
"PHARAOH.PharaohHiFiPreset": "String (optional, default = \"map-hifi\")",
"PHARAOH.diploidFaGz": "File",
"PHARAOH.pafAligner": "String (optional, default = \"minimap2\")",
"PHARAOH.PharaohKmerSize": "String (optional, default = 19)",
"PHARAOH.useMargin": "Boolean (optional, default = false)",
"PHARAOH.sampleName": "String"
}
Detailed description of these parameters for PHARAOH.wdl:
Parameter | value |
---|---|
PHARAOH.allHifiToDiploidBam | Location of bam file with all HiFi reads aligned to diploid assembly. Can be generated by minimap2 or winnowmap |
PHARAOH.allHifiToDiploidBai | .bai index file for allHifiToDiploidBam. Use samtools index to generate |
PHARAOH.allONTToHap1Bam | Location of bam file with all ONT reads aligned to Haplotype 1. Can be generated by minimap2 or winnowmap |
PHARAOH.allONTToHap1Bai | .bai index file for allONTToHap1Bam |
PHARAOH.allONTToHap2Bam | Location of bam file with all ONT reads aligned to Haplotype 2. Can be generated by minimap2 or winnowmap |
PHARAOH.allONTToHap2Bai | .bai index file for allONTToHap2Bam |
PHARAOH.Hap1Fasta | .fasta file containing haplotype 1 |
PHARAOH.Hap1FastaIndex | .fai index file for Hap1Fasta. Use samtools faidx to generate |
PHARAOH.Hap2Fasta | .fasta file containing haplotype 2 |
PHARAOH.Hap2FastaIndex | .fai index file for Hap2Fasta. Use samtools faidx to generate |
PHARAOH.PharaohAligner | Aligner to use for aligning reads to assemblies. Either "minimap2" or "winnowmap" |
PHARAOH.PharaohHiFiPreset | For PharaohAligner="minimap2" use preset "map-hifi" and for PharaohAligner="winnowmap" use preset "map-pb" |
PHARAOH.diploidFaGz | diploid assembly fasta file. Needs to be gzipped. |
PHARAOH.pafAligner | Aligner to use for aligning the Hap1 fasta to the Hap2 fasta. Either "minimap2" or "winnowmap" |
PHARAOH.PharaohKmerSize | For PharaohAligner="minimap2" use 19 and for PharaohAligner="winnowmap" use 15 |
PHARAOH.useMargin | Boolean. True means use Margin for phasing variants in homozygous regions with the UL alignments, False means use WhatsHap. WhatsHap is the default and currently reccommended. |
PHARAOH.sampleName | name of sample being run |