Skip to content

Latest commit

 

History

History
90 lines (79 loc) · 19.3 KB

Definitions.rst

File metadata and controls

90 lines (79 loc) · 19.3 KB

Miscellanious Definitions

Within the instance-wide output table produced by ScaleHD, there are many flags or data entities which require explanation. Throughout the development of ScaleHD, we ended up determining a range of characteristics that indicate the believability of data produced from amplicon sequencing when attempting genotyping via our multiple reference library-based method. These characteristics provide us with a heuristic method to determine if an attempt at automated genotyping was successful, or not. Here, we define these characteristics and what they mean in literal terms. They range in importance, but all are useful in creating a representation of data quality. Some are self-explanatory, but are explained anyway.

A further note on SNP calling: Either Freebayes or GATK may detect variants in allele contigs which are not relevant to the literal alleles of any given sample. I.E., a sample with alleles 17_1_1_7_2 and 23_1_1_10_2 may have variants reported in a highly irrelevant contig, 40_1_1_5_2. SNPs are only reported within InstanceResults.csv if they are found within the appropriate contigs for that sample -- other 'irrelevant' variant reports are written to IrrelevantVariants.txt in the sample's specific output folder. An individual SNP will be reported in the format "{originalbase }->{mutated base}: @{base pair position in read}". E.G. "C->T: @36".

Update: as of SHD 0.322, only freebayes is used.

The significance levels are as follows:

  • N/A -- This means the flag contains discrete information and does not need to be interpreted in regards to genotyping quality.
  • Dependent -- This flag may be significant, depending on other flags. For example, a high level of somatic mosaicism may be an indicator of poor genotyping quality when the CAG repeat tract size is within the non HD-causing allele size range.
  • Minor -- This entity is of minor significance and in the vast majority of samples will not be a deterministic factor for genotyping quality.
  • Moderate -- This entity is of moderate significance. It is unlikely to render a sample's genotype invalid on its own but may contribute to inaccurate genotyping.
  • Major -- This entity is of major significance and is strongly associated with genotyping quality. If any major informative flags are raised, it is recommended to manually inspect the alignment/mapping outputs for that sample.

For maximum genotyping accuracy we recommend manual inspection for all samples for which any major flag was raised and for alleles with >47 CAGs.

+----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ScaleHD Flag | Significance | Definition | +============================+===================+==============================================================================================================================================================================================================================================+ | SampleName | N/A | Literally the sample name/label taken from the file | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Pri/Sec GTYPE | N/A | The genotype returned for each allele (e.g. CAG_1_1_CCG_2) | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Status | Dependant | Whether the structure of the intervening sequence is typical (1_1) or atypical (anything else). | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | BSlippage | Dependent | The amount of backwards slippage, relative to each allele's peak. Calculated as [(n-1 to n-5) / n].* | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Somatic Mosaicism | Dependent | The amount of somatic mosaicism, relative to each allele's peak. Calculated as [(n+1 to n+10) / n].* | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | VariantCall | Dependent | If a SNP was detected, the value of the nucleotide in REF:OBSERVED. States "N/A" if no SNP is found | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | VariantScore | Dependent | The "QUAL" value from the allele's VCF file. The higher the value, the more reliable the variant call. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Confidence | Major | The percentage confidence in a genotype call. See the confidence calculation subsection for info. We consider confidence <55 to be of major significance. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Exception Raised | N/A | If the pipeline reached a fatal error on a sample, the stage at which it crashed is listed here. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Homozygous Haplotype | Moderate/Major | If a sample has a homozygous haplotype – i.e. both alleles have the same genotype. Moderate significance in individuals not affected by HD but major significance if genotyping a population of HD-patients or of individuals at risk of HD. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Neighbouring Peaks | Moderate | Two alleles exist within the same CCG dimension, with CAG values being separated by 1 value. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Diminished Peaks | Moderate | One peak in a CCG-homozygous sample is a large expansion with a relatively minuscule read count. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Novel Atypical | Major | An allele with an atypical intervening sequence, different to that of the commonly observed atypical structures. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Alignment Warning | Moderate | When determining CCG values, more values were returned than is possible (i.e. more than 2 results). | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Atypical Alignment Warning | Minor | When re-aligning for an atypical allele, particularly awful quality re-alignment produced unclear data. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | CCG Rewritten | Moderate | An allele's CCG value (from DSP) was determined invalid and overwritten. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | CCG Zygosity Rewritten | Minor | An allele was (typical reference) deemed CCG-heterozygous, but detected to be an atypical CCG-homozygous allele. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | CCT Uncertainty | Minor | When using DSP to determine the CCT tract length, there was no clear agreement (e.g. CCT2 = 55%, CCT3 = 45%). | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | SVM Failure | Major | The confusion matrix produced by the SVM was inconclusive, and CCG zygosity had to be bootstrapped. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Differential Confusion | Major | The allele sorting algorithm was confused between a potential neighbouring peak, or homozygous haplotype. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Missed Expansion | Major | One genotyping algorithm believes an allele to be expanded with low reads, the other algorithm believes otherwise. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Peak Inspection Warning | Minor | At least one allele failed minimum read-count distribution threshold inspection. Common in "bad" sequencing data. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Low Distribution Reads | Dependent | At least one allele's CAG read distribution (e.g. the CAG read distribution from CCGxyz, 200 Fw-reads in length) contains a noteworthy low number of reads. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Low Peak Reads | Major | In a given allele's read distribution, the n value contains a very low number** of reads. Genotyping is hard, here. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Map % (Forward Reads) | Major (only <90%) | % of forward-aligned reads, if less than 90% mapped then this has a significant impact on performance. | +----------------------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ *n denotes the number of reads for the modal allele; **very low reads is defined as an n value containing <=200 reads

Confidence Calculation

For each allele, ScaleHD calculates the confidence level in the provided genotyping result. This information is taken from a variety of sources, and attempts to paint an evidence-based picture of the data quality, and resultant genotype confidence. Each allele starts with 100% confidence, and penalties are applied when certain data characteristics were discovered throughout the genotyping process. Follows is a list of evidence used to best determine each allele's confidence level:

  • If the First Order Differential peak confirmation stage required to re-run itself, with a lower threshold. More re-calls results in a higher penalty.
  • Rare characteristics, such as homozygous haplotypes, or neighbouring/diminished peaks, incur a penalty.
  • Atypical alleles are treated with more caution, and scores are weighted slightly more severely than typical alleles.
  • Simple data aspects such as total read count within a sample/distribution/peak are used.
  • Mapping percentages are taken into account, albeit as a minor factor within this algorithm.
  • "Fatal" errors, such as Differential Confusion, incur a significant penalty.

Any confidence score is capped at 100%. If the quality of data in a particular sample is high enough for alleles to be awarded a confidence score higher than 100%, they are reported as 100%, regardless. Generally, a 'good' score is anything over 80%, and we have found that samples returning a score of over 60% are considered believable. Anything less than this may justify manual inspection.