Join GitHub today
The ClinGen ACMG Variant Interpretation Profile
In collaboration with SEPIO developers, the ClinGen Data Model Working Group created a SEPIO Profile to support representation of Variant Pathogenicity Interpretations produced using the American College of Medical Genetics (ACMG) Guidelines. Technical documentation about this model can be found on the ClinGen Variant Interpretation website. Below we describe the scope and domain of the ClinGen-ACMG Interpretation Profile, and detail how this profile extends the core SEPIO Ontology and Information Model.
The SEPIO Framework
SEPIO provides an ontology-based modeling framework for representing scientific assertions and the evidence and provenance supporting them. This framework is comprised of four components:
- SEPIO Core Ontology: defines the core, domain-agnostic model using the 'open world' OWL description logic language.
- SEPIO Information Model: provides a UML-like view of the ontology with the constraints of a 'closed world' data model, specifying how terms and design patterns defined in SEPIO may be used to structure data.
- SEPIO Profiles: application specific data models that refine the maximal information model, and can extended it with domain-specific content to support custom schema for a particular use case.
- SEPIO Value Sets: re-usable collections of terms that can be bound to attributes in a particular Profile to constrain data entry.
Additional information about these components and their interactions can be found here.
Variant Pathogenicity Interpretations
Variant Pathogenicity Interpretations are assertions about the causal relationship between a sequence variant and a genetic condition. For example, an assertion that the GLA gene variant NM_000169.2(GLA):c.639+919G>A is pathogenic for Fabry Disease, or that the BRCA1 gene variant NM_007294.3(BRCA1):c.5572A>C is benign for Familial Breast-Ovarian Cancer. Due to the inherent complexity of establishing variant-disease relationships, these assertions rely on nuanced interpretation of diverse types of evidence. It is not uncommon for experts to reach conflicting conclusions about a variant's pathogenicity, as demonstrated by this recent study. This is problematic, given direct applications of this knowledge in patient care, where it is increasingly used to inform diagnosis and management of disease.
The ACMG Interpretation Guidelines were defined in 2015 to improve the rigor and consistency of variant pathogenicity interpretations. They provide a structured reasoning framework for identifying and weighting individual lines of evidence, and combining them to reach a final conclusion. The guidelines define 28 criteria representing different types of evidence relevant to the pathogenicity of a variant. The criteria are evaluated using diverse types of data, including population frequency data, computationally and predictive data, functional data, segregation data, de novo allele data, and genomic context information. For example, the PM2 criteria describes lines of evidence based on population data, which offer moderate (M) support for pathogenicity (P), and is 'met' if a variant is absent in databases cataloging variants found in control populations. The BS3 criteria is based on functional data, and offers strong (S) support for a benign (B) interpretation if well established studies show no deleterious effect on protein function. The outcomes of one or more of these 'criterion assessments' are combined according to an ACMG 'rule set', to arrive at one of five possible classifications for a given variant: pathogenic (P), likely pathogenic (LP), benign (B), likely benign (LB), or variant of uncertain significance (VUS). For example, one rule holds assigns a benign classification if one benign stand alone (BA) criterion is met, or if two or more benign strong (BS) criteria are met.
The ClinGen Use Case
ClinGen is an international consortium that develops resources for defining the clinical relevance of genes and variants for use in precision medicine and research. Central to these efforts is support for the creation and exchange of variant interpretations among clinical and research information systems. ClinGen has developed a suite of tools for curating, evaluating, and sharing variant pathogenicity interpretations, including detailed evidence and provenance information pertinent to the ACMG Guidelines. The SEPIO Profile they have developed to represent ACMG-based variant interpretations is formally implemented as a JSON-LD schema (LINK), and is being implemented as a standard interface for data exchange across the ClinGen ecosystem.
II. The ClinGen-ACMG Profile
The ClinGen-ACMG Profile extends the core SEPIO model to support domain-specific data types and value sets required for modeling the ACMG interpretation workflow. Below we describe key components of this profile and how they were created.
1. Conceptual Mapping
Concepts in the ClinGen-ACMG data were initially mapped to elements of the SEPIO Information Model. This alignment process helped to identify the subset of types and attributes in the Information Model required to represent ClinGen data, and where extensions would be required to accommodate domain or application specific concepts. Many native ClinGen concepts mapped cleanly onto concepts in the SEPIO model, while other mappings required recasting of ClinGen data structures in terms of SEPIO constructs. Figure 1 shows the highest-level SEPIO concepts that were required for representing ClinGen Variant Interpretation data. These included all parts of the model except 'Activities', as the processes generating evidence information are not directly described in the ClinGen data.
Figure 1: High-level concepts and relationships in the SEPIO model. Those required for modeling ClinGen data highlighted in orange. The "Evidence Item" term is shown as a UML'stereotype' in guillemots (<< >>) within the Information content Entity class, indicating that any information contributing to an Evidence Line at this position is inferred to be an instance of the logically defined Evidence Item type in SEPIO.
Note that the ClinGen Profile uses a 'Reified Contribution' (LINK) approach that SEPIO defines as an alternative to the default "Direct Contribution' modeling pattern. This alternative pattern best supported ClinGen requirements for more detailed representation of how agents contribute to particular assertions of lines of evidence. In this approach a 'Contribution' object organizes information about the part played by a particular agent in the creation or modification of a particular artifact, including the timing and roles an agent played in making a contribution.
2. Ontology Extension
SEPIO is a domain-agnostic ontology that lacks domain-specific concepts or commitments. A given SEPIO Profile must extend the core ontology to define any domain- or application- specific concepts it requires. These are implemented as terms that 'specialize' core SEPIO classes and properties, in a separate owl file representing an 'ontology extension' that imports the core SEPIO Ontology. Ontological representation of such domain-specific concepts is an important prerequisite for their use in defining a SEPIO Profile, as all types and attributes defined in a data model must be based on corresponding classes and properties in a supporting ontology.
In order to represent key aspects of the Variant Interpretation domain and ACMG Framework, the ClinGen-ACMG Profile defined several extensions to the SEPIO ontology as specializations of core SEPIO classes. For example, 'Variant Pathogenicity Interpretation' and 'Criterion Assessment' are implemented as subclasses of SEPIO:Assertion (Figure 2A). Additional extensions are created to represent 'domain entities' that variant interpretations and their evidence are about - e.g. Alleles, Phenotypes, and Genetic Conditions (Figure 2B). These 'domain entities' are depicted as specializations of the root 'Entity' class in SEPIO, but in practice may be implemented as subclasses of more specific SEPIO terms (e.g. 'Person' as a subclass of 'Material Entity', or 'Genotype' as a subclass of 'Information Content Entity'). The resulting ontology extension provides all classes and properties needed to define types and attributes for the ClinGen-ACMG Profile.
Figure 2: ClinGen-ACMG Ontology Specializations. Core SEPIO Classes are in light orange, and specializations in dark orange boxes. (A) Specializations of core SEPIO concepts. The 'Statement' class is extended with specific types that will serve as evidence in the ClinGen Model - so these Statement subclasses can be seen as specializing the 'Evidence Item' class as well. (B) Specializations defining 'domain entities'- concepts in the domain of discourse that assertions describe.
3. Profile Definition
Once source data has been mapped to the SEPIO model and ontology extensions have been implemented, the next step is creating the data model the profile will implement. At a high level this involves selecting the modeling patterns from the maximal Information Model required to represent source data, and defining the overall 'shape' of the model - including how many 'levels' of evidence it will implement. This is an important consideration given that SEPIO allows for 'evidence graphs' of arbitrary depth, that trace the provenance of a 'target assertion' across multiple levels of evidence based on 'prior assertions' and their respective evidence (LINK).
The ClinGen-ACMG Profile defines a multi-level structure as shown in Figure 3. The shape of this model aligns with the reasoning workflow prescribed in the ACMG Guidelines, and it represents the specific datatypes relevant for variant interpretations. The Criterion Assessment in the middle of the structure serves as a 'prior assertion' in the first level of the model, where it represents an Evidence Item supporting the root Variant Interpretation. This same Criterion Assessment serves as the 'target assertion' in a second level, where it may be supported by additional Evidence Lines. Note that Evidence Items in the second level are uniformly represented as 'Statements' in the ClinGen model, and more foundational data items supporting these statements are captured within the Statement object - e.g. specific population frequency counts and calculations (not shown) that support an Allele Frequency Statement. Note also that the structure does allow for even more levels of evidence, as there are some ClinGen data records where the Statement serving as evidence for a Criterion Assessment is a prior assertion that has its own trail of evidence.
Figure 3: High-Level Structure of the ClinGen-ACMG Profile. Types that represent specializations of core SEPIO classes are in darker orange. The ellipsis at the bottom of the figure indicates that additional levels of evidence may be implemented in the profile.
Once the high-level structure of the model is defined, the finer details of the profile can be specified. These include indicating cardinality and data type constraints on attributes, and binding coded attributes to value sets. Figure 4 shows the detailed model defined for the ClinGen-ACMG Profile, including cardinalities based on ClinGen application requirements, and value sets defined for the ClinGen use case. The model presented here contains a subset of the maximal SEPIO Information Model, which it extends with specialized types and attributes as described above to give a final profile. Figure 6 below shows an example Variant Interpretation record from the ClinGen database structured according to the ClinGen-ACMG Profile.
Figure 4: Detailed Structure of ClinGen-ACMG Profile. Basic UML notation is used to present data types and their attributes. Value sets bound to attributes are indicated in guillemots (<< >>). Note that the attribute names shown here may differ slightly in the ClinGen JSON schema, which reflect application-specific preferences - but the underlying identifiers for these attributes are the same. The ClinGen JSON-LD context capture mappings from the ClinGen-preferred type and attribute labels to the corresponding SEPIO ontology terms.
Note that the profile diagram above does not show the schema for each of the 22 Statement types the serve as Evidence Items in the second level of the model. These may differ slightly between Statement types, but all are based on a generic Statement schema defined by SEPIO (LINK). Those Statements that describe study data directly (i.e. 'Study Findings') may include attributes specific to each data type captured in the Statement. For example, the Population Allele Frequency Statement, which describes the results of a population frequency study, includes attributes to capture values for allele count, allele frequency, and median sequencing coverage. See the 'Evidence Statement' section of the ClinGen documentation here for a description of their models for each Statement type.
4. Value Sets
For attributes that take a 'code' as a data type in the Information Model, a Profile can define value sets it will use here. SEPIO provides a Value Set Model (LINK) that extends the SKOS framework (LINK), to support the implementation of value sets as part of the profile's ontology extension. Briefly, a particular value set (e.g. the ClinGen 'Allelic Phase Value Set') is implemented in the ontology extension as an instance of the SEPIO 'Value Set' class. The individual terms comprising a value set (e.g. 'in cis', 'in trans') are implemented as instances of the SKOS Concept class, and linked to their containing value set using the SKOS 'isInScheme' property. Attributes of a particular value set that can be defined using the SEPIO model include the notion of 'extensibility' (whether the set is closed or can be extended), and links to one or more 'identifier systems' from which terms in the value set can be taken. **Figure 5A **shows the basic schema for defining value sets in the SEPIO model.
The ClinGen-ACMG Profile required the creation of 34 value sets, which are implemented in the ClinGen ontology extension using the SEPIO Value Set Model. As an example, Figure 5B shows the model for the 'Allelic Phase Value Set'. This is a 'fixed' value set (i.e. not extensible) that is comprised of just two values taken from the Genotype Ontology (GENO). Note that the valueSetExtensibility attribute itself takes a code for a value, that is drawn from the 'Value Set Extensibility Value Set'.
Figure 5: SEPIO Value Set Creation. (A) The schema for the SEPIO Value Set Model. (B) An example of the ClinGen 'Allelic Phase Value Set'. Boxes in grey are individual values from a value set.
5. Formal Schema
Once a profile is informally specified, it can be implemented as a formal schema to support data creation and validation in a particular application. A given profile can be defined in any number of schema definition languages. ClinGen formalizes its profile as JSON-LD schema, and its applications generate JSON-LD data. An important component of this framework is the ClinGen LD-context file (LINK), which defines mappings between schema elements and ontology terms. This allows SEPIO-compliant RDF graph representations to be automatically derived from ClinGen JSON data. The ClinGen JSON-LD schema is described here (LINK).
Example Variant Interpretation Record
Figure 6: An example Variant Interpretation record from the ClinGen database modeled according to the ClinGen-ACMG Profile. The diagram is annotated to highlight how the central axis is repeated to result in a two-level structure.
TO DO - add text describing this VarInter045 example ClinGen VI record.
The JSON for this example record, as published by ClinGen, can be found here. The JSON-LD context file that maps types and attributes names used in this JSON data to SEPIO classes and properties can be found here.