# CYP21A2
[this form of congenital adrenal hyperplasia](https://omim.org/entry/201910) is caused by homozygous or compound heterozygous mutation in the CYP21A2 gene.
This notebook shows how to create phenopackets using GA4GH classes from pyphetools directly to flexibly model elements not yet included in the Template.

Here, we use this to import data specifically from the publication [Xu C, et al. (2019) Genotype-phenotype correlation study and mutational and hormonal analysis in a Chinese cohort with 21-hydroxylase deficiency. Mol Genet Genomic Med.](https://pubmed.ncbi.nlm.nih.gov/30968594/). We processed the data in the supplemental Word file and created an Excel file that
contains the data.

We did not include individuals 70-72, because only one variant was identified.

In some cases, the alleles were compound. Since we cannot currently represent this in a simple way, we chose the most clearly pathogenic variant in such cases. To
see our choices, see the original_allele column of the Excel file.

We create Measurement objects for many of the lab values. We did not include progesterone, FSH, and LH because of the fact that the normal values for these analytes can depend on the
stage of the menstrual cycle, which was not available in the input data.

Note that this notebook includes Python code that is specific to the format of the input Excel file, but the approach can be adapted to other cohorts mutatis mutandis.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # show entire column contents, important!
import typing
import os
import re
from google.protobuf.json_format import MessageToJson
from pyphetools.creation import Measurements, VariantManager, PromoterVariant, StructuralVariant, OntologyTerms
from pyphetools.pp.v202 import *
import pyphetools
print(f"pyphetools version {pyphetools.__version__}")

pyphetools version 0.9.105


In [2]:
df = pd.read_excel("input/CYP21A2_Xu_2019.xlsx")
df.head()

Unnamed: 0,individual,sex,age_at_dx,17ohp,androgen,testosterone,other,allele_1,allele_2,allele_original,...,Clitoral hypertrophy,Irregular menstruation,Diarrhea,Arrhythmia,Congenital hypothyroidism,Growth delay,Acne,Moon facies,Hyperpigmentation of the skin,Original:symptoms
0,str,sex,age_at_dx,17ohp,androgen,ng/dl,other,gt1,gt2,gt1,...,HP:0008665,HP:0000858,HP:0002014,HP:0011675,HP:0000851,HP:0001510,HP:0001061,HP:0500011,HP:0000953,Original
1,individual 1,M,P20D,800,T: 1.27,127,na,c.955C>T,DEL exon1-7,c.955C>T,...,na,na,na,na,na,na,na,na,observed,"Hyperpigmentation, milk regurgitation"
2,individual 2,M,Congenital onset,526,T: 1.68,168,na,c.332_339del8,DEL exon7-10,c.332_339del8,...,na,na,na,na,na,na,na,na,observed,"Hyperpigmentation, undescended testis"
3,individual 3,M,P1M,685,T: 3.4,340,na,c.332_339del,gene deletion,"c.290-13A/C>G, c.332-339del8, c.518 T>A",...,na,na,na,na,na,na,na,na,na,"milk regurgitation, advanced penile growth"
4,individual 4,M,Congenital onset,488,T: 0.78,78,na,c.1069C>T,gene deletion,c.1069C>T,...,na,na,na,na,na,na,na,na,observed,"Hyperpigmentation, advanced penile growth"


In [3]:
PMID = "PMID:30968594"
title = "Genotype-phenotype correlation study and mutational and hormonal analysis in a Chinese cohort with 21-hydroxylase deficiency"
created_by="ORCID:0000-0002-5648-2155"
metadata = MetaData.metadata_for_pmid(created_by=created_by, pmid=PMID, citation_title=title, include_loinc=True)

# Measurements and Phenotypic Features
The PhenotypicFeature element ("message") of the Phenopacket Schema represents a qualitivate observation that is indicate by an ontology term as observed or excluded. In contracts, a
Measurement element represents a (usually) numerical result (for instance of a laboratory test). Measurements can include a reference range, and if the observed value is above the upper limit of normal (e.g., blood sodium of 152) this may also correspond to an ontology term (in this case, Hypernatremia, i.e., elevated blood sodium). In version 2 of the Phenopacket Schema, it is not possible to relate Measurement and PhenotypicFeature objects, but this feature may be added to version 3. Here, we show how to create both Measurement and PhenotypicFeature messages based on laboratory values. Note that if a Measurement in in the normal range (e.g., a blood sodium value of 136), then we can exclude the parent term (Abnormal circulation sodium concentration -- excluded).

In [4]:
def measurement_17php(row: pd.Series, 
                      measurement_list: typing.List[Measurement]) -> None:
    """
    In CYP21A2 deficiency, 17-hydroxyprogesterone (17-OHP) is not converted to 11-deoxycortisol.
    Normal values:
    Babies more than 24 hours old - less than 400 to 600 nanograms per deciliter (ng/dL) or 12.12 to 18.18 nanomoles per liter (nmol/L)
    Children before puberty around 100 ng/dL or 3.03 nmol/L
    Adults - less than 200 ng/dL or 6.06 nmol/L
    The units used here are nanogram per deciliter. We do not have the age at measurement, so we cannot apply the normal range (and thus we do not call HPO terms)
    Add a measurement to the list if we can successfully parse  
    """
    value = row["17ohp"]
    if value == "na":
        return None
    try:
        concentration = float(value)
        m = Measurements.nanogram_per_deciliter(code="LOINC:1668-3",
                                      label="17-Hydroxyprogesterone[Mass/Vol]",
                                      concentration=concentration)
        measurement_list.append(m)
    except ValueError:
        print(f"Could not parse \"{value}\"")

In [5]:
def testosterone_assessment(row: pd.Series, 
                            measurement_list: typing.List[Measurement],
                            phenotypic_feature_list: typing.List[PhenotypicFeature]) -> None:
    """
    Male: 300 to 1,000 nanograms per deciliter (ng/dL) or 10 to 35 nanomoles per liter (nmol/L)
    Female: 15 to 70 ng/dL or 0.5 to 2.4 nmol/L
    in this paper, testosterone is reported as (ng/ml), we converted to ng/dL (conversion factor 100)
    The LOINC code is 2986-8; Testosterone [Mass/volume] in Serum or Plasma     
    Add a measurement to the list if we can successfully parse and check if the value is abnormal.
    """
    value = row["testosterone"]
    if value == "na":
        return None
    sex = row["sex"]
    if sex == "M":
        low = 300
        high = 1000
    elif sex == "F":
        low = 15
        high = 70
    else:
        print("Warning: did not recognize sex")
        return None
    try:
        concentration = float(value)
        m = Measurements.nanogram_per_deciliter(code="LOINC:2986-8",
                                      label="Testosterone[Mass/Vol]",
                                      concentration=concentration,
                                      low=low,
                                      high=high)
        measurement_list.append(m)
        if concentration > high:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0030088", label="Increased serum testosterone level"))
        elif concentration < low:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0040171", label="Decreased serum testosterone concentration"))
        else:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0030087", label="Abnormal circulating testosterone concentration"), excluded=True)
        phenotypic_feature_list.append(pf)
    except ValueError:
        print(f"Could not parse \"{value}\"")

In [6]:
def ACTH_assesssment(row:pd.Series, 
                     measurement_list: typing.List[Measurement],
                     phenotypic_feature_list: typing.List[PhenotypicFeature]) -> None:
    """
    ACTH is measured in picograms per milliliter (pg/mL). Test results are influenced by the time of day the test was done. Normal results are: 
    Adults: 10-60 pg/ml (1.3-16.7 pmol/L) for an early morning sample (8 a.m.); less than 20 pg/ml (4.5 pmol/L) for a late afternoon sample (4 p.m.)
    We will choose the 10-60 range as the test normally should be done in early morning
    LOINC: 2141-0, Corticotropin [Mass/volume] in Plasma
    """
    value = row["ACTH"]
    if value == "na":
        return None
    if isinstance(value,str) and value.endswith(" "):
        raise ValueError(f"Maformed ACTH: {value}")
    lower_limit_of_normal = 10
    upper_limit_of_normal = 60
    try:
        concentration = float(value)
        m = Measurements.picogram_per_liter(code="LOINC:2141-0",
                                      label="Corticotropin (P) [Mass/Vol]",
                                      concentration=concentration,
                                      low=lower_limit_of_normal,
                                      high=upper_limit_of_normal)
        measurement_list.append(m)
        if concentration > upper_limit_of_normal:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0003154", label="Increased circulating ACTH level"))
        elif concentration < lower_limit_of_normal:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0002920", label="Decreased circulating ACTH concentration"))
        else:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0011043", label="Abnormal circulating adrenocorticotropin concentration"), excluded=True)
        phenotypic_feature_list.append(pf)
    except ValueError:
        print(f"Could not parse \"{value}\"")

In [7]:
def cortisol_assessment(row:pd.Series, 
                        measurement_list: typing.List[Measurement],
                        phenotypic_feature_list: typing.List[PhenotypicFeature]) -> None:
    """
    Normal values for a blood sample taken at 8 in the morning are 5 to 25 mcg/dL or 140 to 690 nmol/L.
    LOINC: 2143-6, Cortisol [Mass/volume] in Serum or Plasma
    """
    value = row["cortisol"]
    if value == "na":
        return None
    if isinstance(value,str) and value.endswith(" "):
        raise ValueError(f"Maformed cortisol: {value}")
    lower_limit_of_normal = 140
    upper_limit_of_normal = 690
    try:
        concentration = float(value)
        m = Measurements.nanomole_per_liter(code="LOINC:2143-6",
                                      label="Cortisol [Mass/Vol]",
                                      concentration=concentration,
                                      low=lower_limit_of_normal,
                                      high=upper_limit_of_normal)
        measurement_list.append(m)
        if concentration > upper_limit_of_normal:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0003118", label="Increased circulating cortisol level"))
        elif concentration < lower_limit_of_normal:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0008163", label="Decreased circulating cortisol level"))
        else:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0011731", label="Abnormality of circulating cortisol level"), excluded=True)
        phenotypic_feature_list.append(pf)
    except ValueError:
        print(f"Could not parse \"{value}\"")

In [8]:
def prolactin_assessment(row:pd.Series, 
                        measurement_list: typing.List[Measurement],
                        phenotypic_feature_list: typing.List[PhenotypicFeature]) -> None:
    """ 
    Normal prolactin levels typically range from 2 to 17 nanograms per millilitre (ng/mL) in non-pregnant women and 2 to 15 ng/mL in men.
    Pregnant women: 80 to 400 ng/mL (80 to 400 µg/L) # we will assume patients tested in non-pregnant state
    LOINC: 2842-3 Prolactin [Mass/volume] in Serum or Plasma
    """
    value = row["prolactin"]
    if value == "na":
        return None
    sex = row["sex"]
    if sex == "M":
        lower_limit_of_normal = 2
        upper_limit_of_normal = 15
    elif sex == "F":
        lower_limit_of_normal = 2
        upper_limit_of_normal = 17
    else:
        print("Warning: did not recognize sex")
        return None
    if isinstance(value,str) and value.endswith(" "):
        raise ValueError(f"Maformed cortisol: {value}")
    try:
        concentration = float(value)
        m = Measurements.nanogram_per_milliliter(code="LOINC:2842-3",
                                      label="Prolactin [Mass/Vol]",
                                      concentration=concentration,
                                      low=lower_limit_of_normal,
                                      high=upper_limit_of_normal)
        measurement_list.append(m)
        if concentration > upper_limit_of_normal:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0000870", label="Increased circulating prolactin concentration"))
        else:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0000870", label="Increased circulating prolactin concentration"), excluded=True)
        # Note there is no medically relevant "Reduced" term
        phenotypic_feature_list.append(pf)
    except ValueError:
        print(f"Could not parse \"{value}\"")

In [9]:
def estradiol_assessment(row:pd.Series, 
                         measurement_list: typing.List[Measurement],
                         phenotypic_feature_list: typing.List[PhenotypicFeature]) -> None:
    """ 
    30 to 400 pg/mL for premenopausal women
    0 to 30 pg/mL for postmenopausal women
    10 to 50 pg/mL for men
    LOINC: 2243-4 Estradiol (E2) [Mass/volume] in Serum or Plasma
    """
    value = row["estradiol (E2)"]
    if value == "na":
        return None
    sex = row["sex"]
    if sex == "M":
        lower_limit_of_normal = 10
        upper_limit_of_normal = 50
    elif sex == "F":
        lower_limit_of_normal = 30
        upper_limit_of_normal = 400
    else:
        print("Warning: did not recognize sex")
        return None
    
    if isinstance(value,str) and value.endswith(" "):
        raise ValueError(f"Maformed estradiol: {value}")
    try:
        concentration = float(value)
        m = Measurements.picogram_per_milliliter(code="LOINC:2243-4",
                                      label="Estradiol (E2) [Mass/Vol]",
                                      concentration=concentration,
                                      low=lower_limit_of_normal,
                                      high=upper_limit_of_normal)
        measurement_list.append(m)
        if concentration > upper_limit_of_normal:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0025134", label="Increased serum estradiol"))
        elif concentration < lower_limit_of_normal:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0008214", label="Decreased serum estradiol"))
        else:
            pf = PhenotypicFeature(type=OntologyClass(id="HP:0025133", label="Abnormal serum estradiol"), excluded=True)
        phenotypic_feature_list.append(pf)
    except ValueError:
        print(f"Could not parse \"{value}\"")

# HPO terms
The original input table had a list of symptoms that we encode as HPO terms. If a symptom was not mentioned in the original file, we assume it was not available rather than explicitly excluded.
The following function is used to extract observed HPO terms for each individual based on the corresponding table row. Note that we additionally generate HPO terms based on the numerical measurements as shown above.

In [10]:
def get_hpo_terms(row:pd):
    hpo_term_d = {
        "Secondary amenorrhea": "HP:0000869",
        "Infertility": "HP:0000789",
        "Testicular neoplasm": "HP:0010788",
        "Cryptorchidism": "HP:0000028",
        "Hirsutism": "HP:0001007",
        "Hypertension": "HP:0000822",
        "Clitoral hypertrophy": "HP:0008665",
        "Irregular menstruation": "HP:0000858",
        "Diarrhea": "HP:0002014",
        "Arrhythmia": "HP:0011675",
        "Congenital hypothyroidism": "HP:0000851",
        "Growth delay": "HP:0001510",
        "Acne": "HP:0001061",
        "Moon facies": "HP:0500011",
        "Hyperpigmentation of the skin": "HP:0000953"
    }
    phenotypic_feature_list = list()
    for label, hpo_id in hpo_term_d.items():
        status = row[label]
        if not status == "observed":
            continue
        term = OntologyClass(id=hpo_id, label=label)
        phenotypic_feature_list.append(PhenotypicFeature(type=term))
    return phenotypic_feature_list


# CYP21A2 variants
The following two cells leverage VariantValidator to obtain representations of HGVS-encoded variants as well as pyphetools classes for structural and promoter variants.

In [11]:
cyp21a2_symbol = "CYP21A2"
cyp21a2_id = "HGNC:2600"
cyp21a2_MANE_transcript = "NM_000500.9"
vmanager = VariantManager(df=df, 
                          individual_column_name="individual", 
                          transcript=cyp21a2_MANE_transcript, 
                          gene_id=cyp21a2_id, 
                          gene_symbol=cyp21a2_symbol, 
                          allele_1_column_name="allele_1", 
                          allele_2_column_name="allele_2")
variant_d = vmanager.get_variant_d()

In [12]:
variant_to_object_d = dict()
all_variant_set = set()
deletion_set = {"DEL exon7-10","DEL exon1-7", "DEL exon1-3", "gene deletion"}
duplication_set = {"gene duplication"}
firstRow = None
for _,row in df.iterrows():
    if firstRow is None:
        firstRow = row
        continue ## first row has additional definitions such as HPO identifiers
    allele_1 = row["allele_1"]
    allele_2 = row["allele_2"]
    all_variant_set.add(allele_1)
    all_variant_set.add(allele_2)
for allele in all_variant_set:
    #print(f"\"{allele}\"")
    if allele.startswith("PROMOTER"):
        fields = allele.split(":")
        description = ":".join(fields[1:])
        var_obj = PromoterVariant.two_KB_upstream_variant(description=description,gene_symbol=cyp21a2_symbol, gene_id=cyp21a2_id)
        variant_to_object_d[allele] = var_obj.to_variant_interpretation()
    elif allele in duplication_set:
        dup = StructuralVariant.chromosomal_duplication(cell_contents=allele, gene_id=cyp21a2_id, gene_symbol=cyp21a2_symbol)
        variant_to_object_d[allele] = dup.to_variant_interpretation_202()
    elif allele in deletion_set:
        delet = StructuralVariant.chromosomal_deletion(cell_contents=allele, gene_id=cyp21a2_id, gene_symbol=cyp21a2_symbol)
        variant_to_object_d[allele] = delet.to_variant_interpretation_202()
    elif allele in variant_d:
        var_obj = variant_d.get(allele)
        variant_to_object_d[allele] = var_obj.to_variant_interpretation_202()
    else:
        raise ValueError(f"Did not recognize allele \"{allele}\"")

In [13]:
def row_to_phenopacket(row:pd.Series):
    individual_id = row["individual"]
    phenopacket_id = "PMID_30968594_{}".format(individual_id.replace(" ", "_"))
    sex = row["sex"]
    age_at_dx = row["age_at_dx"]
    age_of_onset = Age(iso8601duration=age_at_dx)
    i = Individual(id=individual_id)
    if sex == "M":
        i.sex = Sex.MALE
    elif sex == "F":
        i.sex = Sex.FEMALE
    else:
        print(f"Warning count not identify sex for {individual_id}")
    allele_1 = row["allele_1"]
    allele_2 = row["allele_2"]
    var_list = list()
    if allele_1 == allele_2:
        var = variant_to_object_d.get(allele_1)
        var.variation_descriptor.allelic_state = OntologyTerms.homozygous()
        var_list.append(var)
    else:
        var1 = variant_to_object_d.get(allele_1)
        var2 = variant_to_object_d.get(allele_2)
        var1.variation_descriptor.allelic_state = OntologyTerms.heterozygous()
        var2.variation_descriptor.allelic_state = OntologyTerms.heterozygous()
        var_list.append(var1)
        var_list.append(var2)
    ## create genomic interpretation
    interpretation_list = list()
    for var in var_list:
        genomic_interpretation = GenomicInterpretation(subject_or_biosample_id=individual_id, 
                                                       interpretation_status=GenomicInterpretation.InterpretationStatus.CAUSATIVE,
                                                       call=var)
        interpretation_list.append(genomic_interpretation)
    ## Disease is always OMIM:201910 for this cohort.
    diseaseClass = OntologyClass(id="OMIM:201910", label="Adrenal hyperplasia, congenital, due to 21-hydroxylase deficiency")
    ## all individuals have Congenital onset or an iso8601 onset
    if age_at_dx == "Congenital onset":
        onset = OntologyTerms.congenital_onset()
    elif age_at_dx.startswith("P"):
        onset = Age(iso8601duration=age_at_dx)
    else:
        raise ValueError(f"Did not recognize onset \"{age_at_dx}\"")
    disease = Disease(term=diseaseClass, onset=TimeElement(element=onset))
    diagnosis = Diagnosis(disease=diseaseClass, genomic_interpretations=interpretation_list)
    interpretation = Interpretation(id=individual_id, progress_status=Interpretation.ProgressStatus.SOLVED, diagnosis=diagnosis)
    phenotypic_features = get_hpo_terms(row)
    measurements = list()
    measurement_17php(row=row, measurement_list=measurements) # we have no applicable normal range, no phenotypic features possible.
    testosterone_assessment(row=row, measurement_list=measurements, phenotypic_feature_list=phenotypic_features)
    ACTH_assesssment(row=row, measurement_list=measurements, phenotypic_feature_list=phenotypic_features)
    cortisol_assessment(row=row, measurement_list=measurements, phenotypic_feature_list=phenotypic_features)
    prolactin_assessment(row=row, measurement_list=measurements, phenotypic_feature_list=phenotypic_features)
    estradiol_assessment(row=row, measurement_list=measurements, phenotypic_feature_list=phenotypic_features)
    ppkt = Phenopacket(id=phenopacket_id, 
                       subject=i, 
                       diseases=[disease], 
                       phenotypic_features=phenotypic_features, 
                       measurements=measurements, 
                       interpretations=[interpretation], 
                       meta_data=metadata)
    return ppkt

In [14]:
phenopacket_list = list()
firstRow = None
for _,row in df.iterrows():
    if firstRow is None:
        firstRow = row
        continue ## first row has additional definitions such as HPO identifiers
    item = row_to_phenopacket(row)
    phenopacket_list.append(item.to_message())

In [15]:
outdir = "phenopackets"
if not os.path.isdir(outdir):
    os.makedirs(outdir)
written = 0
json_list = list()
for ppkt in phenopacket_list:    
    json_string = MessageToJson(ppkt)
    fname = re.sub('[^A-Za-z0-9_-]', '', ppkt.id)  # remove any illegal characters from filename
    fname = fname.replace(" ", "_") + ".json"
    outpth = os.path.join(outdir, fname)
    with open(outpth, "wt") as fh:
        fh.write(json_string)
        json_list.append(json_string)
        written += 1
print(f"We output {written} GA4GH phenopackets to the directory {outdir}")

We output 69 GA4GH phenopackets to the directory phenopackets
