# Case Study
### Purpose: Demonstrate the ability of the automated rare disease epidemiology extraction pipeline
The pipeline is composed of multiple parts:
1. User enters in a search term or list of search terms. For our purposes, this can be a GARD Disease ID (`int` or `str`) or any rare disease name.


2. The autosearch function (in *extract_abs*) maps the input to all other synonyms of that disease (using the `GARD_dict` from `extract_abs.load_GARD_diseases()` which reformats information from `gard-id-name-synonyms.json`) and outputs a list of search terms.


3. PubMed is searched through the NCBI and EBI API (`search_getAbs(searchterm_list, maxResults, filtering)` from *classify_abs*) and abstracts until the number of abstracts returned is >= maxResults or until results have been exhausted.


4. `search_getAbs` has three options for filtering the abstracts. We utilize `'strict'` filtering *after* `int(maxResults)` abstracts have been gathered. Strict filtering must find at least one of the terms from the search term list in the abstract for it to be valid. This was implemented because the APIs are structured to minimize false negatives (gives results generously), thus there are often many unrelated, false positive abstracts returned by the APIs, particularly the EBI API.


5. The relevant abstracts are then passed through a long short-term memory recurrent neural network (LSTM RNN) in *classify_abs*. Abstracts with >0.5 probability of having epidemiologic content are sent to the BioBERT NER model and the `get_diseases` function in *extract_abs*


6. The NER model identifies epidemiologic type (`EPI`); epidemiologic rate (`STAT`); location (`LOC`); date (`DATE`); biological sex (`SEX`); and ethnicity/race/nationality (`ETHN`).


7. The `get_diseases` algorithm identifies rare disease names & synonyms (`DIS`) and GARD IDs (`IDS`) in the abtract using `GARD_dict` and `max_length` (the number of words in the longest disease name/synonym in the `GARD_dict`). By capping the function at `max_length`, the algorithm goes from *O(n<sup>2</sup>)* to *O(n)* time

In [1]:
import pandas as pd
import classify_abs
#classify_abs is a dependency for extract_abs
import extract_abs
pd.set_option('display.max_colwidth', None)
#from IPython.core.display import display, HTML
#display(HTML("<style>.container { width:98% !important; }</style>"))

Load the model and pipeline dependencies once

In [2]:
#LSTM RNN Epi Classifier Model
classify_model_vars = classify_abs.init_classify_model()

#GARD Dictionary - For filtering and exact match disease/GARD ID identification
GARD_dict, max_length = extract_abs.load_GARD_diseases()

#BioBERT-based NER pipeline, open `entities` to see 
NER_pipeline, entity_classes = extract_abs.init_NER_pipeline()

#strict filtering must find at least one of the terms in the search term list in the abstract
filtering = 'strict'

#We will not be extracting diseases in the case study
extract_diseases = False





In [3]:
def search(term,max_results,): 
    return extract_abs.search_term_extraction(term, max_results, filtering, #filtering options are 'strict','lenient'(default), 'none'
                                              NER_pipeline, entity_classes, 
                                              extract_diseases, GARD_dict, max_length, 
                                              classify_model_vars)

In [4]:
a = search(6667,60)
a

SEARCH TERM MATCHED TO GARD DICTIONARY. SEARCHING FOR:  ['homocystinuria due to cystathionine beta-synthase deficiency', 'cystathionine beta-synthase deficiency', 'homocystinuria due to cbs deficiency', 'classic homocystinuria', 'cbs deficiency']
Found 60 PMIDs. Gathered 25 Relevant Abstracts.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


2 abstracts classified as epidemiological.


Unnamed: 0,PMID,ABSTRACT,EPI_PROB,IsEpi,EPI,STAT,LOC,DATE,SEX,ETHN
0,34449519,"Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn Screening: A 6-Year Experience. Kuwait is a small Arabian Gulf country with a high rate of consanguinity and where a national newborn screening program was expanded in October 2014 to include a wide range of endocrine and metabolic disorders. A retrospective study conducted between January 2015 and December 2020 revealed a total of 304,086 newborns have been screened in Kuwait. Six newborns were diagnosed with classic homocystinuria with an incidence of 1:50,000, which is not as high as in Qatar but higher than the global incidence. Molecular testing for five of them has revealed three previously reported pathogenic variants in the <i>CBS</i> gene, c.969G>A, p.(Trp323Ter); c.982G>A, p.(Asp328Asn); and the Qatari founder variant c.1006C>T, p.(Arg336Cys). This is the first study to review the screening of newborns in Kuwait for classic homocystinuria, starting with the detection of elevated blood methionine and providing a follow-up strategy for positive results, including plasma total homocysteine and amino acid analyses. Further, we have demonstrated an increase in the specificity of the current newborn screening test for classic homocystinuria by including the methionine to phenylalanine ratio along with the elevated methionine blood levels in first-tier testing. Here, we provide evidence that the newborn screening in Kuwait has led to the early detection of classic homocystinuria cases and enabled the affected individuals to lead active and productive lives.",0.985902,True,[incidence],"[1 : 50 , 000]","[ku, wai, bian gulf country, qa, tar, global]","[oct, ober 2014, between january 2015 and december 2020]",,"[ara, qa, tar]"
1,20567906,"Vascular presentation of cystathionine beta-synthase deficiency in adulthood. Several recent studies describing a solely vascular presentation of cystathionine beta-synthase (CBS) deficiency in adulthood prompted us to analyze the frequency of patients manifesting with vascular complications in the Czech Republic. Between 1980 and 2009, a total of 20 Czech patients with CBS deficiency have been diagnosed yielding an incidence of 1:311,000. These patients were divided into three groups based on symptoms leading to diagnosis: those with vascular complications, with connective tissue manifestation and with neurological presentation. A vascular event such as a clinical feature leading to diagnosis of homocystinuria was present in five patients, while two of them had no other symptoms typical for CBS deficiency at the time of diagnosis. All patients with the vascular manifestation were diagnosed only during the past decade. The median age of diagnosis was 29 years in the vascular, 11.5 years in the connective tissue and 4.5 years in the neurological group. The ratio of pyridoxine responsive to nonresponsive patients was higher in the vascular (4 of 5 patients) and connective tissue groups (6 of 7 patients) than in the neurological group (2 of 8 patients). Mutation c.833T>C (p.I278T) was frequent in patients with vascular (6/10 alleles) and connective tissue presentation (8/14 alleles), while it was not present in patients with neurological involvement (0/16 alleles). During the last decade, we have observed patients with homocystinuria diagnosed solely due to vascular events; this milder form of homocystinuria usually manifests at greater ages, has a high ratio of pyridoxine responsiveness/nonresponsiveness, and the mutation c.833T>C (p.I278T) is often present.",0.861598,True,[incidence],"[1 : 311 , 000]","[the, czech republic]",[between 1980 and 2009],,[cz]


In [5]:
b = search('GRACILE syndrome',50)
b

SEARCH TERM MATCHED TO GARD DICTIONARY. SEARCHING FOR:  ['growth restriction-aminoaciduria-cholestasis-iron overload-lactic acidosis-early death syndrome', 'growth retardation, aminoaciduria, cholestasis, iron overload, lactic acidosis and early death', 'growth delay-aminoaciduria-cholestasis-iron overload-lactic acidosis-early death syndrome', 'finnish lactic acidosis with hepatic hemosiderosis', 'finnish lethal neonatal metabolic syndrome', 'gracile syndrome', 'fellman syndrome', 'fellman disease']
Found 50 PMIDs. Gathered 11 Relevant Abstracts.
1 abstracts classified as epidemiological.


Unnamed: 0,PMID,ABSTRACT,EPI_PROB,IsEpi,EPI,STAT,LOC,DATE,SEX,ETHN
0,12547234,"The GRACILE syndrome, a neonatal lethal metabolic disorder with iron overload. GRACILE syndrome (Fellman syndrome, MIM 603358), an autosomal recessive metabolic disorder of the Finnish disease heritage, has been diagnosed in 25 infants of 18 families. The incidence is at least 1/47,000 in Finland. The main findings are fetal growth retardation, Fanconi type aminoaciduria, cholestasis, iron overload (liver hemosiderosis, hyperferritinemia, hypotransferrinemia, increased transferrin iron saturation, and free plasma iron), profound lactic acidosis, and early death. The pathophysiology of the metabolic disturbance is unsolved. No significant deficiency of complex III activity of respiratory chain has been found, although we recently showed that the underlying genetic cause is a missense mutation (S78G) in the BCS1L gene and other mutations in that gene have been associated with complex III deficiency. BCS1L encodes a mitochondrial protein, acting as a chaperone in the assembly of complex III. Iron accumulation in liver, a typical feature being less abundant with increasing age, might be a primary abnormality or a secondary phenomenon due to liver dysfunction. In order to decrease the iron overload, three infants have been repeatedly treated with apotransferrin followed by exchange transfusion. Improvement in iron biochemistry occurred, but no clear beneficial effect on the clinical condition was found. Further studies will elucidate the role of iron in the pathophysiology of the disease.",0.9975,True,[incidence],"[least, 1 / 47 , 000]",[finland],,,[fin]


In [6]:
c = search('GARD:0007383',50)
c

SEARCH TERM MATCHED TO GARD DICTIONARY. SEARCHING FOR:  ['phenylalanine hydroxylase deficiency', 'oligophrenia phenylpyruvica', 'phenylketonuria', 'folling disease']
Found 50 PMIDs. Gathered 44 Relevant Abstracts.
3 abstracts classified as epidemiological.


Unnamed: 0,PMID,ABSTRACT,EPI_PROB,IsEpi,EPI,STAT,LOC,DATE,SEX,ETHN
2,34082800,"Birth prevalence of phenylalanine hydroxylase deficiency: a systematic literature review and meta-analysis. <h4>Background</h4>Phenylalanine hydroxylase (PAH) deficiency is an autosomal recessive disorder that results in elevated concentrations of phenylalanine (Phe) in the blood. If left untreated, the accumulation of Phe can result in profound neurocognitive disability. The objective of this systematic literature review and meta-analysis was to estimate the global birth prevalence of PAH deficiency from newborn screening studies and to estimate regional differences, overall and for various clinically relevant Phe cutoff values used in confirmatory testing.<h4>Methods</h4>The protocol for this literature review was registered with PROSPERO (International prospective register of systematic reviews). Pubmed and Embase database searches were used to identify studies that reported the birth prevalence of PAH deficiency. Only studies including numeric birth prevalence reports of confirmed PAH deficiency were included.<h4>Results</h4>From the 85 publications included in the review, 238 birth prevalence estimates were extracted. After excluding prevalence estimates that did not meet quality assessment criteria or because of temporal and regional overlap, estimates from 45 publications were included in the meta-analysis. The global birth prevalence of PAH deficiency, estimated by weighting regional birth prevalences relative to their share of the population of all regions included in the study, was 0.64 (95% confidence interval [CI] 0.53-0.75) per 10,000 births and ranged from 0.03 (95% CI 0.02-0.05) per 10,000 births in Southeast Asia to 1.18 (95% CI 0.64-1.87) per 10,000 births in the Middle East/North Africa. Regionally weighted global birth prevalences per 10,000 births by confirmatory test Phe cutoff values were 0.96 (95% CI 0.50-1.42) for the Phe cutoff value of 360 ± 100 µmol/L; 0.50 (95% CI 0.37-0.64) for the Phe cutoff value of 600 ± 100 µmol/L; and 0.30 (95% CI 0.20-0.40) for the Phe cutoff value of 1200 ± 200 µmol/L.<h4>Conclusions</h4>Substantial regional variation in the birth prevalence of PAH deficiency was observed in this systematic literature review and meta-analysis of published evidence from newborn screening. The precision of the prevalence estimates is limited by relatively small sample sizes, despite widespread and longstanding newborn screening in much of the world.",0.997966,True,"[birth prevalence, prevalence, birth prevalence estimates, prevalence estimates, birth prevalences]","[0 . 64, per 10 , 000 births, 0 . 03, 1 . 18]","[global, southeast asia, the middle east / north africa]",,,
0,35023679,"Frequency of PAH Mutations Among Classic Phenylketon Urea Patients in Mazandaran and Golestan Provinces, North of Iran. <h4>Background</h4>Phenylketonuria (PKU) is the most common aminoacidopathy with an autosomal recessive inheritance pattern. A global PKU prevalence is estimated about 6.002 in 100,000 newborns. In Iran, the prevalence of PKU is estimated at about 1 in 4,698, and it shows an increasing trend from north (0.0015%) to south (0.02%) of the country. Untreated PKU causes mental retardation, microcephaly, and seizure. PAH gene mutations located at chromosome 12q23 are responsible for the classical type of this disease. The spectrum of PAH mutations is varied in different ethnicities and different parts of the world. The aim of this study was to investigate the frequency of PAH mutation in the Mazandaran province, which could be useful for genetic counseling and prenatal diagnosis.<h4>Methods</h4>A total of 66 individuals from 33 families from two provinces (9 families from Golestan and 24 families from Mazandaran) from north of Iran participated in this study. After genomic DNA extraction, PAH gene analysis was carried out using DNA sequencing of both coding and non-coding regions by ABI 3130XL genetic analyzer.<h4>Results</h4>Twenty-six different mutations were identified in the PAH gene in this study. Four mutations including IVS10-11 (c.1066-11G>A), c.727C>T (p.Arg243X), c.898G>T (p.Ala300Ser), and c.601C>T (p.His201Tyr) were the most common mutations with 37.48% frequency in Mazandaran province. Most frequent mutations in Golestan province were IVSI0-11 (c.1066-11G>A), c.722delG (p.Arg241fs), c.842C>T (p.Pro281Leu), and IVSII+5 (G>A) with frequency 58.57%.<h4>Conclusions</h4>The results from the present study verify heterogeneity of the PAH gene and may help to diagnose tests for carrier detection and prenatal diagnosis of the PKU disease in Iranian population.",0.996785,True,[prevalence],"[6 . 002 in 100 , 000 newborns, 1 in 4 , 698, %]","[of iran, global, ira, ma, dar, north, iran]",,,[ira]
1,32893076,"Spectrum of PAH gene mutations and genotype-phenotype correlation in patients with phenylalanine hydroxylase deficiency from Shanxi province. <h4>Background</h4>Phenylalanine hydroxylase deficiency (PAHD) is an autosomal recessive inborn error that affects phenylalanine (Phe) metabolism. It has a complex phenotype with many variants and genotypes among different populations. Shanxi province is a high-prevalence area of PAHD in China.<h4>Methods</h4>In this study, eighty-nine PAHD patients were subjected to genetic testing using Sanger sequencing, followed by multiplex ligation-dependent probe amplification analysis (MLPA). Allelic and genotypic phenotype values (APV and GPV, respectively) were used for genotype-based phenotypic prediction.<h4>Results</h4>Fifty-one types of variants, including three novel forms, were identified. The predominant variant was p.R243Q (22.09%), followed by p.R53H (10.47%), p.EX6-96A > G (9.30%), p.V399V (5.23%) and p.R413P (3.49%). Notably, mild hyperphenylalaninemia (MHP) has a high prevalence in this region (up to 45.76%), and the variant p.R53H was solely observed in patients of MHP. According to the genotype-phenotype prediction, the APV/GPV system was well correlated with the metabolic phenotype of most PAHD patients.<h4>Conclusion</h4>We have systematically constructed the mutational and phenotypic spectrum of PAH in Shanxi province. Hence, this study will help to further understand the genotype-phenotype associations in PAHD patients, and it may offer more reliable genetic counseling and management.",0.842657,True,[prevalence],,"[shanxi province, chin, a, shan, province]",,,


In [7]:
a.to_csv('case_study/classic-homocystinuria.csv',index=False)
b.to_csv('case_study/Fellman-syndrome.csv',index=False)
c.to_csv('case_study/phenylketonuria.csv',index=False)

Also good
- GARD:0009941
- GARD:0006667
- GARD:0012301
- GARD:0007137
- GARD:0002470
- GARD:0006209
- GARD:0006665
- GARD:0007383
- GARD:0007137
- GARD:0005274
- GARD:0002153
- GARD:0007627
- GARD:0000092
- GARD:0000111