# Comparing Model to Orphanet

Goal: Compare the output of my model to Orphanet's data
Need to make comparisons between the output of my model on multiple levels
 - Disease NameGARD_ID
 - Prevalence Type
 - Prevalence Class
   - Not if Source is Expert
 - Location  

Which also requires me to 
1. Input the Orphanet Data
2. XX Write a PMC abstract getting function (EBI API) (Tried, did not work, EBI API and PMC API dont have most full text articles) XX
3. Input the model
4. Default no location to Worldwide
5. Make Predictions
6. Save Predictions

### Input Orphanet Data

In [1]:
import xml.etree.ElementTree as ET
import re
import pandas as pd
import classify_abs
import extract_abs
import time

In [2]:
#This file was downloaded on August 31, 2021. See README.md for details
tree = ET.parse('en_product9_prev.xml')
root = tree.getroot()

In [3]:
df = pd.DataFrame(columns=['OrphaCode',
                           'Disease Name',
                           'Orpha Epi Type',
                           'Orpha Epi Class',
                           'Orpha Epi Rate',
                           'Orpha Loc', 
                           'PMID',#Above here is orphanet (will show to the left)
                           'Title+Abstract', #Below here is my model (will show to the right)
                           'GARD Disease ID',
                           'Pipeline Disease',
                           'Epi Identifier',
                           'Epi Statistics',
                           'Model Location',
                           'Model Date',
                           'Model Sex',
                           'Model Ethnicity'])

In [4]:
NER_pipeline, labels = extract_abs.init_NER_pipeline()
GARD_dict, max_length = extract_abs.load_GARD_diseases()

Parse through the entire Orphanet Database

In [5]:
i=0
pmid_abs = {}
pmid_extract = {}
print(i,time.ctime(time.time()-18000))
exit =False
for disorder in root.iter('Disorder'):  
    code = disorder.find('./OrphaCode').text
    name = disorder.find('./Name').text
    #Each disorder, w/code and name, has multiple prevalence branches
    for prevalence in disorder.findall('./PrevalenceList/Prevalence'):
        EPtype = prevalence.find('./PrevalenceType/Name').text
        if 'class' in prevalence.find('./PrevalenceQualification/Name').text.lower():
            EPclss = prevalence.find('./PrevalenceClass/Name').text
        else:
            EPclss = ''
        EPrate = prevalence.find('./ValMoy').text
        geoloc = prevalence.find('./PrevalenceGeographic/Name').text
        source = prevalence.find('./Source').text
        #each prevalence, w/geoloc and source, has multiple pmids w/abstracts
        if 'PMID' in str(source) and 'EXPERT' not in str(source) and len(EPclss)>1:
            pmids = re.findall('\d{6,8}', source)
            for articleid in pmids:
                pmid = articleid
                if pmid not in pmid_abs.keys():
                    pmid_abs[pmid] = classify_abs.PMID_getAb(articleid)
                    
                #pmid_abs[pmid] is the current abstract, this speeds up the EBI API so it does not keep getting duplicate abstracts
                abstract = pmid_abs[pmid]
                if len(abstract)>5:
                    if pmid not in pmid_extract.keys():
                        extraction = extract_abs.abstract_extraction(abstract, NER_pipeline, labels, GARD_dict, max_length)
                        if len(extraction['LOC']) == 0:
                            extraction['LOC'].update(['worldwide'])
                        pmid_extract[pmid] = extraction
                    else:
                        #pmid_extract[pmid] is the current extraction, this speeds up process so the extraction model does not keep working on duplicate abstracts
                        extraction = pmid_extract[pmid]
                    #Note: there are duplicate PMIDs next to each other in the dataset, but keeping in case orphanet has differen extraction data
                    df = df.append({'OrphaCode':code,
                                    'Disease Name':name,
                                    'Orpha Epi Type':EPtype,
                                    'Orpha Epi Class':EPclss,
                                    'Orpha Epi Rate':EPrate,
                                    'Orpha Loc':geoloc,
                                    'PMID':pmid,
                                    'Title+Abstract':abstract,
                                    'GARD Disease ID':extraction['IDS'],
                                    'Pipeline Disease':extraction['DIS'],
                                    'Epi Identifier':extraction['EPI'],
                                    'Epi Statistics':extraction['STAT'],
                                    'Model Location':extraction['LOC'],
                                    'Model Date':extraction['DATE'],
                                    'Model Ethnicity':extraction['ETHN'],
                                    'Model Sex':extraction['SEX']}
                                   , ignore_index=True)
                i+=1
                if i%500==0:
                    print(i,time.ctime(time.time()-18000))

0 Mon Jan 31 21:59:10 2022


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


500 Mon Jan 31 22:07:59 2022
1000 Mon Jan 31 22:19:05 2022
1500 Mon Jan 31 22:29:47 2022
2000 Mon Jan 31 22:39:22 2022
2500 Mon Jan 31 22:51:06 2022
3000 Mon Jan 31 23:02:44 2022
3500 Mon Jan 31 23:12:56 2022
4000 Mon Jan 31 23:24:50 2022
4500 Mon Jan 31 23:38:05 2022
5000 Mon Jan 31 23:54:06 2022
5500 Tue Feb  1 00:06:48 2022
6000 Tue Feb  1 00:18:45 2022
6500 Tue Feb  1 00:27:23 2022


In [6]:
df

Unnamed: 0,OrphaCode,Disease Name,Orpha Epi Type,Orpha Epi Class,Orpha Epi Rate,Orpha Loc,PMID,Title+Abstract,GARD Disease ID,Pipeline Disease,Epi Identifier,Epi Statistics,Model Location,Model Date,Model Sex,Model Ethnicity
0,166024,"Multiple epiphyseal dysplasia, Al-Gazali type",Point prevalence,<1 / 1 000 000,0.0,Worldwide,11389160,Localisation of a gene for an autosomal recess...,[],[],{},[],{worldwide},{},{},{oman}
1,166024,"Multiple epiphyseal dysplasia, Al-Gazali type",Point prevalence,<1 / 1 000 000,0.0,Worldwide,9689990,"Autosomal recessive syndrome of macrocephaly, ...",[],[],{},[],{worldwide},{},{},{oman}
2,166032,"Multiple epiphyseal dysplasia, with miniepiphyses",Point prevalence,<1 / 1 000 000,0.0,Worldwide,15523498,Mutations in the known genes are not the major...,[],[],{},[],{worldwide},{},{},{}
3,61,Alpha-mannosidosis,Prevalence at birth,<1 / 1 000 000,0.09,Australia,9918480,Prevalence of lysosomal storage disorders. <h4...,[],[],"{combined prevalence, prevalence}","[1 per 57000 live births, 1 per 4 . 2 million ...","{australia, ##tralia, aus}","{ja, december 31 , 1996, ##nuary 1 , 1980, thr...",{},"{##tralian, aus}"
4,61,Alpha-mannosidosis,Prevalence at birth,1-9 / 1 000 000,0.13,Norway,7900112,[Alpha-mannosidosis]. Alpha-mannosidosis is a ...,[],[],{},[],"{##so, trom}",{between 1983 and 1987},{},{}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6377,99812,LIG4 syndrome,Point prevalence,<1 / 1 000 000,0.0,Worldwide,27717373,DNA ligase IV syndrome; a review. DNA ligase I...,[],[],{},[],{worldwide},{},{},{}
6378,99806,Oculootodental syndrome,Point prevalence,<1 / 1 000 000,0.0,Worldwide,12147582,First genomic localization of oculo-oto-dental...,[],[],{},[],{worldwide},{},{},"{br, ##itis}"
6379,99807,PEHO-like syndrome,Point prevalence,<1 / 1 000 000,0.0,Worldwide,15968934,"Progressive encephalopathy with edema, hypsarr...",[GARD:0010559],[child],{},[],{worldwide},{},{},{}
6380,99792,Dentin dysplasia-sclerotic bones syndrome,Point prevalence,<1 / 1 000 000,0.0,Worldwide,264650,Dentine dysplasia with sclerotic bone and skel...,[],[],{},[],{worldwide},{},{},{}


In [7]:
#Not going to remove the duplicate PMIDs because Orphanet has different stat, disease, loc, info for each duplicated entry.
df.to_csv('Orphanet-Comparison-FINAL.csv', header = True,index=False, encoding='utf-8')
print('DONE')

DONE
