# Load functions

In [1]:
%run "C:/Users/asclo/Desktop/HHS/NIH Dashboard/Python Notebooks/Final Deliverables/FunctionsForAnalysis.py"

### Load Data

In [2]:
abstract = pd.read_csv('C:/Users/asclo/Desktop/HHS/NIH Dashboard/Python Notebooks/Final Deliverables/All pmid abstracts from hpo annotations.csv',sep=',', index_col = False)
abstract = abstract.rename(columns = {'Pubmed_ID': 'text_id', 'Abstract': 'text'})
abstract['text_id'] = abstract['text_id'].astype(str)
abstract = abstract.iloc[0:2, :]


# Functions for HPO Annotation File

## HPO Annotations File

**Input:** **Website**

**Output:** Dataset from HPO with annotations for each disease


In [3]:
hpo_annotations = get_hpo_annotations_and_clean()
hpo_annotations

Unnamed: 0,#disease-db,disease-identifier,disease-name,negation,hpo,reference,evidence-code,onset,frequencyHPO,modifier,sub-ontology,alt-names,curators,frequencyRaw,sex,uniqueid
0,DECIPHER,1,Wolf-Hirschhorn Syndrome,,HP:0000252,DECIPHER:1,IEA,,,,P,WOLF-HIRSCHHORN SYNDROME,HPO:skoehler[2013-05-29],-,-,HP:0000252DECIPHER:1
1,DECIPHER,1,Wolf-Hirschhorn Syndrome,,HP:0001249,DECIPHER:1,IEA,,,,P,WOLF-HIRSCHHORN SYNDROME,HPO:skoehler[2013-05-29],-,-,HP:0001249DECIPHER:1
2,DECIPHER,1,Wolf-Hirschhorn Syndrome,,HP:0001250,DECIPHER:1,IEA,,,,P,WOLF-HIRSCHHORN SYNDROME,HPO:skoehler[2013-05-29],-,-,HP:0001250DECIPHER:1
3,DECIPHER,1,Wolf-Hirschhorn Syndrome,,HP:0001252,DECIPHER:1,IEA,,,,P,WOLF-HIRSCHHORN SYNDROME,HPO:skoehler[2013-05-29],-,-,HP:0001252DECIPHER:1
4,DECIPHER,1,Wolf-Hirschhorn Syndrome,,HP:0001518,DECIPHER:1,IEA,,,,P,WOLF-HIRSCHHORN SYNDROME,HPO:skoehler[2013-05-29],-,-,HP:0001518DECIPHER:1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203678,ORPHA,99978,Klatskin tumor,,HP:0002716,ORPHA:99978,TAS,,HP:0040282,,P,,orphadata,-,-,HP:0002716ORPHA:99978
203679,ORPHA,99978,Klatskin tumor,,HP:0004936,ORPHA:99978,TAS,,HP:0040283,,P,,orphadata,-,-,HP:0004936ORPHA:99978
203680,ORPHA,99978,Klatskin tumor,,HP:0012334,ORPHA:99978,TAS,,HP:0040281,,P,,orphadata,-,-,HP:0012334ORPHA:99978
203681,ORPHA,99978,Klatskin tumor,,HP:0012378,ORPHA:99978,TAS,,HP:0040283,,P,,orphadata,-,-,HP:0012378ORPHA:99978


## HPO Annotations Pubmed and Disease Pairings

**Input:** The HPO Annotations File

**Output:** Dataset with disease names, text_ids for disease/pubmedID paring


In [4]:
hpo_annotations_pmid_and_disease(hpo_annotations)

Unnamed: 0,disease-name,text_id
114,Xq28 (MECP2) duplication,17088400
244,15q26 overgrowth syndrome,19133692
246,15q26 overgrowth syndrome,20603595
250,15q26 overgrowth syndrome,10951463
251,15q26 overgrowth syndrome,12404101
...,...,...
110002,"Mitochondrial complex IV deficiency, nuclear t...",31290619
110025,"Vitamin D-dependent rickets, type 3",29461981
110040,"Cleft palate, proliferative retinopathy, and d...",30976112
110061,Neurodevelopmental disorder with alopecia and ...,30475435


# Functions for HPO Graph (OBO) File

## Loading Graph into Python

**Input:** https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.obo

**Output:** HPO IDs and their hierarchy

In [5]:
help(hpo_hierarchy_graph_load)

Help on function hpo_hierarchy_graph_load in module __main__:

hpo_hierarchy_graph_load()
    Loads all data from the graph/obo file located here:
    
    https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.obo
    
    Input: None
    
    Output: HPO IDs and their hierarchy



In [6]:
g = hpo_hierarchy_graph_load()
g

<networkx.classes.multidigraph.MultiDiGraph at 0x25c293964c8>

## List of Phenotypic Abnormalities

**Input:** HPO graph network

**Output:** List of hpo codes that are under phenotypic abnormality

In [7]:
help(graph_phenotypic_abnormality)

Help on function graph_phenotypic_abnormality in module __main__:

graph_phenotypic_abnormality(graph_network)
    This creates a list of HP codes that are children of the phenotypic abnormality.
    It checks if each HPO code in the graph is a child of HP:0000118, which is phenotypic abnormality.
    
    Input: HPO graph_network
    Output: List of hpo codes that are under phenotypic abnormality



In [8]:
phenotypic_abnormality_list = graph_phenotypic_abnormality(g)
phenotypic_abnormality_list[1:10]

['HP:0000003',
 'HP:0000008',
 'HP:0000009',
 'HP:0000010',
 'HP:0000011',
 'HP:0000012',
 'HP:0000013',
 'HP:0000014',
 'HP:0000015']

## Get All external references for all HPO codes

**Input:** HPO graph network

**Output:** DataFrame with hpo codes and associated external references


In [9]:
help(get_all_external_references_for_hpo_codes)

Help on function get_all_external_references_for_hpo_codes in module __main__:

get_all_external_references_for_hpo_codes(graph_network)
    This function pulls all of the other codes associated with each hpo code
    
    Input: HPO graph network
    Output: DataFrame with hpo codes and associated external references



In [10]:
get_all_external_references_for_hpo_codes(g).head(5)

Unnamed: 0,hpo,x_ref
HP:0000001,HP:0000001,UMLS:C0444868
HP:0000002,HP:0000002,UMLS:C4025901
HP:0000003,HP:0000003,MSH:D021782
HP:0000005,HP:0000005,UMLS:C1708511
HP:0000006,HP:0000006,SNOMEDCT_US:263681008


## Get UMLS codes for Metamap

**Input:** HPO graph network

**Output:** DataFrame with hpo codes and associated external references

In [11]:
help(get_hpos_for_umls_code)

Help on function get_hpos_for_umls_code in module __main__:

get_hpos_for_umls_code(graph_network)
    Metamap is coded under UMLS concept codes.  We need to convert these into HPO code.
    There are external references for HPOs codes in the graph network of HPO phenotypes.
    
    This function gets all of the umls codes for hpos that are phenotypic abnormalities.
    
    Child Function: get_all_external_references_for_hpo_codes



In [12]:
df = get_hpos_for_umls_code(g)
df.head(5)

Unnamed: 0,hpo,conceptId
HP:0000001,HP:0000001,C0444868
HP:0000002,HP:0000002,C4025901
HP:0000005,HP:0000005,C1708511
HP:0000008,HP:0000008,C4025900
HP:0000009,HP:0000009,C3806583


## Get Alternate IDs
**Input:** One HPO Code; HPO graph network

**Output:** DataFrame with hpo codes and associated external references


In [13]:
help(get_alt_ids)

Help on function get_alt_ids in module __main__:

get_alt_ids(hpo_code, graph_network)
    # %%



In [14]:
get_alt_ids('HP:0002817', g)

Unnamed: 0,alt_hpo,hpo
0,HP:0003838,HP:0002817


##  Get Alternate IDs for full HPO list

**Input:** List of HPO Codes; HPO graph network

**Output:** DataFrame with hpo codes and associated external references

In [15]:
help(graph_alternate_direct_ids)

Help on function graph_alternate_direct_ids in module __main__:

graph_alternate_direct_ids(List_of_HPOs, graph_network)
    This gets all alternate HPO codes for each given HPO.  
    There can be many HPO codes being used for the same phenotype issues.
    
    Child function:
        - get_alt_ids - this supplies the alternate HPO codes for one given HPO code



In [16]:
lst = ['HP:0002817', 'HP:0001155']
graph_alternate_direct_ids(lst, g)

Unnamed: 0,alt_hpo,hpo
0,HP:0003838,HP:0002817
0,HP:0005858,HP:0001155


## Get Child and Parent Relationships

**Input:** One HPO code; HPO graph network

**Output:** DataFrame with related hpos, orginal hpo, and type of relationship 


In [17]:
help(get_child_parent)

Help on function get_child_parent in module __main__:

get_child_parent(hpo_code, graph_network)
    This function takes one hpo code and returns the parent and child hpos.



In [18]:
df = get_child_parent('HP:0002817', g)
df.head(5)

Unnamed: 0,related_hpo,hpo,relationship
0,HP:0040064,HP:0002817,Parent
0,HP:0001155,HP:0002817,Child
1,HP:0001446,HP:0002817,Child
2,HP:0001454,HP:0002817,Child
3,HP:0002973,HP:0002817,Child


## Get Child and Parent Relationships AND Alternative Ids

**Input:** One HPO code; HPO graph network

**Output:** DataFrame with related hpos, orginal hpo, and type of relationship 

In [19]:
#Comments
help(get_child_parent_and_alternatives)

Help on function get_child_parent_and_alternatives in module __main__:

get_child_parent_and_alternatives(hpo_code, graph_network)
    Combines the:
    1. get_child_parent function, with
    2. graph alternate_direct_ids function
    
    To get a full list of all parent and child and thier alternative ids for one hpo code.



In [20]:
#Example
df = get_child_parent_and_alternatives('HP:0002817', g)
df.head(5)

Unnamed: 0,related_hpos_with_alternates,hpo,relationship
0,HP:0040064,HP:0002817,Parent
0,HP:0001155,HP:0002817,Child
1,HP:0001446,HP:0002817,Child
2,HP:0001454,HP:0002817,Child
3,HP:0002973,HP:0002817,Child


# Mondo API Annotation

## Get one Mondo Annotation

**Input:** Text for Annotation; Id for text; You can set the modifications allowed in the api

**Output:** DataFrame all mondo annotations, the mondo category, the term annotated, and the Id for text

In [21]:
help(get_one_mondo_annotation)

Help on function get_one_mondo_annotation in module __main__:

get_one_mondo_annotation(text, text_id, min_word_length=4, longest_only='true', include_abbreviation='false', include_acronym='false', include_numbers='false')
    Creates a call to the mondo anotator.  it allows for the same modifications to the mondo annotator as the api itself.
    
    Requires input of:
        - Text to be examined
        - ID for the text, most commonly PubMedId



In [22]:
get_one_mondo_annotation(abstract.iloc[0, :]['text'], abstract.iloc[0, :]['text_id'], min_word_length = 4, longest_only = 'true', 
                        include_abbreviation = 'false', include_acronym = 'false', include_numbers = 'false')

Unnamed: 0,id,category,terms,text_id
0,HP:0032320,[],['Affected'],17088400
1,RO:0002418,[],['causally upstream of or within'],17088400
2,RO:0002264,[],['acts upstream of or within'],17088400
3,UBERON:0003101,['anatomical entity'],['male organism'],17088400
4,PATO:0000384,['quality'],['male'],17088400
...,...,...,...,...
110,PATO:0000427,['quality'],['recurrent'],17088400
111,foaf:Document,[],['Document'],17088400
112,IAO:0000310,['publication'],['document'],17088400
113,IAO:0000572,[],['documenting'],17088400


## Get All Mondo Annotations

**Input:** DataFrame (that contains Text for Annotation; Id for text); HPO graph network

**Output:** DataFrame all mondo annotations, the mondo category, the term annotated, and the Id for text

In [23]:
mondo_df = get_all_mondo_annotations(abstract, min_word_length_all = 4, longest_only_all = 'true', 
                    include_abbreviation_all ='false', include_acronym_all = 'false',include_numbers_all = 'false')

mondo_df

Unnamed: 0,annotation_id,category,terms,text_id
0,HP:0032320,[],['Affected'],17088400
1,RO:0002418,[],['causally upstream of or within'],17088400
2,RO:0002264,[],['acts upstream of or within'],17088400
3,UBERON:0003101,['anatomical entity'],['male organism'],17088400
4,PATO:0000384,['quality'],['male'],17088400
...,...,...,...,...
88,MP:0000519,"['phenotype', 'quality']",['hydronephrosis'],19133692
89,HP:0000126,"['phenotype', 'quality']",['Hydronephrosis'],19133692
90,PATO:0000463,['quality'],['conspicuous'],19133692
91,MONDO:0017806,"['disease', 'quality']",['15q overgrowth syndrome'],19133692


## Cleaning Mondo Annotations
**Input:** DataFrame (that contains Text for Annotation; Id for text); HPO graph network

**Output:** DataFrame all mondo annotations, the mondo category, the term annotated, and the Id for text

In [24]:
help(cleaning_mondo_to_hpo)

Help on function cleaning_mondo_to_hpo in module __main__:

cleaning_mondo_to_hpo(mondo_data_frame)
    Makes sure HPO code is being referenced.  
    Can be used with one or many annotations.



In [25]:
mondo_df_cleaned = cleaning_mondo_to_hpo(mondo_df)
mondo_df_cleaned

Unnamed: 0,hpo,category,terms,text_id
0,HP:0032320,[],['Affected'],17088400
8,HP:0001417,['inheritance'],['X-linked inheritance'],17088400
10,HP:0010864,"['phenotype', 'quality']","['Intellectual disability, severe']",17088400
11,HP:0001290,"['phenotype', 'quality']",['Generalized hypotonia'],17088400
13,HP:0002205,"['phenotype', 'quality']",['Recurrent respiratory infections'],17088400
29,HP:0001249,"['phenotype', 'quality']",['Intellectual disability'],17088400
83,HP:0008947,"['phenotype', 'quality']",['Infantile muscular hypotonia'],17088400
87,HP:0001344,"['phenotype', 'quality']",['Absent speech'],17088400
89,HP:0001250,"['phenotype', 'quality']",['Seizure'],17088400
92,HP:0001257,"['phenotype', 'quality']",['Spasticity'],17088400


# MetaMap API Annotator

## Get one Metamap Annotation

**Input:** Text for Annotation; Id for text

**Output:** DataFrame all metamap annoations and the pubmed id associated with the annotations

In [26]:
help(get_one_metamap_annotation)

Help on function get_one_metamap_annotation in module __main__:

get_one_metamap_annotation(text, text_id)
    The metamap api cannont take more than ~1000(?) characters, so we break the text into one sentance each,
    then run each sentance through the metamap annotator.



In [27]:
df = get_one_metamap_annotation(abstract.iloc[0, :]['text'], abstract.iloc[0, :]['text_id'])
df.head(5)

Unnamed: 0,sources,conceptId,conceptName,semanticTypes,preferredName,matchMapList,matchedWords,head,overmatch,positionalInfo,pruningStatus,negationStatus,score,negated,text_id
0,"[AOD, CHV, LNC, MSH, MTH, NCI, SNOMEDCT_US]",C0018017,Objective,[inpr],objective (goal),"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[objective],True,False,"[{'x': 0, 'y': 9}]",0,0,-1000,False,17088400
0,"[LNC, MTH, NCI, SNOMEDCT_US]",C1571702,Objective,[qlco],Objective observation,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[objective],True,False,"[{'x': 0, 'y': 9}]",0,0,-1000,False,17088400
0,"[AOD, CHV, LNC, MSH, MTH, NCI, SNOMEDCT_US]",C0018017,Goal,[inpr],objective (goal),"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[goal],True,False,"[{'x': 15, 'y': 4}]",0,0,-1000,False,17088400
0,"[HL7V3.0, MTH]",C1571704,Goal,[idcn],Act Mood - Goal,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[goal],True,False,"[{'x': 15, 'y': 4}]",0,0,-1000,False,17088400
0,"[AOD, CHV, LCH, MSH, MTH, NCI, NCI_CDISC, SNOM...",C0040363,TO,[geoa],Togo,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[to],True,False,"[{'x': 24, 'y': 2}]",0,0,-1000,False,17088400


## Get All Metamap Annotations

**Input:** Data frame (with Text for Annotation; Id for text)

**Output:** DataFrame all metamap annoations and the pubmed id associated with the annotations

In [28]:
help(get_all_metamap_annotations)

Help on function get_all_metamap_annotations in module __main__:

get_all_metamap_annotations(dataset)
    Runs a loop through a full dataset of text and text_ids through the get_one_metamap_annotation function.



In [29]:
df = get_all_metamap_annotations(abstract)
df.head(5)

Unnamed: 0,sources,conceptId,conceptName,semanticTypes,preferredName,matchMapList,matchedWords,head,overmatch,positionalInfo,pruningStatus,negationStatus,score,negated,text_id
0,"[AOD, CHV, LNC, MSH, MTH, NCI, SNOMEDCT_US]",C0018017,Objective,[inpr],objective (goal),"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[objective],True,False,"[{'x': 0, 'y': 9}]",0,0,-1000,False,17088400
0,"[LNC, MTH, NCI, SNOMEDCT_US]",C1571702,Objective,[qlco],Objective observation,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[objective],True,False,"[{'x': 0, 'y': 9}]",0,0,-1000,False,17088400
0,"[AOD, CHV, LNC, MSH, MTH, NCI, SNOMEDCT_US]",C0018017,Goal,[inpr],objective (goal),"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[goal],True,False,"[{'x': 15, 'y': 4}]",0,0,-1000,False,17088400
0,"[HL7V3.0, MTH]",C1571704,Goal,[idcn],Act Mood - Goal,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[goal],True,False,"[{'x': 15, 'y': 4}]",0,0,-1000,False,17088400
0,"[AOD, CHV, LCH, MSH, MTH, NCI, NCI_CDISC, SNOM...",C0040363,TO,[geoa],Togo,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[to],True,False,"[{'x': 24, 'y': 2}]",0,0,-1000,False,17088400


## Cleaning Metamap and Adding 

**Input:** Data frame (with Text for Annotation; Id for text); graph network

**Output:** DataFrame all metamap annoations and the pubmed id associated with the annotations

In [30]:
help(cleaning_metamap_adding_hpo)

Help on function cleaning_metamap_adding_hpo in module __main__:

cleaning_metamap_adding_hpo(metamap_data_frame, graph_network)
    Function requires a metamap data frame AND the graph network for HPO. This graph network is required
    To get all of the umls codes from metamap returned as hpos
    
    This function:
    1. Gets the hpo codes for the umls returned in the metamap.
    2. Drops duplicate values in the data frame
    3. Attached the hpo code to the metamap data frame and removes umls codes that do not have an hpo equivilant.
    
    Uses functions: get_hpos_for_umls_code



In [31]:
df = get_all_metamap_annotations(abstract)
df_clean = cleaning_metamap_adding_hpo(df, g)
df_clean.head(5)

Unnamed: 0,sources,conceptId,conceptName,semanticTypes,preferredName,matchMapList,matchedWords,head,overmatch,positionalInfo,pruningStatus,negationStatus,score,negated,text_id,hpo
13,"[AOD, CHV, DXP, HPO, MSH, MTH, OMIM, SNOMEDCT_US]",C0241764,X-LINKED,['genf'],X-linked inheritance,"[{'conceptMatchEnd': 2, 'lexMatchVariation': 0...","[x, linked]",False,False,"[{'x': 110, 'y': 8}]",0,0,-734,False,17088400,HP:0001417
15,"[AOD, CHV, COSTAR, HPO, ICD10CM, ICD9CM, MTHIC...",C0036857,"Mental retardation, severe",['mobd'],Severe mental retardation (I.Q. 20-34),"[{'conceptMatchEnd': 3, 'lexMatchVariation': 0...","[severe, mental, retardation]",True,False,"[{'x': 131, 'y': 25}]",0,0,-1000,False,17088400,HP:0010864
16,"[AOD, CHV, CSP, CST, DXP, HPO, LCH, LCH_NW, LN...",C0026827,HYPOTONIA,['fndg'],Muscle hypotonia,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[hypotonia],True,False,"[{'x': 158, 'y': 9}]",0,0,-1000,False,17088400,HP:0001252
90,"[AIR, AOD, CCS, CHV, COSTAR, CSP, CST, DXP, HP...",C0036572,Seizures,['sosy'],Seizures,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[seizures],True,False,"[{'x': 184, 'y': 8}]",0,1,-1000,True,17088400,HP:0001250
91,"[CHV, CST, DXP, HPO, LCH_NW, MEDLINEPLUS, MSH,...",C0026838,SPASTICITY,['sosy'],Muscle Spasticity,"[{'conceptMatchEnd': 1, 'lexMatchVariation': 0...",[spasticity],True,False,"[{'x': 198, 'y': 10}]",0,1,-1000,True,17088400,HP:0001257


In [32]:
# SETUP FOR FUNCTION

#Annotations to test
one_mondo_annotation = get_one_mondo_annotation(abstract.iloc[0, :]['text'], abstract.iloc[0, :]['text_id'], min_word_length = 4, longest_only = 'true', 
                        include_abbreviation = 'false', include_acronym = 'false', include_numbers = 'false')

id_ref = 'PMID:' + one_mondo_annotation['text_id']

one_mondo_annotation = one_mondo_annotation[one_mondo_annotation['id'].str.contains('HP:')]
one_mondo_annotation = one_mondo_annotation[['id', 'text_id']]
one_mondo_annotation.columns = ['hpo','text_id']

#known annotations
hpo_annotations = get_hpo_annotations_and_clean()
hpo_annotations = hpo_annotations[hpo_annotations['reference'].isin(id_ref)]

# Comparing Annotator to Test Data- Direct Matches

**Input:** known_annotations dataset (with at least unqiueid and hpo columns); dataset of annotations to test (with at least hpo and text_id columns); graph network for hpo; graph network

**Output:** one dataset with exact matches, annotations to test that do not have matches, known hpo annotations that do not have a match.

In [33]:
help(one_annotation_direct_matching)

Help on function one_annotation_direct_matching in module __main__:

one_annotation_direct_matching(one_hpo_annotation, one_annotation_to_test, graph_network)
    In this function, we take one set of known hpo annotations and one set of annotations to test against this known set. For this test set, we check for hpo matches directly and with alternate ids.
    
    Input: known_annotations dataset (with at least unqiueid and hpo columns); dataset of annotations to test (with at least hpo and text_id columns); graph network for hpo; graph network
    Output: dataset with exact matches, annotations to test that do not have matches, known hpo annotations that do not have a match.
    
    Child functions: 
    1. graph_alternate_direct_ids



In [34]:
direct_example = one_annotation_direct_matching(hpo_annotations, one_mondo_annotation, graph_network = g)
direct_example

Unnamed: 0,uniqueid,hpo,text_id,exact_match,test_set_annotations_with_no_match,known_annotations_with_no_match
7,HP:0010864PMID:17088400,HP:0010864,17088400.0,1,0.0,0
0,,HP:0032320,17088400.0,0,1.0,0
8,,HP:0001417,17088400.0,0,1.0,0
11,,HP:0001290,17088400.0,0,1.0,0
13,,HP:0002205,17088400.0,0,1.0,0
29,,HP:0001249,17088400.0,0,1.0,0
83,,HP:0008947,17088400.0,0,1.0,0
87,,HP:0001344,17088400.0,0,1.0,0
89,,HP:0001250,17088400.0,0,1.0,0
92,,HP:0001257,17088400.0,0,1.0,0


# Comparing Annotator to Test Data- Direct and Relatives Matches
This tells you if an annotation is within one hierarchecal level with the known hpo codes

**Input:** known_annotations dataset (with at least unqiueid and hpo columns); dataset of annotations to test (with at least hpo and text_id columns); graph network for hpo; graph network

**Output:** one dataset with exact matches, matches to relatives, annotations to test that do not have matches, known hpo annotations that do not have a match.

In [35]:
help(one_annotation_matching_relatives)

Help on function one_annotation_matching_relatives in module __main__:

one_annotation_matching_relatives(one_hpo_annotation, one_annotation_to_test, graph_network)
    In this function, we take one set of known hpo annotations and one set of annotations to test against this known set. For this test set, we check for hpo matches directly, with alternate ids, and finally, with any parent, child relationships (along with thier alternate ids).
    
    Input: known_annotations dataset (with at least unqiueid and hpo columns); dataset of annotations to test (with at least hpo and text_id columns); graph network for hpo; graph network
    Output: dataset with exact matches, matches to relatives, annotations to test that do not have matches, known hpo annotations that do not have a match.
    
    Child functions: 
    1. graph_alternate_direct_ids
    2. graph_parent_child



In [36]:
relative_example = one_annotation_matching_relatives(hpo_annotations, one_mondo_annotation, graph_network = g)
relative_example

Unnamed: 0,uniqueid,hpo,original_hpo,text_id,exact_match,relative_match,test_set_annotations_with_no_match,known_annotations_with_no_match
7,HP:0010864PMID:17088400,HP:0010864,,17088400.0,1,0.0,0.0,0
4,HP:0002191PMID:17088400,HP:0002191,HP:0001257,17088400.0,0,1.0,0.0,0
0,,HP:0032320,,17088400.0,0,0.0,1.0,0
8,,HP:0001417,,17088400.0,0,0.0,1.0,0
11,,HP:0001290,,17088400.0,0,0.0,1.0,0
13,,HP:0002205,,17088400.0,0,0.0,1.0,0
29,,HP:0001249,,17088400.0,0,0.0,1.0,0
83,,HP:0008947,,17088400.0,0,0.0,1.0,0
87,,HP:0001344,,17088400.0,0,0.0,1.0,0
89,,HP:0001250,,17088400.0,0,0.0,1.0,0


In [37]:
#the length of these is different because matching with relatives creates more positive matches, between datasets, which occupy the same row
len(direct_example), len(relative_example)

(20, 19)

# Scoring
The current scoring is fairly straight forward given that we have a bianry classification (the hpo either is or is not in the known annotations), and our classifier is not a percentage of certainty. Typcially, a classification algorithm will give a prediction in percentage form.  

Our test is a little counter intuitive since we are testing two distinct samples against each other.  This means that every right answer changes the sample size of the combined data.  However, what remains clear is that Precision, Recall and the F1 score are the important measures.

### Explaination of Measurements Outlined Below

**Annotations to Test:**  These are annotations from MetaMap or Mondo that can be matched with the gold standard list

**Known Annotations:** These are the number of annotations from the gold standard

**Accurately Predicted:**  This is the number of annotations to test that are in the known annotations

**Additional Measures**

**Precision:** (Accurately Predicted / Annotations to Test) Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.  High percison rates indicate a low false positive rate.  This is the key measure for us since our test dataset has all positive observations (i.e. a list of hpo codes we believe exists in the annotated text).  *In other words, this is the percentage that is correct out of the tested annotations.*

**Recall:** (Accurately Predicted / Known Annotations) Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. *This is the percent of known annotations that are correct.* 

**F1 Score:** F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. In other words, a high f score means that the percent of accurate tested annotations is high and percent of known annotations found is high

**Confusion Matrix:**  True_positive, False_positive, False_Negative, and True_Negative make up the Confusion matrix for the scoring.


### Scoring inputs and Outputs
**Input:** dataset constructed from one of our matching functions.

**Output:** Basic measurements for the success of the annotator in question.

In [38]:
scoring(direct_example)

Unnamed: 0,ScoringType,Annotations_to_Test,Known_Annotations,Accurately_Predicted,Precision,Recall,F1_Score,True_Positive,False_Positive,False_Negative,True_Negative
0,Direct,13,8,1,0.076923,0.125,0.095238,1,12,7,0


In [39]:
scoring(relative_example)

Unnamed: 0,ScoringType,Annotations_to_Test,Known_Annotations,Accurately_Predicted,Precision,Recall,F1_Score,True_Positive,False_Positive,False_Negative,True_Negative
0,Relative,13,8,2,0.153846,0.25,0.190476,2,11,6,0
