# Phase3 : Result Matching

In phase 3 we will match the organization names extracted using the selected Named Entity Recognition (NER) Tools with those listed by the Crossref Funder Registry. 


## NER Tools
The organization names are extracted from the funding statement using the following NER tools - 

1) spaCy (en_core_web_sm)

2) spaCy (en_core_web_md)

3) spaCy (en_core_web_lg)

4) Flair (from Zolando Research)

All the NER tools have been trained on data in English. 

The details for NER tools from spaCy are available here - https://spacy.io/models/en

The details from NER tools from flair are available here - https://github.com/flairNLP/flair

## Cossref Funder Registry

The Crossref Funder Registry also has a curated list of funding organizations which are acknowledged in research papers for their support and contribution in research work. We have extracted all these names and arranged them in alphabetical order in a dictionary.

In [1]:
import timeit
t_0 = timeit.default_timer()

**1. Importing required libraries**

In [2]:
import numpy as np
import pandas as pd
import os
import csv
import re
import pickle
# module for extracting all the organization names from ".rdf" downloaded from -
# https://gitlab.com/crossref/open_funder_registry
import Crossref_funding_organization_extraction_dict_creation

**2. Importing NER tool results**

A pickle file containing the organizations names extracted from the selected biomedical research papers is the frist input to the result matching code. A each tool extracts a list of organizations from the funding statements in these research papers.

In [3]:
# input file path
ner_filepath = "../data/ack_ner.pickle"

# load the data pickle file
with open(ner_filepath, 'rb') as handle:
    ner_data = pickle.load(handle)

ner_data.drop(columns = ['index'], inplace = True)
ner_data.head(5)

Unnamed: 0,Article_Title,PMC_ID,DOI,acknowledgement,NER_Spacy (en_core_web_sm),NER_Spacy (en_core_web_md),NER_Spacy (en_core_web_lg),NER_Flair,NER_spacy_sm_org,NER_spacy_md_org,NER_spacy_lg_org,NER_Flair_org
0,Impact of antibiotics on the human microbiome ...,8756738,na,The authors were funded in part by Science Fou...,"{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland', 'APC Mi...",[Science Foundation Ireland],[Science Foundation Ireland],[Science Foundation Ireland],"[Science Foundation Ireland, APC Microbiome Ir..."
1,Novel nitrite reductase domain structure sugge...,8756737,na,The authors thank Dr. Ranjani Murali for advic...,"{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PER': ['Ranjani Murali', 'Sarah L. Schwartz'...",[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,
2,Efficacy of antifungal agents against fungal s...,8756736,na,This study was funded by Dr. Pfleger Arzneimit...,"{'PERSON': ['Pfleger Arzneimittel GmbH'], 'GPE...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH', 'Pro...","{'ORG': ['Pfleger Arzneimittel GmbH', 'Projekt...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH'], 'LO...",,"[Dr. Pfleger Arzneimittel GmbH, Projekt DEAL]","[Pfleger Arzneimittel GmbH, Projekt]",[Dr. Pfleger Arzneimittel GmbH]
3,Assembly and comparative analysis of the first...,8756732,10.1186/s12870-021-03416-5,Thanks to all the members of the Institute of ...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,"{'ORG': ['Institute of Leisure Agriculture', '...","[the Institute of Leisure Agriculture, the Gen...","[the Institute of Leisure Agriculture, theÂ Mi...","[the Institute of Leisure Agriculture, theÂ, M...","[Institute of Leisure Agriculture, Genepioneer..."
4,Characterizing the effects of different chemic...,8756731,10.1186/s13007-021-00835-1,Not applicable. This work was supported by the...,"{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['National Research Foundation', 'NRF'...","[the National Research Foundation, NRF, MSIT]","[the National Research Foundation, NRF]","[the National Research Foundation, NRF]","[National Research Foundation, NRF, MSIT]"


**3. Importing Crossref Funder Registry Results**

The Crossref funder registry maintains a list organization names which it stores in a ".rdf" file. We have implmenented a module to extract all the organization names and store them into a dictionary with keys as the starting alphabet of each name.

In [4]:
# storing the funding organization data in a dictonary
orga_dict = Crossref_funding_organization_extraction_dict_creation.funder_dictionary_creation("../data/registry.rdf")

# priting first 10 names from all the keys.
for ele in orga_dict:
    print(ele, orga_dict[ele][:5])


a ['Administración de Alimentos y Medicamentos de los Estados Unidos', "American Parkinson's Disease Foundation", 'APDA', 'American Diabetes Association', 'Asociación Americana de la Diabetes']
b ['Boeing', 'Boeing Company', 'Boeing Co', 'Boeing Co.', 'Biological Sciences']
c ['Centers for Disease Control and Prevention', 'Centers for Disease Control & Prevention', 'Centros para el Control y la Prevención de Enfermedades', 'Centers for Disease Control', 'CDC']
d ['Department of Defense', 'DOD', "Department of the Navy's Office of Naval Research", 'David and Lucile Packard Foundation', 'David & Lucile Packard Foundation']
e ['Energy Department', 'ENERGY.GOV', 'El Instituto Nacional de Investigación Dental y Craneofacial', 'El Departamento de Justicia de EE. UU.', 'Education and Human Resources']
f ['Foundation for the National Institutes of Health', 'Foundation for the National Institutes of Health, Inc.', 'Foundation for the NIH', 'Foundation for NIH', 'FNIH']
g ['Graduate Education', 

We will use a reference count to compare the number of organizations names matching between Crossref Funder Registry and the NER tool output. This reference count is the number of unique organizations names present in the Crossref Funder Registry 

In [5]:
ref_count = 0
for ele in orga_dict.values():
    ref_count+= len(ele)
print("The number of unique organization from the Crossref Funder Registry used for result matching:", ref_count)

The number of unique organization from the Crossref Funder Registry used for result matching: 96765


In [6]:
print("The columns of the ner dataframe are:", ner_data.columns)

The columns of the ner dataframe are: Index(['Article_Title', 'PMC_ID', 'DOI', 'acknowledgement',
       'NER_Spacy (en_core_web_sm)', 'NER_Spacy (en_core_web_md)',
       'NER_Spacy (en_core_web_lg)', 'NER_Flair', 'NER_spacy_sm_org',
       'NER_spacy_md_org', 'NER_spacy_lg_org', 'NER_Flair_org'],
      dtype='object')


**4. Preprocessing the text to increase the match count**

We have developed 4 preprocessing functions which have been sequentially applied to the tool and funder registry results. The functions are defined in the cell below.

In [7]:
def case_lowering(text : str) -> str:
    """
    Lower case strings in a text
    :param text: input raw text
    :return text_updated: lower_cased string
    """
    text_updated = text.lower()
    return text_updated

In [8]:
def det_removal(text : str) -> list:
    """
    Removing the determiners (a, an, the) from the begining of the string
    :param text: input raw text
    :return updated_text: list containing string with articles removed from the begining and new start letter
    """
    text_updated = ""
    det_list = ['a', 'an', 'the', 'The', 'A', 'An']
    if text.lower() == 'the' or text.lower() == 'a' or text.lower() == 'an':
        text_updated = text
    elif text.split(" ")[0] in det_list:
            text_updated = " ".join(text.split(" ")[1:])
    else:
        text_updated = text
    updated_text = [text_updated, text_updated[0].lower()]
    return updated_text

In [9]:
def and_replacement(text : str) -> str:
    """
    Replacing "and" with "&" in the text
    :param text: input raw text
    :return text_updated: text with 'and' replaced with '&' if any
    """
    text_updated = re.sub(" [A|a][N|n][D|d] ", " & ", text)
    return text_updated

In [10]:
def punct_removal(text : str) -> str:
    """
    Removing punctuation marks from the text
    :param text: input raw text
    :return text_updated: text with punctuations and special symbols removed except '&'
    """
    regex = r"[!\"#\$%\'\(\)\*\+,-\./:;<=>\?@\[\\\]\^_`{\|}~]"
    text_updated = re.sub(regex, "", text)
    return text_updated

**5. Result Matching**

We will now match the results from the NER tool out and Crossref Funder Registry using the preprocessing functions sequentially

In [11]:
def text_processing(ner_data : pd.DataFrame, tool : str) -> list:
    """
    Arranging all the organization names identified by a specific tool into a list
    :param ner_data: data series containing all the organization of identified by the tool under consideration
    :param tool: name of the NER tool used for extracting organization names
    :return new_tool_org: list of all the extracted organization names
    """
    new_tool_org = []
    for ele in ner_data[tool]:
        if type(ele) == list:
            new_tool_org+=ele
    # new_tool_org_set = set(new_tool_org)
    new_tool_org = [ele for ele in new_tool_org if ele!= " " or ele != ""]
    return  new_tool_org

In [12]:
def result_matching(ner_tool_set, orga_dict):
    """Matching the count of organizations between the NER tools output and crossref funder registry
    :param ner_tool_set: list of all the extracted organization names by a specific NER tool
    :param orga_dict: dictionary containing names of all the funding organization arrange alphabetically
    :return match_count: count of the number matches between NER tool output and Corssref Funder Registry
    :return match_result: list of all the matches between NER tool output and Corssref Funder Registry
    
    """
    match_count = 0
    match_result = []
    for ele in ner_tool_set:
        if ele != '' and ele[0].lower() in orga_dict:
            for ele1 in set(orga_dict[ele[0].lower()]):  
                if ele == ele1:
                    match_result.append(ele)
                    match_count+=1
    return match_count, match_result

In [13]:
# initalizing dictionaries and data frame to store the results
result_match_count = {}
result_match = {}

result_df = pd.DataFrame(columns = ['NER_spacy_sm_org', 
                                    'NER_spacy_sm_org_match',
                                    'NER_spacy_sm_org_match_%', 
                                    'NER_spacy_md_org',
                                    'NER_spacy_md_org_match',
                                    'NER_spacy_md_org_match_%',
                                    'NER_spacy_lg_org', 
                                    'NER_spacy_lg_org_match',
                                    'NER_spacy_lg_org_match_%',
                                    'NER_Flair_org', 
                                    'NER_Flair_org_match',
                                    'NER_Flair_org_match_%'],
                        index = ['Crossref baseline vs NER baseline',
                                 'Crossref baseline vs NER improved version I',
                                 'Crossref baseline vs NER improved version II',
                                 'Crossref baseline vs NER improved version III',
                                 'Crossref baseline vs NER improved version IV',
                                 'NER baseline vs Crossref improved version I',
                                 'NER baseline vs Crossref improved version II',
                                 'NER baseline vs Crossref improved version III',
                                 'NER baseline vs Crossref improved version IV',
                                 'NER improved version I vs Crossref improved version I',
                                 'NER improved version II vs Crossref improved version II',
                                 'NER improved version III vs Crossref improved version III',
                                 'NER improved version IV vs Crossref improved version IV'   
                                ])

**I. NER Baseline vs Crossref Baseline**
We will first start with a baselie comparison. The baseline model in the both cases will be a lower-cased list of organization names.

The models compared here are - 

1) NER baseline model

2) Crossref baseline model

In [14]:
result_match_count['Crossref baseline vs NER baseline'] = {}
result_match['Crossref baseline vs NER baseline'] = {}
tool_output = {}

tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([ele for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER baseline'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER baseline'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER baseline'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER baseline'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " baseline model :")
    print(match_count)

The match count between Crossref baseline model and NER_spacy_sm_org baseline model :
7178
The match count between Crossref baseline model and NER_spacy_md_org baseline model :
7432
The match count between Crossref baseline model and NER_spacy_lg_org baseline model :
7445
The match count between Crossref baseline model and NER_Flair_org baseline model :
11377


In [15]:
print(result_match['Crossref baseline vs NER baseline']['NER_spacy_sm_org'][:10])
print(result_match['Crossref baseline vs NER baseline']['NER_spacy_md_org'][:10])
print(result_match['Crossref baseline vs NER baseline']['NER_spacy_lg_org'][:10])
print(result_match['Crossref baseline vs NER baseline']['NER_Flair_org'][:10])

['AMR', 'CSR', 'Lahore University of Management Sciences', 'VolkswagenStiftung', 'Zhejiang Natural Science Foundation', 'SCA', 'KKF', 'Medical and Health Science and Technology Project of Zhejiang Province', 'China Association for Science and Technology', 'UKRI']
['AMR', 'Lahore University of Management Sciences', 'VolkswagenStiftung', 'Zhejiang Natural Science Foundation', 'SCA', "Scuola Superiore Sant'Anna", 'KKF', 'Medical and Health Science and Technology Project of Zhejiang Province', 'China Association for Science and Technology', 'UKRI']
['AMR', 'Lahore University of Management Sciences', 'VolkswagenStiftung', 'Zhejiang Natural Science Foundation', 'SCA', 'KKF', 'Medical and Health Science and Technology Project of Zhejiang Province', 'China Association for Science and Technology', 'Italian Association for Cancer Research', 'Hasselt University']
['AMR', 'Faculty of Health and Medical Sciences', 'Robert A. Welch Foundation', 'University of Nebraska Medical Center', 'Oregon Health

In [16]:
result_df.loc['Crossref baseline vs NER baseline'] = pd.Series(result_match_count['Crossref baseline vs NER baseline'])


In [17]:
ner_data

Unnamed: 0,Article_Title,PMC_ID,DOI,acknowledgement,NER_Spacy (en_core_web_sm),NER_Spacy (en_core_web_md),NER_Spacy (en_core_web_lg),NER_Flair,NER_spacy_sm_org,NER_spacy_md_org,NER_spacy_lg_org,NER_Flair_org
0,Impact of antibiotics on the human microbiome ...,8756738,na,The authors were funded in part by Science Fou...,"{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland', 'APC Mi...",[Science Foundation Ireland],[Science Foundation Ireland],[Science Foundation Ireland],"[Science Foundation Ireland, APC Microbiome Ir..."
1,Novel nitrite reductase domain structure sugge...,8756737,na,The authors thank Dr. Ranjani Murali for advic...,"{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PER': ['Ranjani Murali', 'Sarah L. Schwartz'...",[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,
2,Efficacy of antifungal agents against fungal s...,8756736,na,This study was funded by Dr. Pfleger Arzneimit...,"{'PERSON': ['Pfleger Arzneimittel GmbH'], 'GPE...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH', 'Pro...","{'ORG': ['Pfleger Arzneimittel GmbH', 'Projekt...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH'], 'LO...",,"[Dr. Pfleger Arzneimittel GmbH, Projekt DEAL]","[Pfleger Arzneimittel GmbH, Projekt]",[Dr. Pfleger Arzneimittel GmbH]
3,Assembly and comparative analysis of the first...,8756732,10.1186/s12870-021-03416-5,Thanks to all the members of the Institute of ...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,"{'ORG': ['Institute of Leisure Agriculture', '...","[the Institute of Leisure Agriculture, the Gen...","[the Institute of Leisure Agriculture, theÂ Mi...","[the Institute of Leisure Agriculture, theÂ, M...","[Institute of Leisure Agriculture, Genepioneer..."
4,Characterizing the effects of different chemic...,8756731,10.1186/s13007-021-00835-1,Not applicable. This work was supported by the...,"{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['National Research Foundation', 'NRF'...","[the National Research Foundation, NRF, MSIT]","[the National Research Foundation, NRF]","[the National Research Foundation, NRF]","[National Research Foundation, NRF, MSIT]"
...,...,...,...,...,...,...,...,...,...,...,...,...
65829,Selenium Induces Pancreatic Cancer Cell Death ...,8773897,na,"This work, in part was supported by Penn State...","{'ORG': ['Penn State Cancer Institute Funds'],...",{'ORG': ['Penn State Cancer Institute Funds']},"{'ORG': ['Penn State Cancer Institute Funds', ...","{'ORG': ['Penn State Cancer Institute'], 'LOC'...",[Penn State Cancer Institute Funds],[Penn State Cancer Institute Funds],"[Penn State Cancer Institute Funds, R.S.]",[Penn State Cancer Institute]
65830,Neuropsychiatric Manifestations of Antiphospho...,8773877,na,This research received no external funding.,{},{},{},{},,,,
65831,Systemic Delivery of mLIGHT-Armed Myxoma Virus...,8773855,na,"We thank Leslie Sharp, Stephen Potts, Kavita M...","{'PERSON': ['Leslie Sharp', 'Stephen Potts', '...","{'PERSON': ['Leslie Sharp', 'Stephen Potts', '...","{'PERSON': ['Leslie Sharp', 'Stephen Potts', '...","{'PER': ['Leslie Sharp', 'Stephen Potts', 'Kav...","[Arizona State University, an ASU PhD Completi...","[Arizona State University, ASU, G.M., NIH, G.M...","[an Arizona State University, ASU, G.M., ASU P...","[., Arizona State University, ASU, ASU, NIH]"
65832,STAT3 Enhances Sensitivity of Glioblastoma to ...,8773829,na,We thank Hildegard KÃ¶nig for excellent techni...,"{'PERSON': ['Hildegard KÃ¶nig'], 'ORG': ['the ...","{'PERSON': ['Hildegard KÃ¶nig'], 'ORG': ['the ...","{'PERSON': ['Hildegard KÃ¶nig', 'D.K.', 'D.K.'...","{'PER': ['Hildegard KÃ ¶ nig'], 'ORG': ['Smart...","[the Graphical Abstract, Smart Servier Medical...","[the Graphical Abstract, Smart Servier Medical...","[Smart Servier Medical Art., CC BY 3.0, the Ge...","[Smart Servier Medical Art, German Research Fo..."


**II. Improved NER vs Crossref Baseline**

Now we will sequentialy compare the results from the Crossref baseline model with those from improved versions of the NER tool output. The 4 Improved models for results from each NER tool are as follows - 

1) NER_I - 'det_removal' function applied

2) NER_II - 'and_replacement' function applied

3) NER_III - 'punct_removal' function applied

4) NER_IV - 'det_removal' + 'and_replacement' + 'punct_removal' functions applied



In [18]:
result_match['Crossref baseline vs NER improved version I'] = {}
result_match['Crossref baseline vs NER improved version II'] = {}
result_match['Crossref baseline vs NER improved version III'] = {}
result_match['Crossref baseline vs NER improved version IV'] = {}

result_match_count['Crossref baseline vs NER improved version I'] = {}
result_match_count['Crossref baseline vs NER improved version II'] = {}
result_match_count['Crossref baseline vs NER improved version III'] = {}
result_match_count['Crossref baseline vs NER improved version IV'] = {}


#I
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    NER_I = set([det_removal(ele)[0] for ele in text_processing(ner_data, tool) ])
    tool_output[tool] = NER_I
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER improved version I'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER improved version I'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER improved version I'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER improved version I'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " improved version I :")
    print(match_count)
    
print("********************************************************************************************************")     
    
#II
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    NER_II = set([and_replacement(ele) for ele in text_processing(ner_data, tool)])
    tool_output[tool] = NER_II
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER improved version II'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER improved version II'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER improved version II'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER improved version II'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " improved version II :")
    print(match_count)

print("********************************************************************************************************")     
    
#III
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    NER_III = set([punct_removal(ele) for ele in text_processing(ner_data, tool)])
    tool_output[tool] = NER_III
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER improved version III'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER improved version III'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER improved version III'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER improved version III'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " improved version III :")
    print(match_count)

print("********************************************************************************************************")    
    
#IV
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    NER_IV = set([punct_removal(and_replacement(det_removal(ele)[0])) for ele in text_processing(ner_data, tool)])
    tool_output[tool] = NER_IV
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER improved version IV'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER improved version IV'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER improved version IV'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER improved version IV'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " improved version IV :")
    print(match_count)

The match count between Crossref baseline model and NER_spacy_sm_org improved version I :
9090
The match count between Crossref baseline model and NER_spacy_md_org improved version I :
9278
The match count between Crossref baseline model and NER_spacy_lg_org improved version I :
9315
The match count between Crossref baseline model and NER_Flair_org improved version I :
11303
********************************************************************************************************
The match count between Crossref baseline model and NER_spacy_sm_org improved version II :
6796
The match count between Crossref baseline model and NER_spacy_md_org improved version II :
7000
The match count between Crossref baseline model and NER_spacy_lg_org improved version II :
7023
The match count between Crossref baseline model and NER_Flair_org improved version II :
10576
********************************************************************************************************
The match count between Crossr

In [19]:
result_df.loc['Crossref baseline vs NER improved version I'] = pd.Series(result_match_count['Crossref baseline vs NER improved version I'])
result_df.loc['Crossref baseline vs NER improved version II'] = pd.Series(result_match_count['Crossref baseline vs NER improved version II'])
result_df.loc['Crossref baseline vs NER improved version III'] = pd.Series(result_match_count['Crossref baseline vs NER improved version III'])
result_df.loc['Crossref baseline vs NER improved version IV'] = pd.Series(result_match_count['Crossref baseline vs NER improved version IV'])



In [20]:
result_df

Unnamed: 0,NER_spacy_sm_org,NER_spacy_sm_org_match,NER_spacy_sm_org_match_%,NER_spacy_md_org,NER_spacy_md_org_match,NER_spacy_md_org_match_%,NER_spacy_lg_org,NER_spacy_lg_org_match,NER_spacy_lg_org_match_%,NER_Flair_org,NER_Flair_org_match,NER_Flair_org_match_%
Crossref baseline vs NER baseline,99161.0,7178.0,7.238733,103543.0,7432.0,7.177694,103245.0,7445.0,7.211003,76541.0,11377.0,14.863929
Crossref baseline vs NER improved version I,93599.0,9090.0,9.711642,97695.0,9278.0,9.496904,97531.0,9315.0,9.550809,75990.0,11303.0,14.874326
Crossref baseline vs NER improved version II,98993.0,6796.0,6.865132,103374.0,7000.0,6.771529,103089.0,7023.0,6.81256,76331.0,10576.0,13.855445
Crossref baseline vs NER improved version III,97489.0,7244.0,7.430582,101775.0,7468.0,7.337755,101472.0,7456.0,7.34784,74506.0,11296.0,15.161195
Crossref baseline vs NER improved version IV,91734.0,8443.0,9.203785,95733.0,8580.0,8.962427,95566.0,8588.0,8.98646,73738.0,10453.0,14.175866
NER baseline vs Crossref improved version I,,,,,,,,,,,,
NER baseline vs Crossref improved version II,,,,,,,,,,,,
NER baseline vs Crossref improved version III,,,,,,,,,,,,
NER baseline vs Crossref improved version IV,,,,,,,,,,,,
NER improved version I vs Crossref improved version I,,,,,,,,,,,,


**Applying the preprocessing function to orga_dict**

Now we will sequentially apply all the preprocessing functions to 'orga_dict' containing organization names from the Crossref Funder Registry. The 4 variations to the orga_dict as as follows - 


1) orga_dict_I - 'det_removal' function applied

2) orga_dict_I - 'and_replacement' function applied

3) orga_dict_I - 'punct_removal' function applied

4) orga_dict_I - 'det_removal' + 'and_replacement' + 'punct_removal' functions applied

In [21]:
from collections import defaultdict

orga_dict_I = {}
orga_dict_II = {}
orga_dict_III = {}
orga_dict_IV = {}


for ele in orga_dict:
    for i in range(len(orga_dict[ele])):
        new_key = det_removal(orga_dict[ele][i])
        new_key = new_key[1]
        if new_key not in orga_dict_I:
            orga_dict_I[new_key] = [det_removal(orga_dict[ele][i])[0]]
        else:   
            orga_dict_I[new_key].append(det_removal(orga_dict[ele][i])[0])
        
        if ele not in orga_dict_II:
            orga_dict_II[ele] = [and_replacement(orga_dict[ele][i])]
        else:
            orga_dict_II[ele].append(and_replacement(orga_dict[ele][i]))                               
        
        if ele not in orga_dict_III:
            orga_dict_III[ele] = [punct_removal(orga_dict[ele][i])]                            
        else:
            orga_dict_III[ele].append(punct_removal(orga_dict[ele][i]))
        
        if new_key not in orga_dict_IV:
            orga_dict_IV[new_key] = [punct_removal(and_replacement(det_removal(orga_dict[ele][i])[0]))]
        else:
            orga_dict_IV[new_key].append(punct_removal(and_replacement(det_removal(orga_dict[ele][i])[0])))   

**III. NER Baseline vs Improved Crossref**

Now we will sequentialy compare the results from the NER baseline model with those from improved versions of Crossref output.

In [22]:
result_match['NER baseline vs Crossref improved version I'] = {}
result_match['NER baseline vs Crossref improved version II'] = {}
result_match['NER baseline vs Crossref improved version III'] = {}
result_match['NER baseline vs Crossref improved version IV'] = {}

result_match_count['NER baseline vs Crossref improved version I'] = {}
result_match_count['NER baseline vs Crossref improved version II'] = {}
result_match_count['NER baseline vs Crossref improved version III'] = {}
result_match_count['NER baseline vs Crossref improved version IV'] = {}




#I
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([ele for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_I)
    result_match_count['NER baseline vs Crossref improved version I'][tool] = int(len(tool_output[tool]))
    result_match_count['NER baseline vs Crossref improved version I'][tool+"_match"] = int(match_count)
    result_match_count['NER baseline vs Crossref improved version I'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER baseline vs Crossref improved version I'][tool] = match_result
    print("The match count between" + tool +  "baseline model and Crossref improved version I :")
    print(match_count)
    
print("********************************************************************************************************")

#II
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([ele for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_II)
    result_match_count['NER baseline vs Crossref improved version II'][tool] = int(len(tool_output[tool]))
    result_match_count['NER baseline vs Crossref improved version II'][tool+"_match"] = int(match_count)
    result_match_count['NER baseline vs Crossref improved version II'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER baseline vs Crossref improved version II'][tool] = match_result
    print("The match count between" + tool +  "baseline model and Crossref improved version II :")
    print(match_count)
    
print("********************************************************************************************************")

#III
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([ele for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_III)
    result_match_count['NER baseline vs Crossref improved version III'][tool] = int(len(tool_output[tool]))
    result_match_count['NER baseline vs Crossref improved version III'][tool+"_match"] = int(match_count)
    result_match_count['NER baseline vs Crossref improved version III'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER baseline vs Crossref improved version III'][tool] = match_result
    print("The match count between" + tool +  "baseline model and Crossref improved version III :")
    print(match_count)
    
print("********************************************************************************************************")

#IV
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([ele for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_IV)
    result_match_count['NER baseline vs Crossref improved version IV'][tool] = int(len(tool_output[tool]))
    result_match_count['NER baseline vs Crossref improved version IV'][tool+"_match"] = int(match_count)
    result_match_count['NER baseline vs Crossref improved version IV'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER baseline vs Crossref improved version IV'][tool] = match_result
    print("The match count between" + tool +  "baseline model and Crossref improved version IV :")
    print(match_count)

The match count betweenNER_spacy_sm_orgbaseline model and Crossref improved version I :
7066
The match count betweenNER_spacy_md_orgbaseline model and Crossref improved version I :
7320
The match count betweenNER_spacy_lg_orgbaseline model and Crossref improved version I :
7337
The match count betweenNER_Flair_orgbaseline model and Crossref improved version I :
11286
********************************************************************************************************
The match count betweenNER_spacy_sm_orgbaseline model and Crossref improved version II :
6751
The match count betweenNER_spacy_md_orgbaseline model and Crossref improved version II :
6948
The match count betweenNER_spacy_lg_orgbaseline model and Crossref improved version II :
6966
The match count betweenNER_Flair_orgbaseline model and Crossref improved version II :
10455
********************************************************************************************************
The match count betweenNER_spacy_sm_orgbaselin

In [23]:
result_df.loc['NER baseline vs Crossref improved version I'] = pd.Series(result_match_count['NER baseline vs Crossref improved version I'])
result_df.loc['NER baseline vs Crossref improved version II'] = pd.Series(result_match_count['NER baseline vs Crossref improved version II'])
result_df.loc['NER baseline vs Crossref improved version III'] = pd.Series(result_match_count['NER baseline vs Crossref improved version III'])
result_df.loc['NER baseline vs Crossref improved version IV'] = pd.Series(result_match_count['NER baseline vs Crossref improved version IV'])


In [24]:
result_df

Unnamed: 0,NER_spacy_sm_org,NER_spacy_sm_org_match,NER_spacy_sm_org_match_%,NER_spacy_md_org,NER_spacy_md_org_match,NER_spacy_md_org_match_%,NER_spacy_lg_org,NER_spacy_lg_org_match,NER_spacy_lg_org_match_%,NER_Flair_org,NER_Flair_org_match,NER_Flair_org_match_%
Crossref baseline vs NER baseline,99161.0,7178.0,7.238733,103543.0,7432.0,7.177694,103245.0,7445.0,7.211003,76541.0,11377.0,14.863929
Crossref baseline vs NER improved version I,93599.0,9090.0,9.711642,97695.0,9278.0,9.496904,97531.0,9315.0,9.550809,75990.0,11303.0,14.874326
Crossref baseline vs NER improved version II,98993.0,6796.0,6.865132,103374.0,7000.0,6.771529,103089.0,7023.0,6.81256,76331.0,10576.0,13.855445
Crossref baseline vs NER improved version III,97489.0,7244.0,7.430582,101775.0,7468.0,7.337755,101472.0,7456.0,7.34784,74506.0,11296.0,15.161195
Crossref baseline vs NER improved version IV,91734.0,8443.0,9.203785,95733.0,8580.0,8.962427,95566.0,8588.0,8.98646,73738.0,10453.0,14.175866
NER baseline vs Crossref improved version I,99161.0,7066.0,7.125785,103543.0,7320.0,7.069527,103245.0,7337.0,7.106397,76541.0,11286.0,14.745039
NER baseline vs Crossref improved version II,99161.0,6751.0,6.80812,103543.0,6948.0,6.710256,103245.0,6966.0,6.747058,76541.0,10455.0,13.659346
NER baseline vs Crossref improved version III,99161.0,6954.0,7.012838,103543.0,7170.0,6.924659,103245.0,7170.0,6.944646,76541.0,11143.0,14.558211
NER baseline vs Crossref improved version IV,99161.0,6440.0,6.494489,103543.0,6608.0,6.38189,103245.0,6618.0,6.409996,76541.0,10144.0,13.253028
NER improved version I vs Crossref improved version I,,,,,,,,,,,,


**IV. Improved NER vs Improved Crossref**

Finally we will sequentially compare the results of Improved Crossref with Improved NER models

In [31]:
result_match['NER improved version I vs Crossref improved version I'] = {}
result_match['NER improved version II vs Crossref improved version II'] = {}
result_match['NER improved version III vs Crossref improved version III'] = {}
result_match['NER improved version IV vs Crossref improved version IV'] = {}

result_match_count['NER improved version I vs Crossref improved version I'] = {}
result_match_count['NER improved version II vs Crossref improved version II'] = {}
result_match_count['NER improved version III vs Crossref improved version III'] = {}
result_match_count['NER improved version IV vs Crossref improved version IV'] = {}



#I
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([det_removal(ele)[0] for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_I)
    result_match_count['NER improved version I vs Crossref improved version I'][tool] = int(len(tool_output[tool]))
    result_match_count['NER improved version I vs Crossref improved version I'][tool+"_match"] = int(match_count)
    result_match_count['NER improved version I vs Crossref improved version I'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER improved version I vs Crossref improved version I'][tool] = match_result
    print("The match count between" + tool +  "improved version I and Crossref improved version I :")
    print(match_count)
    
print("********************************************************************************************************")
    
#II
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([and_replacement(ele) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_II)
    result_match_count['NER improved version II vs Crossref improved version II'][tool] = int(len(tool_output[tool]))
    result_match_count['NER improved version II vs Crossref improved version II'][tool+"_match"] = int(match_count)
    result_match_count['NER improved version II vs Crossref improved version II'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER improved version II vs Crossref improved version II'][tool] = match_result
    print("The match count between" + tool +  "improved version II and Crossref improved version II :")
    print(match_count)
    
print("********************************************************************************************************")    

#III
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([punct_removal(ele) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_III)
    result_match_count['NER improved version III vs Crossref improved version III'][tool] = int(len(tool_output[tool]))
    result_match_count['NER improved version III vs Crossref improved version III'][tool+"_match"] = int(match_count)
    result_match_count['NER improved version III vs Crossref improved version III'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER improved version III vs Crossref improved version III'][tool] = match_result
    print("The match count between" + tool +  "improved version III and Crossref improved version III :")
    print(match_count)
    
print("********************************************************************************************************")    

#IV
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([punct_removal(and_replacement(det_removal(ele)[0])) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_IV)
    result_match_count['NER improved version IV vs Crossref improved version IV'][tool] = int(len(tool_output[tool]))
    result_match_count['NER improved version IV vs Crossref improved version IV'][tool+"_match"] = int(match_count)
    result_match_count['NER improved version IV vs Crossref improved version IV'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER improved version IV vs Crossref improved version IV'][tool] = match_result
    print("The match count between" + tool +  "improved version IV and Crossref improved version IV :")
    print(match_count)

The match count betweenNER_spacy_sm_orgimproved version I and Crossref improved version I :
9134
The match count betweenNER_spacy_md_orgimproved version I and Crossref improved version I :
9328
The match count betweenNER_spacy_lg_orgimproved version I and Crossref improved version I :
9369
The match count betweenNER_Flair_orgimproved version I and Crossref improved version I :
11360
********************************************************************************************************
The match count betweenNER_spacy_sm_orgimproved version II and Crossref improved version II :
7190
The match count betweenNER_spacy_md_orgimproved version II and Crossref improved version II :
7443
The match count betweenNER_spacy_lg_orgimproved version II and Crossref improved version II :
7459
The match count betweenNER_Flair_orgimproved version II and Crossref improved version II :
11379
********************************************************************************************************
The match 

In [32]:
result_df.loc['NER improved version I vs Crossref improved version I'] = pd.Series(result_match_count['NER improved version I vs Crossref improved version I'])
result_df.loc['NER improved version II vs Crossref improved version II'] = pd.Series(result_match_count['NER improved version II vs Crossref improved version II'])
result_df.loc['NER improved version III vs Crossref improved version III'] = pd.Series(result_match_count['NER improved version III vs Crossref improved version III'])
result_df.loc['NER improved version IV vs Crossref improved version IV'] = pd.Series(result_match_count['NER improved version IV vs Crossref improved version IV'])


In [33]:
result_df

Unnamed: 0,NER_spacy_sm_org,NER_spacy_sm_org_match,NER_spacy_sm_org_match_%,NER_spacy_md_org,NER_spacy_md_org_match,NER_spacy_md_org_match_%,NER_spacy_lg_org,NER_spacy_lg_org_match,NER_spacy_lg_org_match_%,NER_Flair_org,NER_Flair_org_match,NER_Flair_org_match_%
Crossref baseline vs NER baseline,99161.0,7178.0,7.238733,103543.0,7432.0,7.177694,103245.0,7445.0,7.211003,76541.0,11377.0,14.863929
Crossref baseline vs NER improved version I,93599.0,9090.0,9.711642,97695.0,9278.0,9.496904,97531.0,9315.0,9.550809,75990.0,11303.0,14.874326
Crossref baseline vs NER improved version II,98993.0,6796.0,6.865132,103374.0,7000.0,6.771529,103089.0,7023.0,6.81256,76331.0,10576.0,13.855445
Crossref baseline vs NER improved version III,97489.0,7244.0,7.430582,101775.0,7468.0,7.337755,101472.0,7456.0,7.34784,74506.0,11296.0,15.161195
Crossref baseline vs NER improved version IV,91734.0,8443.0,9.203785,95733.0,8580.0,8.962427,95566.0,8588.0,8.98646,73738.0,10453.0,14.175866
NER baseline vs Crossref improved version I,99161.0,7066.0,7.125785,103543.0,7320.0,7.069527,103245.0,7337.0,7.106397,76541.0,11286.0,14.745039
NER baseline vs Crossref improved version II,99161.0,6751.0,6.80812,103543.0,6948.0,6.710256,103245.0,6966.0,6.747058,76541.0,10455.0,13.659346
NER baseline vs Crossref improved version III,99161.0,6954.0,7.012838,103543.0,7170.0,6.924659,103245.0,7170.0,6.944646,76541.0,11143.0,14.558211
NER baseline vs Crossref improved version IV,99161.0,6440.0,6.494489,103543.0,6608.0,6.38189,103245.0,6618.0,6.409996,76541.0,10144.0,13.253028
NER improved version I vs Crossref improved version I,93599.0,9134.0,9.758651,97695.0,9328.0,9.548083,97531.0,9369.0,9.606176,75990.0,11360.0,14.949335


**6. Storing the results in a dataframe**

In [34]:
t_1 = timeit.default_timer()
print("The time elapsed: ", t_1 - t_0)

The time elapsed:  2726.080422


In [35]:
result_df.to_csv('../data/Result_Match.csv')

**7. Future work**

1. Random sampling from the tool output and perform subword analysis/matching appearing as substrings for some organizations. Check the original arcticles.

2. Trying out new NER tools

3. Traing own NER tool

In [37]:
result_df

Unnamed: 0,NER_spacy_sm_org,NER_spacy_sm_org_match,NER_spacy_sm_org_match_%,NER_spacy_md_org,NER_spacy_md_org_match,NER_spacy_md_org_match_%,NER_spacy_lg_org,NER_spacy_lg_org_match,NER_spacy_lg_org_match_%,NER_Flair_org,NER_Flair_org_match,NER_Flair_org_match_%
Crossref baseline vs NER baseline,99161.0,7178.0,7.238733,103543.0,7432.0,7.177694,103245.0,7445.0,7.211003,76541.0,11377.0,14.863929
Crossref baseline vs NER improved version I,93599.0,9090.0,9.711642,97695.0,9278.0,9.496904,97531.0,9315.0,9.550809,75990.0,11303.0,14.874326
Crossref baseline vs NER improved version II,98993.0,6796.0,6.865132,103374.0,7000.0,6.771529,103089.0,7023.0,6.81256,76331.0,10576.0,13.855445
Crossref baseline vs NER improved version III,97489.0,7244.0,7.430582,101775.0,7468.0,7.337755,101472.0,7456.0,7.34784,74506.0,11296.0,15.161195
Crossref baseline vs NER improved version IV,91734.0,8443.0,9.203785,95733.0,8580.0,8.962427,95566.0,8588.0,8.98646,73738.0,10453.0,14.175866
NER baseline vs Crossref improved version I,99161.0,7066.0,7.125785,103543.0,7320.0,7.069527,103245.0,7337.0,7.106397,76541.0,11286.0,14.745039
NER baseline vs Crossref improved version II,99161.0,6751.0,6.80812,103543.0,6948.0,6.710256,103245.0,6966.0,6.747058,76541.0,10455.0,13.659346
NER baseline vs Crossref improved version III,99161.0,6954.0,7.012838,103543.0,7170.0,6.924659,103245.0,7170.0,6.944646,76541.0,11143.0,14.558211
NER baseline vs Crossref improved version IV,99161.0,6440.0,6.494489,103543.0,6608.0,6.38189,103245.0,6618.0,6.409996,76541.0,10144.0,13.253028
NER improved version I vs Crossref improved version I,93599.0,9134.0,9.758651,97695.0,9328.0,9.548083,97531.0,9369.0,9.606176,75990.0,11360.0,14.949335
