# Phase3 : Result Matching

In phase 3 we will match the organization names extracted using the selected Named Entity Recognition (NER) Tools with those listed by the Crossref Funder Registry. 


## NER Tools
The organization names are extracted from the funding statement using the following NER tools - 

1) spaCy (en_core_web_sm)

2) spaCy (en_core_web_md)

3) spaCy (en_core_web_lg)

4) Flair (from Zolando Research)

All the NER tools have been trained on data in English. 

The details for NER tools from spaCy are available here - https://spacy.io/models/en

The details from NER tools from flair are available here - https://github.com/flairNLP/flair

## Cossref Funder Registry

The Crossref Funder Registry also has a curated list of funding organizations which are acknowledged in research papers for their support and contribution in research work. We have extracted all these names and arranged them in alphabetical order in a dictionary.

In [1]:
import timeit
t_0 = timeit.default_timer()

**1. Importing required libraries**

In [2]:
import numpy as np
import pandas as pd
import os
import csv
import re
import pickle
# module for extracting all the organization names from ".rdf" downloaded from -
# https://gitlab.com/crossref/open_funder_registry
import Crossref_funding_organization_extraction_dict_creation

**2. Importing NER tool results**

A pickle file containing the organizations names extracted from the selected biomedical research papers is the frist input to the result matching code. A each tool extracts a list of organizations from the funding statements in these research papers.

In [3]:
# input file path
ner_filepath = "../data/ack_ner.pickle"

# load the data pickle file
with open(ner_filepath, 'rb') as handle:
    ner_data = pickle.load(handle)

ner_data.drop(columns = ['index'], inplace = True)
ner_data.head()

Unnamed: 0,Article_Title,PMC_ID,DOI,acknowledgement,NER_Spacy (en_core_web_sm),NER_Spacy (en_core_web_md),NER_Spacy (en_core_web_lg),NER_Flair,NER_spacy_sm_org,NER_spacy_md_org,NER_spacy_lg_org,NER_Flair_org
0,Impact of antibiotics on the human microbiome ...,8756738,na,The authors were funded in part by Science Fou...,"{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland', 'APC Mi...",[Science Foundation Ireland],[Science Foundation Ireland],[Science Foundation Ireland],"[Science Foundation Ireland, APC Microbiome Ir..."
1,Novel nitrite reductase domain structure sugge...,8756737,na,The authors thank Dr. Ranjani Murali for advic...,"{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PER': ['Ranjani Murali', 'Sarah L. Schwartz'...",[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,
2,Efficacy of antifungal agents against fungal s...,8756736,na,This study was funded by Dr. Pfleger Arzneimit...,"{'PERSON': ['Pfleger Arzneimittel GmbH'], 'GPE...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH', 'Pro...","{'ORG': ['Pfleger Arzneimittel GmbH', 'Projekt...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH'], 'LO...",,"[Dr. Pfleger Arzneimittel GmbH, Projekt DEAL]","[Pfleger Arzneimittel GmbH, Projekt]",[Dr. Pfleger Arzneimittel GmbH]
3,Assembly and comparative analysis of the first...,8756732,10.1186/s12870-021-03416-5,Thanks to all the members of the Institute of ...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,"{'ORG': ['Institute of Leisure Agriculture', '...","[the Institute of Leisure Agriculture, the Gen...","[the Institute of Leisure Agriculture, theÂ Mi...","[the Institute of Leisure Agriculture, theÂ, M...","[Institute of Leisure Agriculture, Genepioneer..."
4,Characterizing the effects of different chemic...,8756731,10.1186/s13007-021-00835-1,Not applicable. This work was supported by the...,"{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['National Research Foundation', 'NRF'...","[the National Research Foundation, NRF, MSIT]","[the National Research Foundation, NRF]","[the National Research Foundation, NRF]","[National Research Foundation, NRF, MSIT]"


**3. Importing Crossref Funder Registry Results**

The Crossref funder registry maintains a list organization names which it stores in a ".rdf" file. We have implmenented a module to extract all the organization names and store them into a dictionary with keys as the starting alphabet of each name.

In [4]:
# storing the funding organization data in a dictonary
orga_dict = Crossref_funding_organization_extraction_dict_creation.funder_dictionary_creation("../data/registry.rdf")

# priting first 10 names from all the keys.
for ele in orga_dict:
    print(ele, orga_dict[ele][:10])


a ['administración de alimentos y medicamentos de los estados unidos', "american parkinson's disease foundation", 'apda', 'american diabetes association', 'asociación americana de la diabetes', 'ada', 'amgen foundation', 'amgen foundation, inc.', 'amgen foundation inc', 'american association for cancer research']
b ['boeing', 'boeing company', 'boeing co', 'boeing co.', 'biological sciences (bio)', 'bio', 'bio/oad', 'bio/mcb', 'biological infrastructure', 'bio/dbi']
c ['centers for disease control and prevention', 'centers for disease control & prevention', 'centros para el control y la prevención de enfermedades', 'centers for disease control', 'cdc', 'computer and information science and engineering', 'cise', 'cise/oad', 'congressionally directed medical research programs', 'cdmrp']
d ['department of defense', 'dod', "department of the navy's office of naval research", 'david and lucile packard foundation', 'david & lucile packard foundation', 'dlpf', 'department of energy', 'doe', '

We will use a reference count to compare the number of organizations names matching between Crossref Funder Registry and the NER tool output. This reference count is the number of unique organizations names present in the Crossref Funder Registry 

In [5]:
ref_count = 0
for ele in orga_dict.values():
    ref_count+= len(ele)
print("The number of unique organization from the Crossref Funder Registry used for result matching:", ref_count)

The number of unique organization from the Crossref Funder Registry used for result matching: 94226


In [40]:
print("The columns of the ner dataframe are:", ner_data.columns)
ner_data.head()

The columns of the ner dataframe are: Index(['Article_Title', 'PMC_ID', 'DOI', 'acknowledgement',
       'NER_Spacy (en_core_web_sm)', 'NER_Spacy (en_core_web_md)',
       'NER_Spacy (en_core_web_lg)', 'NER_Flair', 'NER_spacy_sm_org',
       'NER_spacy_md_org', 'NER_spacy_lg_org', 'NER_Flair_org'],
      dtype='object')


Unnamed: 0,Article_Title,PMC_ID,DOI,acknowledgement,NER_Spacy (en_core_web_sm),NER_Spacy (en_core_web_md),NER_Spacy (en_core_web_lg),NER_Flair,NER_spacy_sm_org,NER_spacy_md_org,NER_spacy_lg_org,NER_Flair_org
0,Impact of antibiotics on the human microbiome ...,8756738,na,The authors were funded in part by Science Fou...,"{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland', 'APC Mi...",[Science Foundation Ireland],[Science Foundation Ireland],[Science Foundation Ireland],"[Science Foundation Ireland, APC Microbiome Ir..."
1,Novel nitrite reductase domain structure sugge...,8756737,na,The authors thank Dr. Ranjani Murali for advic...,"{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PER': ['Ranjani Murali', 'Sarah L. Schwartz'...",[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,
2,Efficacy of antifungal agents against fungal s...,8756736,na,This study was funded by Dr. Pfleger Arzneimit...,"{'PERSON': ['Pfleger Arzneimittel GmbH'], 'GPE...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH', 'Pro...","{'ORG': ['Pfleger Arzneimittel GmbH', 'Projekt...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH'], 'LO...",,"[Dr. Pfleger Arzneimittel GmbH, Projekt DEAL]","[Pfleger Arzneimittel GmbH, Projekt]",[Dr. Pfleger Arzneimittel GmbH]
3,Assembly and comparative analysis of the first...,8756732,10.1186/s12870-021-03416-5,Thanks to all the members of the Institute of ...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,"{'ORG': ['Institute of Leisure Agriculture', '...","[the Institute of Leisure Agriculture, the Gen...","[the Institute of Leisure Agriculture, theÂ Mi...","[the Institute of Leisure Agriculture, theÂ, M...","[Institute of Leisure Agriculture, Genepioneer..."
4,Characterizing the effects of different chemic...,8756731,10.1186/s13007-021-00835-1,Not applicable. This work was supported by the...,"{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['National Research Foundation', 'NRF'...","[the National Research Foundation, NRF, MSIT]","[the National Research Foundation, NRF]","[the National Research Foundation, NRF]","[National Research Foundation, NRF, MSIT]"


In [68]:
match_count_spacy_sm_org = 0
match_result_spacy_sm_org = []
match_NER_spacy_sm_org = {}
for pmcid, orga_list in zip(ner_data['PMC_ID'], ner_data['NER_spacy_sm_org']):
    temp = []
    for orga in orga_list:
        orga = orga.lower()
        if orga[0] in orga_dict.keys() and orga in orga_dict[orga[0]]:
            temp.append(orga)
    match_NER_spacy_sm_org[pmcid] = temp

In [69]:
match_count_spacy_md_org = 0
match_result_spacy_md_org = []
match_NER_spacy_md_org = {}
for pmcid, orga_list in zip(ner_data['PMC_ID'], ner_data['NER_spacy_md_org']):
    temp = []
    for orga in orga_list:
        orga = orga.lower()
        if orga[0] in orga_dict.keys() and orga in orga_dict[orga[0]]:
            temp.append(orga)
            match_count_spacy_md_org += 1
            match_result_spacy_md_org.append(orga)
    match_NER_spacy_md_org[pmcid] = temp

In [70]:
match_count_spacy_lg_org = 0
match_result_spacy_lg_org = []
match_NER_spacy_lg_org = {}
for pmcid, orga_list in zip(ner_data['PMC_ID'], ner_data['NER_spacy_lg_org']):
    temp = []
    for orga in orga_list:
        orga = orga.lower()
        if orga[0] in orga_dict.keys() and orga in orga_dict[orga[0]]:
            temp.append(orga)
            match_count_spacy_lg_org += 1
            match_result_spacy_lg_org.append(orga)
    match_NER_spacy_lg_org[pmcid] = temp

In [73]:
match_count_flair_org = 0
match_result_flair_org = []
match_NER_flair_org = {}
for pmcid, orga_list in zip(ner_data['PMC_ID'], ner_data['NER_Flair_org']):
    temp = []
    for orga in orga_list:
        orga = orga.lower()
        if orga[0] in orga_dict.keys() and orga in orga_dict[orga[0]]:
            temp.append(orga)
            match_count_flair_org += 1
            match_result_flair_org.append(orga)
    match_NER_flair_org[pmcid] = temp

In [87]:
NER_spacy_sm_match = pd.DataFrame(zip(match_NER_spacy_sm_org.keys(),
                                      ner_data['acknowledgement'],
                                      match_NER_spacy_sm_org.values()),
                                      
                                      columns = ['PMC_ID', 'Acknowledgement', 'Match']
                                     )
NER_spacy_md_match = pd.DataFrame(zip(match_NER_spacy_md_org.keys(),
                                      ner_data['acknowledgement'],
                                      match_NER_spacy_md_org.values()),
                                     
                                     columns = ['PMC_ID', 'Acknowledgement', 'Match']
                                     )
NER_spacy_lg_match = pd.DataFrame(zip(match_NER_spacy_lg_org.keys(),
                                      ner_data['acknowledgement'],
                                      match_NER_spacy_lg_org.values()),
                                     
                                    columns = ['PMC_ID', 'Acknowledgement', 'Match']
                                     )
NER_flair_match = pd.DataFrame(zip(match_NER_flair_org.keys(),
                                   ner_data['acknowledgement'],
                                   match_NER_flair_org.values()),
                                  
                                   columns = ['PMC_ID', 'Acknowledgement', 'Match']
                                  )
with pd.ExcelWriter('match_orga.xlsx') as writer:
    NER_spacy_sm_match.to_excel(writer, sheet_name='NER_spacy_sm_org', index = False)
    NER_spacy_md_match.to_excel(writer, sheet_name='NER_spacy_md_org', index = False)
    NER_spacy_lg_match.to_excel(writer, sheet_name='NER_spacy_lg_org', index = False)
    NER_flair_match.to_excel(writer, sheet_name='NER_Flair_org', index = False)

**4. Preprocessing the text to increase the match count**

We have developed 4 preprocessing functions which have been sequentially applied to the tool and funder registry results. The functions are defined in the cell below.

In [6]:
def case_lowering(text : str) -> str:
    """
    Lower case strings in a text
    :param text: input raw text
    :return text_updated: lower_cased string
    """
    text_updated = text.lower()
    return text_updated

In [7]:
def det_removal(text : str) -> list:
    """
    Removing the determiners (a, an, the) from the begining of the string
    :param text: input raw text
    :return updated_text: list containing string with articles removed from the begining and new start letter
    """
    text_updated = ""
    det_list = ['a', 'an', 'the']
    if text == 'the' or text == 'a' or text == 'an':
        text_updated = text
    elif text.split(" ")[0] in det_list:
            text_updated = " ".join(text.split(" ")[1:])
    else:
        text_updated = text
    updated_text = [text_updated, text_updated[0]]
    return updated_text

In [8]:
def and_replacement(text : str) -> str:
    """
    Replacing "and" with "&" in the text
    :param text: input raw text
    :return text_updated: text with 'and' replaced with '&' if any
    """
    text_updated = re.sub(" and ", " & ", text)
    return text_updated

In [9]:
def punct_removal(text : str) -> str:
    """
    Removing punctuation marks from the text
    :param text: input raw text
    :return text_updated: text with punctuations and special symbols removed except '&'
    """
    regex = r"[!\"#\$%\'\(\)\*\+,-\./:;<=>\?@\[\\\]\^_`{\|}~]"
    text_updated = re.sub(regex, "", text)
    return text_updated

**5. Result Matching**

We will now match the results from the NER tool out and Crossref Funder Registry using the preprocessing functions sequentially

In [10]:
def text_processing(ner_data : pd.DataFrame, tool : str) -> list:
    """
    Arranging all the organization names identified by a specific tool into a list
    :param ner_data: data series containing all the organization of identified by the tool under consideration
    :param tool: name of the NER tool used for extracting organization names
    :return new_tool_org: list of all the extracted organization names
    """
    new_tool_org = []
    for ele in ner_data[tool]:
        if type(ele) == list:
            new_tool_org+=ele
    # new_tool_org_set = set(new_tool_org)
    new_tool_org = [ele for ele in new_tool_org if ele!= " " or ele != ""]
    return  new_tool_org

In [11]:
def result_matching(ner_tool_set, orga_dict):
    """Matching the count of organizations between the NER tools output and crossref funder registry
    :param ner_tool_set: list of all the extracted organization names by a specific NER tool
    :param orga_dict: dictionary containing names of all the funding organization arrange alphabetically
    :return match_count: count of the number matches between NER tool output and Corssref Funder Registry
    :return match_result: list of all the matches between NER tool output and Corssref Funder Registry
    
    """
    match_count = 0
    match_result = []
    for ele in ner_tool_set:
        if ele != '' and ele[0] in orga_dict:
            for ele1 in set(orga_dict[ele[0]]):  
                if ele == ele1:
                    match_result.append(ele)
                    match_count+=1
    return match_count, match_result

In [12]:
# initalizing dictionaries and data frame to store the results
result_match_count = {}
result_match = {}

result_df = pd.DataFrame(columns = ['NER_spacy_sm_org', 
                                    'NER_spacy_sm_org_match',
                                    'NER_spacy_sm_org_match_%', 
                                    'NER_spacy_md_org',
                                    'NER_spacy_md_org_match',
                                    'NER_spacy_md_org_match_%',
                                    'NER_spacy_lg_org', 
                                    'NER_spacy_lg_org_match',
                                    'NER_spacy_lg_org_match_%',
                                    'NER_Flair_org', 
                                    'NER_Flair_org_match',
                                    'NER_Flair_org_match_%'],
                        index = ['Crossref baseline vs NER baseline',
                                 'Crossref baseline vs NER improved version I',
                                 'Crossref baseline vs NER improved version II',
                                 'Crossref baseline vs NER improved version III',
                                 'Crossref baseline vs NER improved version IV',
                                 'NER baseline vs Crossref improved version I',
                                 'NER baseline vs Crossref improved version II',
                                 'NER baseline vs Crossref improved version III',
                                 'NER baseline vs Crossref improved version IV',
                                 'NER improved version I vs Crossref improved version I',
                                 'NER improved version II vs Crossref improved version II',
                                 'NER improved version III vs Crossref improved version III',
                                 'NER improved version IV vs Crossref improved version IV'   
                                ])

**I. NER Baseline vs Crossref Baseline**
We will first start with a baselie comparison. The baseline model in the both cases will be a lower-cased list of organization names.

The models compared here are - 

1) NER baseline model

2) Crossref baseline model

In [13]:
result_match_count['Crossref baseline vs NER baseline'] = {}
result_match['Crossref baseline vs NER baseline'] = {}
tool_output = {}

tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([case_lowering(ele) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER baseline'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER baseline'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER baseline'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER baseline'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " baseline model :")
    print(match_count)

The match count between Crossref baseline model and NER_spacy_sm_org baseline model :
7762
The match count between Crossref baseline model and NER_spacy_md_org baseline model :
8045
The match count between Crossref baseline model and NER_spacy_lg_org baseline model :
8071
The match count between Crossref baseline model and NER_Flair_org baseline model :
11682


In [14]:
ner_not_crossref = {}
for tool in tool_names:
    temp =[]
    for name in tool_output[tool]:
        if name not in result_match['Crossref baseline vs NER baseline'][tool]:
            temp.append(name)
    ner_not_crossref[tool] = temp          

In [15]:
pd.DataFrame(orga_dict['n']).to_csv('n.csv')

In [88]:
print(result_match['Crossref baseline vs NER baseline']['NER_spacy_sm_org'][:10])
print(result_match['Crossref baseline vs NER baseline']['NER_spacy_md_org'][:10])
print(result_match['Crossref baseline vs NER baseline']['NER_spacy_lg_org'][:10])
print(result_match['Crossref baseline vs NER baseline']['NER_Flair_org'][:10])

['national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'royal institute of technology', 'dh', 'shanghai municipal health and family planning commission']
['national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'industrial technology research institute', 'national council for scientific and technological development', 'royal institute of technology', 'dh', 'tsgh', 'beth israel deaconess medical center']
['national taiwan ocean university', 'global health', 'innovate uk', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'national council for scientific and technological development', 'mdic', 'royal institute of technology', 'mazums']
['shinshu university', 'loughborough university', 'zaozhuang university', 'international tennis federation', 'national taiwan ocean university', 'seoul national university', 'hasso plattne

In [17]:
NER_spacy_sm_org = pd.DataFrame(ner_not_crossref['NER_spacy_sm_org'], columns = ['NER_spacy_sm_org'])
NER_spacy_md_org = pd.DataFrame(ner_not_crossref['NER_spacy_md_org'], columns = ['NER_spacy_md_org'])
NER_spacy_lg_org = pd.DataFrame(ner_not_crossref['NER_spacy_lg_org'], columns = ['NER_spacy_lg_org'])
NER_Flair_org = pd.DataFrame(ner_not_crossref['NER_Flair_org'], columns = ['NER_Flair_org'])

In [35]:
with pd.ExcelWriter('output.xlsx') as writer:
    NER_spacy_sm_org.to_excel(writer, sheet_name='NER_spacy_sm_org')
    NER_spacy_md_org.to_excel(writer, sheet_name='NER_spacy_md_org')
    NER_spacy_lg_org.to_excel(writer, sheet_name='NER_spacy_lg_org')
    NER_Flair_org.to_excel(writer, sheet_name='NER_Flair_org')

In [20]:
print(ner_not_crossref['NER_spacy_sm_org'][:10])
print(ner_not_crossref['NER_spacy_md_org'][:10])
print(ner_not_crossref['NER_spacy_lg_org'][:10])
print(ner_not_crossref['NER_Flair_org'][:10])

['sheetal', 'the ethics committee of', 'the international maize and wheat improvement center', 'tll temple endowed chair', 'the data intensive research enabling clean technology', 'y. escobar', 'the california breast cancer research grants program office of the university of california', 'the national coronial information system', 'jlg', 'the victorian breast cancer research consortium']
['tll temple endowed chair', 'italfarmaco (milan,', 'the data intensive research enabling clean technology', 'the california breast cancer research grants program office of the university of california', 'the cincinnati childrenâ\x80\x99s hospital', 'the national coronial information system', 'prosperity for all', 'metainsight', 'jlg', 'the victorian breast cancer research consortium']
['bdw (dsi/', 'the data intensive research enabling clean technology', 'the california breast cancer research grants program office of the university of california', 'the national coronial information system', 'jlg', 'th

In [21]:
result_df.loc['Crossref baseline vs NER baseline'] = pd.Series(result_match_count['Crossref baseline vs NER baseline'])


In [91]:
ner_data

Unnamed: 0,Article_Title,PMC_ID,DOI,acknowledgement,NER_Spacy (en_core_web_sm),NER_Spacy (en_core_web_md),NER_Spacy (en_core_web_lg),NER_Flair,NER_spacy_sm_org,NER_spacy_md_org,NER_spacy_lg_org,NER_Flair_org
0,Impact of antibiotics on the human microbiome ...,8756738,na,The authors were funded in part by Science Fou...,"{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland'], 'GPE':...","{'ORG': ['Science Foundation Ireland', 'APC Mi...",[Science Foundation Ireland],[Science Foundation Ireland],[Science Foundation Ireland],"[Science Foundation Ireland, APC Microbiome Ir..."
1,Novel nitrite reductase domain structure sugge...,8756737,na,The authors thank Dr. Ranjani Murali for advic...,"{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PERSON': ['Ranjani Murali', 'Sarah L. Schwar...","{'PER': ['Ranjani Murali', 'Sarah L. Schwartz'...",[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,[a National Defense Science and Engineering Gr...,
2,Efficacy of antifungal agents against fungal s...,8756736,na,This study was funded by Dr. Pfleger Arzneimit...,"{'PERSON': ['Pfleger Arzneimittel GmbH'], 'GPE...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH', 'Pro...","{'ORG': ['Pfleger Arzneimittel GmbH', 'Projekt...","{'ORG': ['Dr. Pfleger Arzneimittel GmbH'], 'LO...",,"[Dr. Pfleger Arzneimittel GmbH, Projekt DEAL]","[Pfleger Arzneimittel GmbH, Projekt]",[Dr. Pfleger Arzneimittel GmbH]
3,Assembly and comparative analysis of the first...,8756732,10.1186/s12870-021-03416-5,Thanks to all the members of the Institute of ...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,{'ORG': ['the Institute of Leisure Agriculture...,"{'ORG': ['Institute of Leisure Agriculture', '...","[the Institute of Leisure Agriculture, the Gen...","[the Institute of Leisure Agriculture, theÂ Mi...","[the Institute of Leisure Agriculture, theÂ, M...","[Institute of Leisure Agriculture, Genepioneer..."
4,Characterizing the effects of different chemic...,8756731,10.1186/s13007-021-00835-1,Not applicable. This work was supported by the...,"{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['the National Research Foundation', '...","{'ORG': ['National Research Foundation', 'NRF'...","[the National Research Foundation, NRF, MSIT]","[the National Research Foundation, NRF]","[the National Research Foundation, NRF]","[National Research Foundation, NRF, MSIT]"
...,...,...,...,...,...,...,...,...,...,...,...,...
65829,Selenium Induces Pancreatic Cancer Cell Death ...,8773897,na,"This work, in part was supported by Penn State...","{'ORG': ['Penn State Cancer Institute Funds'],...",{'ORG': ['Penn State Cancer Institute Funds']},"{'ORG': ['Penn State Cancer Institute Funds', ...","{'ORG': ['Penn State Cancer Institute'], 'LOC'...",[Penn State Cancer Institute Funds],[Penn State Cancer Institute Funds],"[Penn State Cancer Institute Funds, R.S.]",[Penn State Cancer Institute]
65830,Neuropsychiatric Manifestations of Antiphospho...,8773877,na,This research received no external funding.,{},{},{},{},,,,
65831,Systemic Delivery of mLIGHT-Armed Myxoma Virus...,8773855,na,"We thank Leslie Sharp, Stephen Potts, Kavita M...","{'PERSON': ['Leslie Sharp', 'Stephen Potts', '...","{'PERSON': ['Leslie Sharp', 'Stephen Potts', '...","{'PERSON': ['Leslie Sharp', 'Stephen Potts', '...","{'PER': ['Leslie Sharp', 'Stephen Potts', 'Kav...","[Arizona State University, an ASU PhD Completi...","[Arizona State University, ASU, G.M., NIH, G.M...","[an Arizona State University, ASU, G.M., ASU P...","[., Arizona State University, ASU, ASU, NIH]"
65832,STAT3 Enhances Sensitivity of Glioblastoma to ...,8773829,na,We thank Hildegard KÃ¶nig for excellent techni...,"{'PERSON': ['Hildegard KÃ¶nig'], 'ORG': ['the ...","{'PERSON': ['Hildegard KÃ¶nig'], 'ORG': ['the ...","{'PERSON': ['Hildegard KÃ¶nig', 'D.K.', 'D.K.'...","{'PER': ['Hildegard KÃ ¶ nig'], 'ORG': ['Smart...","[the Graphical Abstract, Smart Servier Medical...","[the Graphical Abstract, Smart Servier Medical...","[Smart Servier Medical Art., CC BY 3.0, the Ge...","[Smart Servier Medical Art, German Research Fo..."


In [107]:
NER_spacy_sm_org_match = []
for pmcid, orga_list in zip(ner_data['PMC_ID'], ner_data['NER_spacy_sm_org']):
    # print(pmcid,orga_list)
    temp = []
    for orga in orga_list:
        orga = orga.lower()
        if orga[0] in orga_dict.keys() and orga in orga_dict[orga[0]]:
            temp.append(orga)
    NER_spacy_sm_org_match.append(temp)      

In [108]:
NER_spacy_md_org_match = []
for pmcid, orga_list in zip(ner_data['PMC_ID'], ner_data['NER_spacy_sm_org']):
    # print(pmcid,orga_list)
    temp = []
    for orga in orga_list:
        orga = orga.lower()
        if orga[0] in orga_dict.keys() and orga in orga_dict[orga[0]]:
            temp.append(orga)
    NER_spacy_md_org_match.append(temp)      

In [110]:
NER_spacy_lg_org_match = []
for pmcid, orga_list in zip(ner_data['PMC_ID'], ner_data['NER_spacy_sm_org']):
    # print(pmcid,orga_list)
    temp = []
    for orga in orga_list:
        orga = orga.lower()
        if orga[0] in orga_dict.keys() and orga in orga_dict[orga[0]]:
            temp.append(orga)
    NER_spacy_lg_org_match.append(temp)    

In [111]:
NER_Flair_org_match = []
for pmcid, orga_list in zip(ner_data['PMC_ID'], ner_data['NER_spacy_sm_org']):
    # print(pmcid,orga_list)
    temp = []
    for orga in orga_list:
        orga = orga.lower()
        if orga[0] in orga_dict.keys() and orga in orga_dict[orga[0]]:
            temp.append(orga)
    NER_Flair_org_match.append(temp)  

**Creating a DataFrame**

In [113]:
NER_spacy_sm_org_match_df = pd.DataFrame(zip(ner_data['PMC_ID'],
                                            ner_data['acknowledgement'],
                                            NER_spacy_sm_org_match),
                                        columns = ['PMC_ID', 'Acknowledgement', 'Matched_Organizations'])
NER_spacy_md_org_match_df = pd.DataFrame(zip(ner_data['PMC_ID'],
                                            ner_data['acknowledgement'],
                                            NER_spacy_md_org_match),
                                        columns = ['PMC_ID', 'Acknowledgement', 'Matched_Organizations'])
NER_spacy_lg_org_match_df = pd.DataFrame(zip(ner_data['PMC_ID'],
                                            ner_data['acknowledgement'],
                                            NER_spacy_lg_org_match),
                                        columns = ['PMC_ID', 'Acknowledgement', 'Matched_Organizations'])
NER_Flair_org_match_df = pd.DataFrame(zip(ner_data['PMC_ID'],
                                            ner_data['acknowledgement'],
                                            NER_Flair_org_match),
                                        columns = ['PMC_ID', 'Acknowledgement', 'Matched_Organizations'])

**Saving the dataframe**

In [114]:
NER_spacy_sm_org_match_df.to_excel("matched_organiztion.xlsx", sheet_name = "NER_spacy_sm_org", index = False)
NER_spacy_md_org_match_df.to_excel("matched_organiztion.xlsx", sheet_name = "NER_spacy_md_org", index = False)
NER_spacy_lg_org_match_df.to_excel("matched_organiztion.xlsx", sheet_name = "NER_spacy_lg_org", index = False)
NER_Flair_org_match_df.to_excel("matched_organiztion.xlsx", sheet_name = "NER_Flair_org", index = False)

**II. Improved NER vs Crossref Baseline**

Now we will sequentialy compare the results from the Crossref baseline model with those from improved versions of the NER tool output. The 4 Improved models for results from each NER tool are as follows - 

1) NER_I - 'det_removal' function applied

2) NER_II - 'and_replacement' function applied

3) NER_III - 'punct_removal' function applied

4) NER_IV - 'det_removal' + 'and_replacement' + 'punct_removal' functions applied



In [22]:
result_match['Crossref baseline vs NER improved version I'] = {}
result_match['Crossref baseline vs NER improved version II'] = {}
result_match['Crossref baseline vs NER improved version III'] = {}
result_match['Crossref baseline vs NER improved version IV'] = {}

result_match_count['Crossref baseline vs NER improved version I'] = {}
result_match_count['Crossref baseline vs NER improved version II'] = {}
result_match_count['Crossref baseline vs NER improved version III'] = {}
result_match_count['Crossref baseline vs NER improved version IV'] = {}


#I
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    NER_I = set([det_removal(case_lowering(ele))[0] for ele in text_processing(ner_data, tool) ])
    tool_output[tool] = NER_I
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER improved version I'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER improved version I'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER improved version I'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER improved version I'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " improved version I :")
    print(match_count)
    
print("********************************************************************************************************")     
    
#II
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    NER_II = set([and_replacement(case_lowering(ele)) for ele in text_processing(ner_data, tool)])
    tool_output[tool] = NER_II
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER improved version II'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER improved version II'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER improved version II'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER improved version II'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " improved version II :")
    print(match_count)

print("********************************************************************************************************")     
    
#III
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    NER_III = set([punct_removal(case_lowering(ele)) for ele in text_processing(ner_data, tool)])
    tool_output[tool] = NER_III
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER improved version III'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER improved version III'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER improved version III'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER improved version III'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " improved version III :")
    print(match_count)

print("********************************************************************************************************")    
    
#IV
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    NER_IV = set([punct_removal(and_replacement(det_removal(case_lowering(ele))[0])) for ele in text_processing(ner_data, tool)])
    tool_output[tool] = NER_IV
    match_count, match_result = result_matching(tool_output[tool],orga_dict)
    result_match_count['Crossref baseline vs NER improved version IV'][tool] = int(len(tool_output[tool]))
    result_match_count['Crossref baseline vs NER improved version IV'][tool+"_match"] = int(match_count)
    result_match_count['Crossref baseline vs NER improved version IV'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['Crossref baseline vs NER improved version IV'][tool] = match_result
    print("The match count between Crossref baseline model and " + tool + " improved version IV :")
    print(match_count)

The match count between Crossref baseline model and NER_spacy_sm_org improved version I :
9294
The match count between Crossref baseline model and NER_spacy_md_org improved version I :
9522
The match count between Crossref baseline model and NER_spacy_lg_org improved version I :
9544
The match count between Crossref baseline model and NER_Flair_org improved version I :
11609
********************************************************************************************************
The match count between Crossref baseline model and NER_spacy_sm_org improved version II :
7348
The match count between Crossref baseline model and NER_spacy_md_org improved version II :
7584
The match count between Crossref baseline model and NER_spacy_lg_org improved version II :
7611
The match count between Crossref baseline model and NER_Flair_org improved version II :
10882
********************************************************************************************************
The match count between Crossr

In [23]:
result_df.loc['Crossref baseline vs NER improved version I'] = pd.Series(result_match_count['Crossref baseline vs NER improved version I'])
result_df.loc['Crossref baseline vs NER improved version II'] = pd.Series(result_match_count['Crossref baseline vs NER improved version II'])
result_df.loc['Crossref baseline vs NER improved version III'] = pd.Series(result_match_count['Crossref baseline vs NER improved version III'])
result_df.loc['Crossref baseline vs NER improved version IV'] = pd.Series(result_match_count['Crossref baseline vs NER improved version IV'])



In [24]:
result_df

Unnamed: 0,NER_spacy_sm_org,NER_spacy_sm_org_match,NER_spacy_sm_org_match_%,NER_spacy_md_org,NER_spacy_md_org_match,NER_spacy_md_org_match_%,NER_spacy_lg_org,NER_spacy_lg_org_match,NER_spacy_lg_org_match_%,NER_Flair_org,NER_Flair_org_match,NER_Flair_org_match_%
Crossref baseline vs NER baseline,97859.0,7762.0,7.93182,102144.0,8045.0,7.876136,101777.0,8071.0,7.930082,75465.0,11682.0,15.480024
Crossref baseline vs NER improved version I,92970.0,9294.0,9.996773,96974.0,9522.0,9.819127,96723.0,9544.0,9.867353,74910.0,11609.0,15.497263
Crossref baseline vs NER improved version II,97688.0,7348.0,7.521906,101971.0,7584.0,7.437409,101618.0,7611.0,7.489815,75256.0,10882.0,14.459977
Crossref baseline vs NER improved version III,96143.0,7811.0,8.124356,100323.0,8057.0,8.03106,99956.0,8064.0,8.06755,73372.0,11590.0,15.796217
Crossref baseline vs NER improved version IV,91058.0,8643.0,9.491753,94958.0,8809.0,9.276733,94712.0,8805.0,9.296604,72601.0,10741.0,14.794562
NER baseline vs Crossref improved version I,,,,,,,,,,,,
NER baseline vs Crossref improved version II,,,,,,,,,,,,
NER baseline vs Crossref improved version III,,,,,,,,,,,,
NER baseline vs Crossref improved version IV,,,,,,,,,,,,
NER improved version I vs Crossref improved version I,,,,,,,,,,,,


**Applying the preprocessing function to orga_dict**

Now we will sequentially apply all the preprocessing functions to 'orga_dict' containing organization names from the Crossref Funder Registry. The 4 variations to the orga_dict as as follows - 


1) orga_dict_I - 'det_removal' function applied

2) orga_dict_I - 'and_replacement' function applied

3) orga_dict_I - 'punct_removal' function applied

4) orga_dict_I - 'det_removal' + 'and_replacement' + 'punct_removal' functions applied

In [25]:
from collections import defaultdict

orga_dict_I = {}
orga_dict_II = {}
orga_dict_III = {}
orga_dict_IV = {}


for ele in orga_dict:
    for i in range(len(orga_dict[ele])):
        new_key = det_removal(orga_dict[ele][i])
        new_key = new_key[1]
        if new_key not in orga_dict_I:
            orga_dict_I[new_key] = [det_removal(orga_dict[ele][i])[0]]
        else:   
            orga_dict_I[new_key].append(det_removal(orga_dict[ele][i])[0])
        
        if ele not in orga_dict_II:
            orga_dict_II[ele] = [and_replacement(orga_dict[ele][i])]
        else:
            orga_dict_II[ele].append(and_replacement(orga_dict[ele][i]))                               
        
        if ele not in orga_dict_III:
            orga_dict_III[ele] = [punct_removal(orga_dict[ele][i])]                            
        else:
            orga_dict_III[ele].append(punct_removal(orga_dict[ele][i]))
        
        if new_key not in orga_dict_IV:
            orga_dict_IV[new_key] = [punct_removal(and_replacement(det_removal(orga_dict[ele][i])[0]))]
        else:
            orga_dict_IV[new_key].append(punct_removal(and_replacement(det_removal(orga_dict[ele][i])[0])))   

**III. NER Baseline vs Improved Crossref**

Now we will sequentialy compare the results from the NER baseline model with those from improved versions of Crossref output.

In [26]:
result_match['NER baseline vs Crossref improved version I'] = {}
result_match['NER baseline vs Crossref improved version II'] = {}
result_match['NER baseline vs Crossref improved version III'] = {}
result_match['NER baseline vs Crossref improved version IV'] = {}

result_match_count['NER baseline vs Crossref improved version I'] = {}
result_match_count['NER baseline vs Crossref improved version II'] = {}
result_match_count['NER baseline vs Crossref improved version III'] = {}
result_match_count['NER baseline vs Crossref improved version IV'] = {}




#I
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([case_lowering(ele) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_I)
    result_match_count['NER baseline vs Crossref improved version I'][tool] = int(len(tool_output[tool]))
    result_match_count['NER baseline vs Crossref improved version I'][tool+"_match"] = int(match_count)
    result_match_count['NER baseline vs Crossref improved version I'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER baseline vs Crossref improved version I'][tool] = match_result
    print("The match count between" + tool +  "baseline model and Crossref improved version I :")
    print(match_count)
    
print("********************************************************************************************************")

#II
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([case_lowering(ele) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_II)
    result_match_count['NER baseline vs Crossref improved version II'][tool] = int(len(tool_output[tool]))
    result_match_count['NER baseline vs Crossref improved version II'][tool+"_match"] = int(match_count)
    result_match_count['NER baseline vs Crossref improved version II'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER baseline vs Crossref improved version II'][tool] = match_result
    print("The match count between" + tool +  "baseline model and Crossref improved version II :")
    print(match_count)
    
print("********************************************************************************************************")

#III
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([case_lowering(ele) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_III)
    result_match_count['NER baseline vs Crossref improved version III'][tool] = int(len(tool_output[tool]))
    result_match_count['NER baseline vs Crossref improved version III'][tool+"_match"] = int(match_count)
    result_match_count['NER baseline vs Crossref improved version III'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER baseline vs Crossref improved version III'][tool] = match_result
    print("The match count between" + tool +  "baseline model and Crossref improved version III :")
    print(match_count)
    
print("********************************************************************************************************")

#IV
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([case_lowering(ele) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_IV)
    result_match_count['NER baseline vs Crossref improved version IV'][tool] = int(len(tool_output[tool]))
    result_match_count['NER baseline vs Crossref improved version IV'][tool+"_match"] = int(match_count)
    result_match_count['NER baseline vs Crossref improved version IV'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER baseline vs Crossref improved version IV'][tool] = match_result
    print("The match count between" + tool +  "baseline model and Crossref improved version IV :")
    print(match_count)

The match count betweenNER_spacy_sm_orgbaseline model and Crossref improved version I :
7244
The match count betweenNER_spacy_md_orgbaseline model and Crossref improved version I :
7522
The match count betweenNER_spacy_lg_orgbaseline model and Crossref improved version I :
7536
The match count betweenNER_Flair_orgbaseline model and Crossref improved version I :
11584
********************************************************************************************************
The match count betweenNER_spacy_sm_orgbaseline model and Crossref improved version II :
7309
The match count betweenNER_spacy_md_orgbaseline model and Crossref improved version II :
7533
The match count betweenNER_spacy_lg_orgbaseline model and Crossref improved version II :
7558
The match count betweenNER_Flair_orgbaseline model and Crossref improved version II :
10757
********************************************************************************************************
The match count betweenNER_spacy_sm_orgbaselin

In [27]:
result_df.loc['NER baseline vs Crossref improved version I'] = pd.Series(result_match_count['NER baseline vs Crossref improved version I'])
result_df.loc['NER baseline vs Crossref improved version II'] = pd.Series(result_match_count['NER baseline vs Crossref improved version II'])
result_df.loc['NER baseline vs Crossref improved version III'] = pd.Series(result_match_count['NER baseline vs Crossref improved version III'])
result_df.loc['NER baseline vs Crossref improved version IV'] = pd.Series(result_match_count['NER baseline vs Crossref improved version IV'])


In [28]:
result_df

Unnamed: 0,NER_spacy_sm_org,NER_spacy_sm_org_match,NER_spacy_sm_org_match_%,NER_spacy_md_org,NER_spacy_md_org_match,NER_spacy_md_org_match_%,NER_spacy_lg_org,NER_spacy_lg_org_match,NER_spacy_lg_org_match_%,NER_Flair_org,NER_Flair_org_match,NER_Flair_org_match_%
Crossref baseline vs NER baseline,97859.0,7762.0,7.93182,102144.0,8045.0,7.876136,101777.0,8071.0,7.930082,75465.0,11682.0,15.480024
Crossref baseline vs NER improved version I,92970.0,9294.0,9.996773,96974.0,9522.0,9.819127,96723.0,9544.0,9.867353,74910.0,11609.0,15.497263
Crossref baseline vs NER improved version II,97688.0,7348.0,7.521906,101971.0,7584.0,7.437409,101618.0,7611.0,7.489815,75256.0,10882.0,14.459977
Crossref baseline vs NER improved version III,96143.0,7811.0,8.124356,100323.0,8057.0,8.03106,99956.0,8064.0,8.06755,73372.0,11590.0,15.796217
Crossref baseline vs NER improved version IV,91058.0,8643.0,9.491753,94958.0,8809.0,9.276733,94712.0,8805.0,9.296604,72601.0,10741.0,14.794562
NER baseline vs Crossref improved version I,97859.0,7244.0,7.402487,102144.0,7522.0,7.364113,101777.0,7536.0,7.404423,75465.0,11584.0,15.350162
NER baseline vs Crossref improved version II,97859.0,7309.0,7.468909,102144.0,7533.0,7.374883,101777.0,7558.0,7.426039,75465.0,10757.0,14.25429
NER baseline vs Crossref improved version III,97859.0,7516.0,7.680438,102144.0,7756.0,7.593202,101777.0,7767.0,7.63139,75465.0,11426.0,15.140794
NER baseline vs Crossref improved version IV,97859.0,6611.0,6.755638,102144.0,6798.0,6.65531,101777.0,6799.0,6.680291,75465.0,10418.0,13.805075
NER improved version I vs Crossref improved version I,,,,,,,,,,,,


**IV. Improved NER vs Improved Crossref**

Finally we will sequentially compare the results of Improved Crossref with Improved NER models

In [29]:
result_match['NER improved version I vs Crossref improved version I'] = {}
result_match['NER improved version II vs Crossref improved version II'] = {}
result_match['NER improved version III vs Crossref improved version III'] = {}
result_match['NER improved version IV vs Crossref improved version IV'] = {}

result_match_count['NER improved version I vs Crossref improved version I'] = {}
result_match_count['NER improved version II vs Crossref improved version II'] = {}
result_match_count['NER improved version III vs Crossref improved version III'] = {}
result_match_count['NER improved version IV vs Crossref improved version IV'] = {}



#I
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([det_removal(case_lowering(ele))[0] for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_I)
    result_match_count['NER improved version I vs Crossref improved version I'][tool] = int(len(tool_output[tool]))
    result_match_count['NER improved version I vs Crossref improved version I'][tool+"_match"] = int(match_count)
    result_match_count['NER improved version I vs Crossref improved version I'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER improved version I vs Crossref improved version I'][tool] = match_result
    print("The match count between" + tool +  "improved version I and Crossref improved version I :")
    print(match_count)
    print(match_result)
    
print("********************************************************************************************************")
    
#II
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([and_replacement(case_lowering(ele)) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_II)
    result_match_count['NER improved version II vs Crossref improved version II'][tool] = int(len(tool_output[tool]))
    result_match_count['NER improved version II vs Crossref improved version II'][tool+"_match"] = int(match_count)
    result_match_count['NER improved version II vs Crossref improved version II'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER improved version II vs Crossref improved version II'][tool] = match_result
    print("The match count between" + tool +  "improved version II and Crossref improved version II :")
    print(match_count)
    print(match_result)
    
print("********************************************************************************************************")    

#III
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([punct_removal(case_lowering(ele)) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_III)
    result_match_count['NER improved version III vs Crossref improved version III'][tool] = int(len(tool_output[tool]))
    result_match_count['NER improved version III vs Crossref improved version III'][tool+"_match"] = int(match_count)
    result_match_count['NER improved version III vs Crossref improved version III'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER improved version III vs Crossref improved version III'][tool] = match_result
    print("The match count between" + tool +  "improved version III and Crossref improved version III :")
    print(match_count)
    print(match_result)
    
print("********************************************************************************************************")    

#IV
tool_output = {}
tool_names = ner_data.columns[-4:]
for tool in tool_names:
    tool_output[tool] = set([punct_removal(and_replacement(det_removal(case_lowering(ele))[0])) for ele in text_processing(ner_data, tool)])
    match_count, match_result = result_matching(tool_output[tool],orga_dict_IV)
    result_match_count['NER improved version IV vs Crossref improved version IV'][tool] = int(len(tool_output[tool]))
    result_match_count['NER improved version IV vs Crossref improved version IV'][tool+"_match"] = int(match_count)
    result_match_count['NER improved version IV vs Crossref improved version IV'][tool+"_match_%"] = match_count*100/len(tool_output[tool])
    result_match['NER improved version IV vs Crossref improved version IV'][tool] = match_result
    print("The match count between" + tool +  "improved version IV and Crossref improved version IV :")
    print(match_count)
    print(match_result)

The match count betweenNER_spacy_sm_orgimproved version I and Crossref improved version I :
9337
['international tennis federation', 'national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'national council for scientific and technological development', 'royal institute of technology', 'autonomous university of chapingo', 'montreal general hospital foundation', 'australian genome research facility', 'society for maternal fetal medicine', 'dh', 'shanghai municipal health and family planning commission', 'tsgh', 'beth israel deaconess medical center', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'university of dar es salaam', 'hlf', 'cai', 'diabetes australia', 'centre for cognitive ageing and cognitive epidemiology', 'aag', 'promotion and mutual aid corporation for private schools of japan', 'bialystok university of technology', 'huna

The match count betweenNER_spacy_md_orgimproved version I and Crossref improved version I :
9567
['international tennis federation', 'national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'industrial technology research institute', 'national council for scientific and technological development', 'royal institute of technology', 'autonomous university of chapingo', 'montreal general hospital foundation', 'australian genome research facility', 'society for maternal fetal medicine', 'dh', 'tsgh', 'beth israel deaconess medical center', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'university of dar es salaam', 'schweizerische herzstiftung', 'hlf', 'cai', 'diabetes australia', 'sums', 'miller institute', 'promotion and mutual aid corporation for private schools of japan', 'bialystok university of technology', 'hunan provincial science and technology plan project', 'national social science foundation', 'embrapa genetic resou

The match count betweenNER_spacy_lg_orgimproved version I and Crossref improved version I :
9593
['international tennis federation', 'national taiwan ocean university', 'global health', 'innovate uk', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'national council for scientific and technological development', 'mdic', 'royal institute of technology', 'autonomous university of chapingo', 'mazums', 'montreal general hospital foundation', 'australian genome research facility', 'dh', 'shanghai municipal health and family planning commission', 'tsgh', 'beth israel deaconess medical center', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'schweizerische herzstiftung', 'hlf', 'cai', 'diabetes australia', 'sums', 'aag', 'promotion and mutual aid corporation for private schools of japan', 'bialystok university of technology', 'hunan provincial science and technology plan project', 'national social science fou

The match count betweenNER_Flair_orgimproved version I and Crossref improved version I :
11661
['shinshu university', 'loughborough university', 'zaozhuang university', 'international tennis federation', 'national taiwan ocean university', 'seoul national university', 'hasso plattner foundation', 'sppu', 'kreitman school of advanced graduate studies', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'baylor university medical center', 'epidemiological and clinical research information network', 'hospital universiti sains malaysia', 'srcd', 'national council for scientific and technological development', 'nidilrr', 'uclh', 'mdic', 'inee', 'royal institute of technology', 'b', 'autonomous university of chapingo', 'national agrarian university', 'hp', 'montreal general hospital foundation', 'australian genome research facility', 'huashan hospital', 'society for maternal fetal medicine', 'dh', 'shanghai municipal health and family planning commission', 'ocp foundat

The match count betweenNER_spacy_sm_orgimproved version II and Crossref improved version II :
7774
['oregon clinical & translational research institute', 'shenyang science & technology bureau', 'national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'royal institute of technology', 'fujian provincial science & technology department', 'dh', 'university of electronic science & technology of china', 'tsgh', 'beth israel deaconess medical center', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'the university grants committee', 'hlf', 'cai', 'diabetes australia', 'aag', 'the manitoba centre for health policy', 'bialystok university of technology', 'national social science foundation', 'ubc', 'naval medical research center', 'petrus och augusta hedlunds stiftelse', 'wp', 'virbac', 'university of macedonia', 'southwest hospital', 'als associ

The match count betweenNER_spacy_md_orgimproved version II and Crossref improved version II :
8057
['shenyang science & technology bureau', 'national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'industrial technology research institute', 'royal institute of technology', 'fujian provincial science & technology department', 'dh', 'university of electronic science & technology of china', 'tsgh', 'beth israel deaconess medical center', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'the university grants committee', 'schweizerische herzstiftung', 'hlf', 'cai', 'diabetes australia', 'sums', 'the manitoba centre for health policy', 'bialystok university of technology', 'national social science foundation', 'ubc', 'naval medical research center', 'wuhan national biosafety laboratory', 'national program for support of top-notch young professionals', 'wp', 'virbac', 'a-step', 'university of macedonia', 'southwest hospital', 'arsl

The match count betweenNER_spacy_lg_orgimproved version II and Crossref improved version II :
8085
['national taiwan ocean university', 'global health', 'innovate uk', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'mdic', 'royal institute of technology', 'fujian provincial science & technology department', 'mazums', 'dh', 'university of electronic science & technology of china', 'tsgh', 'beth israel deaconess medical center', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'the university grants committee', 'schweizerische herzstiftung', 'hlf', 'cai', 'diabetes australia', 'sums', 'aag', 'the manitoba centre for health policy', 'bialystok university of technology', 'national social science foundation', 'ubc', 'naval medical research center', 'virbac', 'university of macedonia', 'southwest hospital', 'arsla', 'als association', 'lg chem ltd.', 'lccc', 'thailand development research institute', 'sic', '

The match count betweenNER_Flair_orgimproved version II and Crossref improved version II :
11680
['shinshu university', 'loughborough university', 'oregon clinical & translational research institute', 'zaozhuang university', 'international tennis federation', 'national taiwan ocean university', 'seoul national university', 'hasso plattner foundation', 'the university of manchester', 'sppu', 'kreitman school of advanced graduate studies', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'baylor university medical center', 'hospital universiti sains malaysia', 'srcd', 'nidilrr', 'uclh', 'mdic', 'inee', 'department of geology & geophysics', 'royal institute of technology', 'foundation for the promotion of health & biomedical research of valencia region', 'b', 'fujian provincial science & technology department', 'office of energy efficiency & renewable energy', 'national agrarian university', 'hp', 'montreal general hospital foundation', 'australian genome research

The match count betweenNER_spacy_sm_orgimproved version III and Crossref improved version III :
8143
['national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'royal institute of technology', 'dh', 'shanghai municipal health and family planning commission', 'vall dhebron university hospital', 'tsgh', 'beth israel deaconess medical center', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'the university grants committee', 'hlf', 'the walter and eliza hall institute of medical research', 'cai', 'diabetes australia', 'aag', 'the manitoba centre for health policy', 'bialystok university of technology', 'hunan provincial science and technology plan project', 'national social science foundation', 'embrapa genetic resources and biotechnology', 'ubc', 'naval medical research center', 'petrus och augusta hedlunds stiftelse', 'wp', 'virbac', 'univ

The match count betweenNER_spacy_md_orgimproved version III and Crossref improved version III :
8411
['national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'national council for scientific and technological development', 'royal institute of technology', 'dh', 'vall dhebron university hospital', 'tsgh', 'beth israel deaconess medical center', 'veracyte inc', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'the university grants committee', 'schweizerische herzstiftung', 'hlf', 'cai', 'diabetes australia', 'sums', 'the manitoba centre for health policy', 'bialystok university of technology', 'hunan provincial science and technology plan project', 'national social science foundation', 'embrapa genetic resources and biotechnology', 'ior', 'ubc', 'naval medical research center', 'wuhan national biosafety laboratory', 'wp', 'virbac', 'university of macedonia', 'southwest hospit

The match count betweenNER_spacy_lg_orgimproved version III and Crossref improved version III :
8426
['national taiwan ocean university', 'global health', 'innovate uk', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'national council for scientific and technological development', 'mdic', 'royal institute of technology', 'mazums', 'dh', 'shanghai municipal health and family planning commission', 'vall dhebron university hospital', 'tsgh', 'beth israel deaconess medical center', 'veracyte inc', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'the university grants committee', 'schweizerische herzstiftung', 'hlf', 'cai', 'diabetes australia', 'sums', 'aag', 'the manitoba centre for health policy', 'bialystok university of technology', 'hunan provincial science and technology plan project', 'national social science foundation', 'embrapa genetic resources and biotechnology', 'ior', 'ubc', 'naval medical re

The match count betweenNER_Flair_orgimproved version III and Crossref improved version III :
12027
['shinshu university', 'loughborough university', 'mngha', 'zaozhuang university', 'international tennis federation', 'national taiwan ocean university', 'seoul national university', 'hasso plattner foundation', 'the university of manchester', 'sppu', 'kreitman school of advanced graduate studies', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'baylor university medical center', 'hospital universiti sains malaysia', 'srcd', 'epidemiological and clinical research information network', 'mws', 'national council for scientific and technological development', 'nidilrr', 'uclh', 'mdic', 'inee', 'royal institute of technology', 'b', 'national agrarian university', 'hp', 'montreal general hospital foundation', 'australian genome research facility', 'huashan hospital', 'society for maternal fetal medicine', 'dh', 'shanghai municipal health and family planning commission

The match count betweenNER_spacy_sm_orgimproved version IV and Crossref improved version IV :
9734
['oregon clinical & translational research institute', 'shenyang science & technology bureau', 'international tennis federation', 'national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'royal institute of technology', 'fujian provincial science & technology department', 'earmarked fund for modern agroindustry technology research system', 'autonomous university of chapingo', 'montreal general hospital foundation', 'australian genome research facility', 'society for maternal fetal medicine', 'dh', 'vall dhebron university hospital', 'university of electronic science & technology of china', 'tsgh', 'beth israel deaconess medical center', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'university of dar es salaam', 'hlf', 'cai', 'diabetes au

The match count betweenNER_spacy_md_orgimproved version IV and Crossref improved version IV :
9961
['shenyang science & technology bureau', 'international tennis federation', 'national taiwan ocean university', 'global health', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'royal institute of technology', 'fujian provincial science & technology department', 'earmarked fund for modern agroindustry technology research system', 'autonomous university of chapingo', 'montreal general hospital foundation', 'australian genome research facility', 'society for maternal fetal medicine', 'dh', 'vall dhebron university hospital', 'university of electronic science & technology of china', 'tsgh', 'beth israel deaconess medical center', 'veracyte inc', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'university of dar es salaam', 'schweizerische herzstiftung', 'hlf', 'cai', 'diabetes australia', 'sums', 'miller institute', 'bialys

The match count betweenNER_spacy_lg_orgimproved version IV and Crossref improved version IV :
9969
['oregon clinical & translational research institute', 'national taiwan ocean university', 'international tennis federation', 'global health', 'innovate uk', 'fmd', 'industrial technology research institute', 'hospital universiti sains malaysia', 'mdic', 'royal institute of technology', 'fujian provincial science & technology department', 'earmarked fund for modern agroindustry technology research system', 'autonomous university of chapingo', 'mazums', 'montreal general hospital foundation', 'australian genome research facility', 'dh', 'vall dhebron university hospital', 'university of electronic science & technology of china', 'tsgh', 'beth israel deaconess medical center', 'veracyte inc', 'frc', 'beijing foreign studies university', 'astra', 'spa', 'agria pet insurance', 'lcm', 'schweizerische herzstiftung', 'hlf', 'cai', 'diabetes australia', 'sums', 'aag', 'bialystok university of tec

The match count betweenNER_Flair_orgimproved version IV and Crossref improved version IV :
11999
['shinshu university', 'loughborough university', 'oregon clinical & translational research institute', 'mngha', 'zaozhuang university', 'international tennis federation', 'national taiwan ocean university', 'seoul national university', 'hasso plattner foundation', 'sppu', 'kreitman school of advanced graduate studies', 'innovate uk', 'laqv', 'fmd', 'industrial technology research institute', 'baylor university medical center', 'hospital universiti sains malaysia', 'srcd', 'mws', 'nidilrr', 'uclh', 'mdic', 'inee', 'royal institute of technology', 'foundation for the promotion of health & biomedical research of valencia region', 'b', 'fujian provincial science & technology department', 'office of energy efficiency & renewable energy', 'autonomous university of chapingo', 'national agrarian university', 'hp', 'montreal general hospital foundation', 'australian genome research facility', 'huas

In [30]:
result_df.loc['NER improved version I vs Crossref improved version I'] = pd.Series(result_match_count['NER improved version I vs Crossref improved version I'])
result_df.loc['NER improved version II vs Crossref improved version II'] = pd.Series(result_match_count['NER improved version II vs Crossref improved version II'])
result_df.loc['NER improved version III vs Crossref improved version III'] = pd.Series(result_match_count['NER improved version III vs Crossref improved version III'])
result_df.loc['NER improved version IV vs Crossref improved version IV'] = pd.Series(result_match_count['NER improved version IV vs Crossref improved version IV'])


In [31]:
result_df

Unnamed: 0,NER_spacy_sm_org,NER_spacy_sm_org_match,NER_spacy_sm_org_match_%,NER_spacy_md_org,NER_spacy_md_org_match,NER_spacy_md_org_match_%,NER_spacy_lg_org,NER_spacy_lg_org_match,NER_spacy_lg_org_match_%,NER_Flair_org,NER_Flair_org_match,NER_Flair_org_match_%
Crossref baseline vs NER baseline,97859.0,7762.0,7.93182,102144.0,8045.0,7.876136,101777.0,8071.0,7.930082,75465.0,11682.0,15.480024
Crossref baseline vs NER improved version I,92970.0,9294.0,9.996773,96974.0,9522.0,9.819127,96723.0,9544.0,9.867353,74910.0,11609.0,15.497263
Crossref baseline vs NER improved version II,97688.0,7348.0,7.521906,101971.0,7584.0,7.437409,101618.0,7611.0,7.489815,75256.0,10882.0,14.459977
Crossref baseline vs NER improved version III,96143.0,7811.0,8.124356,100323.0,8057.0,8.03106,99956.0,8064.0,8.06755,73372.0,11590.0,15.796217
Crossref baseline vs NER improved version IV,91058.0,8643.0,9.491753,94958.0,8809.0,9.276733,94712.0,8805.0,9.296604,72601.0,10741.0,14.794562
NER baseline vs Crossref improved version I,97859.0,7244.0,7.402487,102144.0,7522.0,7.364113,101777.0,7536.0,7.404423,75465.0,11584.0,15.350162
NER baseline vs Crossref improved version II,97859.0,7309.0,7.468909,102144.0,7533.0,7.374883,101777.0,7558.0,7.426039,75465.0,10757.0,14.25429
NER baseline vs Crossref improved version III,97859.0,7516.0,7.680438,102144.0,7756.0,7.593202,101777.0,7767.0,7.63139,75465.0,11426.0,15.140794
NER baseline vs Crossref improved version IV,97859.0,6611.0,6.755638,102144.0,6798.0,6.65531,101777.0,6799.0,6.680291,75465.0,10418.0,13.805075
NER improved version I vs Crossref improved version I,92970.0,9337.0,10.043025,96974.0,9567.0,9.865531,96723.0,9593.0,9.918013,74910.0,11661.0,15.56668


**6. Storing the results in a dataframe**

In [32]:
t_1 = timeit.default_timer()
print("The time elapsed: ", t_1 - t_0)

The time elapsed:  1717.890178708


In [33]:
result_df.to_csv('../data/Result_Match.csv')

**7. Future work**

1. Random sampling from the tool output and perform subword analysis/matching appearing as substrings for some organizations. Check the original arcticles.

2. Trying out new NER tools

3. Traing own NER tool

In [34]:
result_df

Unnamed: 0,NER_spacy_sm_org,NER_spacy_sm_org_match,NER_spacy_sm_org_match_%,NER_spacy_md_org,NER_spacy_md_org_match,NER_spacy_md_org_match_%,NER_spacy_lg_org,NER_spacy_lg_org_match,NER_spacy_lg_org_match_%,NER_Flair_org,NER_Flair_org_match,NER_Flair_org_match_%
Crossref baseline vs NER baseline,97859.0,7762.0,7.93182,102144.0,8045.0,7.876136,101777.0,8071.0,7.930082,75465.0,11682.0,15.480024
Crossref baseline vs NER improved version I,92970.0,9294.0,9.996773,96974.0,9522.0,9.819127,96723.0,9544.0,9.867353,74910.0,11609.0,15.497263
Crossref baseline vs NER improved version II,97688.0,7348.0,7.521906,101971.0,7584.0,7.437409,101618.0,7611.0,7.489815,75256.0,10882.0,14.459977
Crossref baseline vs NER improved version III,96143.0,7811.0,8.124356,100323.0,8057.0,8.03106,99956.0,8064.0,8.06755,73372.0,11590.0,15.796217
Crossref baseline vs NER improved version IV,91058.0,8643.0,9.491753,94958.0,8809.0,9.276733,94712.0,8805.0,9.296604,72601.0,10741.0,14.794562
NER baseline vs Crossref improved version I,97859.0,7244.0,7.402487,102144.0,7522.0,7.364113,101777.0,7536.0,7.404423,75465.0,11584.0,15.350162
NER baseline vs Crossref improved version II,97859.0,7309.0,7.468909,102144.0,7533.0,7.374883,101777.0,7558.0,7.426039,75465.0,10757.0,14.25429
NER baseline vs Crossref improved version III,97859.0,7516.0,7.680438,102144.0,7756.0,7.593202,101777.0,7767.0,7.63139,75465.0,11426.0,15.140794
NER baseline vs Crossref improved version IV,97859.0,6611.0,6.755638,102144.0,6798.0,6.65531,101777.0,6799.0,6.680291,75465.0,10418.0,13.805075
NER improved version I vs Crossref improved version I,92970.0,9337.0,10.043025,96974.0,9567.0,9.865531,96723.0,9593.0,9.918013,74910.0,11661.0,15.56668
