# Medical Institute Matching Analysis #

## Set up ##

In [1]:
import pandas as pd
from tabulate import tabulate
from process import main

### Now to get the dataframe:

In [2]:
medical_institution = main("PubmedArticle.xml")
print(medical_institution)

100%|██████████| 3551/3551 [14:38<00:00,  4.04it/s]   


      Article_PMID                                      Article_title  \
0         31644467  Common and rare forms of vasculitis associated...   
1         31642912  The quantitative assessment of interstitial lu...   
2         31642912  The quantitative assessment of interstitial lu...   
3         31642912  The quantitative assessment of interstitial lu...   
4         31642912  The quantitative assessment of interstitial lu...   
...            ...                                                ...   
23297     23477430  Primary anetoderma associated with primary Sjö...   
23298     23462883  DPP4 inhibitor-induced polyarthritis: a report...   
23299     23303389  Lymphoma risk in systemic lupus: effects of di...   
23300     23271426  Cryoglobulinemic vasculitis in systemic sclero...   
23301     23011161  Refractory thrombotic thrombocytopenic purpura...   

                                        Article_keywords  \
0                                                   None   
1  

### Of the all the rows in the table, how many of the Affiliation GRID name was matched successfully?

In [3]:
aff_value_df = medical_institution[medical_institution['Affiliation_name_PubMed'].notna()]

matched_count = aff_value_df["Affiliation_name_GRID"].value_counts().sum()
total = len(aff_value_df.index)
percentage_matched = (matched_count/total) * 100
print(percentage_matched)

11.90026607158184


Only 11.3%

### Inspecting the successful Affiliation GRID matches

In [4]:

matches_found_df = aff_value_df[aff_value_df["Affiliation_name_GRID"].notna()]
df = matches_found_df[["Affiliation_name_PubMed", "Affiliation_name_GRID"]].head(20)
print(tabulate(df, headers='keys', tablefmt='psql'))

+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
|     | Affiliation_name_PubMed                                                                                                                                                                              | Affiliation_name_GRID                     |
|-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------|
|  22 | Rheumatology Unit, Department of Medicine, University of Perugia, Italy.                                                                                                                             | University of Perugia                   

### Inspecting the unsuccessful Affiliation GRID names 

In [11]:
no_grid_found_df = aff_value_df[aff_value_df["Affiliation_name_GRID"].isna()]
df = no_grid_found_df[["Affiliation_name_PubMed",
                       "Affiliation_name_GRID", "Affiliation_country"]].head(20)
print(tabulate(df, headers='keys', tablefmt='psql'))

+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-----------------------+
|    | Affiliation_name_PubMed                                                                                                                                                                                                                                       | Affiliation_name_GRID   | Affiliation_country   |
|----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-----------------------|
|  0 | Department of Pathophysiology, School of Medicine, Nat

### Inspecting institutes.csv file

In [12]:
from os import environ
from dotenv import load_dotenv
from process import load_grid_institute_names

load_dotenv()
institutes = load_grid_institute_names(environ["GRID_INSTITUTE_FILEPATH"])
print(institutes[0])

Australian National University


Now lets have a look at some of the university names in Affiliation_name_PubMed for those without a match.
For example, lets use first row which mentions University of Athens in the column.
Let's check whether that unsuccessful match was because the institute is not in the csv file.

In [13]:
is_institute_exist = [institute for institute in institutes if 'University of Athens'.lower() in institute.lower()]
print(is_institute_exist)

['National Technical University of Athens', 'National and Kapodistrian University of Athens', 'Agricultural University of Athens']


There are 3 different institutes with 'University of Athens' included. 
One of them is the 'National and Kapodistrian University of Athens'.
This should've been the exact match to the first row's Affiliation_name_PubMed value' Department of Pathophysiology, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece.  '.
Similar issue occurs for rows with Affiliation_name_PubMed value that include 'Wuhan University'.