Data extracted from CC Grant Tracker:

* Login to CC Grant Tracker
* Grants tab
* Export tab (next to Search and Saved Filters)
* export... button
* Tick "Strip HTML formatting from output"
* export and save file to data/CCGrantTracker

File name must start with export_grants and be of type .xlsx

In [1]:
import numpy as np
import pandas as pd
import glob
from fuzzywuzzy import process

In [2]:
grants_file = glob.glob('data/CCGrantTracker/export_grants*.xlsx')[0]
ccgt = pd.read_excel(grants_file)

# only consider approved grants
ccgt = ccgt[ccgt.Status=='Approved']
ccgt.reset_index(inplace=True,drop=True)

ccgt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73 entries, 0 to 72
Data columns (total 24 columns):
Reference                       73 non-null object
Lead Applicant Full Name        73 non-null object
Lead Applicant Email            73 non-null object
Title                           73 non-null object
Status                          73 non-null object
Outcome                         0 non-null object
Sus                             73 non-null object
Round                           73 non-null object
Organisation                    73 non-null object
Type                            73 non-null object
Rejection Source                0 non-null object
Total Requested                 73 non-null float64
Award Date                      73 non-null datetime64[ns]
Finance Reference               0 non-null float64
Current Award                   0 non-null float64
Start Date                      73 non-null datetime64[ns]
End Date                        73 non-null datetime64[ns]
Grant M

In [3]:
master = pd.read_excel('data/Award Holders Master.xlsx')

In [4]:
# exact title matches (excluding spaces)
ccgt['MasterID'] = np.nan

title_in_master = ccgt.Title.str.replace(' ','').isin(master['Project Title'].str.replace(' ',''))

for row in ccgt[title_in_master].index:
    ccgt.loc[row,'MasterID'] = master.loc[master['Project Title'].str.replace(' ','')==ccgt.loc[row,'Title'].replace(' ',''),'Grant Ref Unique'].iloc[0]

print(sum(title_in_master),'titles from ccgt match award master spreadsheet.')
print(sum(~title_in_master),'entries without match.')

64 titles from ccgt match award master spreadsheet.
9 entries without match.


In [5]:
# try to match similar titles using fuzzywuzzy.process.extract
for row in ccgt[ccgt['MasterID'].isnull()].index:
    
    result = process.extract(ccgt.loc[row,'Title'],
                                  master['Project Title'],
                                  limit=1)[0]
    
    matched_title = result[0]
    match_score = result[1]
    match_idx = result[2]
    
    if match_score>95:
        print('CCGT:',ccgt.loc[row,'Title'])      
        print('Master:',matched_title)
        print('score =',match_score)
        
        ccgt.loc[row,'MasterID'] = master.loc[match_idx,'Grant Ref Unique']
        print('----------------------------')

CCGT: Exploring the molecular mechanisms of Aldehyde dehydrogenase (ALDH)/Retinoic acid (RA)-mediated extracellular matrix production in activated mucous membrane pemphigoid (MMP) conjunctival fibroblasts.
Master: Exploring the molecular mechanisms of Aldehyde dehydrogenase (ALDH)/Retinoic acid (RA)-mediated extracellular matrix production in activated mucous membrane pemphigoid (MMP) conjunctival fibroblasts
score = 100
----------------------------
CCGT: Metabolomic and dietary profiling in choroideremia to determine mechanistic, prognostic and therapeutic biomarkers
Master: Metabolomic and dietary profiling in choroideremia to determine mechanistic, prognostic and therapeutic biomarkers.
score = 100
----------------------------
CCGT: GlaucoMirs: MicroRNA-based therapeutics to treat ocular fibrosis in glaucoma.
Master: GlaucoMirs: MicroRNA based therapeutics to treat ocular fibrosis in glaucoma
score = 100
----------------------------
CCGT: Functional MRI as a Potential Predictive Too

In [6]:
print(sum(ccgt['MasterID'].isnull()),'remaining CCGT entries without a match in Award Master:')
ccgt.loc[ccgt['MasterID'].isnull(),
         ['Lead Applicant Full Name','Title','Round','Organisation']]

2 remaining CCGT entries without a match in Award Master:


Unnamed: 0,Lead Applicant Full Name,Title,Round,Organisation
48,Dr Hannah Levis,Creation of bio-synthetic corneal endothelial ...,Project Grants - Full Application 2016,University of Liverpool
52,Dr Nina Milosavljevic,Quantifying the visual restoration potential o...,Early Career Investigator Awards - Full Applic...,University of Manchester


In [7]:
ccgt.set_index('MasterID',inplace=True)
ccgt.to_excel('data/CCGrantTracker/CCGT_processed.xlsx')