# RWJF open data analysis

We have access to two datasets about the projects that RWJF support:

1. Pioneers dataset: Information about grants awarded as part of the Pioneers programme, which focuses on innovations in the USA
2. Global dataset: Grants awarsed as part of the Global programme, which focuses on innovations outside the USA
3. Open dataset: With information about all their grants

1 and 2 are relatively unstructured but contain rich text, whereas 3 is well structured but doesn't have a lot of text.

We want to rapidly process these data and analyse them to understand: 

* What is RWJFs funding portfolio: what topics are they supporting? Where?
* Enrich these data with additional information from for example GRID, CrunchBase to map collaboration networks.



## Preamble

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
#Additional imports
import os
import ratelim
import re
import io
import urllib
import codecs
import bs4
import json

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datetime import datetime
from nltk.corpus import stopwords

from analysis.src.nlp.lda_pipeline import LdaPipeline, CleanTokenize
from analysis.src.data.readnwrite import get_data_dir

stop = stopwords.words('English')

In [3]:
%matplotlib inline
# Open a standard set of directories

# Paths

# Get the top path
data_path = get_data_dir()
# Create the path for external data
ext_data = os.path.join(data_path, 'external')
# Raw data
raw_data = os.path.join(data_path, 'raw')
# And external data
proc_data = os.path.join(data_path, 'processed')
# And interim data
inter_data = os.path.join(data_path, 'interim')
# And figures
fig_path = os.path.join(data_path, 'figures')

# Get date for saving files
today = datetime.today()

today_str = "_".join([str(x) for x in [today.day,today.month,today.year]])

## 1. Load data

### 1.1 Load the Global and Pioneers data

In [4]:
def get_project_meta(project):
    '''
    This function takes a project and returns the name and the id (if they are available, this is not always the case)  
    '''
    
    if 'ID'  in project:
        #Split on the ID string to get the name
        name = project.split('ID')[0].strip()
        
        #Split on the ID string again to get what we want
        grant_id = re.sub('[#:]','',project.split('ID')[1].split('\n')[0].strip()).strip()   
    else:
        #If there is no ID we split on line breaks
        name = project.strip().split('\n')[0].strip()
        grant_id = np.nan

    #description = project.split('\n*')[1]
    return([name,grant_id])

def flatten_list(my_list):
    '''
    Turns a nested list into a flat list
    '''    
    flat = [x for el in my_list for x in el]    
    return(flat)

In [5]:
def read_rwjf_data(file):
    '''
    This function reads project lists from the RWJF and tidies it up, and returns
    a list where each element has the project name, grant id and description
    
    '''
    #Load the data
    with open(raw_data + '/' + file, 'r') as myfile:
        data=myfile.read()
    
    #Split it based on the project separator and leave out the links at the top
    projects = data.split('\n________________\n')[1:]
    
    #Extract metadata
    project_meta = [get_project_meta(x) for x in projects]
    
    #project_descriptions = [x[2] for x in project_meta]
    
    #Clean up the project info
    projects_clean = [re.sub('\* ','',re.sub('\n','',project)).lower() for project in projects]
    
    return([[x,y,z] for x,y,z in zip(
        [x[0] for x in project_meta],
        [x[1] for x in project_meta],
        projects_clean)])  

In [6]:
# Load both files
pio = read_rwjf_data('pioneer_grantees.txt')
glob = read_rwjf_data('global_grantees.txt') 

In [7]:
rw_df = pd.DataFrame([x + ['pioneers'] for x in pio] + [x + ['global'] for x in glob], columns=['project',
                                                                                'code', 'description', 'source_id'])

rw_df.to_csv(os.path.join(inter_data, 'rwjf_pioneer_and_global_projects.csv'), index=False)

rw_df.head()

Unnamed: 0,project,code,description,source_id
0,Air Louisville (Propeller Health) (Paul Tarini...,71592,air louisville (propeller health) (paul tarini...,pioneers
1,AARP Catalyst (David) Grant,72884,aarp catalyst (david) grant id: 72884with supp...,pioneers
2,Applying Behavioral Economics to Perplexing Pr...,69227,applying behavioral economics to perplexing pr...,pioneers
3,Atlas of Caregiving (David Adler/Lori Melichar...,72411,atlas of caregiving (david adler/lori melichar...,pioneers
4,MIT Blockchain (Mike) Grant,74694 [https//medrec.media.mit.edu/],mit blockchain (mike) grant id: 74694 [https:/...,pioneers


### 1.2 Load the RWJF open grant data

In [33]:
grant_data = pd.read_csv(raw_data+'/rwjf_grants.csv')

In [34]:
len(grant_data)

7315

In [35]:
grant_data.head()

Unnamed: 0,about,address,amount_awarded,awarded,awarded_on,contacts,grant_number,location,organization,page,timeframe,title,topics,website,year,about_tokenized
0,The Robert Wood Johnson Clinical Scholars Prog...,University of North Carolina at Chapel Hill Sc...,"$865,484","$865,484",3/25/2009,Desmond Kimo Runyan,48349,"Chapel Hill, NC",University of North Carolina at Chapel Hill Sc...,1344,5/1/2009 - 4/30/2010,Technical assistance and direction for the Rob...,Health Disparities\nHealth Leadership Development,http://www.med.unc.edu/contact,2009,"['robert_wood', 'johnson', 'clinical', 'schola..."
1,The Robert Wood Johnson Clinical Scholars Prog...,University of North Carolina at Chapel Hill Sc...,"$801,175","$801,175",3/8/2010,Desmond Kimo Runyan,48350,"Chapel Hill, NC",University of North Carolina at Chapel Hill Sc...,1201,5/1/2010 - 4/30/2011,Technical assistance and direction for the Rob...,Health Disparities\nHealth Leadership Development,http://www.med.unc.edu/contact,2010,"['robert_wood', 'johnson', 'clinical', 'schola..."
2,This grant will continue long-standing support...,National Public Radio Inc.\n1111 North Capital...,"$2,800,000","$2,800,000",11/13/2008,Anne Gudenkauf,51491,"Washington, DC",National Public Radio Inc.,1391,11/15/2008 - 11/14/2011,Health and health care reporting by National P...,Public and Community Health\nHealth Care Quality,http://www.npr.org/,2008,"['grant', 'continue', 'long', 'standing', 'sup..."
3,This supplemental grant supports the nonpartis...,"George Washington University\n2121 Eye Street,...","$2,671,103","$2,671,103",7/30/2008,Judith Miller Jones,51492,"Washington, DC",George Washington University,1459,9/1/2008 - 11/30/2011,National Health Policy Forum,,http://www.gwu.edu/,2008,"['supplemental', 'grant_supports', 'nonpartisa..."
4,The Foundation's Summer Medical and Dental Edu...,Association of American Medical Colleges\n655 ...,"$1,252,432","$1,252,432",8/7/2008,Charles Terrell\nRichard W. Valachovic,53039,"Washington, DC",Association of American Medical Colleges,1444,9/1/2008 - 8/31/2009,Technical assistance and direction for RWJF's ...,Social Determinants of Health\nHealth Leadersh...,http://www.aamc.org,2008,"['foundation', 'summer', 'medical_dental', 'ed..."


In [36]:
#Unfortunately they don't have the grant ids in their open dataset! 

pd.Series(flatten_list([[y for y in x[2].split(' ') if y not in stop] for x in pio])).value_counts()[:10]


                1688
health           329
data             130
grant             95
help              81
care              70
description:      68
research          67
improve           65
people            65
dtype: int64