# Gateway to Research

This repo processes and produces indicators based on the Gateway to Research data.

See [here](https://github.com/nestauk/gtr_data_processing) for a fuller description of data processing and enrichment.

We will use a dataset with information about projects to create the following indicators:

* Level of activity in and funding received by discipline (focusing on projects led by organisations in a location)
* Number of participations in research (to capture research participation by organisations that aren't in NUTS with lots of universities)
* Dyadic collaborations (instances where projects include pairs of organisations from the same location)

**NOTE**

In this version of the notebook we are using `data_getters_lab` to get a processed version of the gtr data from Nesta DAPS. In a future version we will make the raw data available in a AWS bucket or something along the lines that non-Nesta researchers can use to access the data.

## Preamble

In [28]:
from ast import literal_eval
import random
from beis_indicators.utils.nuts_utils import auto_nuts2_uk

In [2]:
%run ../notebook_preamble.ipy

In [3]:
# Functions etc down here

### Functions

In [4]:
def make_dirs(name,dirs = ['raw','processed']):
    '''
    Utility that creates directories to save the data
    
    '''
    
    for d in dirs:
        if name not in os.listdir(f'../../data/{d}'):
            os.mkdir(f'../../data/{d}/{name}')
            
def flat_freq(a_list):
    '''
    Return value counts for categories in a nested list
    
    '''
    return(pd.Series([x for el in a_list for x in el]).value_counts())

        

def flatten_list(a_list):
    
    return([x for el in a_list for x in el])

        

In [5]:
def parse_gtr_data(df,vars_to_parse):
    '''
    
    This function parses strings into lists
    
    Args:
        df is the df whose columns we want to parsr
        vars_to_parse is a list with the variables to parse
    
    '''
    
    #If the column is in the list above, then parse it
    for c in df.columns:
    
        if c in vars_to_parse:
            df[c] = [literal_eval(x) for x in df[c]]

    return(df)

In [6]:
#Hopefully this will allow us to convert into the new variables

def convert_to_nut(list_of_lads,lookup,value):
    '''
    This function converts a list where every element is a lad code into a list where every element is a nuts code (or name!)
    
    Arguments:
        list_of_lads (iterable) is an iterable where every element is a list of LAD codes.
        lookups (dict) is a lookup between lad codes and NUT 2 codes and names
        value (str) is whether we want to output NUT codes or names
        
    
    '''
    
    #Note that we have some control flow to deal with LADS missing from the lookup (it might happen) 
    out = [[lookup[x][value] if x in lookup.keys() else np.nan for x in el] for el in list_of_lads]
    
    return(out)

def convert_to_nut_multiple(df,var):
    '''
    
    This function automates some of the above eg. we can choose one variable suffix (lead, all) and it automatically converts to nuts names and lads
    
    Note - this directly transforms the input df
    
    Arguments:
        df (df) is the dataframe where we want to add the converted variables
        var (str) is the list of lad codes that we want to convert to nuts codes and names
    
    '''
    
    
    df[f'{var}_nut_code'],df[f'{var}_nut_name'] = [convert_to_nut(df[f'{var}_lad_code'],lads_to_nuts_lookup,n) for n in nut_vars]
    
    #return(df)
    

In [7]:
def make_geo_var_stats(df,geo='nut',var='disc_top_discipline'):
    '''
    This function takes a df with project activity and creates discipline project counts and total amounts by geography
    
    Arguments:
        df (df) is a dataframe where one of the columns is the geography and another the top discipline (or funder - we could change this)
        geo (str) is the variable we want to use in the geo analysis
        var (str) is the variable we want to get project counts and amounts of funding for
    
    '''
    
    df_2 = df.copy()

    
    #Extract variable names and codes from list
    df_2[f'lead_{geo}_name'],df_2[f'lead_{geo}_code']= [[x[0] if len(x)>0 else np.nan for x in df_2[var]] for var in [f'lead_{geo}_name',
                                                                                        f'lead_{geo}_code']]
    
    
    #Project frequencies by variable and geography
    project_geo_counts = df_2.groupby([f'lead_{geo}_name',f'lead_{geo}_code','year'])[var].value_counts()

    project_geo_counts.name = 'project_count'

    #Pivot to create a wide version

    project_wide = project_geo_counts.reset_index(drop=False).pivot_table(index=[f'lead_{geo}_name',f'lead_{geo}_code','year'],
                                                                columns=var,values='project_count',aggfunc='sum').fillna(0)

    project_wide.columns = [x+'_project_n' for x in project_wide]
    
    #Project funding by discipline
    project_geo_funding = df_2.groupby([f'lead_{geo}_name',f'lead_{geo}_code',var,'year'])['amount'].sum()

    fund_wide = project_geo_funding.reset_index(drop=False).pivot_table(index=[f'lead_{geo}_name',f'lead_{geo}_code','year'],
                                                                columns=var,values='amount',aggfunc='sum').fillna(0)

    fund_wide.columns = [x+'_funding_gpb' for x in fund_wide]
    
    out = pd.concat([project_wide,fund_wide],axis=1)
    
    return(out)
    

In [8]:
# %load ../utilities.py
# Some utilities

def make_data_dict(table,name,path,sample=5):
    '''
    A function to output the form for a data dictionary
    
    Args:
        -table (df) is the df we want to create the data dictionary for
        -name (str) of the df
        -path (str) is the place where we want to save the file
        

    
    '''
    
    types = [estimate_type(table[x],sample=sample) for x in table.columns]
        
    data_dict = pd.DataFrame()
    data_dict['variable'] = table.columns
        
    data_dict['type'] = types
    
    data_dict['description'] = ['' for x in data_dict['variable']]
        
    out = os.path.join(path,f'{today_str}_{name}.csv')
    
    #print(data_dict.columns)
    
    data_dict.to_csv(out)
    

def estimate_type(variable,sample):
    '''
    Estimates the type of a column. 

    Args:
        variable (iterable) with values
        sample (n) is the number of values to test
    
    '''
    
    selection = random.sample(list(variable),sample)
    
    types = pd.Series([type(x) for x in selection]).value_counts().sort_values(ascending=False)
    
    return(types.index[0])

                           
                           
    
    

In [37]:
def make_indicator(table,target_path,var_lookup,year_var,nuts_var='nuts_code',nuts_spec=2018,decimals=3):
    '''
    We use this function to create and save indicators using our standardised format.
    
    Args:
        table (df) is a df with relevant information
        target_path (str) is the location of the directory where we want to save the data (includes interim and processed)
        var_lookup (dict) is a lookup to rename the variable into our standardised name
        year (str) is the name of the year variable
        nuts_var (str) is the name of the NUTS code variable. We assume it is nuts_code
        nuts_spec (y) is the value of the NUTS specification. We assume we are working with 2018 NUTS
    
    '''
    #Copy
    t = table.reset_index(drop=False)
    
    #Reset index (we assume that the index is the nuts code, var name and year - this might need to be changed)
    
    
    #Process the interim data into an indicator
    
    #This is the variable name and code
    var_name = list(var_lookup.keys())[0]
    
    var_code = list(var_lookup.values())[0]
    
    #Focus on those
    t = t[[year_var,nuts_var,var_name]]
    
    #Add the nuts specification
    t['nuts_year_spec'] = nuts_spec
    
    #Rename variables
    t.rename(columns={var_name:var_code,year_var:'year',nuts_var:'nuts_id'},inplace=True)

    #Round variables
    t[var_code] = [np.round(x,decimals) if decimals>0 else int(x) for x in t[var_code]]
    
    
    #Reorder variables
    t = t[['year','nuts_id','nuts_year_spec',var_code]]
    
    print(t.head())
    
    #Save in the processed folder
    t.to_csv(f'../../data/processed/{target_path}/{var_code}.csv',index=False)

In [10]:
#Differently from other sources, we put the data in external because it has already been pre-processed
make_dirs('gtr',['processed','interim','raw'])

### Metadata

In [11]:
nuts_df = pd.read_csv('http://geoportal1-ons.opendata.arcgis.com/datasets/9b4c94e915c844adb11e15a4b1e1294d_0.csv')

In [12]:
#We create a lookup between LADS and NUTS codes and names
lads_to_nuts_lookup = {
    rid: {'NUTS218CD':row['NUTS218CD'],'NUTS218NM':row['NUTS218NM']} for rid,row in nuts_df.set_index('LAD18CD')[['NUTS218CD','NUTS218NM']].iterrows()}

In [13]:
#Also a nuts code to name lookup
nuts_code_to_name = {x['NUTS218CD']:x['NUTS218NM'] for rid,x in nuts_df.drop_duplicates('NUTS218CD').iterrows()}

## 1. Collect data

From Nesta data getters

In [14]:
from data_getters.labs.core import download_file


def get_gtr(file,file_path,progress=True):
    """ Fetch Gateway To Research predicted industries

    Repo: https://github.com/nestauk/gtr_data_processing
    Commit: cd3cddb
    File: https://github.com/nestauk/gtr_data_processing/blob/master/notebooks/05_jmg_data_demo.ipynb

    Args:
        file_path (`str`, optional): Path to download to. If None, stream file.
        progress (`bool`, optional): If `True` and `file_path` is not `None`,
            display download progress.
    """
    
    return download_file(file_to_fetch=file, download_path=file_path+file, progress=progress)

In [15]:
#Download the data and save in the folders we created before

In [16]:
gtr_org = get_gtr(file='17_9_2019_gtr_orgs.csv',file_path='../../data/raw/gtr/',progress=False)

gtr_proj = get_gtr(file='17_9_2019_gtr_projects.csv',file_path='../../data/raw/gtr/',progress=False)

2020-02-21 08:28:39,138 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials


## 2. Process data


### Projects

In [17]:
#Load the data
gtr_proj = pd.read_csv('../../data/raw/gtr/17_9_2019_gtr_projects.csv')

In [18]:
#Some tidying up

#Remove the unnamed columns
gtr_proj = gtr_proj[[x for x in gtr_proj.columns if 'Unnamed' not in x]]

#Parse lists
list_var = [x for x in gtr_proj.columns if '_lad_' in x]

gtr_proj = parse_gtr_data(gtr_proj,list_var)

In [19]:
# We are not interested in columns between 24 and 102, which includes modelled industry and SDG variables

gtr_proj = gtr_proj.iloc[:,[n for n in np.arange(0,len(gtr_proj.columns)) if n not in set(np.arange(25,103))]]

#### Geoprocessing

In previous work we geocoded the GtR data with LADS. Now we want to transfer this to NUTS2, the geographical unit of analysis for this project.

Let's do it

In [20]:
#We use the functions defined above to convert lads to nuts
nut_vars = ['NUTS218CD','NUTS218NM']


convert_to_nut_multiple(gtr_proj,'lead')
convert_to_nut_multiple(gtr_proj,'all')

In [21]:
gtr_proj.head()

Unnamed: 0,index,project_id,title,abstract,year,funder,status,grant_category,amount,currency,...,involved_lad_code,involved_lad_name,lead_scot,inv_scot,inv_scot_n,short_abstract,lead_nut_code,lead_nut_name,all_nut_code,all_nut_name
0,0,9BCF2DEE-0911-4D95-9259-84AEC7D51AC6,Wars of Position: Communism and Civil Society,Coinciding with the centenary of the October r...,2016,AHRC,Closed,Fellowship,143026,GBP,...,[],[],False,False,0.0,False,[UKD3],[Greater Manchester],[UKD3],[Greater Manchester]
1,2,52405730-6906-4BDD-B9FC-FEB8E6C97E84,Burning fat: an in vivo and in vitro study of ...,People in the UK are getting fatter and this h...,2012,BBSRC,Closed,Research Grant,363926,GBP,...,"[E07000242, E07000041, E06000030, E07000178, E...","[East Hertfordshire, Exeter, Swindon, Oxford, ...",False,False,0.0,False,[UKH1],[East Anglia],"[UKH1, UKH2, UKK4, UKK1, UKJ1, UKJ1]","[East Anglia, Bedfordshire and Hertfordshire, ..."
2,3,CBA76D0D-2D95-4028-B8DE-65041D8D1F85,Atomic-Scale Characterisation of Reactor Press...,The reactor pressure vessel (RPV) is a safety ...,2018,EPSRC,Active,Studentship,0,GBP,...,[],[],False,False,0.0,False,[UKJ1],"[Berkshire, Buckinghamshire and Oxfordshire]",[UKJ1],"[Berkshire, Buckinghamshire and Oxfordshire]"
3,4,82E47A5E-A4C6-4F70-9242-821EFC9456D7,University of the West of Scotland Nuclear Phy...,It is just over 100 years since Ernest Rutherf...,2014,STFC,Active,Research Grant,379758,GBP,...,[],[],True,False,1.0,False,[UKM8],[West Central Scotland],[UKM8],[West Central Scotland]
4,6,05BE79C5-AF13-4A67-B042-3BCE38D575A2,Transcription factor hierarchies underlying th...,"Our sense of hearing, and the information we u...",2017,BBSRC,Active,Research Grant,351052,GBP,...,[],[],False,False,0.0,False,[UKH1],[East Anglia],[UKH1],[East Anglia]


#### New variables

**Discipline**

In [22]:
#Each projects has a modelled discipline based on a predictive analysis that is fully reported in the source repo.

#We assign each project to its top discipline

disc_vars = [x for x in gtr_proj.columns if 'disc_' in x]

gtr_proj['disc_top_discipline'] = gtr_proj[disc_vars].idxmax(axis=1)

## 3. Create NUTS aggregations

We are going to create the following:

* Number of projects and level of funding led in the NUTS in various disciplines


**Discipline aggregates**

In [23]:
#We use the function above to create the geo by discipline aggregates
disc = make_geo_var_stats(gtr_proj)

#disc.to_csv(f'../../data/interim/gtr/{today_str}_nuts_discipline_activity.csv')

## 4. Final processing and saving

In [24]:
with open('../../data/aux/gtr_stem_disciplines.txt','r') as infile:
    
    stem = infile.read().split('\n')

In [25]:
#Number of STEM projects per year and NUTS2 area
disc_aggregate = disc[stem].sum(axis=1).reset_index(drop=False)

In [26]:
#Some final prep before saving
#Rename columns
disc_aggregate.rename(columns={'lead_nut_code':'nuts_id',0:'total_gtr_projects_stem'},inplace=True)



In [27]:
disc_aggregate.drop('lead_nut_name',axis=1,inplace=True)

In [39]:
make_indicator(disc_aggregate,'gtr',
              {'total_gtr_projects_stem':'total_gtr_projects_stem'},'year',nuts_var='nuts_id',decimals=0)

   year nuts_id  nuts_year_spec  total_gtr_projects_stem
0  2006    UKH2            2018                       45
1  2007    UKH2            2018                       71
2  2008    UKH2            2018                      106
3  2009    UKH2            2018                       39
4  2010    UKH2            2018                       53


In [40]:
auto_nuts2_uk(pd.read_csv('../../data/processed/gtr/total_gtr_projects_stem.csv')
             ).to_csv('../../data/processed/gtr/total_gtr_projects_stem.csv',index=False)

## Excluded indicators

**These are not included in the final inventory so we have excluded them from the analysis (for now)**

* Number of participations in research
* Number of local collaboration in research (how many projects contain the same NUT more than once?


**Volume of participation of research**

We also want to capture the level of participation from organisations in research even when they are not leading projects (ie sites of universities).

We will simply count instances when an organisation appears in a project


In [None]:
def final_process(nuts_freqs,lookup,name):
    '''
    
    This is to avoid repetition when doing the final processing of nuts frequency series
    
    Args:
        nut_freqs (series) a nuts freq where the index are nuts codes
    
    '''
    #Reset the index
    expand = nuts_freqs.reset_index(drop=False)
    
    #Add names using the lookup
    expand['nuts_name'] = expand['index'].map(lookup)
    
    #Rename columns
    expand.rename(columns={'index':'nuts_code',0:name},inplace=True)
    
    out = expand.set_index(['nuts_name','nuts_code'])
    
    return(out)
    
    

In [None]:
#How many times does a NUTS 2 area participate at least once in a project

research_participation = final_process(
    flat_freq([list(set(x)) for x in gtr_proj['all_nut_code']]),nuts_code_to_name,'proj_participation')

**Local collaborations**

Finally, we want to calculate how many projects involve local collaborations.

This one is a bit more complicated. We will do the following:

* Extract pairs of combinations from each project NUTS list and concatenate them (this is an edge list)
* Set them. If len ==1 then that is a local collaboration
* Remove all len >1 & count them

In [None]:
from itertools import combinations

In [None]:
#Nice list comprehension: create a list of pairwise combinations from the collaborations, and keep those whose set length is 1 (both NUTS are the same)
edge_list = [list(collab) for collab in [set(x) for x in flatten_list([list(combinations(x,2)) for x in gtr_proj['all_nut_code']])] if len(collab)==1]

In [None]:
local_collabs = final_process(flat_freq(edge_list),nuts_code_to_name,'local_collaborations')

In [None]:
research_activity = pd.concat([research_participation,local_collabs],axis=1)

In [None]:
research_activity.to_csv(f'../../data/interim/gtr/{today_str}_research_act_collab.csv')

#### Quick plot comparing project participation vs local collaboration

In [None]:
research_activity.sort_values('local_collaborations').apply(lambda x: x/x.sum()).plot.barh(figsize=(8,10))

West London is massively overrepresented in the local collaborations