## GtR data showcase

This repo summarises the content of two GtR related datasets that we are using in the project to map innovation in Scotland

## Preamble

In [None]:
% run notebook_preamble.ipy

In [None]:
from ast import literal_eval

In [None]:
def parse_gtr_data(df,vars_to_parse):
    '''
    
    This function parses strings into lists
    
    Args:
        df is the df whose columns we want to parsr
        vars_to_parse is a list with the variables to parse
    
    '''
    
    #If the column is in the list above, then parse it
    for c in df.columns:
    
        if c in vars_to_parse:
            df[c] = [literal_eval(x) for x in df[c]]

    return(df)

## Load datasets

### `proj`

In [None]:
proj = pd.read_csv('../data/temp_scotland/5_9_2019_gtr_projects_geo_labelled.csv',compression='zip')

#Remove the unnamed columns
proj = proj[[x for x in proj.columns if 'Unnamed' not in x]]

In [None]:
#We want to parse the lists in the data

list_var = [x for x in proj.columns if '_lad_' in x]

proj = parse_gtr_data(proj,list_var)

#### Content

In [None]:
proj.shape

In [None]:
len(set(proj['project_id']))

In [None]:
proj['grant_category'].value_counts(normalize=True).head()

In [None]:
proj['funder'].value_counts(normalize=True)

In [None]:
proj['abstract'].value_counts().head(n=7)

In [None]:
pd.Series([len(x) for x in proj['abstract']]).describe()

The dataset contains information about UKRI funded projects involving all research councils *and* Innovate UK

It includes all types of grants although the majority are Research Grants

All projects include abstracts although there is a small number with very short ones.

During modelling we removed some uninformative / garbagey abstracts but as cell 16 above shows a small number of them remain. We remove all abstracts with less than 300 words.


In [None]:
rep_text = 'The recent discovery of the Higgs boson at the LHC was a major technical and scientific triumph, but it is not the end of the story. There are still many'

In [None]:
proj.loc[[rep_text in abst for abst in proj['abstract']]][['project_id','title','abstract','year','amount','grant_category','funder','all_lad_name']]

Note that there are some duplicate abstracts - they refer to long term programmes of work with multiple projects.

In [None]:
#Drop ridiculously short abstracts?

proj['short_abstract'] = [len(x)<300 for x in proj['abstract']]

pd.crosstab(proj['grant_category'],proj['short_abstract'],normalize=0)[True].sort_values(ascending=False).plot.bar()

In [None]:
pd.crosstab(proj['funder'],proj['short_abstract'],normalize=0)[True].sort_values(ascending=False).plot.bar()

They are mostly less-academic Knowledge Transfer Partnership (projects where a university researcher is embedded in industry) and SME projects

In [None]:
proj_2 = proj.loc[[len(x)>300 for x in proj['abstract']]].reset_index(drop=False)

In [None]:
len(proj)-len(proj_2)

We lose ~5000 abstracts

#### Time

In [None]:
proj_2['year'].value_counts()[np.arange(min(proj_2['year']),max(proj_2['year']))].fillna(0).plot()

Most activity begins in 2006 so we remove projects before. We also remove projects in non-full years (eg 2019 or 2020)



In [None]:
proj_3 = proj_2.loc[(proj_2['year']>=2006)&(proj_2['year']<2019)]

len(proj_2)-len(proj_3)

We lose 1280 projects

#### Geo

All our geo information here is contained in lists of LADs participating in projects

In [None]:
def flatten_freq(a_list):
    '''
    Create frequency of observations in a nested list
    
    '''
    
    return(pd.Series([x for el in a_list for x in el]).value_counts())

In [None]:
flatten_freq(proj_3['lead_lad_name']).head(n=10)

In [None]:
flatten_freq(proj_3['all_lad_name']).head(n=10)

As the above shows we have variables about organisations leading projects, and organisations participating in projects

#### Disciplines

We have labelled projects with their discipline probabilities.

The steps to do this, presented elsewhere in this repo (`02_jmg`) are:

* Projects are tagged with keywords
* Detect communities of tags ('disciplines')
* Find pure discipline projects (all tags in one discipline)
* Train a one v rest model on that discipline (w/ Grid search)
* Predict labels for all data



In [None]:
disc_names = [x for x in proj_3.columns if 'disc_' in x]

proj_3[disc_names].head()

In [None]:
proj_3['disc_top'].value_counts()

In [None]:
ax = pd.crosstab(proj_3['year'],proj_3['disc_top']).rolling(window=3).mean().dropna().plot()

ax.legend(bbox_to_anchor=(1,1))

In [None]:
ax = pd.crosstab(proj_3['funder'],proj_3['disc_top'],normalize=0).plot.bar(stacked=True)

ax.legend(bbox_to_anchor=(1,1))

The funder - discipline link is as wwe would expect 

#### Other observations

The project data also includes information about our predictive analysis of industry and SDG labels but we think of these as placeholders that are currently being tuned

In [None]:
proj_3.to_csv(f'../data/processed/data_getters_lab/{today_str}_gtr_projects.csv')

## `org`

This contains organisation level information for all organisations in the GTR data

In [None]:
org = pd.read_csv('../data/processed/17_9_2019_organisation_activities.csv',compression='zip')

In [None]:
org.head()

In [None]:
org.shape

In [None]:
org.columns

For each organisation, it includes:

* ID, name, location
* List of project ids and roles
* Number of projects it has worked in and number of projects it has led
* Level of funding
* Number of projects in different disciplines
* Types of grants it received
* Outputs in its projects
* Propensity to collaborate locally

### Top organisations

By funding

In [None]:
org.sort_values('led_funding',ascending=False)[['name','led_funding']].head(n=10)

By led projects

In [None]:
org.sort_values('lead_project_n',ascending=False)[['name','led_funding']].head(n=10)

By local collaborations

In [None]:
org.sort_values('local_collab',ascending=False)[['name','local_collab']].head(n=10)

### Top places by discipline

Here there will be double counting

In [None]:
org.groupby('lad')[disc_names[:-1]].sum().sort_values('disc_eng_tech',ascending=False).head()

### Top places by output

In [None]:
output_names = [x for x in org.columns if 'out_' in x]

In [None]:
org.groupby('lad')[output_names].sum().sort_values('out_spin',ascending=False).head(n=10)

Check Scotland!

### Save

In [None]:
org.to_csv(f'../data/processed/data_getters_lab/{today_str}_gtr_orgs.csv')