# Indicator Section 1: Public R&D Capability

This notebook uses `processed` data to create our first set of indicators.

The output will be a collection of tables with an indicator field for every `nuts_id` and code. 

**Todo** Decide what we do about years.

## Preamble

In [None]:
%run ../notebook_preamble.ipy

In [None]:
if 'indicators' not in os.listdir(os.chdir('../data')):
    os.mkdir('../data/indicators')

## 1. Load data

### REF data

In [None]:
ref = pd.read_csv('../../data/processed/ref/9_11_2019_ref_nuts.csv')

In [None]:
ref.head()

### UKRI project funding

In [None]:
ukri = pd.read_csv('../../data/processed/gtr/2019_11_14_nuts_discipline_activity.csv')

ukri.head()

### HESA

#### Hesa university income

In [None]:
hesa_data_17_18 = pd.read_csv('../../data/processed/hesa/2019_11_19_hesa_data_2017_18_nuts_2.csv')

hesa_data_17_18.head()

The HESA data combines multiple variables in a single table. Not very tidy or nicely documented

### Eurostat

In [None]:
berd = pd.read_csv('../../data/processed/eurostat/eurostat_berd_data.csv')

berd.head()

In [None]:
herd = pd.read_csv('../../data/processed/eurostat/eurostat_higher_ed_rd_workforce_data.csv')

herd.head()

## 2. Create indicators

### Subsection 1: Comparative advantage in performing excellent public research													

#### 1. REF Scores 

We assume that this means overall FTE score, that is: FTE score weighted by FTE for all disciplines

In [None]:
#In order to calculate this we need to melt the data

In [None]:
ref_melted = pd.melt(ref,id_vars=['nuts_name','nuts_code','unit_of_assessment_name','total_fte'])

ref_melted.head()

In [None]:
ref_melted['score'] = [int(x.split('*')[0]) if 'unclassified' not in x else 0 for x in ref_melted['variable']]

In [None]:
ref_weighted_scores = ref_melted.groupby(['nuts_code','nuts_name']).apply(lambda x: np.sum((x['value']/x['value'].sum())*x['score'])
                                                                         ).sort_values(ascending=False)

ref_weighted_scores.head(n=10)

#### 2. REF scores in STEM disciplines

We need to define what STEM disciplines are! We will load a json stored in `aux` and change if needed.

In [None]:
with open('../../data/aux/ref_stem.txt','r') as infile:
    stem = infile.read().split('\n')

In [None]:
ref_stem = ref_melted.loc[[x in stem for x in ref_melted['unit_of_assessment_name']]]

In [None]:
ref_stem_weighted_scores = ref_stem.groupby(
    ['nuts_code','nuts_name']).apply(lambda x: np.sum((x['value']/x['value'].sum())*x['score'])).sort_values(ascending=False)

In [None]:
ref_stem_weighted_scores.head(n=10)

#### 3. Excellent researchers submitted to REF

This is the 4* FTEs

In [None]:
ref_excellent = ref.groupby(['nuts_code','nuts_name'])['4*_fte'].sum().sort_values(ascending=False)

In [None]:
ref_excellent.head(n=10)

#### 4. Research income

We use research income data from HESA

In [None]:
research_income = hesa_data_17_18.set_index(['nuts_name','nuts_code'])['research_income_(£)'].sort_values(ascending=False)

In [None]:
research_income.head(n=10)

#### 5. UKRI-funded activity in STEM disciplines

This one is quite experimental. We have classified projects into disciplines and aggregated number of led projects and total income by NUTS2. Here we focus on project counts to avoid the problem that UKRI only makes funding data available at the project (rather than organisation) level.

In [None]:
with open('../../data/aux/ukri_stem.txt','r') as infile:
    ukri_stem = infile.read().split('\n')

ukri_stem

In [None]:
#Melt the data
ukri_long = pd.melt(ukri,id_vars=['lead_nut_code','lead_nut_name'])

In [None]:
#Focus on projects
ukri_projects = ukri_long.loc[['project' in x for x in ukri_long['variable']]]

In [None]:
#Focus on projects in stem disciplines as described above
ukri_projects_stem = ukri_projects.loc[[any(x in v for x in ukri_stem) for v in ukri_projects['variable']]]

In [None]:
#Aggregate
ukri_stem_led_projects = ukri_projects_stem.groupby(['lead_nut_code','lead_nut_name'])['value'].sum().sort_values(ascending=False)

In [None]:
ukri_stem_led_projects.head(n=10)

#### **[TODO]** Field weighted citation impact

### Subsection 2: Business absorptive capacity and private R&D investment

#### University buildings

We continue using the HESA data

In [None]:
university_buildings = hesa_income.set_index(['nuts_name','nuts_code'])['total_number_of_buildings'].sort_values(ascending=False)

In [None]:
university_buildings.head(n=10)

#### Area of university estates

In [None]:
university_site_area = hesa_income.set_index(['nuts_name','nuts_code'])['total_site_area_(hectares)'].sort_values(ascending=False)

In [None]:
university_site_area.head(n=10)

#### HE Performed R&D Expenditure (HERD)

#### Government Performed R&D Expenditure (GovERD)