# Eurostat
Code to collect and process eurostat data to create the following indicators:

* Private sector R&D workforce
* Business Enterprise R&D (BERD)
* Share if hugh growth firms 

Raw data collected using the Eurostat API via the `EuroStat API Client` python package (https://pypi.org/project/eurostatapiclient/).

## Preamble

In [1]:
from eurostatapiclient import EurostatAPIClient

import numpy as np
import pandas as pd

In [2]:
VERSION = 'v2.1'
FORMAT = 'json'
LANGUAGE = 'en'

In [3]:
client = EurostatAPIClient(VERSION, FORMAT, LANGUAGE)

### Mappings

In [8]:
nuts2_map = {
    'UKC1': 'Tees Valley and Durham',
    'UKC2': 'Northumberland and Tyne and Wear',
    'UKD1': 'Cumbria',
    'UKD6': 'Cheshire',
    'UKD3': 'Greater Manchester',
    'UKD4': 'Lancashire',
    'UKD7': 'Merseyside',
    'UKE1': 'East Riding and North Lincolnshire', 
    'UKE2': 'North Yorkshire',
    'UKE3': 'South Yorkshire',
    'UKE4': 'West Yorkshire', 
    'UKF1': 'Derbyshire and Nottinghamshire',
    'UKF2': 'Leicestershire, Rutland and Northamptonshire', 
    'UKF3': 'Lincolnshire', 
    'UKG1': 'Herefordshire, Worcestershire and Warwickshire',
    'UKG2': 'Shropshire and Staffordshire',
    'UKG3': 'West Midlands', 
    'UKH1': 'East Anglia', 
    'UKH2': 'Bedfordshire and Hertfordshire',
    'UKH3': 'Essex',
    'UKI3': 'Inner London - West', 
    'UKI4': 'Inner London - East',
    'UKI5': 'Outer London - East and North East',
    'UKI6': 'Outer London - South', 
    'UKI7': 'Outer London - West and North West',
    'UKJ1': 'Berkshire, Buckinghamshire, and Oxfordshire', 
    'UKJ2': 'Surrey, East and West Sussex',
    'UKJ3': 'Hampshire and Isle of Wight', 
    'UKJ4': 'Kent', 
    'UKK1': 'Gloucestershire, Wiltshire and Bristol/Bath area', 
    'UKK2': 'Dorset and Somerset', 
    'UKK3': 'Cornwall and Isles of Scilly',
    'UKK4': 'Devon',
    'UKL1': 'West Wales and The Valleys',
    'UKL2': 'East Wales', 
    'UKM2': 'Eastern Scotland', 
    'UKM3': 'South Western Scotland', 
    'UKM5': 'North Eastern Scotland',
    'UKM6': 'Highlands and Islands', 
    'UKN0': 'Northern Ireland', 
    'UKZZ': 'Extra-regio NUTS 2'
    
}

In [9]:
#using the eurostat code labels
vars_map = {
    'EUR_HAB': 'Euro per inhabitant',
    'MIO_EUR': 'Million euro',
    'FTE': 'Full-time equivalent (FTE)',
    'HC': 'Head count', 
    'PC_ACT_FTE': 'Percentage of active population - numerator in full-time equivalent (FTE)',
    'PC_ACT_HC': 'Percentage of active population - numerator in head count (HC)'
}

## Data Collection, Processing & Transformation

This section is made up with three sections- one for each indicator. Each section is broken down in the following steps:

* Use the python package to pull down flattened data by entering using a query & put into a dataframe
* Collect the subset for UK NUTS2 regions
* Replace the codes with the label associated 
* Data is transformed into a pivot table to output the desired format

### Private sector R&D workforce data

#### Head Count

In [14]:
#pull in data
data_priv_nuts2 = client.get_dataset('rd_p_persreg?sinceTimePeriod=2012&geoLevel=nuts2&precision=1&sex=T&sectperf=BES&prof_pos=TOTAL&unit=HC')

print(data_priv_nuts2.label)

dataframe_priv_nuts2 = data_priv_nuts2.to_dataframe()

Total R&D personnel and researchers by sectors of performance, sex and NUTS 2 regions


In [15]:
#UK NUTS2 regions subset
dataframe_priv_nuts2_uk = dataframe_priv_nuts2[dataframe_priv_nuts2['geo'].str.contains('UK')]

In [16]:
#mappings
dataframe_priv_nuts2_uk['time'] = dataframe_priv_nuts2_uk['time'].astype(int)
dataframe_priv_nuts2_uk['geo_name'] = dataframe_priv_nuts2_uk['geo'].map(nuts2_map)
dataframe_priv_nuts2_uk['unit'] = dataframe_priv_nuts2_uk['unit'].map(vars_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [17]:
#pivot table
d_priv = dataframe_priv_nuts2_uk.pivot_table(index=['geo_name','geo','time'],
               columns = 'unit',
               values = 'values').reset_index().set_index('geo_name')

In [18]:
d_priv['Head count'] = pd.to_numeric(d_priv['Head count'], downcast='integer')

In [19]:
d_priv.columns

Index(['geo', 'time', 'Head count'], dtype='object', name='unit')

In [20]:
d_priv.reset_index(drop=True, inplace=True)
d_priv.columns = ['nuts_id', 'year', 'eurostat_private_rd_headcount_workforce_data']
d_priv['nuts_year_spec'] = [2013]*len(d_priv)

In [21]:
d_priv = d_priv[['year','nuts_id', 'nuts_year_spec', 'eurostat_private_rd_headcount_workforce_data']]

In [22]:
#save data
d_priv.to_csv('../../data/processed/eurostat/eurostat_private_rd_headcount_workforce_data.csv',index=False)

#### Full Time Equivalent (FTE)

In [23]:
#pull in data
data_priv_nuts2 = client.get_dataset('rd_p_persreg?sinceTimePeriod=2012&geoLevel=nuts2&precision=6&sex=T&sectperf=BES&prof_pos=TOTAL&unit=FTE')

print(data_priv_nuts2.label)

dataframe_priv_nuts2 = data_priv_nuts2.to_dataframe()

Total R&D personnel and researchers by sectors of performance, sex and NUTS 2 regions


In [24]:
#UK NUTS2 regions subset
dataframe_priv_nuts2_uk = dataframe_priv_nuts2[dataframe_priv_nuts2['geo'].str.contains('UK')]

In [25]:
#mappings
dataframe_priv_nuts2_uk['time'] = dataframe_priv_nuts2_uk['time'].astype(int)
dataframe_priv_nuts2_uk['geo_name'] = dataframe_priv_nuts2_uk['geo'].map(nuts2_map)
dataframe_priv_nuts2_uk['unit'] = dataframe_priv_nuts2_uk['unit'].map(vars_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [26]:
#pivot table
d_priv = dataframe_priv_nuts2_uk.pivot_table(index=['geo_name','geo','time'],
               columns = 'unit',
               values = 'values').reset_index().set_index('geo_name')

In [27]:
d_priv.columns

Index(['geo', 'time', 'Full-time equivalent (FTE)'], dtype='object', name='unit')

In [28]:
d_priv.reset_index(drop=True, inplace=True)
d_priv.columns = ['nuts_id', 'year', 'eurostat_private_rd_fte_workforce_data']
d_priv['nuts_year_spec'] = [2013]*len(d_priv)

In [29]:
d_priv = d_priv[['year','nuts_id', 'nuts_year_spec', 'eurostat_private_rd_fte_workforce_data']]

In [30]:
#save data
d_priv.to_csv('../../data/processed/eurostat/eurostat_private_rd_fte_workforce_data.csv', index=False)

### Business Enterprise R&D (BERD) data

In [31]:
#pull in data
data_berd_nuts2 = client.get_dataset('rd_e_gerdreg?sinceTimePeriod=2012&geoLevel=nuts2&precision=6&sectperf=BES&unit=MIO_EUR')

print(data_berd_nuts2.label)

dataframe_berd_nuts2 = data_berd_nuts2.to_dataframe()

Intramural R&D expenditure (GERD) by sectors of performance and NUTS 2 regions


In [32]:
#UK NUTS2 regions subset
dataframe_berd_nuts2_uk = dataframe_berd_nuts2[dataframe_berd_nuts2['geo'].str.contains('UK')]

In [33]:
#mappings
dataframe_berd_nuts2_uk['time'] = dataframe_berd_nuts2_uk['time'].astype(int)
dataframe_berd_nuts2_uk['geo_name'] = dataframe_berd_nuts2_uk['geo'].map(nuts2_map)
dataframe_berd_nuts2_uk['unit'] = dataframe_berd_nuts2_uk['unit'].map(vars_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [34]:
#pivot table
d_berd = dataframe_berd_nuts2_uk.pivot_table(index=['geo_name','geo','time'],
               columns = 'unit',
               values = 'values').reset_index().set_index('geo_name')

In [35]:
d_berd['euros'] = d_berd['Million euro'] * 1000000.00
d_berd.drop(columns=['Million euro'], inplace=True)

In [38]:
d_berd.columns

Index(['nuts_id', 'year', 'eurostat_berd_data'], dtype='object')

In [39]:
d_berd.reset_index(drop=True, inplace=True)
d_berd.columns = ['nuts_id', 'year', 'eurostat_berd_data']

In [40]:
d_berd['nuts_year_spec'] = [2013]*len(d_berd)

In [41]:
d_berd = d_berd[['year','nuts_id', 'nuts_year_spec', 'eurostat_berd_data']]

In [42]:
#save data
d_berd.to_csv('../../data/processed/eurostat/eurostat_berd_data.csv',index=False)

### Private non-profit (PNPERD) data

In [10]:
#pull in data
data_pnpberd_nuts2 = client.get_dataset('rd_e_gerdreg?sinceTimePeriod=2012&geoLevel=nuts2&precision=6&sectperf=PNP&unit=MIO_EUR')

print(data_pnpberd_nuts2.label)

df_pnpberd_nuts2 = data_pnpberd_nuts2.to_dataframe()

Intramural R&D expenditure (GERD) by sectors of performance and NUTS 2 regions


In [11]:
#UK NUTS2 regions subset
df_pnpberd_nuts2_uk = df_pnpberd_nuts2[df_pnpberd_nuts2['geo'].str.contains('UK')]

In [12]:
#mappings
df_pnpberd_nuts2_uk['time'] = df_pnpberd_nuts2_uk['time'].astype(int)
df_pnpberd_nuts2_uk['geo_name'] = df_pnpberd_nuts2_uk['geo'].map(nuts2_map)
df_pnpberd_nuts2_uk['unit'] = df_pnpberd_nuts2_uk['unit'].map(vars_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [13]:
d_pnpberd = df_pnpberd_nuts2_uk.pivot_table(index=['geo_name','geo','time'],
               columns = 'unit',
               values = 'values').reset_index().set_index('geo_name')

In [14]:
d_pnpberd['euros'] = d_pnpberd['Million euro'] * 1000000.00
d_pnpberd.drop(columns=['Million euro'], inplace=True)

In [15]:
d_pnpberd.columns

Index(['geo', 'time', 'euros'], dtype='object', name='unit')

In [16]:
d_pnpberd.reset_index(drop=True, inplace=True)
d_pnpberd.columns = ['nuts_id', 'year', 'eurostat_private_non_profit_rd_workforce_data']

In [17]:
d_pnpberd['nuts_year_spec'] = [2013]*len(d_pnpberd)

In [18]:
d_pnpberd = d_pnpberd[['year','nuts_id', 'nuts_year_spec', 'eurostat_private_non_profit_rd_workforce_data']]

In [19]:
d_pnpberd.to_csv('../../data/processed/eurostat/eurostat_private_non_profit_rd_workforce_data.csv', index=False)

### Higher Education Performed R&D expenditure (HERD)

In [53]:
#pull in data
data_herd_nuts2 = client.get_dataset('rd_e_gerdreg?sinceTimePeriod=2012&geoLevel=nuts2&precision=6&sectperf=HES&unit=MIO_EUR')

print(data_herd_nuts2.label)

df_herd_nuts2 = data_herd_nuts2.to_dataframe()

Intramural R&D expenditure (GERD) by sectors of performance and NUTS 2 regions


In [54]:
#UK NUTS2 regions subset
df_herd_nuts2_uk = df_herd_nuts2[df_herd_nuts2['geo'].str.contains('UK')]

In [55]:
#mappings
df_herd_nuts2_uk['time'] = df_herd_nuts2_uk['time'].astype(int)
df_herd_nuts2_uk['geo_name'] = df_herd_nuts2_uk['geo'].map(nuts2_map)
df_herd_nuts2_uk['unit'] = df_herd_nuts2_uk['unit'].map(vars_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [56]:
d_herd = df_herd_nuts2_uk.pivot_table(index=['geo_name','geo','time'],
               columns = 'unit',
               values = 'values').reset_index().set_index('geo_name')

In [57]:
d_herd['euros'] = d_herd['Million euro'] * 1000000.00
d_herd.drop(columns=['Million euro'], inplace=True)

In [58]:
d_herd.columns

Index(['geo', 'time', 'euros'], dtype='object', name='unit')

In [59]:
d_herd.reset_index(drop=True, inplace=True)
d_herd.columns = ['nuts_id', 'year', 'eurostat_higher_ed_rd_workforce_data']

In [60]:
d_herd['nuts_year_spec'] = [2013]*len(d_herd)

In [61]:
d_herd = d_herd[['year','nuts_id', 'nuts_year_spec', 'eurostat_higher_ed_rd_workforce_data']]

In [62]:
d_herd.to_csv('../../data/processed/eurostat/eurostat_higher_ed_rd_workforce_data.csv',index=False)

### Government Performed R&D Expenditure (GovERD)

In [63]:
#pull in data
data_goverd_nuts2 = client.get_dataset('rd_e_gerdreg?sinceTimePeriod=2012&geoLevel=nuts2&precision=2&sectperf=GOV&unit=MIO_EUR')

print(data_goverd_nuts2.label)

df_goverd_nuts2 = data_goverd_nuts2.to_dataframe()

Intramural R&D expenditure (GERD) by sectors of performance and NUTS 2 regions


In [64]:
#UK NUTS2 regions subset
df_goverd_nuts2_uk = df_goverd_nuts2[df_goverd_nuts2['geo'].str.contains('UK')]

In [65]:
#mappings
df_goverd_nuts2_uk['time'] = df_goverd_nuts2_uk['time'].astype(int)
df_goverd_nuts2_uk['geo_name'] = df_goverd_nuts2_uk['geo'].map(nuts2_map)
df_goverd_nuts2_uk['unit'] = df_goverd_nuts2_uk['unit'].map(vars_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [66]:
d_goverd = df_goverd_nuts2_uk.pivot_table(index=['geo_name','geo','time'],
               columns = 'unit',
               values = 'values').reset_index().set_index('geo_name')

In [67]:
d_goverd['euros'] = d_goverd['Million euro'] * 1000000.00
d_goverd.drop(columns=['Million euro'], inplace=True)

In [68]:
d_goverd.columns

Index(['geo', 'time', 'euros'], dtype='object', name='unit')

In [69]:
d_goverd.reset_index(drop=True, inplace=True)
d_goverd.columns = ['nuts_id', 'year', 'eurostat_gov_rd_workforce_data']

In [70]:
d_goverd['nuts_year_spec'] = [2013]*len(d_goverd)

In [71]:
d_goverd = d_goverd[['year','nuts_id', 'nuts_year_spec', 'eurostat_gov_rd_workforce_data']]

In [72]:
d_goverd.to_csv('../../data/processed/eurostat/eurostat_gov_rd_workforce_data.csv', index=False)

### Share of high growth firms

In [None]:
#pull in data
data_share_nuts2 = client.get_dataset('bd_hgnace2_r3?sinceTimePeriod=2012&geoLevel=nuts2&precision=1&indic_sb=V97460&nace_r2=B-E&nace_r2=B-S_X_K642&nace_r2=F&nace_r2=G&nace_r2=H&nace_r2=I&nace_r2=J&nace_r2=K_L_X_K642&nace_r2=M_N&nace_r2=P_Q&nace_r2=R_S')

print(data_share_nuts2.label)

dataframe_share_nuts2 = data_share_nuts2.to_dataframe()

In [None]:
dataframe_share_nuts2[dataframe_share_nuts2['geo'] == 'UKC1']

In [None]:
#UK NUTS2 regions subset

dataframe_share_nuts2_uk = dataframe_share_nuts2[dataframe_share_nuts2['geo'].str.contains('UK')]

In [None]:
dataframe_share_nuts2_uk

In [None]:
dataframe_share_nuts2_uk.pivot_table(index=['geo','time'],
               columns = 'indic_sb',
               values = 'values')

Note: Does not seem to be UK NUTS2 values for this dataset