# APS

Code to process, work with and plot APS data to create indicators about skills supply in the UK. 

We are interested in the following indicators:

* Percentage of the population with tertiary education
* Percentage of population employed in professional occupations

Raw collected from https://www.nomisweb.co.uk/articles/676.aspx

See [this table](https://docs.google.com/spreadsheets/d/1V2fAQcvuLsoImwo6uLdyIK3x80pBNoX97CxsxkjvRP4/edit?usp=sharing) for more information.

## Preamble

In [1]:
import requests 

import numpy as np 
import pandas as pd 
import json
import seaborn as sns

import os
cwd = os.getcwd()

import matplotlib.pyplot as plt
%matplotlib inline

  import pandas.util.testing as tm


## Data Processing & Transformation

Raw data is downloaded via command line using `get_aps_nomis_data` module in the `beis-indicators/data` directory.
Note: `get_nomis_data` uses `nomis`

### Processing for 'Percentage of population employed in professional occupations' data

In [2]:
# fetching raw data
data_occupations_json = '../../data/raw/nomis_percent_pro_occs-0-25000.json'
with open(data_occupations_json) as f:
    data = json.load(f)

In [3]:
df_occupations = pd.DataFrame.from_records(data)

In [4]:
# selecting the rows with the variable in question
df_occupations = df_occupations[(df_occupations['measures_name']=='Variable')].reset_index(drop=True)

# creating pivot table with indicators as fields
df_occupations_pivot = df_occupations.pivot_table(index=df_occupations[['geography_name', 'geography_code', 'date_code']], columns='variable_name', aggfunc='mean')['obs_value']

In [5]:
df_occupations_pivot.reset_index(inplace=True)

In [6]:
df_occupations_pivot.set_index('geography_name',inplace=True)

In [7]:
df_occupations_pivot.columns

Index(['geography_code', 'date_code',
       '% all in employment who are - 2: professional occupations (SOC2010)'],
      dtype='object', name='variable_name')

In [8]:
df_occupations_pivot['date_code'] = df_occupations_pivot['date_code'].apply(lambda x: int(x.split('-')[0]))

In [9]:
df_occupations_pivot.columns = ['nuts_id', 'year', 'aps_pro_occupations_data']
df_occupations_pivot['nuts_year_spec'] = [2016]*len(df_occupations_pivot)

In [10]:
df_occupations_pivot = df_occupations_pivot[['year','nuts_id', 'nuts_year_spec', 'aps_pro_occupations_data']].reset_index(drop=True)

In [11]:
#saving pivot table
df_occupations_pivot.to_csv('../../data/processed/aps/aps_pro_occupations_data.csv')

### Processing for 'Economically active with NVQ4+ (graduates)' data

In [24]:
# fetching raw data
data_edu_json = '../../data/raw/nomis_tert_quals-0-25000.json'
with open(data_edu_json) as f:
    data = json.load(f)

In [25]:
df_edu = pd.DataFrame.from_records(data)

In [26]:
# selecting the rows with the variable in question
df_edu = df_edu[(df_edu['measures_name']=='Variable')].reset_index(drop=True)

# creating pivot table with indicators as fields
df_edu_pivot = df_edu.pivot_table(index=df_edu[['geography_name', 'geography_code', 'date_code']], columns='variable_name', aggfunc='mean')['obs_value']

In [27]:
df_edu_pivot.reset_index(inplace=True)
df_edu_pivot.set_index('geography_name',inplace=True)

In [28]:
df_edu_pivot.columns

Index(['geography_code', 'date_code',
       '% of economically active with NVQ4+ - aged 16-64'],
      dtype='object', name='variable_name')

In [29]:
df_edu_pivot['date_code'] = df_edu_pivot['date_code'].apply(lambda x: int(x.split('-')[0]))

In [30]:
df_edu_pivot.columns = ['nuts_id', 'year', 'aps_nvq4_education_data']
df_edu_pivot['nuts_year_spec'] = [2016]*len(df_edu_pivot)

In [31]:
df_edu_pivot = df_edu_pivot[['year','nuts_id', 'nuts_year_spec', 'aps_nvq4_education_data']].reset_index(drop=True)

In [32]:
df_edu_pivot

Unnamed: 0,year,nuts_id,nuts_year_spec,aps_nvq4_education_data
0,2012,UKH2,2016,41.7
1,2013,UKH2,2016,41.4
2,2014,UKH2,2016,43.5
3,2015,UKH2,2016,43.5
4,2016,UKH2,2016,43.8
...,...,...,...,...
275,2014,UKE4,2016,34.0
276,2015,UKE4,2016,35.1
277,2016,UKE4,2016,34.6
278,2017,UKE4,2016,36.7


In [33]:
#saving pivot table
df_edu_pivot.to_csv('../../data/processed/aps/aps_nvq4_education_data.csv')

### Processing for 'Economically active in either science, research, engineering and technology, professionals and associated professionals' data

In [91]:
# fetching raw data
data_stem_json = '../../data/raw/nomis_stem-0-25000.json'
with open(data_stem_json) as f:
    data = json.load(f)

In [92]:
data_stem = pd.DataFrame.from_records(data)

In [93]:
# selecting the rows with the variable in question
data_stem = data_stem[(data_stem['measures_name']=='Value')].reset_index(drop=True)

# creating pivot table with indicators as fields
df_stem_pivot = data_stem.pivot_table(index=data_stem[['geography_name', 'geography_code', 'date_code']], columns='cell_name', aggfunc='mean')['obs_value']

In [94]:
df_stem_pivot.reset_index(inplace=True)
df_stem_pivot.set_index('geography_name',inplace=True)

In [95]:
df_stem_pivot.columns

Index(['geography_code', 'date_code',
       'T09a:19 (All people - Science, Engineering and Technology Associate Professionals (SOC2010) : All people )',
       'T09a:7 (All people - Science, Research, Engineering and Technology Professionals (SOC2010) : All people )'],
      dtype='object', name='cell_name')

In [96]:
df_stem_pivot['date_code'] = df_stem_pivot['date_code'].apply(lambda x: int(x.split('-')[0]))

In [97]:
df_stem_pivot.columns = ['nuts_id', 'year', 'aps_econ_active_stem_associate_profs_data','aps_econ_active_stem_profs_data']
df_stem_pivot['aps_econ_active_stem_profs_data'] = df_stem_pivot['aps_econ_active_stem_profs_data'].apply(lambda x: int(x))
df_stem_pivot['aps_econ_active_stem_associate_profs_data'] = df_stem_pivot['aps_econ_active_stem_associate_profs_data'].apply(lambda x: int(x))
df_stem_pivot['nuts_year_spec'] = [2016]*len(df_edu_pivot)

In [98]:
df_stem_pivot_prof = df_stem_pivot[['year','nuts_id', 'nuts_year_spec', 'aps_econ_active_stem_profs_data']].reset_index(drop=True)
df_stem_pivot_aprof = df_stem_pivot[['year','nuts_id', 'nuts_year_spec', 'aps_econ_active_stem_associate_profs_data']].reset_index(drop=True)

In [99]:
#saving pivot table
df_stem_pivot_prof.to_csv('../../data/processed/aps/aps_econ_active_stem_profs_data.csv')
df_stem_pivot_aprof.to_csv('../../data/processed/aps/aps_econ_active_stem_associate_profs_data.csv')

### Processing for 'STEM employee density' data

In [72]:
# fetching raw data
data_stem_dens_json = '../../data/raw/nomis_stem_dens-0-25000.json'
with open(data_stem_dens_json) as f:
    data = json.load(f)

In [73]:
data_stem_dens = pd.DataFrame.from_records(data)

In [74]:
# selecting the rows with the variable in question
data_stem_dens = data_stem_dens[(data_stem_dens['measures_name']=='Variable')].reset_index(drop=True)

# creating pivot table with indicators as fields
data_stem_dens_pivot = data_stem_dens.pivot_table(index=data_stem_dens[['geography_name', 'geography_code', 'date_code']], columns='variable_name', aggfunc='mean')['obs_value']

In [75]:
data_stem_dens_pivot.reset_index(inplace=True)
data_stem_dens_pivot.set_index('geography_name',inplace=True)

In [76]:
data_stem_dens_pivot.columns

Index(['geography_code', 'date_code',
       '% all in employment who are - 21: science, research, engineering and technology profs (SOC2010)'],
      dtype='object', name='variable_name')

In [77]:
data_stem_dens_pivot['date_code'] = data_stem_dens_pivot['date_code'].apply(lambda x: int(x.split('-')[0]))

In [78]:
data_stem_dens_pivot

variable_name,geography_code,date_code,"% all in employment who are - 21: science, research, engineering and technology profs (SOC2010)"
geography_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bedfordshire and Hertfordshire,UKH2,2012,7.1
Bedfordshire and Hertfordshire,UKH2,2013,6.8
Bedfordshire and Hertfordshire,UKH2,2014,6.7
Bedfordshire and Hertfordshire,UKH2,2015,5.8
Bedfordshire and Hertfordshire,UKH2,2016,6.8
...,...,...,...
West Yorkshire,UKE4,2014,4.1
West Yorkshire,UKE4,2015,4.1
West Yorkshire,UKE4,2016,4.8
West Yorkshire,UKE4,2017,4.1


In [79]:
data_stem_dens_pivot.columns = ['nuts_id', 'year', 'aps_econ_active_stem_density_data']
data_stem_dens_pivot['nuts_year_spec'] = [2016]*len(data_stem_dens_pivot)

In [80]:
# data_stem_dens_pivot
data_stem_dens_pivot = data_stem_dens_pivot[['year','nuts_id', 'nuts_year_spec', 'aps_econ_active_stem_density_data']].reset_index(drop=True)

In [81]:
#saving pivot table
data_stem_dens_pivot.to_csv('../../data/processed/aps/aps_econ_active_stem_density_data.csv')

## (Processed) Data Collection

This section includes an example of how to work with the dataset if interested in one year (i.e. 2018).

In [None]:
data_edu = '../../data/processed/aps/11_11_2019_aps_tertiary_education_data.csv'
data_occupations = '../../data/processed/aps/11_11_2019_aps_pro_occupations_data.csv'

In [None]:
df_edu = pd.read_csv(data_edu)
df_edu.set_index('geography_name', inplace=True)
df_edu

In [None]:
df_edu_2018 = df_edu[df_edu['date_code'] == '2018-12']
df_edu_2018

In [None]:
df_edu_2018['% with NVQ4+ - aged 16-64'].sort_values(ascending=True).plot(kind='barh', figsize=(10,8))
plt.ylabel('NUTS2 Region', fontsize=12)
plt.xlabel('% of NUTS2 Region Population', fontsize=12)
plt.title('Percentage of population in NUTS2 regions with NVQ4+: 2018')

In [None]:
df_occ = pd.read_csv(data_occupations)
df_occ.set_index('geography_name', inplace=True)
df_occ

In [None]:
df_occ_2018 = df_occ[df_occ['date_code'] == '2018-12']
# df_occ_2018

In [None]:
df_occ_2018.plot(kind='barh', figsize=(10,8), stacked=True)
plt.axvline(x= 100, linestyle='--', color='grey', alpha =0.3)
plt.xlabel('% of NUTS2 Region Population')
plt.ylabel('NUTS2 Region')
plt.title('Percentage of population in NUTS2 regions in given employment categories: 2018')
plt.legend(bbox_to_anchor=(1.05, 1.05))