# Reconstruction of the ESCO occupation hierarchy

This notebook uses [ESCO API](https://ec.europa.eu/esco/api/doc/esco_api_doc.html) to reconstruct ESCO occupation hierarchy, i.e., the 'broader' (parent) and 'narrower' (children) occupation relationships. The final output is a table `full_hierarchy` (stored as `occupations_ESCO_hierarchy.csv`).

In [3]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import pickle
import requests

# 0. Import already prepared data

In [16]:
# Import ESCO occupations (NB: This is not really prerequisite, as one could also recreate this dataframe by
# using the ESCO API, using the requests shown below).
occupations = pd.read_csv('../data/processed/item_occupations.csv')

print(f'Imported {len(occupations)} occupations')
occupations.head(2)

Imported 2942 occupations


Unnamed: 0,id,concept_type,concept_uri,preferred_label,alt_labels,description,isco_group,isco_level_1,isco_level_2,isco_level_3,isco_level_4
0,0,Occupation,http://data.europa.eu/esco/occupation/00030d09...,technical director,technical and operations director\nhead of tec...,Technical directors realise the artistic visio...,2166,2,21,216,2166
1,1,Occupation,http://data.europa.eu/esco/occupation/000e93a3...,metal drawing machine operator,metal drawing machine technician\nmetal drawin...,Metal drawing machine operators set up and ope...,8121,8,81,812,8121


# 1. ESCO API

In [5]:
# Path where results from API calls are stored
api_folder = '../data/interim/ESCO_occupations_API/'

In [6]:
# Example of checking a broader occupation's narrower occupations
link = 'https://ec.europa.eu/esco/api/resource/occupation?uri=http://data.europa.eu/esco/occupation/35bc3847-58ad-46f5-8921-e58acc2762a6'
req_occupation = requests.get(link)
print(f"Broader occupation: {req_occupation.json()['title']}")
print('----')
link_dict = req_occupation.json()['_links']
for occ in link_dict['narrowerOccupation']:
    print(occ['title'])


Broader occupation: chemical engineering technician
----
asphalt laboratory technician
colour sampling technician
chemical manufacturing quality technician
hazardous waste technician


## 1.1 Top level ESCO occupations (Level 5 ISCO)

I refer to these as "top level" ESCO occupations, but note that they constitute the next level of granularity after the four-digit ISCO-08 codes. Thus, they in fact correspond to Level 5.

In [17]:
# Set up a dataframe to store the hierarchy, with columns pertaining to hierarchy levels and parent occupation's ID
occupation_hierarchy = occupations.copy()
occupation_hierarchy['is_top_level'] = False
occupation_hierarchy['is_second_level'] = False
occupation_hierarchy['parent_occupation_id'] = False

occupation_hierarchy.head(1)

Unnamed: 0,id,concept_type,concept_uri,preferred_label,alt_labels,description,isco_group,isco_level_1,isco_level_2,isco_level_3,isco_level_4,is_top_level,is_second_level,parent_occupation_id
0,0,Occupation,http://data.europa.eu/esco/occupation/00030d09...,technical director,technical and operations director\nhead of tec...,Technical directors realise the artistic visio...,2166,2,21,216,2166,False,False,False


In [18]:
# Use the API to get all the top-level ESCO occupations
head_link = 'https://ec.europa.eu/esco/api'
link = '/resource/taxonomy?uris=http://data.europa.eu/esco/concept-scheme/isco&uris=http://data.europa.eu/esco/concept-scheme/member-occupations&language=en'
req = requests.get(head_link+link)


In [19]:
# ESCO occupations 
occupations_from_API = req.json()['_embedded']['http://data.europa.eu/esco/concept-scheme/member-occupations']['_links']['hasTopConcept']
print(f'{len(occupations_from_API)} top level occupations')


1701 top level occupations


In [20]:
# Create a dataframe with the top level ESCO occupations
top_uri = []
top_titles = []

for occupation in occupations_from_API:
    top_uri.append(occupation['uri'])
    top_titles.append(occupation['title'])
    
top_occupations = pd.DataFrame(data={
    'concept_uri':top_uri,
    'title': top_titles,
    'is_top': [1]*len(top_uri)})


In [21]:
# Update occupation hierarchy dataframe
occupation_hierarchy_ = occupation_hierarchy.merge(top_occupations, on='concept_uri', how='left')
occupation_hierarchy_.loc[occupation_hierarchy_.is_top==1, 'is_top_level'] = True
occupation_hierarchy_.drop(['is_top', 'title'], axis=1, inplace=True)
occupation_hierarchy_.head(2)


Unnamed: 0,id,concept_type,concept_uri,preferred_label,alt_labels,description,isco_group,isco_level_1,isco_level_2,isco_level_3,isco_level_4,is_top_level,is_second_level,parent_occupation_id
0,0,Occupation,http://data.europa.eu/esco/occupation/00030d09...,technical director,technical and operations director\nhead of tec...,Technical directors realise the artistic visio...,2166,2,21,216,2166,False,False,False
1,1,Occupation,http://data.europa.eu/esco/occupation/000e93a3...,metal drawing machine operator,metal drawing machine technician\nmetal drawin...,Metal drawing machine operators set up and ope...,8121,8,81,812,8121,True,False,False


## 1.2 Second level ESCO occupations (Level 6 ISCO)

In [35]:
# Check which Level 5 occupations have "narrower occupations";
# Note that making all the API calls will take some time
# (alternatively, use the next cell to import the preloaded data)
has_narrower_occupations = []
has_broader_occupations = []
narrow_occupations = []

for j in tqdm(range(len(top_titles)), total=len(top_titles)):
    
    # Get top occupation's data
    link = occupations_from_API[j]['href']
    req_occupation = requests.get(link)

    link_dict = req_occupation.json()['_links']
    k = list(link_dict.keys())

    # Get broader and narrower occupations
    if 'broaderOccupation' in k:
        has_broader_occupations.append(j)
        # should remain empty if data is ok
    if 'narrowerOccupation' in k:    
        has_narrower_occupations.append(j)
        narrow_occupations.append(link_dict['narrowerOccupation'])
        
pickle.dump(has_narrower_occupations, open(api_folder + 'ESCO_has_narrower_occupations.pickle','wb'))
pickle.dump(narrow_occupations, open(api_folder + 'ESCO_narrower_occupations.pickle','wb'))
pickle.dump(top_occupations, open(api_folder + 'ESCO_top_occupations_dataframe.pickle', 'wb'))

In [22]:
# Import the preloaded lists of narrower occinupations
has_narrower_occupations = pickle.load(open(api_folder + 'ESCO_has_narrower_occupations.pickle','rb'))
narrow_occupations = pickle.load(open(api_folder + 'ESCO_narrower_occupations.pickle','rb'))
top_occupations = pickle.load(open(api_folder + 'ESCO_top_occupations_dataframe.pickle', 'rb'))

In [23]:
# Get IDs for the broader (parent) and narrower (children) ESCO occupations
top_occupations_id = top_occupations.merge(occupations[['id','concept_uri']], on='concept_uri')

# List of lists of narrow occupation IDs
narrow_occupations_ids = []

# Broad (parent) occupation IDs
broad_occupation_ids = []

# Get IDs of the narrower occupations
for j in range(len(has_narrower_occupations)):
    
    # Get the ID of the parent occupation 
    top_id = top_occupations_id.iloc[has_narrower_occupations[j]]['id']
    broad_occupation_ids.append(top_id)
    
    # Get the IDs of the children occupations
    n_narrow = len(narrow_occupations[j])
    narrow_id = []
    for i in range(n_narrow):
        narrow_id.append(occupations[occupations.concept_uri==narrow_occupations[j][i]['uri']]['id'].values[0])
    
    narrow_occupations_ids.append(narrow_id)
    

In [24]:
# Update occupation hierarchy dataframe
for j in range(len(broad_occupation_ids)):
    parent_id = occupations.loc[broad_occupation_ids[j]]['id']
    occupation_hierarchy_.loc[narrow_occupations_ids[j], 'is_second_level'] = True
    occupation_hierarchy_.loc[narrow_occupations_ids[j], 'parent_occupation_id'] = parent_id

In [25]:
# Get second level URIs for the next API calls
second_level_occupations = occupation_hierarchy_[occupation_hierarchy_.is_second_level==True].copy()
second_level_occupations_uris = second_level_occupations.concept_uri.to_list()
print(f"{len(broad_occupation_ids)} second level occupations have {len(second_level_occupations)} children")

230 second level occupations have 1064 children


## 1.3 Third level ESCO occupations (Level 7 ISCO)

In [57]:
# Check which Level 6 occupations have "narrower occupations";
# Note that making all the API calls will take some time
# (alternatively, use the next cell to import the preloaded data)
has_narrower_occupations_level_2 = []
narrow_occupations_level_2 = []

has_broader_occupations_level_2 = []
broader_occupations_level_2 = []

for j in tqdm(range(len(second_level_occupations)), total=len(second_level_occupations)):

    uri = second_level_occupations_uris[j]
    link = head_link + '/resource/occupation?uris=' + uri + '&language=en'
    req_occupation = requests.get(link)

    keys = list(req_occupation.json()['_embedded'].keys())
    key = keys[0]

    link_dict = req_occupation.json()['_embedded'][key]['_links']
    k = list(link_dict.keys())

    if 'broaderOccupation' in k:
        has_broader_occupations_level_2.append(j)
        broader_occupations_level_2.append(link_dict['broaderOccupation'])
        
    if 'narrowerOccupation' in k:    
        has_narrower_occupations_level_2.append(j)
        narrow_occupations_level_2.append(link_dict['narrowerOccupation'])
        
pickle.dump(has_narrower_occupations_level_2, open(api_folder + 'ESCO_has_narrower_occupations_level_2.pickle','wb'))
pickle.dump(narrow_occupations_level_2, open(api_folder + 'ESCO_narrower_occupations_level_2.pickle','wb'))

pickle.dump(has_broader_occupations_level_2, open(api_folder + 'ESCO_has_broader_occupations_level_2.pickle','wb'))
pickle.dump(broader_occupations_level_2, open(api_folder + 'ESCO_broader_occupations_level_2.pickle','wb'))


HBox(children=(FloatProgress(value=0.0, max=1064.0), HTML(value='')))




In [26]:
# Import the preloaded lists of narrower & broader occupations
has_narrower_occupations_level_2 = pickle.load(open(api_folder + 'ESCO_has_narrower_occupations_level_2.pickle','rb'))
narrow_occupations_level_2 = pickle.load(open(api_folder + 'ESCO_narrower_occupations_level_2.pickle','rb'))

has_broader_occupations_level_2 =  pickle.load(open(api_folder + 'ESCO_has_broader_occupations_level_2.pickle','rb'))
broader_occupations_level_2 = pickle.load(open(api_folder + 'ESCO_broader_occupations_level_2.pickle','rb'))


In [27]:
# Get IDs for the broader (parent) and narrower (children) ESCO occupations
top_occupations = pd.DataFrame(data={'concept_uri':second_level_occupations_uris})
top_occupations_id = top_occupations.merge(occupations[['id','concept_uri']], on='concept_uri')

narrow_occupations_ids_level_2 = []
broad_occupation_ids_level_2 = []

for j in range(len(has_narrower_occupations_level_2)):
    top_id = top_occupations_id.iloc[has_narrower_occupations_level_2[j]]['id']
    broad_occupation_ids_level_2.append(top_id)
    
    n_narrow = len(narrow_occupations_level_2[j])
    narrow_id = []
    for i in range(n_narrow):
        narrow_id.append(occupations[occupations.concept_uri==narrow_occupations_level_2[j][i]['uri']]['id'].values[0])
    
    narrow_occupations_ids_level_2.append(narrow_id)
    

In [28]:
# Update occupation hierarchy dataframe
occupation_hierarchy__ = occupation_hierarchy_.copy()
occupation_hierarchy__['is_third_level'] = False

print('Third level occupations with children')
print('----')

for j in range(len(broad_occupation_ids_level_2)):
    parent_id = occupations.loc[broad_occupation_ids_level_2[j]]['id']
    print(parent_id, occupations.loc[parent_id].preferred_label)
    occupation_hierarchy__.loc[narrow_occupations_ids_level_2[j], 'is_third_level'] = True
    occupation_hierarchy__.loc[narrow_occupations_ids_level_2[j], 'parent_occupation_id'] = parent_id

Third level occupations with children
----
13 rental service representative in machinery, equipment and tangible goods
179 musical conductor
254 product development manager
532 rental service representative in personal and household goods
572 warehouse manager
593 quality engineer
619 building electrician
866 sommelier
923 agricultural engineer
982 commercial pilot
1025 botanist
1172 import export specialist
1191 lathe and turning machine operator
1194 distribution manager
1246 outdoor animator
1335 marketing manager
1442 fashion designer
1682 microelectronics engineer
1703 specialist pharmacist
1710 V-belt builder
1711 pastry chef
2016 supply chain manager
2180 import export manager
2277 research manager
2279 securities broker
2364 energy assessor
2525 production engineer
2531 aerospace engineer
2728 industrial production manager
2898 animal trainer
2919 industrial quality manager


In [29]:
# Get third level URIs for the next API calls
third_level_occupations = occupation_hierarchy__[occupation_hierarchy__.is_third_level==True].copy()
third_level_occupations_uris = third_level_occupations.concept_uri.to_list()
print(f"{len(broad_occupation_ids_level_2)} third level occupations have {len(third_level_occupations)} children")


31 third level occupations have 136 children


## 1.4 Fourth level ESCO occupations (Level 8 ISCO)

In [62]:
# Check which Level 7 occupations have "narrower occupations";
# Note that making all the API calls will take some time
# (alternatively, use the next cell to import the preloaded data)
has_narrower_occupations_level_3 = []
narrow_occupations_level_3 = []

has_broader_occupations_level_3 = []
broader_occupations_level_3 = []

for j in tqdm(range(len(third_level_occupations_uris)), total=len(third_level_occupations_uris)):

    uri = third_level_occupations_uris[j]
    link = head_link + '/resource/occupation?uris=' + uri + '&language=en'
    req_occupation = requests.get(link)

    keys = list(req_occupation.json()['_embedded'].keys())
    key = keys[0]

    link_dict = req_occupation.json()['_embedded'][key]['_links']
    k = list(link_dict.keys())

    if 'broaderOccupation' in k:
        has_broader_occupations_level_3.append(j)
        broader_occupations_level_3.append(link_dict['broaderOccupation'])
        
    if 'narrowerOccupation' in k:    
        has_narrower_occupations_level_3.append(j)
        narrow_occupations_level_3.append(link_dict['narrowerOccupation'])
        
pickle.dump(has_narrower_occupations_level_3, open(api_folder + 'ESCO_has_narrower_occupations_level_3.pickle','wb'))
pickle.dump(narrow_occupations_level_3, open(api_folder + 'ESCO_narrower_occupations_level_3.pickle','wb'))

pickle.dump(has_broader_occupations_level_3, open(api_folder + 'ESCO_has_broader_occupations_level_3.pickle','wb'))
pickle.dump(broader_occupations_level_3, open(api_folder + 'ESCO_broader_occupations_level_3.pickle','wb'))
        

HBox(children=(FloatProgress(value=0.0, max=136.0), HTML(value='')))




In [30]:
# Import the preloaded lists of narrower & broader occupations
has_narrower_occupations_level_3 = pickle.load(open(api_folder + 'ESCO_has_narrower_occupations_level_3.pickle','rb'))
narrow_occupations_level_3 = pickle.load(open(api_folder + 'ESCO_narrower_occupations_level_3.pickle','rb'))

has_broader_occupations_level_3 = pickle.load(open(api_folder + 'ESCO_has_broader_occupations_level_3.pickle','rb'))
broader_occupations_level_3 = pickle.load(open(api_folder + 'ESCO_broader_occupations_level_3.pickle','rb'))


In [31]:
# Get IDs for the broader (parent) and narrower (children) ESCO occupations
top_occupations = pd.DataFrame(data={'concept_uri': third_level_occupations_uris})
top_occupations_id = top_occupations.merge(occupations[['id','concept_uri']], on='concept_uri')

narrow_occupations_ids_level_3 = []
broad_occupation_ids_level_3 = []

for j in range(len(has_narrower_occupations_level_3)):
    top_id = top_occupations_id.iloc[has_narrower_occupations_level_3[j]]['id']
    broad_occupation_ids_level_3.append(top_id)
    
    n_narrow = len(narrow_occupations_level_3[j])
    narrow_id = []
    for i in range(n_narrow):
        narrow_id.append(occupations[occupations.concept_uri==narrow_occupations_level_3[j][i]['uri']]['id'].values[0])
    
    narrow_occupations_ids_level_3.append(narrow_id)
    

In [32]:
occupation_hierarchy__['is_fourth_level'] = False

print('Fourth level occupations with children')
print('----')

for j in range(len(broad_occupation_ids_level_3)):
    parent_id = occupations.loc[broad_occupation_ids_level_3[j]]['id']
    print(parent_id, occupations.loc[parent_id].preferred_label)
    occupation_hierarchy__.loc[narrow_occupations_ids_level_3[j], 'is_fourth_level'] = True
    occupation_hierarchy__.loc[narrow_occupations_ids_level_3[j], 'parent_occupation_id'] = parent_id
    

Fourth level occupations with children
----
471 dog trainer
599 leather production manager
671 purchasing manager
1021 outdoor activities instructor
1372 specialised goods distribution manager
1978 sales manager


In [33]:
occupation_hierarchy_final = occupation_hierarchy__.copy()
occupation_hierarchy_final.sample(2)

Unnamed: 0,id,concept_type,concept_uri,preferred_label,alt_labels,description,isco_group,isco_level_1,isco_level_2,isco_level_3,isco_level_4,is_top_level,is_second_level,parent_occupation_id,is_third_level,is_fourth_level
1305,1305,Occupation,http://data.europa.eu/esco/occupation/6d823dbb...,solar power plant operator,solar electricity power plant operative\nsolar...,Solar power plant operators operate and mainta...,3131,3,31,313,3131,False,True,1361,False,False
1284,1284,Occupation,http://data.europa.eu/esco/occupation/6b5e5aa6...,immigration policy officer,asylum policy officer\ngovernment policy manag...,Immigration policy officers develop strategies...,2422,2,24,242,2422,False,True,2400,False,False


# 2. Full ESCO occupation hierarchy

In [42]:
full_hierarchy = occupation_hierarchy_final.copy()

# Add a column which explicates the top level parent (Level 5 occupation)
top_level_parent_id = []
for i, row in full_hierarchy.iterrows():
    parent_id = row.parent_occupation_id
    while parent_id != False:
        row = full_hierarchy.loc[parent_id]
        parent_id = row.parent_occupation_id
    top_level_parent_id.append(row.id)
full_hierarchy['top_level_parent_id'] = top_level_parent_id

# Reorganise the columns
full_hierarchy = full_hierarchy[[
    'id','concept_type','concept_uri','preferred_label','alt_labels','description',
    'isco_level_1', 'isco_level_2', 'isco_level_3', 'isco_level_4',
    'is_top_level', # True if the occupation is top ESCO level (i.e., Level 5 including the four-digit ISCO levels)
    'parent_occupation_id', # ID of the parent occupation that is one level higher
    'is_second_level','is_third_level','is_fourth_level', 
    'top_level_parent_id' # ID of the top level (Level 5) parent
]]

full_hierarchy.loc[full_hierarchy.parent_occupation_id==False, 'parent_occupation_id'] = np.nan

In [43]:
full_hierarchy.sample(3)

Unnamed: 0,id,concept_type,concept_uri,preferred_label,alt_labels,description,isco_level_1,isco_level_2,isco_level_3,isco_level_4,is_top_level,parent_occupation_id,is_second_level,is_third_level,is_fourth_level,top_level_parent_id
676,676,Occupation,http://data.europa.eu/esco/occupation/37f7677a...,machinery assembly supervisor,leadingmachinist\nlead machine operative\nmach...,Machinery assembly supervisors monitor the mac...,3,31,312,3122,False,2093,True,False,False,2093
381,381,Occupation,http://data.europa.eu/esco/occupation/1f5932eb...,foster care support worker,fostering support worker\nplacement support wo...,Foster care support workers assist and support...,3,34,341,3412,False,1700,True,False,False,1700
2317,2317,Occupation,http://data.europa.eu/esco/occupation/c3cae318...,motor vehicle engine assembler,truck engine assembler\ngas engine builder\nb...,Motor vehicle engine assemblers build and inst...,8,82,821,8211,False,2941,True,False,False,2941


## 2.1 Sanity checks

In [44]:
print(full_hierarchy.loc[2903][['preferred_label', 'is_top_level', 'top_level_parent_id']])
print('---')
print(full_hierarchy.loc[59][['preferred_label',  'is_top_level', 'top_level_parent_id']])

preferred_label        international student exchange coordinator
is_top_level                                                False
top_level_parent_id                                            59
Name: 2903, dtype: object
---
preferred_label        administrative assistant
is_top_level                               True
top_level_parent_id                          59
Name: 59, dtype: object


In [45]:
print(full_hierarchy.loc[47][['preferred_label', 'is_top_level', 'parent_occupation_id', 'top_level_parent_id']])
print('---')
print(full_hierarchy.loc[1563][['preferred_label',  'is_top_level', 'parent_occupation_id', 'top_level_parent_id']])

preferred_label         electronic and telecommunications equipment an...
is_top_level                                                        False
parent_occupation_id                                                 1372
top_level_parent_id                                                  1563
Name: 47, dtype: object
---
preferred_label         logistics and distribution manager
is_top_level                                          True
parent_occupation_id                                   NaN
top_level_parent_id                                   1563
Name: 1563, dtype: object


# 3. Export the table

In [46]:
# Export the ESCO occupation hierarchy
full_hierarchy.to_csv('../data/processed/occupations_ESCO_hierarchy.csv', index=False)
