# Reconstruction of the ESCO occupational hierarchy

This notebook uses the [ESCO API](https://ec.europa.eu/esco/api/doc/esco_api_doc.html) to reconstruct the 'broader' (parent) and 'narrower' (children) occupational relationships. The final output table `ESCO_occupational_hierarchy.csv` is saved in the `data/processed/ESCO_occupational_hierarchy` folder.

Note that this ESCO occupational hierarchy is explicitly specified by ESCO and it is different from the skills-based sectors and sub-sectors that we derived by using graph-based clustering and described in the the Mapping Career Causeways [report](https://media.nesta.org.uk/documents/Mapping_Career_Causeways.pdf).

**Warning**: With updates to the ESCO API, this hierarchy or the API are not guaranteed to stay the same. Therefore, we have stored backup results from the API calls (as they were in Summer 2020, when this analysis was carried out) in `data/interim/ESCO_API_occupations`.

# 0. Import dependencies and inputs

In [1]:
%run ../notebook_preamble.ipy
import requests

In [2]:
# Specify the path where results from API calls are stored/saved
api_folder = f'{data_folder}interim/ESCO_API_occupations/'

# Specify the folder for saving the final output table
outputs_folder = f'{data_folder}processed/ESCO_occupational_hierarchy/'

In [3]:
# Import ESCO occupations
occupations = pd.read_csv(f'{data_folder}processed/ESCO_occupations.csv')

print(f'Imported {len(occupations)} occupations')
occupations.head(2)

Imported 2942 occupations


Unnamed: 0,concept_type,concept_uri,isco_group,preferred_label,alt_labels,description,id
0,Occupation,http://data.europa.eu/esco/occupation/00030d09...,2166,technical director,technical and operations director\nhead of tec...,Technical directors realise the artistic visio...,0
1,Occupation,http://data.europa.eu/esco/occupation/000e93a3...,8121,metal drawing machine operator,metal drawing machine technician\nmetal drawin...,Metal drawing machine operators set up and ope...,1


# 1. ESCO API calls

Here, we make API calls to find each occupation's narrower occupations. We do it in several steps, each subsequent step corresponding to a deeper level of the ESCO occupational hierarchy.

In [4]:
# Example of a broader occupation's narrower occupations
link = 'https://ec.europa.eu/esco/api/resource/occupation?uri=http://data.europa.eu/esco/occupation/35bc3847-58ad-46f5-8921-e58acc2762a6'
req_occupation = requests.get(link)
print(f"Broader occupation: {req_occupation.json()['title']}")
print('----')
print('Narrower occupations:')
link_dict = req_occupation.json()['_links']
for occ in link_dict['narrowerOccupation']:
    print(occ['title'])


Broader occupation: chemical engineering technician
----
Narrower occupations:
asphalt laboratory technician
hazardous waste technician
chemical manufacturing quality technician
colour sampling technician


## 1.1 Top-level ESCO occupations (ISCO Level 5)

We refer to these as "top-level" ESCO occupations, but note that they constitute the next level of granularity after the four-digit ISCO-08 codes. Thus, with respect to the ISCO classification, we also refer to them as "Level 5 occupations".

In [5]:
# Set up a dataframe to store the hierarchy, with columns pertaining to hierarchy levels and parent occupation's ID
occupation_hierarchy = occupations.copy()
occupation_hierarchy['is_top_level'] = False
occupation_hierarchy['is_second_level'] = False
occupation_hierarchy['parent_occupation_id'] = False

occupation_hierarchy.head(1)

Unnamed: 0,concept_type,concept_uri,isco_group,preferred_label,alt_labels,description,id,is_top_level,is_second_level,parent_occupation_id
0,Occupation,http://data.europa.eu/esco/occupation/00030d09...,2166,technical director,technical and operations director\nhead of tec...,Technical directors realise the artistic visio...,0,False,False,False


In [27]:
# Use the API to get all top-level ESCO occupations
# (alternatively, use the next cell to import the preloaded data)
head_link = 'https://ec.europa.eu/esco/api'
link = '/resource/taxonomy?uris=http://data.europa.eu/esco/concept-scheme/isco&uris=http://data.europa.eu/esco/concept-scheme/member-occupations&language=en'
req = requests.get(head_link+link)

# ESCO occupations 
occupations_from_API = req.json()['_embedded']['http://data.europa.eu/esco/concept-scheme/member-occupations']['_links']['hasTopConcept']
print(f'{len(occupations_from_API)} top-level occupations')

# Create a dataframe with the top level ESCO occupations
top_uri = []
top_titles = []

for occupation in occupations_from_API:
    top_uri.append(occupation['uri'])
    top_titles.append(occupation['title'])
    
top_occupations = pd.DataFrame(data={
    'concept_uri':top_uri,
    'title': top_titles,
    'is_top': [1]*len(top_uri)})

## Save a backup of the API response
pickle.dump(top_occupations, open(api_folder + 'ESCO_top_occupations_dataframe.pickle', 'wb'))


1701 top-level occupations


In [6]:
## Load the dataframe with all top occupations
top_occupations = pickle.load(open(api_folder + 'ESCO_top_occupations_dataframe.pickle', 'rb'))


In [7]:
# Update occupation hierarchy dataframe
occupation_hierarchy_ = occupation_hierarchy.merge(top_occupations, on='concept_uri', how='left')
occupation_hierarchy_.loc[occupation_hierarchy_.is_top==1, 'is_top_level'] = True
occupation_hierarchy_.drop(['is_top', 'title'], axis=1, inplace=True)
occupation_hierarchy_.head(2)


Unnamed: 0,concept_type,concept_uri,isco_group,preferred_label,alt_labels,description,id,is_top_level,is_second_level,parent_occupation_id
0,Occupation,http://data.europa.eu/esco/occupation/00030d09...,2166,technical director,technical and operations director\nhead of tec...,Technical directors realise the artistic visio...,0,False,False,False
1,Occupation,http://data.europa.eu/esco/occupation/000e93a3...,8121,metal drawing machine operator,metal drawing machine technician\nmetal drawin...,Metal drawing machine operators set up and ope...,1,True,False,False


## 1.2 Second level ESCO occupations (ISCO Level 6)

In [15]:
# Check which Level 5 occupations have "narrower occupations";
# Note that making all the API calls can take about 20 minutes
# (alternatively, use the next cell to import the preloaded data)
has_narrower_occupations = []
has_broader_occupations = []
narrow_occupations = []

for j in tqdm(range(len(top_titles)), total=len(top_titles)):
    
    # Get top occupation's data
    link = occupations_from_API[j]['href']
    req_occupation = requests.get(link)

    link_dict = req_occupation.json()['_links']
    k = list(link_dict.keys())

    # Get broader and narrower occupations
    if 'broaderOccupation' in k:
        has_broader_occupations.append(j)
        # should remain empty if data is ok
    if 'narrowerOccupation' in k:    
        has_narrower_occupations.append(j)
        narrow_occupations.append(link_dict['narrowerOccupation'])
        
pickle.dump(has_narrower_occupations, open(api_folder + 'ESCO_has_narrower_occupations.pickle','wb'))
pickle.dump(narrow_occupations, open(api_folder + 'ESCO_narrower_occupations.pickle','wb'))


In [8]:
# Import the preloaded lists of narrower occupations
has_narrower_occupations = pickle.load(open(api_folder + 'ESCO_has_narrower_occupations.pickle','rb'))
narrow_occupations = pickle.load(open(api_folder + 'ESCO_narrower_occupations.pickle','rb'))


In [9]:
# Get IDs for the broader (parent) and narrower (children) ESCO occupations
top_occupations_id = top_occupations.merge(occupations[['id','concept_uri']], on='concept_uri')

# List of lists of narrow occupation IDs
narrow_occupations_ids = []

# Broad (parent) occupation IDs
broad_occupation_ids = []

# Get IDs of the narrower occupations
for j in range(len(has_narrower_occupations)):
    
    # Get the ID of the parent occupation 
    top_id = top_occupations_id.iloc[has_narrower_occupations[j]]['id']
    broad_occupation_ids.append(top_id)
    
    # Get the IDs of the children occupations
    n_narrow = len(narrow_occupations[j])
    narrow_id = []
    for i in range(n_narrow):
        narrow_id.append(occupations[occupations.concept_uri==narrow_occupations[j][i]['uri']]['id'].values[0])
    
    narrow_occupations_ids.append(narrow_id)
    

In [10]:
# Update occupation hierarchy dataframe
for j in range(len(broad_occupation_ids)):
    parent_id = occupations.loc[broad_occupation_ids[j]]['id']
    occupation_hierarchy_.loc[narrow_occupations_ids[j], 'is_second_level'] = True
    occupation_hierarchy_.loc[narrow_occupations_ids[j], 'parent_occupation_id'] = parent_id

In [11]:
# Get second level URIs for the next API calls
second_level_occupations = occupation_hierarchy_[occupation_hierarchy_.is_second_level==True].copy()
second_level_occupations_uris = second_level_occupations.concept_uri.to_list()
print(f"{len(broad_occupation_ids)} second level occupations have {len(second_level_occupations)} children")

230 second level occupations have 1064 children


## 1.3 Third level ESCO occupations (ISCO Level 7)

In [24]:
# Check which Level 6 occupations have "narrower occupations";
# Note that making all the API calls will take some time
# (alternatively, use the next cell to import the preloaded data)
has_narrower_occupations_level_2 = []
narrow_occupations_level_2 = []

has_broader_occupations_level_2 = []
broader_occupations_level_2 = []

for j in tqdm(range(len(second_level_occupations)), total=len(second_level_occupations)):

    uri = second_level_occupations_uris[j]
    link = head_link + '/resource/occupation?uris=' + uri + '&language=en'
    req_occupation = requests.get(link)

    keys = list(req_occupation.json()['_embedded'].keys())
    key = keys[0]

    link_dict = req_occupation.json()['_embedded'][key]['_links']
    k = list(link_dict.keys())

    if 'broaderOccupation' in k:
        has_broader_occupations_level_2.append(j)
        broader_occupations_level_2.append(link_dict['broaderOccupation'])
        
    if 'narrowerOccupation' in k:    
        has_narrower_occupations_level_2.append(j)
        narrow_occupations_level_2.append(link_dict['narrowerOccupation'])
        
pickle.dump(has_narrower_occupations_level_2, open(api_folder + 'ESCO_has_narrower_occupations_level_2.pickle','wb'))
pickle.dump(narrow_occupations_level_2, open(api_folder + 'ESCO_narrower_occupations_level_2.pickle','wb'))

pickle.dump(has_broader_occupations_level_2, open(api_folder + 'ESCO_has_broader_occupations_level_2.pickle','wb'))
pickle.dump(broader_occupations_level_2, open(api_folder + 'ESCO_broader_occupations_level_2.pickle','wb'))


In [12]:
# Import the preloaded lists of narrower & broader occupations
has_narrower_occupations_level_2 = pickle.load(open(api_folder + 'ESCO_has_narrower_occupations_level_2.pickle','rb'))
narrow_occupations_level_2 = pickle.load(open(api_folder + 'ESCO_narrower_occupations_level_2.pickle','rb'))

has_broader_occupations_level_2 =  pickle.load(open(api_folder + 'ESCO_has_broader_occupations_level_2.pickle','rb'))
broader_occupations_level_2 = pickle.load(open(api_folder + 'ESCO_broader_occupations_level_2.pickle','rb'))


In [13]:
# Get IDs for the broader (parent) and narrower (children) ESCO occupations
top_occupations = pd.DataFrame(data={'concept_uri':second_level_occupations_uris})
top_occupations_id = top_occupations.merge(occupations[['id','concept_uri']], on='concept_uri')

narrow_occupations_ids_level_2 = []
broad_occupation_ids_level_2 = []

for j in range(len(has_narrower_occupations_level_2)):
    top_id = top_occupations_id.iloc[has_narrower_occupations_level_2[j]]['id']
    broad_occupation_ids_level_2.append(top_id)
    
    n_narrow = len(narrow_occupations_level_2[j])
    narrow_id = []
    for i in range(n_narrow):
        narrow_id.append(occupations[occupations.concept_uri==narrow_occupations_level_2[j][i]['uri']]['id'].values[0])
    
    narrow_occupations_ids_level_2.append(narrow_id)
    

In [14]:
# Update occupation hierarchy dataframe
occupation_hierarchy__ = occupation_hierarchy_.copy()
occupation_hierarchy__['is_third_level'] = False

print('Third level occupations with children')
print('----')

for j in range(len(broad_occupation_ids_level_2)):
    parent_id = occupations.loc[broad_occupation_ids_level_2[j]]['id']
    print(parent_id, occupations.loc[parent_id].preferred_label)
    occupation_hierarchy__.loc[narrow_occupations_ids_level_2[j], 'is_third_level'] = True
    occupation_hierarchy__.loc[narrow_occupations_ids_level_2[j], 'parent_occupation_id'] = parent_id

Third level occupations with children
----
13 rental service representative in machinery, equipment and tangible goods
179 musical conductor
254 product development manager
532 rental service representative in personal and household goods
572 warehouse manager
593 quality engineer
619 building electrician
866 sommelier
923 agricultural engineer
982 commercial pilot
1025 botanist
1172 import export specialist
1191 lathe and turning machine operator
1194 distribution manager
1246 outdoor animator
1335 marketing manager
1442 fashion designer
1682 microelectronics engineer
1703 specialist pharmacist
1710 V-belt builder
1711 pastry chef
2016 supply chain manager
2180 import export manager
2277 research manager
2279 securities broker
2364 energy assessor
2525 production engineer
2531 aerospace engineer
2728 industrial production manager
2898 animal trainer
2919 industrial quality manager


In [15]:
# Get third level URIs for the next API calls
third_level_occupations = occupation_hierarchy__[occupation_hierarchy__.is_third_level==True].copy()
third_level_occupations_uris = third_level_occupations.concept_uri.to_list()
print(f"{len(broad_occupation_ids_level_2)} third level occupations have {len(third_level_occupations)} children")


31 third level occupations have 136 children


## 1.4 Fourth level occupations (Level 8)

In [29]:
# Check which Level 7 occupations have "narrower occupations";
# Note that making all the API calls will take some time
# (alternatively, use the next cell to import the preloaded data)
has_narrower_occupations_level_3 = []
narrow_occupations_level_3 = []

has_broader_occupations_level_3 = []
broader_occupations_level_3 = []

for j in tqdm(range(len(third_level_occupations_uris)), total=len(third_level_occupations_uris)):

    uri = third_level_occupations_uris[j]
    link = head_link + '/resource/occupation?uris=' + uri + '&language=en'
    req_occupation = requests.get(link)

    keys = list(req_occupation.json()['_embedded'].keys())
    key = keys[0]

    link_dict = req_occupation.json()['_embedded'][key]['_links']
    k = list(link_dict.keys())

    if 'broaderOccupation' in k:
        has_broader_occupations_level_3.append(j)
        broader_occupations_level_3.append(link_dict['broaderOccupation'])
        
    if 'narrowerOccupation' in k:    
        has_narrower_occupations_level_3.append(j)
        narrow_occupations_level_3.append(link_dict['narrowerOccupation'])
        
pickle.dump(has_narrower_occupations_level_3, open(api_folder + 'ESCO_has_narrower_occupations_level_3.pickle','wb'))
pickle.dump(narrow_occupations_level_3, open(api_folder + 'ESCO_narrower_occupations_level_3.pickle','wb'))

pickle.dump(has_broader_occupations_level_3, open(api_folder + 'ESCO_has_broader_occupations_level_3.pickle','wb'))
pickle.dump(broader_occupations_level_3, open(api_folder + 'ESCO_broader_occupations_level_3.pickle','wb'))
        

In [16]:
# Import the preloaded lists of narrower & broader occupations
has_narrower_occupations_level_3 = pickle.load(open(api_folder + 'ESCO_has_narrower_occupations_level_3.pickle','rb'))
narrow_occupations_level_3 = pickle.load(open(api_folder + 'ESCO_narrower_occupations_level_3.pickle','rb'))

has_broader_occupations_level_3 = pickle.load(open(api_folder + 'ESCO_has_broader_occupations_level_3.pickle','rb'))
broader_occupations_level_3 = pickle.load(open(api_folder + 'ESCO_broader_occupations_level_3.pickle','rb'))


In [17]:
# Get IDs for the broader (parent) and narrower (children) ESCO occupations
top_occupations = pd.DataFrame(data={'concept_uri': third_level_occupations_uris})
top_occupations_id = top_occupations.merge(occupations[['id','concept_uri']], on='concept_uri')

narrow_occupations_ids_level_3 = []
broad_occupation_ids_level_3 = []

for j in range(len(has_narrower_occupations_level_3)):
    top_id = top_occupations_id.iloc[has_narrower_occupations_level_3[j]]['id']
    broad_occupation_ids_level_3.append(top_id)
    
    n_narrow = len(narrow_occupations_level_3[j])
    narrow_id = []
    for i in range(n_narrow):
        narrow_id.append(occupations[occupations.concept_uri==narrow_occupations_level_3[j][i]['uri']]['id'].values[0])
    
    narrow_occupations_ids_level_3.append(narrow_id)
    

In [18]:
occupation_hierarchy__['is_fourth_level'] = False

print('Fourth level occupations with children')
print('----')

for j in range(len(broad_occupation_ids_level_3)):
    parent_id = occupations.loc[broad_occupation_ids_level_3[j]]['id']
    print(parent_id, occupations.loc[parent_id].preferred_label)
    occupation_hierarchy__.loc[narrow_occupations_ids_level_3[j], 'is_fourth_level'] = True
    occupation_hierarchy__.loc[narrow_occupations_ids_level_3[j], 'parent_occupation_id'] = parent_id
    

Fourth level occupations with children
----
471 dog trainer
599 leather production manager
671 purchasing manager
1021 outdoor activities instructor
1372 specialised goods distribution manager
1978 sales manager


In [19]:
occupation_hierarchy_final = occupation_hierarchy__.copy()
occupation_hierarchy_final.sample(2)

Unnamed: 0,concept_type,concept_uri,isco_group,preferred_label,alt_labels,description,id,is_top_level,is_second_level,parent_occupation_id,is_third_level,is_fourth_level
1018,Occupation,http://data.europa.eu/esco/occupation/555ad9ba...,4323,gas transmission system operator,gas transmission system operative\nnatural gas...,Gas transmission system operators transport en...,1018,True,False,False,False,False
1552,Occupation,http://data.europa.eu/esco/occupation/81309031...,3412,residential care home worker,care home manager\nsleep in support care worke...,Residential care home workers follow a specifi...,1552,False,True,1700,False,False


# 2. Reconstruct the full ESCO occupational hierarchy

The final table contains information that allows one to reconstruct the full occupational hierarchy of the 2942 ESCO occupations.

The columns `isco_level_{x}` with `x` between 1..4 indicate the ESCO occupation's membership to ISCO major, sub-major, minor and unit occupational groups, respectively.

The columns `is_{top|second|third|fourth}_level` indicate whether the occupation is, what we call, a top-level ESCO occupation (i.e. ISCO Level 5) or a lower level ESCO occupation (with `is_second_level` corresponding to ISCO Level 6 and so forth).

Finally, column `parent_occupation_id` indicates the direct parent ESCO occupation (for top-level ESCO occupations there is no parent), and `top_level_parent_id` indicates the ESCO occupation's corresponding top-level occupation. 

In [20]:
# Import a dataframe explicitly specifying different ISCO levels
occupations_isco = pd.read_csv(f'{outputs_folder}ESCO_to_ISCO.csv')

# Create a dataframe with the full hierarchy (ISCO + ESCO levels)
full_hierarchy = occupation_hierarchy_final.merge(occupations_isco[[
    'concept_uri','isco_level_1','isco_level_2','isco_level_3','isco_level_4']], on='concept_uri')

# Add a column which explicates the top level parent (Level 5 occupation)
top_level_parent_id = []
for i, row in full_hierarchy.iterrows():
    parent_id = row.parent_occupation_id
    while parent_id != False:
        row = full_hierarchy.loc[parent_id]
        parent_id = row.parent_occupation_id
    top_level_parent_id.append(row.id)
full_hierarchy['top_level_parent_id'] = top_level_parent_id

# Reorganise the columns
full_hierarchy = full_hierarchy[[
    'id','concept_type','concept_uri','preferred_label',
    'isco_level_1', 'isco_level_2', 'isco_level_3', 'isco_level_4',
    'is_top_level', # True if the occupation is top ESCO level (i.e., Level 5 including the four-digit ISCO levels)
    'is_second_level','is_third_level','is_fourth_level', 
    'parent_occupation_id', # ID of the parent occupation that is one level higher in the hierarchy
    'top_level_parent_id' # ID of the ESCO top level (Level 5) parent
]]

full_hierarchy.loc[full_hierarchy.parent_occupation_id==False, 'parent_occupation_id'] = np.nan


In [21]:
len(full_hierarchy)

2942

In [22]:
full_hierarchy.sample(3)

Unnamed: 0,id,concept_type,concept_uri,preferred_label,isco_level_1,isco_level_2,isco_level_3,isco_level_4,is_top_level,is_second_level,is_third_level,is_fourth_level,parent_occupation_id,top_level_parent_id
408,408,Occupation,http://data.europa.eu/esco/occupation/219e7fdd...,train preparer,8,83,831,8312,True,False,False,False,,408
119,119,Occupation,http://data.europa.eu/esco/occupation/099a901f...,motor vehicles parts advisor,5,52,522,5223,True,False,False,False,,119
2276,2276,Occupation,http://data.europa.eu/esco/occupation/c01cdf85...,financial broker,3,33,331,3311,True,False,False,False,,2276


## 2.1 Examples

We first check that `international student exchange coordinator` is a narrower occupation of the top-level occupation `administrative assistant`.

In [23]:
print(full_hierarchy.loc[2903][['id', 'preferred_label', 'is_top_level', 'top_level_parent_id']])
print('---')
print(full_hierarchy.loc[59][['id', 'preferred_label',  'is_top_level', 'top_level_parent_id']])

id                                                           2903
preferred_label        international student exchange coordinator
is_top_level                                                False
top_level_parent_id                                            59
Name: 2903, dtype: object
---
id                                           59
preferred_label        administrative assistant
is_top_level                               True
top_level_parent_id                          59
Name: 59, dtype: object


Check that `logistics and distribution manager` is a top-level occupation (without a parent id) and it is the top-level parent to a Level 8 occupation `electronic and telecommunications equipment and parts distribution manager`.

In [24]:
print(full_hierarchy.loc[47][['id', 'preferred_label', 'is_top_level', 'parent_occupation_id', 'top_level_parent_id']])
print('---')
print(full_hierarchy.loc[1563][['id', 'preferred_label',  'is_top_level', 'parent_occupation_id', 'top_level_parent_id']])

id                                                                     47
preferred_label         electronic and telecommunications equipment an...
is_top_level                                                        False
parent_occupation_id                                                 1372
top_level_parent_id                                                  1563
Name: 47, dtype: object
---
id                                                    1563
preferred_label         logistics and distribution manager
is_top_level                                          True
parent_occupation_id                                   NaN
top_level_parent_id                                   1563
Name: 1563, dtype: object


# 3. Export the table

In [25]:
# Export the ESCO occupation hierarchy
full_hierarchy.to_csv(f'{outputs_folder}ESCO_occupational_hierarchy.csv', index=False)
