# Clustering of ESCO occupations

To find and analyse groups of related occupations that share similar job requirements and work characteristics, we applied graph-based clustering on occupation similarities. The resulting grouping of occupations organises the 2942 ESCO occupations at two hierarchical levels – which we call skills-based sectors and sub-sectors – with 14 groups at the first level and 54 groups at the second level.

For more details on the clustering methodology, consult pp. 98-102 of the Mapping Career Causeways report.

# 0. Import dependencies and inputs

In [1]:
%run ../notebook_preamble.ipy

import mapping_career_causeways.cluster_utils as cluster_utils
import mapping_career_causeways.cluster_profiling_utils as cluster_profiling_utils
import os 

Data = load_data.Data()
Sims = load_data.Similarities()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/karliskanders/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 0.1 Prepare the similarity matrix used for clustering

Clustering of occupations was based on the combined similarity of their essential skills, optional skills, work activities and work context. For this purpose, we used a combination of our four different similarity measures capturing each of these aspects.

The clustering is based on *all* 2942 ESCO occupations. This creates a small complication when combining the similarity measures, as a small number of the occupations did not have work context features (we didn't include this occupations in further transitions analyses as they also did not have estimates of automation risk). Below, we account for this excluding the work context measure in the combined similarity if the origin occupation does not have work context features.

Later in the project we adapted a slightly different approach, which was used for producing the final similarity matrices for the 2942 occupations (see `notebooks/04_occupation_comparisons/Combine_similarity_measures.ipynb`).

In [2]:
matrix_path = data_folder + 'processed/sim_matrices/OccupationSimilarity_Combined_forClustering.npy'

if os.path.exists(matrix_path):
    W = np.load(matrix_path)
else:
    # For the clustering of occupations, we combined our four occupation similarity measures.
    p = 0.25 # weights for each of the four similarity measures
    p_x = p/(1-p) # for cases where occupation doesn't have work context features

    # Get the list of occupations without work context features
    esco_to_work_context_vector = pd.read_csv(data_folder + 'interim/work_context_features/occupations_work_context_vector.csv')
    esco_with_work_context = esco_to_work_context_vector[esco_to_work_context_vector.has_vector==True].id.to_list()

    W = np.zeros((2942, 2942))
    for j in tqdm(range(len(W)), total=len(W)):
        if j in esco_with_work_context:
            W[j,:] = (p * Sims.W_essential[j,:]) + (p * Sims.W_all_to_essential[j,:]) + (p * Sims.W_activities[j,:]) + (p * Sims.W_work_context[j,:])
        else:
            W[j,:] = (p_x * Sims.W_essential[j,:]) + (p_x * Sims.W_all_to_essential[j,:]) + (p_x * Sims.W_activities[j,:])

    # Save the similarity matrix
    np.save(matrix_path, W)   
    
# Create a symmetric similarity matrix for clustering
W_cluster = 0.5*W + 0.5*W.T
W_cluster.shape    

(2942, 2942)

In [3]:
# Number of occupations
n_occ = W_cluster.shape[0]

# 1. Set up clustering parameters

In [4]:
# Name of this clustering session
session_name = 'ESCO_occ_v1'
# Number of nearest neighbours used for the graph construction
nearest_neighbours = [15, 20, 25, 30, 60]
# Ensemble size for the first step
N = 1000
# Ensemble size for the consensus step
N_consensus = 100
# Number of clustering trials for each nearest neighbour value
N_nn = N // len(nearest_neighbours)
# Which clusters to break down from the partition
clusters = 'all' # Either a list of integers, or 'all'
# Path to save the clustering results
fpath = f'{data_folder}interim/raw_clustering/{session_name}/'

clustering_params = {
    'N': N,
    'N_consensus': N_consensus,
    'N_nn': N_nn,
    'clusters': clusters,
    'fpath': fpath,
    'session_name': session_name,
    'nearest_neighbours': nearest_neighbours}

# 2. Perform two steps of clustering

## 2.1 Level-1

In [5]:
# Prepare and save the Level-0 partition file (all in one cluster)
partition_df = pd.DataFrame()
partition_df['id'] = Data.occupations.id.to_list()
partition_df['cluster'] = np.zeros((len(Data.occupations)))
partition_df.to_csv(fpath+session_name+'_clusters_Level0.csv')

# Set the random_state variable for reproduciblity
clustering_params['random_state'] = 14523

In [6]:
# Perform the clustering
cluster_utils.subcluster_nodes(W=W_cluster, l=0, **clustering_params)

Partitioning cluster 0.0...
Building the graph... done!
Clustering graph with 15 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  19.00 seconds
Building the graph... done!
Clustering graph with 20 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  21.00 seconds
Building the graph... done!
Clustering graph with 25 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

In [7]:
# Collect subclusters into one partition
partition_1 = cluster_utils.collect_subclusters(0, fpath, session_name, n_total=n_occ)

# Check that partition contains all nodes
len(partition_1)

Final partition saved in /Users/karliskanders/Documents/mapping-career-causeways/codebase/data/interim/raw_clustering/ESCO_occ_v1/ESCO_occ_v1_clusters_Level1.csv


2942

In [8]:
# Check a summary clustering result
cluster_utils.ConsensusClustering.describe_partition(partition_1.cluster.values)

Clustering with 2942 nodes and 14 clusters.


{'n': 14,
 'sizes': [600, 426, 322, 272, 237, 185, 178, 145, 129, 124, 89, 87, 74, 74]}

## 2.2 Level-2 clusters

In [9]:
# Load the partition that we wish to further split apart
partition = pd.read_csv(fpath + session_name + '_clusters_Level1.csv')

# Set the random_state variable for reproduciblity
clustering_params['random_state'] = 1

In [10]:
clustering_params

{'N': 1000,
 'N_consensus': 100,
 'N_nn': 200,
 'clusters': 'all',
 'fpath': '/Users/karliskanders/Documents/mapping-career-causeways/codebase/data/interim/raw_clustering/ESCO_occ_v1/',
 'session_name': 'ESCO_occ_v1',
 'nearest_neighbours': [15, 20, 25, 30, 60],
 'random_state': 1}

In [11]:
# Perform the clustering
cluster_utils.subcluster_nodes(W=W_cluster, l=1, **clustering_params)


Partitioning cluster 0...
Building the graph... done!
Clustering graph with 15 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  3.00 seconds
Building the graph... done!
Clustering graph with 20 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  4.00 seconds
Building the graph... done!
Clustering graph with 25 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  2.00 seconds
Building the graph... done!
Clustering graph with 60 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  4.00 seconds
Clustering the consensus partition...
Setting random seeds...
Using co-occurrence matrix to do consensus clustering...
Building the graph... done!
Setting random seeds...
Generating an ensemble with 100 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Average pairwise AMI across 100 partitions is 0.9744
Clustering with 272 nodes 

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  1.00 seconds
Building the graph... done!
Clustering graph with 20 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  1.00 seconds
Building the graph... done!
Clustering graph with 25 nearest-neighbours...
Setting random seeds...
Generating an ensemble with 200 partitions...
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Elapsed time:  1.00 seconds
Building the graph... d

In [12]:
# Adjust the clustering labels and save
partition_2 = cluster_utils.collect_subclusters(1, fpath, session_name, n_total = n_occ)

Final partition saved in /Users/karliskanders/Documents/mapping-career-causeways/codebase/data/interim/raw_clustering/ESCO_occ_v1/ESCO_occ_v1_clusters_Level2.csv


In [13]:
cluster_utils.ConsensusClustering.describe_partition(partition_2.cluster.values);

Clustering with 2942 nodes and 54 clusters.


# 3. Combine the partitions from both steps

In [14]:
partition_1 = pd.read_csv(fpath + session_name + '_clusters_Level1.csv')
partition_2 = pd.read_csv(fpath + session_name + '_clusters_Level2.csv')

# Create a dataframe with all three partitions
partitions = partition_1.merge(partition_2, on='id')
partitions = partitions.rename(columns={'cluster_x': 'level_1', 'cluster_y': 'level_2'})

# Relabel Level 2 clusters to match the ordering of Level 1 clusters
partitions = partitions.sort_values(['level_1','level_2'])
level_2_labels = partitions.drop_duplicates('level_2').level_2.to_list()
level_2_new_labels = list(range(len(level_2_labels)))
relabel_dict = dict(zip(level_2_labels, level_2_new_labels))
partitions.level_2 = partitions.level_2.apply(lambda x: relabel_dict[x])
partitions = partitions.sort_values('id')

partitions.sort_values(['level_1','level_2'])

Unnamed: 0,id,level_1,level_2
2,2,0,0
51,51,0,0
83,83,0,0
93,93,0,0
113,113,0,0
...,...,...,...
604,604,13,53
1281,1281,13,53
1917,1917,13,53
2850,2850,13,53


In [15]:
# Final dataframe with occupation clusters
occ_cluster = load_data.Data().occupation_hierarchy.merge(partitions, on='id')
occ_cluster = occ_cluster[['id', 'concept_uri', 'preferred_label',
                           'isco_level_4',
                           'level_1', 'level_2']]

In [16]:
len(occ_cluster)

2942

# 4. Profile the clusters

In [17]:
occ_df = occ_cluster.copy()
occ_df.head(2)

Unnamed: 0,id,concept_uri,preferred_label,isco_level_4,level_1,level_2
0,0,http://data.europa.eu/esco/occupation/00030d09...,technical director,2166,5,25
1,1,http://data.europa.eu/esco/occupation/000e93a3...,metal drawing machine operator,8121,3,16


In [18]:
# Generate keywords for each cluster, using the occupation labels
keywords_level_1, keywords_level_1_ = cluster_profiling_utils.tfidf_keywords(partitions.level_1.values,
                                                                            occ_df, 'preferred_label', [])
keywords_level_2, keywords_level_2_ = cluster_profiling_utils.tfidf_keywords(partitions.level_2.values,
                                                                            occ_df, 'preferred_label', [])


In [19]:
# Create dataframes with the keywords
clusters_level_1 = partitions.copy().drop_duplicates('level_1').sort_values('level_1')
clusters_level_1 = clusters_level_1.drop(['level_2','id'], axis=1)
clusters_level_1['keywords'] = keywords_level_1
clusters_level_1 = clusters_level_1.reset_index(drop=True)

clusters_level_2 = partitions.copy().drop_duplicates(['level_1','level_2']).sort_values(['level_2'])
clusters_level_2 = clusters_level_2.drop(['id'], axis=1)
clusters_level_2['keywords'] = keywords_level_2
clusters_level_2 = clusters_level_2.reset_index(drop=True)


In [20]:
clusters_level_1

Unnamed: 0,level_1,keywords
0,0,"technician, operator, inspector, assembler, mi..."
1,1,"manager, officer, policy, policy officer, anal..."
2,2,"shop, shop manager, seller, specialised seller..."
3,3,"operator, machine operator, machine, maker, pr..."
4,4,"engineer, technician, drafter, engineering, en..."
5,5,"artist, director, editor, designer, painter, j..."
6,6,"teacher, school, lecturer, secondary school, s..."
7,7,"leather, leather good, textile, footwear, oper..."
8,8,"operator, food, operator food, machine operato..."
9,9,"import export, export, import, export manager,..."


In [21]:
clusters_level_2

Unnamed: 0,level_1,level_2,keywords
0,0,0,"assembler, technician, inspector, engineering ..."
1,0,1,"supervisor, construction, installer, operator,..."
2,0,2,"technician, repair technician, repair, operato..."
3,0,3,"officer, driver, guard, pilot, aviation, airpo..."
4,0,4,"environmental, inspector, worker, waste, prote..."
5,0,5,"mine, engineer, manager mine, operator, mining..."
6,1,6,"manager, supervisor, assembly supervisor, asse..."
7,1,7,"financial, insurance, analyst, manager, broker..."
8,1,8,"assistant, clerk, officer, court, administrati..."
9,1,9,"manager, business, product manager, officer, p..."


# 5. Export the partitions and cluster keywords

In [22]:
save_path = f'{data_folder}interim/raw_clustering/{session_name}/final_partitions/'
occ_cluster.to_csv(save_path + f'partitions_{session_name}.csv', index=False)
clusters_level_1.to_csv(save_path + f'partitions_{session_name}_LEVEL1.csv', index=False)
clusters_level_2.to_csv(save_path + f'partitions_{session_name}_LEVEL2.csv', index=False)
