# Calculate Sentence-BERT embeddings of skills and occupations

We use `sentence_transformers` package for generating Sentence-BERT embeddings (see more information [on their github page](https://github.com/UKPLab/sentence-transformers)).

For much faster calculations, you can also use [this Google Colab notebook](https://colab.research.google.com/drive/1EfEXjdu4MYZMmr7X2gymFamxpw_J23me?usp=sharing), with the GPU enabled.

# 0. Import dependencies and inputs

In [5]:
%run ../notebook_preamble.ipy

skills = pd.read_csv(data_folder + 'processed/ESCO_skills.csv')
occupations = pd.read_csv(data_folder + 'processed/ESCO_occupations.csv')

In [2]:
from sentence_transformers import SentenceTransformer
bert_model = 'bert-base-nli-mean-tokens'
model = SentenceTransformer(bert_model)

In [3]:
import pandas as pd
import numpy as np
from time import time

In [25]:
def calculate_embeddings(list_of_sentences, save_name=None):

    # Calculate the sentence embeddings
    t = time()
    print(f'Calculating {len(list_of_sentences)} embeddings...', end=' ')
    sentence_embeddings = np.array(model.encode(list_of_sentences))
    print(f'Done in {time()-t:.2f} seconds')

    # Save the embeddings
    if save_name is not None:
        save_file = f'{data_folder}interim/embeddings/{save_name}.npy'
        np.save(save_file, sentence_embeddings)
        print(f'Embeddings saved in {save_file}')
    
    return sentence_embeddings

# 1. Calculate embeddings of skills descriptions

In [26]:
skills.head(1)

Unnamed: 0,concept_type,concept_uri,skill_type,reuse_level,preferred_label,alt_labels,description,id
0,KnowledgeSkillCompetence,http://data.europa.eu/esco/skill/0005c151-5b5a...,skill/competence,sector-specific,manage musical staff,manage staff of music\ncoordinate duties of mu...,Assign and manage staff tasks in areas such as...,0


In [27]:
print(len(skills))

13485


In [28]:
# Skills description embeddings
sentence_embeddings = calculate_embeddings(skills.description.to_list(),
                                           save_name='embeddings_skills_description_SBERT')


Calculating 3 embeddings... Done in 0.37 seconds
Embeddings saved in /Users/karliskanders/Documents/mapping-career-causeways/codebase/data/interim/embeddings/embeddings_skills_description_SBERT.npy


In [29]:
print(sentence_embeddings.shape)

(3, 768)


# 2. Calculate embeddings of occupation descriptions 

In [30]:
occupations.head(1)

Unnamed: 0,concept_type,concept_uri,isco_group,preferred_label,alt_labels,description,id
0,Occupation,http://data.europa.eu/esco/occupation/00030d09...,2166,technical director,technical and operations director\nhead of tec...,Technical directors realise the artistic visio...,0


In [31]:
print(len(occupations))

2942


In [34]:
# Occupations description embeddings
list_of_sentences = occupations.description.to_list()

# Remove \xa0 and \n
list_of_sentences = [s.replace(u'\n', u' ') for s in list_of_sentences]
list_of_sentences = [s.replace(u'\xa0', u' ') for s in list_of_sentences]

sentence_embeddings = calculate_embeddings(list_of_sentences,
                                           save_name='embeddings_occupation_description_SBERT')


Calculating 3 embeddings... Done in 0.51 seconds
Embeddings saved in /Users/karliskanders/Documents/mapping-career-causeways/codebase/data/interim/embeddings/embeddings_occupation_description_SBERT.npy


In [36]:
sentence_embeddings.shape

(3, 768)