# Identify normalized skills in CVs

Extracting normalized skills from CVs **simplifies the hiring process** by allowing recruiters to quickly identify candidates with the desired skills. 
While this is currently an open problem, this can be tackled in specific sectors using language processors and interfaces to support applicants in writing their CVs
and applying for jobs.

Normalization allows providing a **consistent context** to be passed to AI models and to humans for evaluation.



Large platforms like linkedin have a large amount of data and significant experience in this area.
On the other hand, a more focused approach for specific tasks can be more effective, especially when evaluating a large number of CVs (e.g. for junior positions).
Moreover, a standardized approach can be more effective in the long term, since it allows to build a common knowledge base to support the relocation of workers in time
(e.g. identifying skill gaps and training needs).

ESCO is a multilingual classification of skills, competences, qualifications and occupations maintained and updated periodically by the European Commission.



## Structure of the ESCO dataset

```mermaid
---
title: ESCO dataset
---
graph

SK[Skill]
O[Occupation]
K[Knowledge]

SK -->|broader| SK

O -->|essentialSkill 1..n| SK
O -->|optionalSkill 1..n| SK

SK -.->|type| S
SK -.->|type| K

```



## Structure of Skills and Occupations

Here is an excerpt from the **SaaS (service-oriented modelling)** skill:

- preferred label: SaaS (service-oriented modelling)
- unique identifier as an URI: http://data.europa.eu/esco/skill/eeca3780-8049-499f-a268-95a7ad26642c
- alternative labels: SaaS model
- description: The SaaS model consists of principles and fundamentals of service-oriented modelling for business and software systems that allow the design and specification of service-oriented business systems within a variety of architectural styles, such as enterprise architecture.


While occupations are something that changes in time, skills are more stable.
Nonetheless, the ESCO Occupation taxonomy can be useful as a reference for writing position and the associated skills.

An ESCO Occupation excerpt for **Cloud engineer**. Note that the URIs of the skills are present in the dataset but not shown below:

- preferred label: cloud engineer
- unique identifier as an URI: http://data.europa.eu/esco/occupation/349ee6f6-c295-4c38-9b98-48765b55280e
- alternative labels: cloud-native engineer, cloud architect, cloud computing engineer, cloud developer, cloud devops engineer, cloud infrastructure engineer, cloud network engineer, cloud security engineer, cloud software engineer, cloud solution engineer, hybrid cloud engineer
- description: Cloud engineers are responsible for the design, planning, management and maintenance of cloud-based systems. They develop and implement cloud-applications, handle the migration of existing on-premise applications to the cloud, and debug cloud stacks.
- required skills: ICT system integration, ICT system programming, SaaS (service-oriented modelling), cloud monitoring and reporting, cloud security and compliance, cloud technologies, computer programming, cyber security database, development tools, implement cloud security and compliance, operating systems, systems development life-cycle, systems theory



## Extracting skills from CVs

This notebook shows how to create a spacy NLP model based on the ESCO skill descriptions and labels, for identifying skills in CVs.

In [61]:
import spacy
import yaml
from pathlib import Path
import pandas as pd
import json

# Without GPU this is quite slow. Further info on using a GCP server on GPU can be found in GCP.md
spacy.prefer_gpu()

False

In [62]:
# Load the ESCO data from SPARQL and create a dataframe with some useful columns.
# Alternatively, use the pre-processed esco.json file.
df = pd.read_json("../esco/esco.json.gz", orient="records")
df.index = df.s
skills = df.groupby(df.s).agg(
    {
        "altLabel": lambda x: list(set(x)),
        "label": lambda x: x.iloc[0],
        "description": lambda x: x.iloc[0],
        "skillType": lambda x: x.iloc[0],
    }
)
# Add a lowercase text field for semantic search.
skills["text"] = skills.apply(
    lambda x: "; ".join([x.label] + x.altLabel + [x.description]).lower(), axis=1
)
# .. and a set of all the labels for each skill.
skills["allLabel"] = skills.apply(
    lambda x: {t.lower() for t in x.altLabel} | {x.label.lower()}, axis=1
)



In [63]:

with pd.option_context("max_colwidth", None):
  display(skills.head())



Unnamed: 0_level_0,altLabel,label,description,skillType,text,allLabel
s,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
http://data.europa.eu/esco/skill/000f1d3d-220f-4789-9c0a-cc742521fb02,[Haskell],Haskell,"The techniques and principles of software development, such as analysis, algorithms, coding, testing and compiling of programming paradigms in Haskell.",knowledge,"haskell; haskell; the techniques and principles of software development, such as analysis, algorithms, coding, testing and compiling of programming paradigms in haskell.",{haskell}
http://data.europa.eu/esco/skill/00c04e40-35ea-4ed1-824c-82f936c8f876,[Incremental development],Incremental development,The incremental development model is a methodology to design software systems and applications.,knowledge,incremental development; incremental development; the incremental development model is a methodology to design software systems and applications.,{incremental development}
http://data.europa.eu/esco/skill/0121ac3b-775b-4faf-b86d-0078623674bc,"[develop terminology databases, compile terminology databases, define terminology databases prepare terminology databases, write terminology databases, developing terminology databases, populate terminology databases, developing terminology database, draw up terminology databases, develop terminology database]",develop terminology databases,Collect and submit terms after verifying their legitimacy in order to build up terminology databases on an array of domains.,skill,develop terminology databases; develop terminology databases; compile terminology databases; define terminology databases prepare terminology databases; write terminology databases; developing terminology databases; populate terminology databases; developing terminology database; draw up terminology databases; develop terminology database; collect and submit terms after verifying their legitimacy in order to build up terminology databases on an array of domains.,"{develop terminology databases, compile terminology databases, define terminology databases prepare terminology databases, write terminology databases, developing terminology databases, populate terminology databases, developing terminology database, draw up terminology databases, develop terminology database}"
http://data.europa.eu/esco/skill/013441c1-1f13-47e9-80c4-9a53e8e1bc05,"[KDevelop, KDevelop 4.7.0, KDevelop 4.0.0, KDevelop 5.0.0, KDevelop 4.6.0]",KDevelop,"The computer program KDevelop is a suite of software development tools for writing programs, such as compiler, debugger, code editor, code highlights, packaged in a unified user interface. It is developed by the software community KDE.",knowledge,"kdevelop; kdevelop; kdevelop 4.7.0; kdevelop 4.0.0; kdevelop 5.0.0; kdevelop 4.6.0; the computer program kdevelop is a suite of software development tools for writing programs, such as compiler, debugger, code editor, code highlights, packaged in a unified user interface. it is developed by the software community kde.","{kdevelop 4.6.0, kdevelop 4.7.0, kdevelop 5.0.0, kdevelop 4.0.0, kdevelop}"
http://data.europa.eu/esco/skill/0189f448-179e-47cc-9716-c5c3ac4b1aec,[Absorb],Absorb (learning management systems),"The learning system Absorb is an e-learning platform for creating, administrating and delivering e-learning education courses or training programs for secondary school students.",knowledge,"absorb (learning management systems); absorb; the learning system absorb is an e-learning platform for creating, administrating and delivering e-learning education courses or training programs for secondary school students.","{absorb (learning management systems), absorb}"


In [64]:

# Smoke test some skills
list(skills[skills.label.str.contains("MySQL")].altLabel)

[['MySQL Embedded (OEM/ISV)',
  'MySQL',
  'MySQL Cluster CGE',
  'MySQL Enterprise Edition',
  'MySQL Classic Edition',
  'MySQL Standard Edition']]

In [65]:
def make_pattern(kn: dict):
    """Given an ESCO skill entry in the dataframe, create a pattern for the matcher.
    
    The entry has the following fields:
    - label: the preferred label
    - altLabel: a list of alternative labels
    - the skillType: e.g. knowledge, skill, ability

    The logic uses some euristic to decide whether to use the preferred label or the alternative labels.
    """
    label = kn["label"]    
    pattern = [{"LOWER": label.lower()}] if len(label) > 3 else [{"TEXT": label}] 
    patterns = [pattern]
    altLabel = [kn['altLabel']] if isinstance(kn['altLabel'], str) else kn['altLabel']
    for alt in altLabel:
        if len(alt) <= 3:
            candidate = [{"TEXT": alt}]
        elif len(alt.split()) > 1:
            candidate = [{"LOWER": x} for x in alt.lower().split()]
        else:
            candidate = [{"LOWER": alt.lower()}]
        if candidate in patterns:
            print(f"Skipping {candidate}")
            continue
        patterns.append(candidate)
    pattern_identifier = f"{kn['skillType'][:2]}_{label.replace(' ', '_')}".upper().translate(
        str.maketrans("", "", "()")
    )
    if 'python' in pattern_identifier.lower():
        print(pattern_identifier, patterns)
    return pattern_identifier, patterns

# Create the patterns for the matcher
m = dict(make_pattern(kni) for kni in skills.to_dict(orient="records"))


Skipping [{'LOWER': 'haskell'}]
Skipping [{'LOWER': 'kdevelop'}]
Skipping [{'LOWER': 'maltego'}]
Skipping [{'LOWER': 'erlang'}]
Skipping [{'LOWER': 'wiziq'}]
Skipping [{'LOWER': 'lisp'}]
Skipping [{'LOWER': 'sass'}]
Skipping [{'LOWER': 'edmodo'}]
Skipping [{'TEXT': 'MDX'}]
Skipping [{'LOWER': 'drupal'}]
Skipping [{'LOWER': 'engrade'}]
Skipping [{'LOWER': 'less'}]
Skipping [{'LOWER': 'nessus'}]
Skipping [{'LOWER': 'javascript'}]
Skipping [{'LOWER': 'live'}, {'LOWER': 'script'}]
Skipping [{'LOWER': 'javascript'}]
Skipping [{'LOWER': 'live-script'}]
Skipping [{'TEXT': 'DB2'}]
Skipping [{'LOWER': 'analyze'}, {'LOWER': 'large-scale'}, {'LOWER': 'data'}, {'LOWER': 'in'}, {'LOWER': 'healthcare'}]
Skipping [{'LOWER': 'xquery'}]
Skipping [{'LOWER': 'ponie'}]
Skipping [{'LOWER': 'perl'}]
Skipping [{'LOWER': 'perl'}]
Skipping [{'LOWER': 'phtml'}]
Skipping [{'TEXT': 'PHP'}]
Skipping [{'LOWER': 'database'}]
Skipping [{'LOWER': 'xcode'}]
Skipping [{'LOWER': 'schoology'}]
Skipping [{'LOWER': 'objects

In [66]:
# Use spacy matcher with a blank model to validate the patterns.
# If this doesn't work, spacy will raise an error.
import spacy
from spacy.matcher import Matcher
nlp_test = spacy.blank("en")
m1 = Matcher(nlp_test.vocab, validate=True)
for pid, patterns in m.items():
    m1.add(pid, patterns) 


In [67]:
# Show the first 3 patterns
list(m.items())[:3]
json.dump(m, open("../generated/esco_matchers.json", "w"))
esco_p = [{"label":"ESCO", "pattern": pattern } for k, p in m.items()  for pattern in p ]

# Save the patterns to a json
import json
with open("../generated/esco_patterns.json", "w") as f:
    json.dump(esco_p, f)

## Create an ESCO spacy entity recognizer model

This entity recognizer reuses the en_core_web_trf model, that is quite good at identifying PRODUCTS.
We will add a new entity label, ESCO, that uses the altLabel patterns to identify further entities.

The ESCO entity label is added to the pipeline after the NER component, so that the NER component can identify the entities that are already in the en_core_web_trf model, and the ESCO component can add the ESCO entities.

In [68]:
# This model is quite good at recognizing ICT entities like products.
import spacy
from spacy import displacy  # Load a viewer.
nlp_e = spacy.load("en_core_web_trf")
ruler = nlp_e.add_pipe("entity_ruler", after="ner")
ruler.add_patterns(esco_p)
nlp_e.to_disk("../generated/en_core_web_trf_esco_ner")


In [69]:
import sys
from hashlib import sha256
DATADIR = Path(sys.path[0]).parent / "tests" / "data"
text = (DATADIR / "rpolli.txt").read_text()


def get_stats(doc):
    doc_id = sha256(doc.text.encode("utf-8")).hexdigest()
    all_entities = [(ent.text, ent.label_) for ent in doc.ents]
    ent_count = len(all_entities)
    ent_unique = set(all_entities)
    ent_unique_count = len(ent_unique)
    ent_text_unique = set(text for text, _ in ent_unique)
    ent_text_unique_count = len(ent_text_unique)
    ent_skills = set((t, l) for t, l in ent_unique if l in ("ESCO", "PRODUCT"))
    ent_skills_text_unique = len(set(text for text, _ in ent_skills))
    return {
        "doc_id": doc_id,
        "ent_count": ent_count,
        "ent_unique": list(ent_unique),
        "ent_unique_count": ent_unique_count,
        "ent_text_unique": list(ent_text_unique),
        "ent_text_unique_count": ent_text_unique_count,
        "ent_skills": list(ent_skills),
        "ent_skills_text_unique": ent_skills_text_unique,
    }

doc = nlp_e(text)
# Show some stats
get_stats(doc)

{'doc_id': 'eeec810d9b58cd8adbb09104ef5a9deb38a2c29b8713f602c1657fd64df7187f',
 'ent_count': 104,
 'ent_unique': [('2012-2022.', 'DATE'),
  ('Thales Alenia Space', 'ORG'),
  ('Ansible', 'PRODUCT'),
  ('Redis', 'PRODUCT'),
  ('ldap', 'ESCO'),
  ('R', 'ESCO'),
  ('Babel', 'ORG'),
  ('Rome\n', 'ORG'),
  ('2007–2011', 'DATE'),
  ('dstat', 'PRODUCT'),
  ('Apple', 'ORG'),
  ('Bedework', 'PRODUCT'),
  ('ansible', 'PRODUCT'),
  ('OpenShift', 'PRODUCT'),
  ('5 December', 'DATE'),
  ('AIX', 'PRODUCT'),
  ('2018', 'DATE'),
  ('JBoss EAP', 'PRODUCT'),
  ('Oracle', 'ORG'),
  ('Other\n\nItalian', 'LAW'),
  ('Kieran O’Grady', 'PERSON'),
  ('java', 'ESCO'),
  ('389ds', 'PRODUCT'),
  ('the Italian Privacy Code', 'LAW'),
  ('W3C', 'ORG'),
  ('Docker', 'PRODUCT'),
  ('1992', 'DATE'),
  ('1998', 'DATE'),
  ('Digital Transformation Team', 'ORG'),
  ('Google', 'ORG'),
  ('July 26, 2022', 'DATE'),
  ('2005–2007', 'DATE'),
  ('Linux', 'PRODUCT'),
  ('Alenia Spazio Spa', 'ORG'),
  ('IETF', 'ORG'),
  ('Italian'

In [None]:
# Visualize the result.
displacy.render(doc, style="ent", jupyter=True)

# Model testing

The recognizer is tested processing a set of CVs and returning the entities found in each CV.


In [None]:
DATADIR = Path(".").parent / "tests" / "data"
cvs = yaml.safe_load((DATADIR / 'curricula.yaml').read_text("utf-8"))


In [None]:
# Prepare the data for the test.


def model_factory(model_name, patterns=None, config=None):
    model = spacy.load(model_name)
    if patterns:
        config = config or {}
        ruler = model.add_pipe("entity_ruler", **config)
        ruler.add_patterns(esco_p)
    return model


testcases = {
    "base": model_factory("en_core_web_md"),
    "trf": model_factory("en_core_web_trf"),
    "trf_pre": model_factory("en_core_web_trf", esco_p, config={"before": "ner"}),
    "trf_post": model_factory("en_core_web_trf", esco_p, config={"after": "ner"}),
}


In [None]:
# Run the testcases and save each result asap, since the test can take a long time.
results_out = Path("results.out")
with results_out.open("wb") as fh:
    fh.write(b"[")
    for model_name, nlp_model in testcases.items():
        for doc, cv in nlp_model.pipe([(cv["text"], cv) for cv in cvs], as_tuples=True):
            stats = get_stats(doc)
            stats["model"] = model_name
            fh.write(json.dumps(stats).encode())
            fh.write(b",\n")
    fh.write(b"]")

In [None]:
results = Path("results.out").read_text()
data=yaml.safe_load(results)
df = pd.DataFrame(data)
df["model"] = df.index // 9
# aggregate the dataframe above by doc_id, using the couple (model, ent_.._count) as columns
# and the doc_id as the index
results = df.groupby(["doc_id", "model"]).agg(list).unstack()


# Experiment the saved model

Now that we have a saved model, we can experiment with it.


In [None]:
import spacy
import sys
import re
from spacy import displacy

DATADIR = Path(sys.path[0]).parent / "tests" / "data"

nlp_esco = spacy.load("../generated/en_core_web_trf_esco_ner/")

In [None]:
text_raw = (DATADIR / "rpolli.txt").read_text()
text = re.sub('\n+','\n',text_raw)

In [None]:
doc = nlp_esco(text.replace("\n", " "))

In [None]:
# Start analyzing the parts of speech (POS)
from collections import Counter
most_common = lambda pos: Counter([ t.lemma_ for t in doc if t.pos_ == pos and len(t.text)> 2]).most_common(6) 
print("Most common verbs", most_common("VERB"))
print("Most common nouns", most_common("NOUN"))

In [None]:
# Now a nice display
displacy.render(doc, style="ent", jupyter=True)

In [None]:
def get_stats(doc):
    doc_id = sha256(doc.text.encode("utf-8")).hexdigest()
    all_entities = [(ent.text, ent.label_) for ent in doc.ents]
    ent_count = len(all_entities)
    ent_unique = set(all_entities)
    ent_unique_count = len(ent_unique)
    ent_text_unique = set(text for text, _ in ent_unique)
    ent_text_unique_count = len(ent_text_unique)
    ent_skills = set((t, l) for t, l in ent_unique if l in ("ESCO", "PRODUCT"))
    ent_skills_text_unique = len(set(text for text, _ in ent_skills))
    return {
        "doc_id": doc_id,
        "ent_count": ent_count,
        "ent_unique": list(ent_unique),
        "ent_unique_count": ent_unique_count,
        "ent_text_unique": list(ent_text_unique),
        "ent_text_unique_count": ent_text_unique_count,
        "ent_skills": list(ent_skills),
        "ent_skills_text_unique": ent_skills_text_unique,
    }

get_stats(doc)

In [None]:
from esco import infer_skills_from_products, infer_skills_from_skill

s_ = infer_skills_from_products(skills, product_labels=[t.text for t in doc.ents if t.label_ in ( "PRODUCT", "ESCO"   )])
print(yaml.safe_dump(s_))
print(yaml.safe_dump(list(infer_skills_from_skill(k) for k in s_)))

## Some takings

- The model is not perfect, but it is a good starting point for further improvements.
- Even a simple lexical analysis can provide useful insights.
- It is helpful to identify legacy skills that are no more relevant in the current market.
- The model is good at identifying knowledges (e.g., programming languages, tools).
- The model is not good at identifying skills made up of multiple words (e.g., "can do X"). This could be addressed in different ways.
