Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mapping #196

Open
lizgzil opened this issue Aug 9, 2023 · 9 comments
Open

Improve mapping #196

lizgzil opened this issue Aug 9, 2023 · 9 comments

Comments

@lizgzil
Copy link
Collaborator

lizgzil commented Aug 9, 2023

Some avenues to improve our mapping algorithm

  • Big refactor/make the code more clear
  • Wikipedia entity linking might be able to help us with improving the mapping of entities to skills taxonomies (e.g. https://github.com/amazon-science/ReFinED)
  • Map entities to multiple skills (would be good for the multiskill entities)
@lizgzil
Copy link
Collaborator Author

lizgzil commented Aug 9, 2023

I had a little go at experimenting with ReFinED. I'll write up my results here.

Using ReFinED

pip install https://github.com/amazon-science/ReFinED/archive/refs/tags/V1.zip

from refined.inference.processor import Refined
refined = Refined.from_pretrained(model_name='wikipedia_model_with_numbers',
                                  entity_set="wikipedia")

esco_skill_batch = ['manage musical staff', 'supervise correctional procedures',
       'apply anti-oppressive practices',
       'control compliance of railway vehicles regulations',
       'identify available services', 'perform toxicological studies',
       'ensure coquille uniformity', 'Haskell',
       'apply diplomatic principles', 'lead police investigations']
spans_batch = refined.process_text_batch(esco_skill_batch)
for esco_skill, spans in zip(esco_skill_batch, spans_batch):
    spans = spans.spans
    if spans:
        for span in spans:
            print(esco_skill)
            print(span)
            if (span.candidate_entities) and (span.predicted_entity.wikidata_entity_id):
                    print((
                            span.predicted_entity.wikidata_entity_id,
                            span.predicted_entity.wikipedia_entity_title,
                            span.entity_linking_model_confidence_score
                    ))

This gives:

manage musical staff
['manage', Entity(wikidata_entity_id=Q1320883, wikipedia_entity_title=Talent manager), None]
('Q1320883', 'Talent manager', 0.2776)
control compliance of railway vehicles regulations
['control', None, 'DATE']
ensure coquille uniformity
['coquille', Entity(wikidata_entity_id=Q1778928, wikipedia_entity_title=Permanent mold casting), None]
('Q1778928', 'Permanent mold casting', 0.2473)
Haskell
['Haskell', Entity not linked to a knowledge base, None]

So of the 10 inputted ESCO skills, only 2 of them actually linked to a wikipedia entity, and of these the linking confidence scores are pretty low.

ESCO skills to wiki

First I mapped all the ESCO skill names to wikipedia entries.

  • 95920 unique ESCO names
  • 12848 (14%) linked to wiki using the ReFinED model

This is the linking confidence scores distribution for these 12848 links:

Screenshot 2023-08-09 at 17 55 20

Some of the links when the score was < 0.9:
Screenshot 2023-08-09 at 18 01 12

Some of the links when the score was > 0.9:
Screenshot 2023-08-09 at 18 01 51

Linking a sample of job advert predicted entities to wiki

Using a sample of 1000 job adverts (from the mixed sample):

  • we extracted 17581 entities (SKILL - 12064 entities, MULTISKILL - 4284 and EXPERIENCE - 1233)
  • 12650 of these were unique
  • 2338 (18%) linked to wiki using the ReFinED model

This is the linking confidence scores distribution for these 2338 links:
Screenshot 2023-08-09 at 17 56 20

Some of the links when the score was < 0.9:
Screenshot 2023-08-09 at 18 02 54

Some of the links when the score was > 0.9:
Screenshot 2023-08-09 at 18 02 39

Merging

I then mapped the predicted entities to ESCO skills via the linked wiki ids.

e.g if a entity linked to the wiki id 'Q219416' with >0.9 confidence, then I would find which ESCO skills mapped to 'Q219416' with >0.9 confidence, and use these as the output.

This actually only yielded 295 entities which could be mapped to ESCO skills this way.

Some exmaples:

Screenshot 2023-08-09 at 18 06 36

Comparison with the original mapping method

I compared these results to the original way we mapped (using semantic similarity).

This gave 45% of the entities having some cross over in which ESCO skill they were mapped to.

e.g.

the extracted entity "providing excellent customer service" was mapped to ESCO skill "provide outstanding customer service" (7e5786f8-1174-4f75-97e3-cfecfd95d797) using our original method, and via wikipedia entity linking it got matched to several:

['customer service',
  'maintain customer service',
  'customer care',
  'provide customer care',
  'provide outstanding customer service',
  'provide training in customer service techniques',
  'provide training in approaches to customer service',
  'provide training in customer service methods',
  'pursue the highest possible quality of customer service',
  'work to achieve the highest possible level of customer service',
  'act with the goal of providing the highest possible level of customer service',
  'undertake communication with customer service department',
  'work in communication with customer services',
  'correspond with customer services',
  'provide excellence in customer service']

which had the unique IDS:

['15a33d76-4640-438d-ae64-fdc0c1d3eebc',
  '75dfe1ee-5935-42ce-b820-697f827825c3',
  '704fda1b-cd0a-40fe-99fc-0a24250a2010',
  '7e5786f8-1174-4f75-97e3-cfecfd95d797',
  '8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd',
  'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca',
  'b215031a-dd21-48b1-a998-75d6373838d8',
  'e782f412-4cb5-45f1-b5bc-15be441171aa']

(which as you can see includes 7e5786f8-1174-4f75-97e3-cfecfd95d797)

When original way didn't find a match

There were 555 entities which couldn't be mapped to ESCO using the original method (this is when the match is at the least granular level e.g. S1).
Of these, only 2 had matches via the wiki method. This were:

Entity: 'Change delivery Project management Business management Stakeholder management Line Management
ESCO match the original way: 'management skills', 'S4'
ESCO matches the wiki way: ['imprinting visionary aspirations into the business management', 'incorporate visionary aspirations into the business management'], '272fddbb-917a-4720-8903-85ce51e1cbe5

and

Entity: Fluent in written and spoken English Application deadline
ESCO match the original way:  'self-management skills and competences', 'T3'
ESCO matches the wiki way: ['interact verbally in English', 'understand spoken English', 'understand written English', 'interacting verbally in English', 'be fluent in English', 'verbally interact in English', 'communicate verbally in English', 'show competency in written English', 'correspond in written English', 'listen to English', 'understanding spoken English', 'comprehend spoken English', 'understand English speech', 'make sense of spoken English', 'interpret spoken English', 'understanding written English', 'interpret written English', 'make sense of written English', 'comprehend written English']),
        list(['0ee9e985-0ee5-4a73-8a12-78b53b261bb2', '64ff8d5f-58a8-4efb-af5a-e161854b3e9a', '7ee20fe2-facd-4cc5-837b-927429e0e7ac', '3993c87c-7719-4186-811b-8ddfb40e76be'])]

Entity length

Here are 10 random entities which were over 60 characters in length,

<style> </style>
skill label orig_mapped_esco_skill orig_mapped_esco_id wiki_mapped_esco_skills unique wiki_mapped_esco_ids do I think the wiki map is better?
relevant degree and relevant management experience or equivalent competency gained EXPERIENCE management skills S4.0.0 ['improve transportation processes through application of management concepts', 'improve transportation processes through application of management principles'] ['7afd29d7-9c9f-4151-83f2-562b8c94a3af'] no
Transfer all vacant and all-inclusive flats over to a green energy supplier MULTISKILL management and administration K0413 ['promoting sustainable energy', 'encouraging use of sustainable energy'] ['1a6c7e0d-fc13-41d7-a5c0-8ca00606de89'] yes
ability to interpret Mechanical drawings Keywords design, estimation, tender, AutoCAD, water, drainage, rainwater, wastewater, Greenford, London MULTISKILL interpreting technical documentation and diagrams S2.1.3 ['create AutoCAD drawings', 'creation of AutoCAD drawings', 'creating AutoCAD drawings', 'make AutoCAD drawings', 'making of AutoCAD drawings', 'making AutoCAD drawings', 'AutoCAD drawings creation', 'AutoCAD drawing creation', 'drawing with AutoCAD', 'AutoCAD drawing'] ['76415d7f-0fde-4364-b45f-5c044580d2aa'] no
Perform market research to establish target accounts and contacts MULTISKILL performing market research fe39d4db-4cb5-4299-bb9f-896c8fd6ab13 ['market research', 'market research performance', 'implement market research'] ['b011c8b4-76e1-4bbc-8bb9-1d205e7b618a', 'fe39d4db-4cb5-4299-bb9f-896c8fd6ab13'] same
2+ years of proven track record in account relationship management or customer service EXPERIENCE analysing and evaluating information and data S2.7.0 ['customer service', 'maintain customer service', 'customer care', 'provide customer care', 'provide outstanding customer service', 'provide training in customer service techniques', 'provide training in approaches to customer service', 'provide training in customer service methods', 'pursue the highest possible quality of customer service', 'work to achieve the highest possible level of customer service', 'act with the goal of providing the highest possible level of customer service', 'undertake communication with customer service department', 'work in communication with customer services', 'correspond with customer services', 'provide excellence in customer service'] ['15a33d76-4640-438d-ae64-fdc0c1d3eebc', '75dfe1ee-5935-42ce-b820-697f827825c3', '704fda1b-cd0a-40fe-99fc-0a24250a2010', '7e5786f8-1174-4f75-97e3-cfecfd95d797', '8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd', 'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca', 'b215031a-dd21-48b1-a998-75d6373838d8', 'e782f412-4cb5-45f1-b5bc-15be441171aa'] yes
3-5 years of relevant experience in the planning and management of social development activities EXPERIENCE technical or academic writing S1.13.3 ['support social change'] ['644209ac-8452-4e81-959a-2b10050023cc'] yes
Ensure all technical and design information complies with Clients requirements, current Building Regulations MULTISKILL integrate building requirements of clients in the architecture designs bd2102ea-c8d9-40f6-8327-211450120e96 ['building standards'] ['615cfc39-797f-4229-8e92-159fcf8f3030'] no
utilisation of project management aligned to the agreed delivery strategy SKILL management and administration K0413 ['principles of project management'] ['7111b95d-0ce3-441a-9d92-4c75d05c4388'] yes
Responding to queries via the customer service department received via telephone SKILL providing information to the public and clients S3.4.1 ['customer service', 'maintain customer service', 'customer care', 'provide customer care', 'provide outstanding customer service', 'provide training in customer service techniques', 'provide training in approaches to customer service', 'provide training in customer service methods', 'pursue the highest possible quality of customer service', 'work to achieve the highest possible level of customer service', 'act with the goal of providing the highest possible level of customer service', 'undertake communication with customer service department', 'work in communication with customer services', 'correspond with customer services', 'provide excellence in customer service'] ['15a33d76-4640-438d-ae64-fdc0c1d3eebc', '75dfe1ee-5935-42ce-b820-697f827825c3', '704fda1b-cd0a-40fe-99fc-0a24250a2010', '7e5786f8-1174-4f75-97e3-cfecfd95d797', '8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd', 'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca', 'b215031a-dd21-48b1-a998-75d6373838d8', 'e782f412-4cb5-45f1-b5bc-15be441171aa'] yes
Fluent in both German and English with exceptional verbal and written communication skill MULTISKILL be fluent in German 2abb9db5-350c-444c-8292-0e0b2ce00f9a ['understand spoken German', 'interact verbally in German', 'understand written German', 'understanding spoken German', 'comprehend spoken German', 'listen to German', 'make sense of spoken German', 'communicate verbally in German', 'verbally interact in German', 'interacting verbally in German', 'be fluent in German', 'comprehend written German', 'understanding written German', 'make sense of written German', 'correspond in written German', 'show competency in written German', 'interact verbally in English', 'understand spoken English', 'understand written English', 'interacting verbally in English', 'be fluent in English', 'verbally interact in English', 'communicate verbally in English', 'show competency in written English', 'correspond in written English', 'listen to English', 'understanding spoken English', 'comprehend spoken English', 'understand English speech', 'make sense of spoken English', 'interpret spoken English', 'understanding written English', 'interpret written English', 'make sense of written English', 'comprehend written English'] ['1d5526b3-f17b-46fc-ba7d-f4a32d908a7e', '2abb9db5-350c-444c-8292-0e0b2ce00f9a', '486e4f39-e968-41f4-955e-56e9eba96ef5', '52894650-9077-40f0-96d6-6f07d1a6cafa', '0ee9e985-0ee5-4a73-8a12-78b53b261bb2', '64ff8d5f-58a8-4efb-af5a-e161854b3e9a', '7ee20fe2-facd-4cc5-837b-927429e0e7ac', '3993c87c-7719-4186-811b-8ddfb40e76be'] yes

so 6/10 were better (but not necessarily perfect). 1 was the same, and 3 were worse.

Thoughts

  • It's not perfect, but does provide something extra!
  • Is it just adding more complexity with little reward?
  • Is there a way we could filter which wiki entries it matches to? (we dont really want it to match to anything)

@india-kerle
Copy link
Collaborator

wow this is so interesting!

On filtering wiki entries, I'm not sure exactly how ReFinED works but when we used wikidata to extract entities from patents/abstracts, there was a way to filter for relevant entities (although the amount of pages that had appropriate tags to filter with was very low)

@india-kerle
Copy link
Collaborator

india-kerle commented Aug 10, 2023

It does feel like we could be adding more complexity - I think treating it as an entity disambiguation problem is still an interesting idea though. Could we treat ESCO as a knowledge base and train our own entity linker to match the extracted skill to a ESCO skill?

I semi looked into spacy's entity linker as part of a personal project familiarising myself with prodigy.

@india-kerle
Copy link
Collaborator

re: Big refactor/make the code more clear and mapping to multiple skills (and ignoring entity disambiguation) - we might want to explore the world of vector dbs/vector search because i think they have a lot of this baked in (i.e. surfacing similar data + speed)

@india-kerle
Copy link
Collaborator

ok truuuuly live spitballing - what would doing both look like? Is it overkill?

  1. extracted skill -> vectorise -> use faiss/elastic search to identify the top N ESCO skills based on semantic similarity/k nearest neighbours -> label which of the top N is the most appropriate and use that data to train an entity linker to disambiguate the skill

@lizgzil
Copy link
Collaborator Author

lizgzil commented Aug 10, 2023

ok truuuuly live spitballing - what would doing both look like? Is it overkill?

  1. extracted skill -> vectorise -> use faiss/elastic search to identify the top N ESCO skills based on semantic similarity/k nearest neighbours -> label which of the top N is the most appropriate and use that data to train an entity linker to disambiguate the skill

oo I like it! It does make far more sense to train our own if it works ok. Once we train the EL model then I guess we don't need to apply all the vectorisation+faiss/elastic steps anymore, so is there a benefit to implementing these to just create the training data? (i.e. maybe our current mappings will do?)
I can have a little look at spacy EL too.

@india-kerle
Copy link
Collaborator

india-kerle commented Aug 10, 2023

I think if we go down the EL route, that's a good point - i don't really think there's additional benefit to implementing vector dbs beyond creating training data. I wonder how we can quickly assess that approach? what about training an EL model on an engineered training set with skills that our current approach consistently does not match very well on?

@india-kerle
Copy link
Collaborator

I've already written a custom entity linker recipe in prodigy as part of a personal project so probably could get the labelling side of things up and running relatively quickly -- https://github.com/india-kerle/viclit_food_linker/blob/main/src/cake_recipe.py

@lizgzil
Copy link
Collaborator Author

lizgzil commented Aug 10, 2023

oo amazing! I was thinking we'd be training on our existing labelled data (rather than creating more)? We'd have to reconfigure the data into the correct format though which might be tricky (but surely easier than labelling more?).
But - I'm not sure - do you think we'd need to relabel?

I've been following this and the training data is in the form

{"text":"Interestingly, Emerson is one of only five tennis players all-time to win multiple slam sets in two disciplines, only matched by Frank Sedgman, Margaret Court, Martina Navratilova and Serena Williams.","_input_hash":2024197919,"_task_hash":-1926469210,"spans":[{"start":15,"end":22,"text":"Emerson","rank":0,"label":"ORG","score":1,"source":"en_core_web_lg","input_hash":2024197919}],"meta":{"score":1},"options":[{"id":"Q48226","html":"<a href='https://www.wikidata.org/wiki/Q48226'>Q48226: American philosopher, essayist, and poet</a>"},{"id":"Q215952","html":"<a href='https://www.wikidata.org/wiki/Q215952'>Q215952: Brazilian footballer</a>"},{"id":"Q312545","html":"<a href='https://www.wikidata.org/wiki/Q312545'>Q312545: Australian tennis player</a>"},{"id":"NIL_otherLink","text":"Link not in options"},{"id":"NIL_ambiguous","text":"Need more context"}],"_session_id":null,"_view_id":"choice","accept":["Q312545"],"answer":"accept"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants