Add prodigy folder for tagging more skills data using prodigy #193

lizgzil · 2023-07-10T19:54:13Z

Fixes #192

combine_labels.py now includes adding in the new prodigy labels
ner_spacy.py uses "en_core_web_lg" as the base model + uses a fixed random seed + a few small bug fixes
pipeline/skill_ner/prodigy/ folder created with NER prodigy labelling instructions, a recipe, and a data processing script
A new model has been trained called 20230808. I've added Update to new model everywhere #197 as a separate issue +PR to update all the model versions everywhere (as I wanted to make this PR separate)
I noticed an error with the pytest github action. The fix was found in this and this. (adding typing_extensions<4.6.0 and spacy==3.4.0 to requirements_dev.txt)
The github action then raised another error - looks like its the same as whats discussed here Installation Error: Building wheel for scikit-learn (pyproject.toml) did not run successfully. #194

Note for reviewer:

Check code looks ok
Try running:

from ojd_daps_skills.pipeline.skill_ner.ner_spacy import JobNER
job_ner = JobNER()
nlp = job_ner.load_model('outputs/models/ner_model/20230808/', s3_download=True)
text = "We want someone with good communication and maths skills. There are job benefits such as a pension and cycle to work scheme. We would like someone with experience in marketing."
pred_ents = job_ner.predict(text)

You should get:

>>> pred_ents
[{'label': 'SKILL', 'start': 26, 'end': 39},
 {'label': 'SKILL', 'start': 44, 'end': 56},
 {'label': 'BENEFIT', 'start': 103, 'end': 123},
 {'label': 'EXPERIENCE', 'start': 152, 'end': 175}]
>>> for ent in pred_ents:
>>>     print(text[ent['start']:ent['end']])
communication
maths skills
cycle to work scheme
experience in marketing

Prodigy labels and new NER model

In prodigy BENEFIT entities were also labelled. This causes a slight problem since the label-studio data doesn't have BENEFIT labels. I opted to train the NER model with BENEFIT label anyway, which means the recall has ended up being really low.
The number of labelled job adverts has increased from 375 to 500.

Skill metric changes:

F1 Precision Recall

Previous model 0.59 0.68 0.52

Newest model 0.61 0.71 0.54

All metrics from the new model:

Entity	F1	Precision	Recall
Skill	0.612	0.712	0.537
Experience	0.524	0.647	0.441
Benefit	0.531	0.708	0.425
All	0.590	0.680	0.521

Thanks for contributing to Nesta's Skills Extractor Library 🙏!

If you have suggested changes to code anywhere outside of the ExtractSkills class, please consult the checklist below.

Checklist ✔️🐍:

If you have suggested changes to documentation (and/or the ExtractSkills class), please ALSO consult the checklist below.

Documentation Checklist ✔️📚:

I have run make html in docs
I have manually reviewed the docs/build/*.html files locally to ensure they have formatted correctly
I have pushed both relevant files AND their corresponding docs/build/*.html files

sample data

… training the model

…int, set a random seed, fix saving bug

lizgzil · 2023-08-14T16:13:03Z

ojd_daps_skills/pipeline/skill_ner/prodigy/skill_recipe.py

+        # Custom way to split into chunks of a certain size
+        # its not ideal if these are too big (the model struggles)
+        # or too small (it's hard to label)
+        def split_text(adverts, chunk_size=5):
+            for advert in adverts:
+                text = advert["text"]
+                meta = advert["meta"]
+                sentences = text.split(".")
+                sentences = [
+                    sentence.strip()
+                    for sentence in sentences
+                    if len(sentence.strip()) != 0
+                ]
+                for sent_id, i in enumerate(range(0, len(sentences), chunk_size)):
+                    yield {
+                        "text": ". ".join(sentences[i : i + chunk_size]),
+                        "meta": {"id": meta["id"], "chunk": sent_id},
+                    }
+
+        stream = split_text(list(stream))


this is the only thing different from spacy's original ner.correct recipe

lizgzil · 2023-08-14T16:36:00Z

this is ready for review now @india-kerle . The tests fail for python 3.9 but I think we can look into that separately (see #194) since it's an environment issue rather than a code issue.

india-kerle

Thanks Liz for the PR! I was able to predict experiences, skills and benefits using the most up to date NER model. The only comment I've made is more around dealing with spaCy's char indices being annoying - they've released a nice parameter to deal with this (although unclear if its totally relevant to your usecase but worth knowing about - I've added more detail in a comment).

india-kerle · 2023-08-15T13:42:31Z

ojd_daps_skills/getters/data_getters.py

@@ -116,6 +116,20 @@ def load_s3_json(s3, bucket_name, file_name):
    return json.loads(file)


+def load_prodigy_jsonl_s3_data(s3, bucket_name, file_name):


india-kerle · 2023-08-15T13:43:21Z

ojd_daps_skills/pipeline/skill_ner/README.md


-This model can be used by running:
+A trained model can be used by running:

 ```python


I was able to run this in a notebook with no problems and got the same results!

india-kerle · 2023-08-15T13:44:04Z

ojd_daps_skills/pipeline/skill_ner/README.md

@@ -126,7 +137,7 @@ The `s3_download=True` argument will mean this model will be first downloaded fr
 Running

 ```
-python ojd_daps_skills/pipeline/skill_ner/get_skills.py --model_path outputs/models/ner_model/20220825/ --output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/ --job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json
+python ojd_daps_skills/pipeline/skill_ner/get_skills.py --model_path outputs/models/ner_model/20230808/ --output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/ --job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json


india-kerle · 2023-08-15T13:46:23Z

ojd_daps_skills/pipeline/skill_ner/combine_labels.py

+    """
+    s3 = get_s3_resource()
+    prodigy_data_chunks = defaultdict(dict)
+    for prodigy_labelled_data_s3_folder in prodigy_labelled_data_s3_folders:


is there a reason why we're looping as it looks like we only have one filename? Or am i miss understanding?

no you are right! I was copying the structure of label-studio where we had multiple files without thinking. Will adapt

india-kerle · 2023-08-15T13:50:22Z

ojd_daps_skills/pipeline/skill_ner/combine_labels.py

+def combine_prodigy_spans(prodigy_data_chunks):
+    """


this is such a pain for prodigy/spacy more generally - when dealing with start/end char indexes in spacy Span's, they have a nice parameter called alignment_mode which allows you to define how character indices snap to token boundaries i.e. Options: "strict" (no snapping), "contract" (span of all tokens completely within the character span), "expand" (span of all tokens at least partially covered by the character span). Defaults to "strict". This is what i modified when I needed to deal with spans for the company descriptions recipe.

More info in the Span.char_span method in the Span documentation . Not sure if this is totally relevant here but something to consider.

oh that's good to know about! thanks 🙏

india-kerle · 2023-08-15T13:53:27Z

ojd_daps_skills/pipeline/skill_ner/prodigy/process_data.py

@@ -0,0 +1,106 @@
+"""
+Process a dataset of job adverts for labelling in Prodigy


I remember you saying that you cleaned the data slightly more than for the company descriptions approach - do you think its worth me using this instead to be more consistent?

Sorry this is probably an out of date reply since you've already started labelling the company descriptions. Might have been worth adding more cleaning, however it might not make much of a difference to your task especially if you are tagging the whole job advert at more or less the sentence level. So I wouldn't worry about it.

The reason I did more cleaning here was because I was splitting into chunks of 5 sentences by full stop, and I kept finding that the previous level of cleaning hadn't separated them out well enough (e.g. it was replacing \n with a space not a full stop).

india-kerle · 2023-08-15T13:54:02Z

ojd_daps_skills/pipeline/skill_ner/prodigy/skill_recipe.py

+        def split_text(adverts, chunk_size=5):
+            for advert in adverts:


it's funny how the length is an issue in this recipe but not in the other

its so odd!

india-kerle · 2023-08-15T13:54:57Z

outputs/reports/skills_extraction.md

+| ---------- | ----- | --------- | ------ |
+| Skill      | 0.612 | 0.712     | 0.537  |
+| Experience | 0.524 | 0.647     | 0.441  |
+| Benefit    | 0.531 | 0.708     | 0.425  |


I know the recall isn't ideal but honestly impressed with the precision given how few data points are labelled benefit

yeh it's good! And really this metric should be just applied to the prodigy test data too, I think if it was then perhaps the precision would be higher (think the recall would be the same though)?

I had such a pain with this because I wanted to train one NER model to predict all the entity types, but obviously only the most recent labelled job adverts had BENEFIT labels. So, the model effectively sees the label-studio training data as saying there are no BENEFITs in when there probably are.

Couldn't find a way to tell it to only train the BENEFIT label on the new data, whilst using all the data to train the SKILL and EXPERIENCE labels.

I tried training a model just with the prodigy data and on the BENEFIT label it got:

F1 Precision Recall

0.539 0.655 0.458

so arguably more or less the same!

FYI I made a table with the training experiments here

lizgzil · 2023-08-18T16:18:12Z

thanks so much @india-kerle !!

lizgzil added 5 commits July 10, 2023 20:53

Add prodigy folder for tagging more skills data using prodigy

c9df778

Split be 5 sentences and update task for all tags

7cd8b31

correct sentence ids given to prodigy stream outputs

790fe50

Add instruction about multilabellers in ec2 to readme

a219b4a

Add lots of text cleaning to the prodigy

0429d6a

sample data

lizgzil requested a review from india-kerle August 2, 2023 08:52

lizgzil marked this pull request as ready for review August 2, 2023 08:52

lizgzil added 11 commits August 4, 2023 17:35

add a getter for the prodigy jsonl s3 data

1b85ebf

Add the prodigy data to the script to combine all labels together for…

8282d7b

… training the model

Only clean for label studio data, use pretrained model as starting po…

5b2262a

…int, set a random seed, fix saving bug

change output name for model

555a045

Update readmes to latest model date

490075c

remove some unneccessary code from skillner

f9f3ec2

Add to the readme example with the new model

8e50261

specify to download the spacy lg model

d6752ab

pin spacy in the requirements dev after tests are failing on github

4ed2799

Add new model metrics to NER experiment doc

4727a49

pin typing extensions to fix pydantic/spacy issue

10f4dbf

lizgzil commented Aug 14, 2023

View reviewed changes

lizgzil added 2 commits August 14, 2023 17:26

remove pinned spacy from requirements dev

a0ed382

Add back in spacy pinned in requirementsdev

7bc7e8b

india-kerle approved these changes Aug 15, 2023

View reviewed changes

Address Indias comments

d2fe5e6

lizgzil merged commit f24eb5c into dev Aug 18, 2023
1 of 2 checks passed

lizgzil deleted the prodigy_ner branch August 18, 2023 16:18

lizgzil mentioned this pull request Aug 31, 2023

Improve extracted skills nestauk/dap_prinz_green_jobs#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prodigy folder for tagging more skills data using prodigy #193

Add prodigy folder for tagging more skills data using prodigy #193

lizgzil commented Jul 10, 2023 •

edited

Loading

lizgzil Aug 14, 2023

lizgzil commented Aug 14, 2023 •

edited

Loading

india-kerle left a comment

india-kerle Aug 15, 2023

india-kerle Aug 15, 2023

india-kerle Aug 15, 2023

india-kerle Aug 15, 2023

lizgzil Aug 18, 2023

india-kerle Aug 15, 2023

lizgzil Aug 18, 2023

india-kerle Aug 15, 2023

lizgzil Aug 18, 2023

india-kerle Aug 15, 2023

lizgzil Aug 18, 2023

india-kerle Aug 15, 2023

lizgzil Aug 18, 2023

lizgzil commented Aug 18, 2023

		@@ -116,6 +116,20 @@ def load_s3_json(s3, bucket_name, file_name):
		return json.loads(file)


		def load_prodigy_jsonl_s3_data(s3, bucket_name, file_name):

		@@ -0,0 +1,106 @@
		"""
		Process a dataset of job adverts for labelling in Prodigy

		def split_text(adverts, chunk_size=5):
		for advert in adverts:

Add prodigy folder for tagging more skills data using prodigy #193

Add prodigy folder for tagging more skills data using prodigy #193

Conversation

lizgzil commented Jul 10, 2023 • edited Loading

Note for reviewer:

Prodigy labels and new NER model

Choose a reason for hiding this comment

lizgzil commented Aug 14, 2023 • edited Loading

india-kerle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lizgzil commented Aug 18, 2023

lizgzil commented Jul 10, 2023 •

edited

Loading

lizgzil commented Aug 14, 2023 •

edited

Loading