Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prodigy folder for tagging more skills data using prodigy #193

Merged
merged 19 commits into from
Aug 18, 2023
Merged

Conversation

lizgzil
Copy link
Collaborator

@lizgzil lizgzil commented Jul 10, 2023

Fixes #192

  • combine_labels.py now includes adding in the new prodigy labels
  • ner_spacy.py uses "en_core_web_lg" as the base model + uses a fixed random seed + a few small bug fixes
  • pipeline/skill_ner/prodigy/ folder created with NER prodigy labelling instructions, a recipe, and a data processing script
  • A new model has been trained called 20230808. I've added Update to new model everywhere #197 as a separate issue +PR to update all the model versions everywhere (as I wanted to make this PR separate)
  • I noticed an error with the pytest github action. The fix was found in this and this. (adding typing_extensions<4.6.0 and spacy==3.4.0 to requirements_dev.txt)
  • The github action then raised another error - looks like its the same as whats discussed here Installation Error: Building wheel for scikit-learn (pyproject.toml) did not run successfully. #194

Note for reviewer:

  • Check code looks ok
  • Try running:
from ojd_daps_skills.pipeline.skill_ner.ner_spacy import JobNER
job_ner = JobNER()
nlp = job_ner.load_model('outputs/models/ner_model/20230808/', s3_download=True)
text = "We want someone with good communication and maths skills. There are job benefits such as a pension and cycle to work scheme. We would like someone with experience in marketing."
pred_ents = job_ner.predict(text)

You should get:

>>> pred_ents
[{'label': 'SKILL', 'start': 26, 'end': 39},
 {'label': 'SKILL', 'start': 44, 'end': 56},
 {'label': 'BENEFIT', 'start': 103, 'end': 123},
 {'label': 'EXPERIENCE', 'start': 152, 'end': 175}]
>>> for ent in pred_ents:
>>>     print(text[ent['start']:ent['end']])
communication
maths skills
cycle to work scheme
experience in marketing

Prodigy labels and new NER model

  • In prodigy BENEFIT entities were also labelled. This causes a slight problem since the label-studio data doesn't have BENEFIT labels. I opted to train the NER model with BENEFIT label anyway, which means the recall has ended up being really low.
  • The number of labelled job adverts has increased from 375 to 500.

Skill metric changes:

  F1 Precision Recall
Previous model 0.59 0.68 0.52
Newest model 0.61 0.71 0.54

All metrics from the new model:

Entity F1 Precision Recall
Skill 0.612 0.712 0.537
Experience 0.524 0.647 0.441
Benefit 0.531 0.708 0.425
All 0.590 0.680 0.521

Thanks for contributing to Nesta's Skills Extractor Library 🙏!

If you have suggested changes to code anywhere outside of the ExtractSkills class, please consult the checklist below.

Checklist ✔️🐍:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained the feature in this PR or (better) in output/reports/
  • I have requested a code review

If you have suggested changes to documentation (and/or the ExtractSkills class), please ALSO consult the checklist below.

Documentation Checklist ✔️📚:

  • I have run make html in docs
  • I have manually reviewed the docs/build/*.html files locally to ensure they have formatted correctly
  • I have pushed both relevant files AND their corresponding docs/build/*.html files

@lizgzil lizgzil marked this pull request as ready for review August 2, 2023 08:52
Comment on lines +94 to +113
# Custom way to split into chunks of a certain size
# its not ideal if these are too big (the model struggles)
# or too small (it's hard to label)
def split_text(adverts, chunk_size=5):
for advert in adverts:
text = advert["text"]
meta = advert["meta"]
sentences = text.split(".")
sentences = [
sentence.strip()
for sentence in sentences
if len(sentence.strip()) != 0
]
for sent_id, i in enumerate(range(0, len(sentences), chunk_size)):
yield {
"text": ". ".join(sentences[i : i + chunk_size]),
"meta": {"id": meta["id"], "chunk": sent_id},
}

stream = split_text(list(stream))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only thing different from spacy's original ner.correct recipe

@lizgzil
Copy link
Collaborator Author

lizgzil commented Aug 14, 2023

this is ready for review now @india-kerle . The tests fail for python 3.9 but I think we can look into that separately (see #194) since it's an environment issue rather than a code issue.

Copy link
Collaborator

@india-kerle india-kerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Liz for the PR! I was able to predict experiences, skills and benefits using the most up to date NER model. The only comment I've made is more around dealing with spaCy's char indices being annoying - they've released a nice parameter to deal with this (although unclear if its totally relevant to your usecase but worth knowing about - I've added more detail in a comment).

@@ -116,6 +116,20 @@ def load_s3_json(s3, bucket_name, file_name):
return json.loads(file)


def load_prodigy_jsonl_s3_data(s3, bucket_name, file_name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


This model can be used by running:
A trained model can be used by running:

```python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run this in a notebook with no problems and got the same results!

@@ -126,7 +137,7 @@ The `s3_download=True` argument will mean this model will be first downloaded fr
Running

```
python ojd_daps_skills/pipeline/skill_ner/get_skills.py --model_path outputs/models/ner_model/20220825/ --output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/ --job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json
python ojd_daps_skills/pipeline/skill_ner/get_skills.py --model_path outputs/models/ner_model/20230808/ --output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/ --job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫨

"""
s3 = get_s3_resource()
prodigy_data_chunks = defaultdict(dict)
for prodigy_labelled_data_s3_folder in prodigy_labelled_data_s3_folders:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we're looping as it looks like we only have one filename? Or am i miss understanding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you are right! I was copying the structure of label-studio where we had multiple files without thinking. Will adapt

Comment on lines +146 to +147
def combine_prodigy_spans(prodigy_data_chunks):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is such a pain for prodigy/spacy more generally - when dealing with start/end char indexes in spacy Span's, they have a nice parameter called alignment_mode which allows you to define how character indices snap to token boundaries i.e. Options: "strict" (no snapping), "contract" (span of all tokens completely within the character span), "expand" (span of all tokens at least partially covered by the character span). Defaults to "strict". This is what i modified when I needed to deal with spans for the company descriptions recipe.

More info in the Span.char_span method in the Span documentation . Not sure if this is totally relevant here but something to consider.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's good to know about! thanks 🙏

@@ -0,0 +1,106 @@
"""
Process a dataset of job adverts for labelling in Prodigy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you saying that you cleaned the data slightly more than for the company descriptions approach - do you think its worth me using this instead to be more consistent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is probably an out of date reply since you've already started labelling the company descriptions. Might have been worth adding more cleaning, however it might not make much of a difference to your task especially if you are tagging the whole job advert at more or less the sentence level. So I wouldn't worry about it.

The reason I did more cleaning here was because I was splitting into chunks of 5 sentences by full stop, and I kept finding that the previous level of cleaning hadn't separated them out well enough (e.g. it was replacing \n with a space not a full stop).

Comment on lines +97 to +98
def split_text(adverts, chunk_size=5):
for advert in adverts:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's funny how the length is an issue in this recipe but not in the other

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its so odd!

| ---------- | ----- | --------- | ------ |
| Skill | 0.612 | 0.712 | 0.537 |
| Experience | 0.524 | 0.647 | 0.441 |
| Benefit | 0.531 | 0.708 | 0.425 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the recall isn't ideal but honestly impressed with the precision given how few data points are labelled benefit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh it's good! And really this metric should be just applied to the prodigy test data too, I think if it was then perhaps the precision would be higher (think the recall would be the same though)?

I had such a pain with this because I wanted to train one NER model to predict all the entity types, but obviously only the most recent labelled job adverts had BENEFIT labels. So, the model effectively sees the label-studio training data as saying there are no BENEFITs in when there probably are.

Couldn't find a way to tell it to only train the BENEFIT label on the new data, whilst using all the data to train the SKILL and EXPERIENCE labels.

I tried training a model just with the prodigy data and on the BENEFIT label it got:

F1 Precision Recall
0.539 0.655 0.458

so arguably more or less the same!

FYI I made a table with the training experiments here

@lizgzil
Copy link
Collaborator Author

lizgzil commented Aug 18, 2023

thanks so much @india-kerle !!

@lizgzil lizgzil merged commit f24eb5c into dev Aug 18, 2023
1 of 2 checks passed
@lizgzil lizgzil deleted the prodigy_ner branch August 18, 2023 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use fix_random_seed in train NER
2 participants