-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prodigy folder for tagging more skills data using prodigy #193
Conversation
sample data
… training the model
…int, set a random seed, fix saving bug
# Custom way to split into chunks of a certain size | ||
# its not ideal if these are too big (the model struggles) | ||
# or too small (it's hard to label) | ||
def split_text(adverts, chunk_size=5): | ||
for advert in adverts: | ||
text = advert["text"] | ||
meta = advert["meta"] | ||
sentences = text.split(".") | ||
sentences = [ | ||
sentence.strip() | ||
for sentence in sentences | ||
if len(sentence.strip()) != 0 | ||
] | ||
for sent_id, i in enumerate(range(0, len(sentences), chunk_size)): | ||
yield { | ||
"text": ". ".join(sentences[i : i + chunk_size]), | ||
"meta": {"id": meta["id"], "chunk": sent_id}, | ||
} | ||
|
||
stream = split_text(list(stream)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only thing different from spacy's original ner.correct
recipe
this is ready for review now @india-kerle . The tests fail for python 3.9 but I think we can look into that separately (see #194) since it's an environment issue rather than a code issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Liz for the PR! I was able to predict experiences, skills and benefits using the most up to date NER model. The only comment I've made is more around dealing with spaCy's char indices being annoying - they've released a nice parameter to deal with this (although unclear if its totally relevant to your usecase but worth knowing about - I've added more detail in a comment).
@@ -116,6 +116,20 @@ def load_s3_json(s3, bucket_name, file_name): | |||
return json.loads(file) | |||
|
|||
|
|||
def load_prodigy_jsonl_s3_data(s3, bucket_name, file_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
This model can be used by running: | ||
A trained model can be used by running: | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to run this in a notebook with no problems and got the same results!
@@ -126,7 +137,7 @@ The `s3_download=True` argument will mean this model will be first downloaded fr | |||
Running | |||
|
|||
``` | |||
python ojd_daps_skills/pipeline/skill_ner/get_skills.py --model_path outputs/models/ner_model/20220825/ --output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/ --job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json | |||
python ojd_daps_skills/pipeline/skill_ner/get_skills.py --model_path outputs/models/ner_model/20230808/ --output_file_dir escoe_extension/outputs/data/skill_ner/skill_predictions/ --job_adverts_filename escoe_extension/inputs/data/skill_ner/data_sample/20220622_sampled_job_ads.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫨
""" | ||
s3 = get_s3_resource() | ||
prodigy_data_chunks = defaultdict(dict) | ||
for prodigy_labelled_data_s3_folder in prodigy_labelled_data_s3_folders: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we're looping as it looks like we only have one filename? Or am i miss understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no you are right! I was copying the structure of label-studio where we had multiple files without thinking. Will adapt
def combine_prodigy_spans(prodigy_data_chunks): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is such a pain for prodigy/spacy more generally - when dealing with start/end char indexes in spacy Span's, they have a nice parameter called alignment_mode
which allows you to define how character indices snap to token boundaries i.e. Options: "strict" (no snapping), "contract" (span of all tokens completely within the character span), "expand" (span of all tokens at least partially covered by the character span). Defaults to "strict". This is what i modified when I needed to deal with spans for the company descriptions recipe.
More info in the Span.char_span method in the Span documentation . Not sure if this is totally relevant here but something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's good to know about! thanks 🙏
@@ -0,0 +1,106 @@ | |||
""" | |||
Process a dataset of job adverts for labelling in Prodigy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you saying that you cleaned the data slightly more than for the company descriptions approach - do you think its worth me using this instead to be more consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is probably an out of date reply since you've already started labelling the company descriptions. Might have been worth adding more cleaning, however it might not make much of a difference to your task especially if you are tagging the whole job advert at more or less the sentence level. So I wouldn't worry about it.
The reason I did more cleaning here was because I was splitting into chunks of 5 sentences by full stop, and I kept finding that the previous level of cleaning hadn't separated them out well enough (e.g. it was replacing \n with a space not a full stop).
def split_text(adverts, chunk_size=5): | ||
for advert in adverts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's funny how the length is an issue in this recipe but not in the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its so odd!
| ---------- | ----- | --------- | ------ | | ||
| Skill | 0.612 | 0.712 | 0.537 | | ||
| Experience | 0.524 | 0.647 | 0.441 | | ||
| Benefit | 0.531 | 0.708 | 0.425 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the recall isn't ideal but honestly impressed with the precision given how few data points are labelled benefit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh it's good! And really this metric should be just applied to the prodigy test data too, I think if it was then perhaps the precision would be higher (think the recall would be the same though)?
I had such a pain with this because I wanted to train one NER model to predict all the entity types, but obviously only the most recent labelled job adverts had BENEFIT labels. So, the model effectively sees the label-studio training data as saying there are no BENEFITs in when there probably are.
Couldn't find a way to tell it to only train the BENEFIT label on the new data, whilst using all the data to train the SKILL and EXPERIENCE labels.
I tried training a model just with the prodigy data and on the BENEFIT label it got:
F1 | Precision | Recall |
---|---|---|
0.539 | 0.655 | 0.458 |
so arguably more or less the same!
FYI I made a table with the training experiments here
thanks so much @india-kerle !! |
Fixes #192
combine_labels.py
now includes adding in the new prodigy labelsner_spacy.py
uses "en_core_web_lg" as the base model + uses a fixed random seed + a few small bug fixespipeline/skill_ner/prodigy/
folder created with NER prodigy labelling instructions, a recipe, and a data processing script20230808
. I've added Update to new model everywhere #197 as a separate issue +PR to update all the model versions everywhere (as I wanted to make this PR separate)typing_extensions<4.6.0
andspacy==3.4.0
to requirements_dev.txt)Note for reviewer:
You should get:
Prodigy labels and new NER model
Skill metric changes:
All metrics from the new model:
Thanks for contributing to Nesta's Skills Extractor Library 🙏!
If you have suggested changes to code anywhere outside of the ExtractSkills class, please consult the checklist below.
Checklist ✔️🐍:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
soutput/reports/
If you have suggested changes to documentation (and/or the ExtractSkills class), please ALSO consult the checklist below.
Documentation Checklist ✔️📚:
make html
indocs
docs/build/*.html
files locally to ensure they have formatted correctlydocs/build/*.html
files