Use word_ids to get labels in run_ner #8962

sgugger · 2020-12-07T15:11:10Z

What does this PR do?

As #8958 pointed out, the current way labels are computed in the run_ner script using offset mappings does not work for sentencepiece-based tokenizers. This PR fixes that using the .word_ids method which is more elegant and more reliable.

In passing it adds an early check that the tokenzier is fast (otherwise the script just doesn't work).

Fixes #8958

LysandreJik

LGTM, thanks for looking into it!

sgugger added 2 commits December 7, 2020 10:04

Use word_ids to get labels in run_ner

991f981

Add sanity check

ad072e8

sgugger requested a review from LysandreJik December 7, 2020 15:11

LysandreJik approved these changes Dec 7, 2020

View reviewed changes

sgugger merged commit 7f9ccff into master Dec 7, 2020

sgugger deleted the fix_8958 branch December 7, 2020 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use word_ids to get labels in run_ner #8962

Use word_ids to get labels in run_ner #8962

sgugger commented Dec 7, 2020

LysandreJik left a comment

Use word_ids to get labels in run_ner #8962

Use word_ids to get labels in run_ner #8962

Conversation

sgugger commented Dec 7, 2020

What does this PR do?

LysandreJik left a comment

Choose a reason for hiding this comment