Evaluate does not work for custom dataset #56

imayachita · 2019-11-14T05:53:55Z

Hi,
I tried using your code on my data. It finished training without problem, but it got problem during evaluation.

Traceback (most recent call last):
  File "run_ner.py", line 674, in <module>
    main()
  File "run_ner.py", line 661, in main
    temp_1.append(label_map[label_ids[i][j]])
KeyError: 0

I printed the label_ids:

label ids:  0 i:  0 j:  16 label:  [58  1  1  1  1  1  1  1  1  1  1  1  1  1  1 59  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]

Seems like it is because the element with j==16 is 0 in the label list and label 0 is not in the labelmap.
I wonder how did you build the label_ids?

Thanks

The text was updated successfully, but these errors were encountered:

kamalkraj · 2019-11-14T06:07:22Z

try:
temp_1.append(label_map.get(label_ids[i][j],1))

imayachita · 2019-11-14T06:48:20Z

Okay thanks, I tried that, but now I got another problem that y_true and y_pred are both empty.


Evaluating: 100%|███████████████████████████████████████████████████████████████████████| 73/73 [00:25<00:00,  2.89it/s]
y_true:  []
y_pred:  []
Traceback (most recent call last):
  File "run_ner.py", line 677, in <module>
    main()
  File "run_ner.py", line 667, in main
    report = classification_report(y_true, y_pred,digits=4)
  File "/home/inneke/Documents/D_drive/Balikpapan_handil/codes/BERT-NER-dev/venv/lib/python3.6/site-packages/seqeval/metrics/sequence_labeling.py", line 323, in classification_report
    np.average(ps, weights=s),
  File "<__array_function__ internals>", line 6, in average
  File "/home/inneke/Documents/D_drive/Balikpapan_handil/codes/BERT-NER-dev/venv/lib/python3.6/site-packages/numpy/lib/function_base.py", line 420, in average
    "Weights sum to zero, can't be normalized")
ZeroDivisionError: Weights sum to zero, can't be normalized

kamalkraj · 2019-11-14T06:50:35Z

Could you share your label_map ?

imayachita · 2019-11-14T07:07:07Z

Please kindly find below, together with the data:
data_custom.zip

I also found the sentences are truncated when loaded:

while actually it is the original data:

figure O
2 O
production O
history O
per O
recovery O
mechanism O
natural B-Pressure Depletion
depletion I-Pressure Depletion
and O
water B-Waterflooding
injection I-Waterflooding
phase O
the O
shallow O
zone O
of O
the O
handil O
field O
which O
contains O
160 O
reservoirs O
, O
experiences O
a O
strong O
aquifer O
drive O
and O
were O
very O
efficiently O
swept O
by O
it O
. O

These words are not loaded:

efficiently O
swept O
by O
it O
. O

Thanks a lot!

imayachita · 2019-11-14T07:11:03Z

I modified the preprocess part anyway, to replace the white space in the label with underscore.

kamalkraj · 2019-11-14T07:19:19Z

@imayachita,
Remove Label X from the label list. It is not used , right ?
The last label in the label list should be [SEP]

imayachita · 2019-11-14T08:16:34Z

Oh yes! Thanks! It is running now.
But, I still wonder what caused the examples to be truncated?
Thanks!

imayachita · 2019-11-14T08:38:33Z

@kamalkraj I think I know what caused the examples to be truncated. It is because of the bert tokenization that chunks the word into subwords, so the labels and the token don't match anymore.

The original data:

figure O
2 O
production O
history O
per O
recovery O
mechanism O
natural B-Pressure Depletion
depletion I-Pressure Depletion
and O
water B-Waterflooding
injection I-Waterflooding
phase O
the O
shallow O
zone O
of O

turned into this:

Have you got idea how to fix this problem?

Thanks a lot! I really appreciate your help.

kamalkraj · 2019-11-14T09:16:07Z

Bert subword tokenization is handled. Labels and tokenized words will have different lengths.
I hope you are using zip to combine labels and tokens., when you use zip iteration will stop when one of the iterable is finished. labels are assigned to only the first subtoken from the BERT tokenization.

But all the tokens are passed BERT for feature extraction and after feature extraction only features of first subtoken is passed to the classifier layer.
refer
#53

imayachita · 2019-11-15T03:05:48Z

Thanks for the explanation @kamalkraj!
So, it means that the tokens that appear after the last index of iteration will not be fed to the model.
For example (using the original data you posted):

Peter NNP B-NP B-PER
Hedblom NNP I-NP I-PER
( ( O O
Sweden NNP B-NP B-LOC
) ) O O
70 CD B-NP O
75 CD I-NP O
75 CD I-NP O
, , O O
Retief NNP B-NP B-PER
Goosen NNP I-NP I-PER
( ( O O
South NNP B-NP B-LOC

When I printed the tokens and labels:

Seems to me that because the original sentence length is 13 and the BERT tokenization tokenized the sentence to have more than 13 tokens, the iteration stopped at "," (which got index of 12). Therefore, the words "Retief", "Goosen", "(", "South" are not fed into the model.

Do I understand it correctly? Thanks!

kamalkraj · 2019-11-15T04:37:04Z

@imayachita,
All the tokens are given to the BERT feature extractor. But only the first subtoken of the word hidden features will be passed to the classifier layer. Check the forward function of NER class.

BERT-NER/run_ner.py

Line 30 in b27c79c

class Ner(BertForTokenClassification):

imayachita closed this as completed Nov 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate does not work for custom dataset #56

Evaluate does not work for custom dataset #56

imayachita commented Nov 14, 2019

kamalkraj commented Nov 14, 2019

imayachita commented Nov 14, 2019

kamalkraj commented Nov 14, 2019

imayachita commented Nov 14, 2019

imayachita commented Nov 14, 2019

kamalkraj commented Nov 14, 2019 •

edited

imayachita commented Nov 14, 2019

imayachita commented Nov 14, 2019

kamalkraj commented Nov 14, 2019 •

edited

imayachita commented Nov 15, 2019

kamalkraj commented Nov 15, 2019

Evaluate does not work for custom dataset #56

Evaluate does not work for custom dataset #56

Comments

imayachita commented Nov 14, 2019

kamalkraj commented Nov 14, 2019

imayachita commented Nov 14, 2019

kamalkraj commented Nov 14, 2019

imayachita commented Nov 14, 2019

imayachita commented Nov 14, 2019

kamalkraj commented Nov 14, 2019 • edited

imayachita commented Nov 14, 2019

imayachita commented Nov 14, 2019

kamalkraj commented Nov 14, 2019 • edited

imayachita commented Nov 15, 2019

kamalkraj commented Nov 15, 2019

kamalkraj commented Nov 14, 2019 •

edited

kamalkraj commented Nov 14, 2019 •

edited