Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate does not work for custom dataset #56

Closed
imayachita opened this issue Nov 14, 2019 · 11 comments
Closed

Evaluate does not work for custom dataset #56

imayachita opened this issue Nov 14, 2019 · 11 comments

Comments

@imayachita
Copy link

Hi,
I tried using your code on my data. It finished training without problem, but it got problem during evaluation.

Traceback (most recent call last):
  File "run_ner.py", line 674, in <module>
    main()
  File "run_ner.py", line 661, in main
    temp_1.append(label_map[label_ids[i][j]])
KeyError: 0

I printed the label_ids:

label ids:  0 i:  0 j:  16 label:  [58  1  1  1  1  1  1  1  1  1  1  1  1  1  1 59  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]

Seems like it is because the element with j==16 is 0 in the label list and label 0 is not in the labelmap.
I wonder how did you build the label_ids?

Thanks

@kamalkraj
Copy link
Owner

try:
temp_1.append(label_map.get(label_ids[i][j],1))

@imayachita
Copy link
Author

Okay thanks, I tried that, but now I got another problem that y_true and y_pred are both empty.


Evaluating: 100%|███████████████████████████████████████████████████████████████████████| 73/73 [00:25<00:00,  2.89it/s]
y_true:  []
y_pred:  []
Traceback (most recent call last):
  File "run_ner.py", line 677, in <module>
    main()
  File "run_ner.py", line 667, in main
    report = classification_report(y_true, y_pred,digits=4)
  File "/home/inneke/Documents/D_drive/Balikpapan_handil/codes/BERT-NER-dev/venv/lib/python3.6/site-packages/seqeval/metrics/sequence_labeling.py", line 323, in classification_report
    np.average(ps, weights=s),
  File "<__array_function__ internals>", line 6, in average
  File "/home/inneke/Documents/D_drive/Balikpapan_handil/codes/BERT-NER-dev/venv/lib/python3.6/site-packages/numpy/lib/function_base.py", line 420, in average
    "Weights sum to zero, can't be normalized")
ZeroDivisionError: Weights sum to zero, can't be normalized

@kamalkraj
Copy link
Owner

Could you share your label_map ?

@imayachita
Copy link
Author

Please kindly find below, together with the data:
data_custom.zip

I also found the sentences are truncated when loaded:
image
while actually it is the original data:

figure O
2 O
production O
history O
per O
recovery O
mechanism O
natural B-Pressure Depletion
depletion I-Pressure Depletion
and O
water B-Waterflooding
injection I-Waterflooding
phase O
the O
shallow O
zone O
of O
the O
handil O
field O
which O
contains O
160 O
reservoirs O
, O
experiences O
a O
strong O
aquifer O
drive O
and O
were O
very O
efficiently O
swept O
by O
it O
. O

These words are not loaded:

efficiently O
swept O
by O
it O
. O

Thanks a lot!

@imayachita
Copy link
Author

I modified the preprocess part anyway, to replace the white space in the label with underscore.

@kamalkraj
Copy link
Owner

kamalkraj commented Nov 14, 2019

@imayachita,
Remove Label X from the label list. It is not used , right ?
The last label in the label list should be [SEP]

@imayachita
Copy link
Author

Oh yes! Thanks! It is running now.
But, I still wonder what caused the examples to be truncated?
Thanks!

@imayachita
Copy link
Author

@kamalkraj I think I know what caused the examples to be truncated. It is because of the bert tokenization that chunks the word into subwords, so the labels and the token don't match anymore.

The original data:

figure O
2 O
production O
history O
per O
recovery O
mechanism O
natural B-Pressure Depletion
depletion I-Pressure Depletion
and O
water B-Waterflooding
injection I-Waterflooding
phase O
the O
shallow O
zone O
of O

turned into this:
image

Have you got idea how to fix this problem?

Thanks a lot! I really appreciate your help.

@kamalkraj
Copy link
Owner

kamalkraj commented Nov 14, 2019

Bert subword tokenization is handled. Labels and tokenized words will have different lengths.
I hope you are using zip to combine labels and tokens., when you use zip iteration will stop when one of the iterable is finished. labels are assigned to only the first subtoken from the BERT tokenization.
Screenshot 2019-11-04 at 3 03 10 PM
But all the tokens are passed BERT for feature extraction and after feature extraction only features of first subtoken is passed to the classifier layer.
refer
#53

@imayachita
Copy link
Author

Thanks for the explanation @kamalkraj!
So, it means that the tokens that appear after the last index of iteration will not be fed to the model.
For example (using the original data you posted):

Peter NNP B-NP B-PER
Hedblom NNP I-NP I-PER
( ( O O
Sweden NNP B-NP B-LOC
) ) O O
70 CD B-NP O
75 CD I-NP O
75 CD I-NP O
, , O O
Retief NNP B-NP B-PER
Goosen NNP I-NP I-PER
( ( O O
South NNP B-NP B-LOC

When I printed the tokens and labels:
image

Seems to me that because the original sentence length is 13 and the BERT tokenization tokenized the sentence to have more than 13 tokens, the iteration stopped at "," (which got index of 12). Therefore, the words "Retief", "Goosen", "(", "South" are not fed into the model.

Do I understand it correctly? Thanks!

@kamalkraj
Copy link
Owner

@imayachita,
All the tokens are given to the BERT feature extractor. But only the first subtoken of the word hidden features will be passed to the classifier layer. Check the forward function of NER class.

class Ner(BertForTokenClassification):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants