Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trained model on custom dataset. Though predicting only conll2003 entities. #52

Closed
ranjeetds opened this issue Nov 4, 2019 · 16 comments

Comments

@ranjeetds
Copy link

I have custom entities data of around 8 entities. I combined that dataset with the conll2003 (As I am interested in conll2003 entities also). I trained the model. Though the trained model is unable to predict any entities outside conll2003. Could you please help me if I am missing anything while training on custom dataset.

I used below command to train the model.

nohup python3 run_ner.py --data_dir=data --bert_model=bert-large-cased --task_name=ner --output_dir=out_bert_large --max_seq_length=128 --num_train_epochs 10 --do_train --do_eval --no_cuda --warmup_proportion=0.4 > log.txt &

@kamalkraj
Copy link
Owner

Hi @ranjeetds ,
Did you change the

def get_labels(self):
?

@ranjeetds
Copy link
Author

ranjeetds commented Nov 4, 2019

Yes

Changed it to below which i wanted.

# This function will return list of labels present in the dataset

def get_labels(self): return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-ART", "I-ART", "B-EVE", "I-EVE", "B-GPE", "I-GPE", "B-NAT", "I-NAT", "B-TIM", "I-TIM", "[CLS]", "[SEP]"]

@kamalkraj
Copy link
Owner

Check model_config.json file in the saved model dir. and verify all the labels are there in the label_map

@ranjeetds
Copy link
Author

ranjeetds commented Nov 4, 2019

Checked. All the labels are in the model_config.json file's label_map.

@kamalkraj
Copy link
Owner

How is the new entity distribution across the whole dataset?
if possible share the dataset, I will try on my machine

@ranjeetds
Copy link
Author

Hi,

Below is the distribution across training dataset

14988
47958
402 B-ART
308 B-EVE
15870 B-GPE
44784 B-LOC
201 B-NAT
26464 B-ORG
23590 B-PER
20333 B-TIM
297 I-ART
253 I-EVE
198 I-GPE
8571 I-LOC
51 I-NAT
20488 I-ORG
21779 I-PER
6528 I-TIM
1063025 O
Also attaching files
Valid and test are same as yours

test.txt
train.txt
valid.txt

@ranjeetds
Copy link
Author

@kamalkraj did you try on your machine? If yes could you please share results by trying statements containing other entities like time for example "Tomorrow i will write to Shyam regarding hurricane MAHA"

@kamalkraj
Copy link
Owner

The training data format was not correct. The new sentences you added were not split by \n.
Corrected train data.
train.txt

Training metric one bert-base-model:
Screenshot 2019-11-06 at 9 57 33 AM

@ranjeetds
Copy link
Author

ranjeetds commented Nov 7, 2019

Hi @kamalkraj . Thank you for checking out the dataset and pointing to the bug. Though I ran model on updated train file you shared and somehow model performance went to 0 for all the classes in evaluation set. Now model is predicting every token as 'O' entity.

I am running 10 epochs so do you think model might be overfitting for O class? If yes, How do i reduce the model bias.

@kamalkraj
Copy link
Owner

Hi @ranjeetds,

  1. Don't run model for 10 epochs. It will only cause overfitting.
  2. Downsample the CoNLL-2003 entities, Because the new entity distribution you added to the data is very small compared to CoNLL-2003.
  3. Train for 3-5 epochs with learning rate { 2e-5, 3e-5, 5e-5 }

@ranjeetds
Copy link
Author

Hi @kamalkraj Thanks for the suggestions. Tried it on OntoNotes 5.0 and could achieve benchmarks.

Just one last question.

Does your script support transfer learning on custom trained model. That is if I trained model for 10 entities store it in 'Out' directory.

Now I get new dataset with 5 different entities. I train this model where my pretrained model will be 'out' instead pretrained bert model. Will this approach work and give better results? Or should i just train model each time from scratch with bert pretrained model?

@kamalkraj
Copy link
Owner

kamalkraj commented Nov 12, 2019

Hi @ranjeetds,

  • Transfer learning on the custom trained model not supported. And I won't recommend doing that.
  • When you add new entities or remove existing entities, the Dimensions of Output layer changes. So you can only reuse the BERT encoder from the custom trained model.

Could you please share the onotOntoNotes 5.0 results with me.

@ranjeetds
Copy link
Author

ranjeetds commented Nov 12, 2019

Hi,

Sure will user BERT encoder form custom trained model for fine-tuning.

Below are the results on OntoNotes 5.0 dataset. SOTA is around 89% F1. With BERT people have got f1 around 83% to 85%.

output

@kamalkraj
Copy link
Owner

kamalkraj commented Nov 12, 2019

OntoNotes 5.0 full or CoNLL-2012 train,test,dev split ?

@ranjeetds
Copy link
Author

OntoNotes 5.0 Full.
Used https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO this to covert it to BIO tagged format.

@Jeetkarsh
Copy link

Getting the following error while following the above steps.

Traceback (most recent call last):
File "run_ner.py", line 594, in
main()
File "run_ner.py", line 582, in main
temp_2.append(label_map[logits[i][j]])
KeyError: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants