Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format for training custom NER classifiers #6

Closed
mrshu opened this issue Feb 5, 2021 · 6 comments
Closed

Format for training custom NER classifiers #6

mrshu opened this issue Feb 5, 2021 · 6 comments

Comments

@mrshu
Copy link

mrshu commented Feb 5, 2021

First of all, thanks for opensourcing trankit -- it looks very interesting!

I would be interested in training a custom NER model as described in the docs. Could you please comment a bit on what sort of a format should the .bio files be stored in?

Thanks!

cc @minhvannguyen

@minhhdvn
Copy link
Collaborator

minhhdvn commented Feb 6, 2021

Hi @mrshu,
Thank you for asking.
Regarding the format of the training data for NER modules, the GermEval 2014 NER dataset would be a good source for formatting your own training data.
Please let us know if you have further questions on this.
Thanks

@mrshu
Copy link
Author

mrshu commented Feb 6, 2021

Thanks for your answer @minhvannguyen!

I did try to put a simple demo of trankit's NER being finetuned on the German 2014 NER dataset in Colab:

https://colab.research.google.com/drive/1sgU0U42c1ipn6QbskFcRDs_tdvQgK7Gf

It seems that everything indeed works but despite the loss getting close to 0 towards the second epoch for the respective training examples, the dev F1 score still reports 0. I suspect it may be connected to the format of the input files (now .tsv). Did you happen to run into something like this or alternatively, would you mind sharing a few pointers as to how an issue like this could be resolved?

If I end up getting it to work, I'd be happy to write it up for the trankit docs so that there could be a step-by-step guide on how to train custom NER models.

Thanks again!

@mrshu
Copy link
Author

mrshu commented Feb 6, 2021

All right, I did find the get_example_from_lines which governs which columns are selected (the first and the last one).

I think something like this could be mentioned in the docs so that it is relatively obvious to anyone trying to train a NER model on a custom dataset.

def get_example_from_lines(sent_lines):
tokens = []
ner_tags = []
for line in sent_lines:
array = line.split()
assert len(array) >= 2
tokens.append(array[0])
ner_tags.append(array[-1])
ner_tags = convert_to_bioes(convert_to_bio2(ner_tags))
return {'words': tokens, 'entity-labels': ner_tags}

@minhhdvn
Copy link
Collaborator

minhhdvn commented Feb 6, 2021

Thank you a lot for pointing it out, @mrshu.
We will update the docs to specify this information.

@minhhdvn
Copy link
Collaborator

minhhdvn commented Feb 6, 2021

The docs has been updated.
Thank you again @mrshu.

@mrshu
Copy link
Author

mrshu commented Feb 6, 2021

Thanks a ton for your speedy response @minhvannguyen, let's close this issue then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants