Format for training custom NER classifiers #6

mrshu · 2021-02-05T10:13:13Z

First of all, thanks for opensourcing trankit -- it looks very interesting!

I would be interested in training a custom NER model as described in the docs. Could you please comment a bit on what sort of a format should the .bio files be stored in?

Thanks!

cc @minhvannguyen

The text was updated successfully, but these errors were encountered:

minhhdvn · 2021-02-06T05:41:41Z

Hi @mrshu,
Thank you for asking.
Regarding the format of the training data for NER modules, the GermEval 2014 NER dataset would be a good source for formatting your own training data.
Please let us know if you have further questions on this.
Thanks

mrshu · 2021-02-06T14:20:14Z

Thanks for your answer @minhvannguyen!

I did try to put a simple demo of trankit's NER being finetuned on the German 2014 NER dataset in Colab:

https://colab.research.google.com/drive/1sgU0U42c1ipn6QbskFcRDs_tdvQgK7Gf

It seems that everything indeed works but despite the loss getting close to 0 towards the second epoch for the respective training examples, the dev F1 score still reports 0. I suspect it may be connected to the format of the input files (now .tsv). Did you happen to run into something like this or alternatively, would you mind sharing a few pointers as to how an issue like this could be resolved?

If I end up getting it to work, I'd be happy to write it up for the trankit docs so that there could be a step-by-step guide on how to train custom NER models.

Thanks again!

mrshu · 2021-02-06T16:21:07Z

All right, I did find the get_example_from_lines which governs which columns are selected (the first and the last one).

I think something like this could be mentioned in the docs so that it is relatively obvious to anyone trying to train a NER model on a custom dataset.

trankit/trankit/utils/ner_utils.py

Lines 42 to 51 in 576d918

    
           def get_example_from_lines(sent_lines): 
        
               tokens = [] 
        
               ner_tags = [] 
        
               for line in sent_lines: 
        
                   array = line.split() 
        
                   assert len(array) >= 2 
        
                   tokens.append(array[0]) 
        
                   ner_tags.append(array[-1]) 
        
               ner_tags = convert_to_bioes(convert_to_bio2(ner_tags)) 
        
               return {'words': tokens, 'entity-labels': ner_tags}

minhhdvn · 2021-02-06T17:26:20Z

Thank you a lot for pointing it out, @mrshu.
We will update the docs to specify this information.

minhhdvn · 2021-02-06T19:44:47Z

The docs has been updated.
Thank you again @mrshu.

mrshu · 2021-02-06T22:21:47Z

Thanks a ton for your speedy response @minhvannguyen, let's close this issue then!

mrshu closed this as completed Feb 6, 2021

mrshu mentioned this issue Feb 6, 2021

Difficulties in reproducing the GermEval14 NER model #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format for training custom NER classifiers #6

Format for training custom NER classifiers #6

mrshu commented Feb 5, 2021

minhhdvn commented Feb 6, 2021

mrshu commented Feb 6, 2021

mrshu commented Feb 6, 2021

minhhdvn commented Feb 6, 2021

minhhdvn commented Feb 6, 2021

mrshu commented Feb 6, 2021

Format for training custom NER classifiers #6

Format for training custom NER classifiers #6

Comments

mrshu commented Feb 5, 2021

minhhdvn commented Feb 6, 2021

mrshu commented Feb 6, 2021

mrshu commented Feb 6, 2021

minhhdvn commented Feb 6, 2021

minhhdvn commented Feb 6, 2021

mrshu commented Feb 6, 2021