RoBERTa Named Entity Recognition

This code is based on xlm-roberta-ner by mohammadKhalifa.

Requirements

python 3.6+
torch 1.x
fairseq
pytorch_transformers (for AdamW and WarmpUpScheduler)

Setting up

Download the Polish RoBERTa base model.

mkdir models/roberta_base_fairseq -p
wget https://github.com/sdadas/polish-roberta/releases/download/models/roberta_base_fairseq.zip
unzip roberta_base_fairseq.zip -d models/roberta_base_fairseq
rm roberta_base_fairseq.zip

Training and evaluating

The code expects the data directory passed to contain 3 dataset splits: train.txt, valid.txt and test.txt.

Named Entity Recognition on KPWr n82

Train the model from scratch

The following commands and parameters were used to train and test a model fine-grained named entity recognition on the KPWr corpus.

Base model

time python main.py  \
      --data_dir=data/kpwr_n82/  \
      --task_name=ner \
      --output_dir=models/kpwr_n82_base/   \
      --max_seq_length=128   \
      --num_train_epochs 50  \
      --do_eval \
      --warmup_proportion=0.0 \
      --pretrained_path roberta_base_fairseq \
      --learning_rate 6e-5 \
      --gradient_accumulation_steps 4 \
      --do_train \
      --eval_on test \
      --train_batch_size 32 \
      --dropout 0.2

Time: 113m29.552s

Large model

time python main.py  \
      --data_dir=data/kpwr_n82/  \
      --task_name=ner \
      --output_dir=models/kpwr_n82_large/   \
      --max_seq_length=128   \
      --num_train_epochs 50  \
      --do_eval \
      --warmup_proportion=0.0 \
      --pretrained_path roberta_large_fairseq \
      --learning_rate 6e-5 \
      --gradient_accumulation_steps 4 \
      --do_train \
      --eval_on test \
      --train_batch_size 32 \
      --dropout 0.2

Time: 260m32.544s

Download pre-trained model

mkdir models -d
wget https://clarin-pl.eu/dspace/bitstream/handle/11321/743/kpwr_n82_polish_roberta_base.zip
unzip kpwr_n82_polish_roberta_base.zip -d models
rm kpwr_n82_polish_roberta_base.zip

Eval model

time python main.py  \
      --data_dir=data/kpwr_n82/  \
      --task_name=ner \
      --output_dir=models/kpwr_n82_polish_roberta_base/   \
      --max_seq_length=128   \
      --do_eval \
      --pretrained_path models/roberta_base_fairseq \
      --eval_on test \
      --train_batch_size 32

Summary

Results on the test part of the KPWr n82 corpus.

Model	Precision	Recall	F1	Time	Memory usage	GPU memory	Embeddings size
Polish RoBERTa large	76.10	78.72	77.39	~ 0.9 m	3.0 GB	3.8 GB	0.71 GB + 1.40 GB
Polish RoBERTa base	74.79	76.42	75.60	~ 0.5 m	3.0 GB	2.0 GB	0.25 GB + 0.50 GB
PolDeepNer (n82-elmo-kgr10)	73.97	75.49	74.72	~ 4.0 m	4.5 GB		0.4 GB

Detailed results for Polish RoBERTa base on KPWr n82 test

                           precision    recall  f1-score   support

nam_adj                       0.4833    0.5577    0.5179        52
nam_adj_city                  0.8409    0.8810    0.8605        42
nam_adj_country               0.6859    0.7892    0.7339       166
nam_adj_person                1.0000    0.3333    0.5000        18
nam_eve                       1.0000    0.8750    0.9333         8
nam_eve_human                 0.3472    0.3205    0.3333        78
nam_eve_human_cultural        0.2609    0.2727    0.2667        22
nam_eve_human_holiday         0.5714    0.4444    0.5000         9
nam_eve_human_sport           0.6712    0.8909    0.7656        55
nam_fac_bridge                0.5000    0.5000    0.5000         4
nam_fac_goe                   0.5179    0.4531    0.4833        64
nam_fac_goe_stop              0.0000    0.0000    0.0000         4
nam_fac_park                  0.8571    0.6000    0.7059        10
nam_fac_road                  0.7788    0.8526    0.8141        95
nam_fac_square                0.6667    0.3333    0.4444         6
nam_fac_system                0.6111    0.4231    0.5000        26
nam_liv_animal                0.0000    0.0000    0.0000        11
nam_liv_god                   0.9412    0.9143    0.9275        35
nam_liv_habitant              0.5000    0.2857    0.3636         7
nam_liv_person                0.8877    0.9319    0.9093       925
nam_loc                       0.0000    0.0000    0.0000         4
nam_loc_country_region        0.2000    0.5000    0.2857         4
nam_loc_gpe_admin1            0.8814    0.8125    0.8455        64
nam_loc_gpe_admin2            0.8286    0.8056    0.8169        36
nam_loc_gpe_admin3            0.8571    0.7660    0.8090        47
nam_loc_gpe_city              0.8069    0.8604    0.8328       437
nam_loc_gpe_country           0.9103    0.9384    0.9241       357
nam_loc_gpe_district          0.3077    0.2222    0.2581        18
nam_loc_gpe_subdivision       0.5714    0.4615    0.5106        26
nam_loc_historical_region     0.6190    0.5909    0.6047        22
nam_loc_hydronym              0.0000    0.0000    0.0000         1
nam_loc_hydronym_lake         1.0000    0.5000    0.6667         2
nam_loc_hydronym_ocean        0.5000    1.0000    0.6667         1
nam_loc_hydronym_river        0.9048    0.7451    0.8172        51
nam_loc_hydronym_sea          1.0000    0.6667    0.8000         3
nam_loc_land                  0.0000    0.0000    0.0000         2
nam_loc_land_continent        0.9667    0.9062    0.9355        32
nam_loc_land_island           0.8000    0.7273    0.7619        11
nam_loc_land_mountain         0.7143    0.5556    0.6250         9
nam_loc_land_region           0.5333    0.7273    0.6154        11
nam_num_house                 0.8462    1.0000    0.9167        11
nam_num_phone                 0.0000    0.0000    0.0000         2
nam_org_company               0.6923    0.5921    0.6383        76
nam_org_group                 0.2500    0.1667    0.2000        18
nam_org_group_band            0.5500    0.5789    0.5641        19
nam_org_group_team            0.8627    0.8919    0.8771       148
nam_org_institution           0.6622    0.7368    0.6975       266
nam_org_nation                0.8333    0.7407    0.7843        81
nam_org_organization          0.6939    0.6911    0.6925       246
nam_org_organization_sub      0.0000    0.0000    0.0000         3
nam_org_political_party       0.8615    0.9655    0.9106        58
nam_oth                       0.3929    0.5000    0.4400        22
nam_oth_currency              0.9388    0.9020    0.9200        51
nam_oth_data_format           0.0000    0.0000    0.0000        10
nam_oth_license               0.4118    0.6364    0.5000        11
nam_oth_position              0.4167    0.5000    0.4545        10
nam_oth_tech                  0.6981    0.6066    0.6491        61
nam_oth_www                   0.5000    0.1000    0.1667        20
nam_pro                       0.0000    0.0000    0.0000         2
nam_pro_award                 0.6316    0.5217    0.5714        23
nam_pro_brand                 0.5000    0.5000    0.5000        46
nam_pro_media                 0.2857    0.2500    0.2667         8
nam_pro_media_periodic        0.7857    0.8049    0.7952        82
nam_pro_media_radio           0.4000    0.6667    0.5000         3
nam_pro_media_tv              0.5000    0.8571    0.6316         7
nam_pro_media_web             0.3538    0.5750    0.4381        40
nam_pro_model_car             0.7778    0.8077    0.7925        26
nam_pro_software              0.7013    0.5567    0.6207        97
nam_pro_software_game         0.3333    0.3333    0.3333         3
nam_pro_title                 0.4146    0.4857    0.4474        35
nam_pro_title_album           0.5000    0.5714    0.5333         7
nam_pro_title_book            0.6667    0.5455    0.6000        11
nam_pro_title_document        0.5444    0.5904    0.5665        83
nam_pro_title_song            0.2000    0.2857    0.2353         7
nam_pro_title_treaty          0.0000    0.0000    0.0000         2
nam_pro_title_tv              0.4500    0.3750    0.4091        24
nam_pro_vehicle               0.0000    0.0000    0.0000         4

micro avg                     0.7479    0.7642    0.7560      4398
macro avg                     0.7491    0.7642    0.7529      4398

Sample usage

Command:

time python sample.py

Expected output:

--------------------
Ala          B-nam_liv_person
z            O
Krakowa      B-nam_loc_gpe_city
jeździ       O
Audi         B-nam_pro_brand
--------------------
Marek        B-nam_liv_person
Nowak        I-nam_liv_person
z            O
Politechniki B-nam_org_organization
Wrocławskiej I-nam_org_organization
mieszka      O
przy         O
ul           O
.            O
Sądeckiej    B-nam_fac_road
--------------------

real	0m6.309s
user	0m9.360s
sys 	0m3.136s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RoBERTa Named Entity Recognition

Requirements

Setting up

Training and evaluating

Named Entity Recognition on KPWr n82

Train the model from scratch

Base model

Large model

Download pre-trained model

Eval model

Summary

Detailed results for Polish RoBERTa base on KPWr n82 test

Sample usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
core		core
data		data
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
sample.py		sample.py

mczuk/xlm-roberta-ner

Folders and files

Latest commit

History

Repository files navigation

RoBERTa Named Entity Recognition

Requirements

Setting up

Training and evaluating

Named Entity Recognition on KPWr n82

Train the model from scratch

Base model

Large model

Download pre-trained model

Eval model

Summary

Detailed results for Polish RoBERTa base on KPWr n82 test

Sample usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages