Skip to content

limteng-rpi/neural_name_tagging

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dynamic Feature Composition for Name Tagging

Code for our ACL2019 paper Reliability-aware Dynamic Feature Composition for Name Tagging.

Input Data Set Directory Structure

  • <input_dir>
    • embed.vocab.tsv (embedding vocab file, 1st column: token, 2nd column: index)
    • embed.count.tsv (embedding token frequency file, 1st column: token, 2nd column: frequency)
    • bc
      • train.tsv (training set)
      • dev.tsv (development set)
      • test.tsv (test set)
      • token.vocab.tsv (token vocab file, 1st column: token, 2nd column: index)
      • char.vocab.tsv (character vocab file: 1st column: character, 2nd column: index)
      • label.vocab.tsv (label vocab file: 1st column: label, 2nd column: index)
    • bn
    • mz
    • nw
    • tc
    • wb

Note:

  • Other subsets have train.tsv, dev.tsv, test.tsv, token.vocab.tsv, char.vocab.tsv, and label.vocab.tsv in their directories.
  • In our experiments, we generated *.vocab.tsv from a merged data set of all subsets.
  • In our experiments, we use CoNLL format files generated from OntoNotes 5.0 with Pradhan et al.'s scripts, which can be found at https://cemantix.org/data/ontonotes.html.

Pre-processing

The following functions in proprocess.py can be used to create vocab and frequency files.

  • build_all_vocabs takes as input a list of CoNLL format files, and generate {token,char,label}.vocab.tsv in output_dir.
  • build_embed_vocab takes a pre-trained embedding file as input and return the embedding vocab.
  • build_embed_token_count takes a pre-trained embedding file as input and generate an embedding token frequency file.

Train LSTM-CNN

python train_lstmcnn_all.py -d 0 -i <input_dir> -o <output_dir> -e <embedding_file>
  --embed_vocab <embedding_vocab_file> --char_dim 50 --seed <random_seed>

This script train a model for each subset (which can be specified with the --datasets argument) and report within-subset (within-genre) and cross-subset (cross-genre) performance.

Train LSTM-CNN with Dynamic Feature Composition

python train_lstmcnn_dfc_all.py -d 0 -i <input_dir> -o <output_dir> -e <embedding_file>
  --embed_vocab <embedding_vocab_file> --embed_count <embedding_freq_file> --char_dim 50 --seed <random_seed>

Requirement

  • Python 3.5+
  • Pytorch 1.0

Resources

Reference

Lin, Y., Liu, L., Ji, H., Yu, D., Han, J. (2019) Reliability-aware Dynamic Feature Composition for Name Tagging. Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics.

@article{lin2019reliability,
  title={Reliability-aware Dynamic Feature Composition for Name Tagging},
  author={Lin, Ying and Liu, Liyuan and Ji, Heng and Yu, Dong and Han, Jiawei},
  booktitle={Proceedings of The 57th Annual Meeting of the Association for Computational Linguistics (ACL2019)},
  year={2019}
}

Releases

No releases published

Packages

No packages published

Languages