Learning-Numeral-Embeddings

Source code for the paper "Learning Numeral Embedding" Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao, Libin Shen, and Kewei Tu, accepted in Findings of EMNLP, 2020

Citation

@inproceedings{jiang-etal-2020-learning,
    title = "Learning Numeral Embedding",
    author = "Jiang, Chengyue  and
      Nian, Zhonglin  and
      Guo, Kaihao  and
      Chu, Shanbo  and
      Zhao, Yinggong  and
      Shen, Libin  and
      Tu, Kewei",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.235",
    pages = "2586--2599",
    abstract = "Word embedding is an essential building block for deep learning methods for natural language processing. Although word embedding has been extensively studied over the years, the problem of how to effectively embed numerals, a special subset of words, is still underexplored. Existing word embedding methods do not learn numeral embeddings well because there are an infinite number of numerals and their individual appearances in training corpora are highly scarce. In this paper, we propose two novel numeral embedding methods that can handle the out-of-vocabulary (OOV) problem for numerals. We first induce a finite set of prototype numerals using either a self-organizing map or a Gaussian mixture model. We then represent the embedding of a numeral as a weighted average of the prototype number embeddings. Numeral embeddings represented in this manner can be plugged into existing word embedding learning approaches such as skip-gram for training. We evaluated our methods and showed its effectiveness on four intrinsic and extrinsic tasks: word similarity, embedding numeracy, numeral prediction, and sequence labeling.",
}

Run the code (Example)

Requirements

pytorch==0.4.1
scikit-learn==0.19.2
matplotlib==2.2.2
seaborn==0.9.0
numpy==1.15.0

check options

List all options and their explainations, use:

python <...>.py --help

download data

As the wiki data is too big, we show how we obtain the data. You can download the latest Wikipedia dumps here https://dumps.wikimedia.org/enwiki/20210101/, we obtained the data from the "Articles, templates, media/file descriptions, and primary meta-pages." section. An example download URL: https://dumps.wikimedia.org/enwiki/20210101/enwiki-20210101-pages-articles1.xml-p1p41242.bz2

We downloaded about 20 of these files, and preprocess it using this python script. https://github.com/jeffchy/Learning-Numeral-Embeddings/blob/master/pytorch-sgn/data/wikipedia/process_wiki.py

Preprocess

python preprocess.py --MAXDATA=20000000 --save_dir=data/wikipedia/preprocess1B --filtered=filtered.txt --corpus=data/wikipedia/wikiraw/bz2/train.txt --max_vocab=300000 --mode=all --window=5 --scheme=numeral_as_numeral --saved_dir_name=NumeralAsNumeral30W

preprocess the original plain text data <--corpus> (train.txt), and write to a filtered plain text file <--filtered> (filtered.txt), then generate all necessary files for training in <--save_dir> including vocabularies and training batches.

Train SOM / GMM

python preprocess.py --mode=train_som --num_iters=200000 --num_prototypes=100 --lr=1 --sigma=0.6 --save_dir=data/wikipedia/preprocess1B/ --saved_dir_name=NumeralAsNumeral30W

python preprocess.py --mode=train_gmm --num_components=200 --gmm_iters=100 --save_dir=data/wikipedia/preprocess1B/ --prototype_path=prototypes-200-0.6-1.0.dat --saved_dir_name=NumeralAsNumeral30W --gmm_type=hard

Train word vectors

python train.py --cuda --weights --preprocess_dir=./data/wikipedia/preprocess1B/NumeralAsNumeral30W --save_dir=./data/wikipedia/save/1B30W/prototypes/2-0005 --log_dir=./data/wikipedia/logs/1B30W/prototypes/2-0005 --epoch=1 --n_neg=5 --mb=2048 --scheme=prototype --e_dim=300 --prototypes_path=prototypes-200-0.6-1.0.dat --lr=0.0005

Generate Numeral Embeddings

see the file create_embedding.py in /experiments/probing/src

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
imgs		imgs
pytorch-sgn		pytorch-sgn
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imgs

imgs

pytorch-sgn

pytorch-sgn

README.md

README.md

Repository files navigation

Learning-Numeral-Embeddings

Citation

Run the code (Example)

Requirements

check options

download data

Preprocess

Train SOM / GMM

Train word vectors

Generate Numeral Embeddings

About

Releases

Packages

Languages

jeffchy/Learning-Numeral-Embeddings

Folders and files

Latest commit

History

Repository files navigation

Learning-Numeral-Embeddings

Citation

Run the code (Example)

Requirements

check options

download data

Preprocess

Train SOM / GMM

Train word vectors

Generate Numeral Embeddings

About

Resources

Stars

Watchers

Forks

Languages