Hierarchical Embeddings for Hypernymy Detection and Directionality
- spaCy: for parsing, version 2.0.11
- a corpus such as wikipedia corpus (plain-text)
Create the feature files:
python create_features.py -input corpus-file.txt -output output-file-name -pos pos_tag
in which: pos_tag is either NN (for the noun features) or VB (for the verb features)
See the config.cfg to set agruments for model.
java -jar HyperVec.jar config.cfg vector-size window-size
For example, training embeddings with 100 dimensions; window-size = 5:
java -jar HyperVec.jar config.cfg 100 5
Pretrained (hypervec) embeddings
The embeddings used in our paper can be downloaded by using the script in
get-pretrainedHyperVecEmbeddings/download_embeddings.sh. Note that the script downloads 9 files and concatenates them again to a single file (
hypervec.txt.gz). The format is the default word2vec format: first line with header information, other lines word followed by whitespace seperated vector.
Information about the embeddings: creatd using the ENCOW14A corpus (14.5bn token), 100 dimensions, sym. window of 5, 15 negative samples, 0.025 learning rate, threshhold set to 0.05. The resulting vocabulary contains about 2.7m words.
Example usage: Evaluation BLESS,BIBLESS and AWBLESS
To reproduce our experiments from Table 3 use the code in the
assuming your vector file is located in the same folder and named
java -jar eval-dir.jar hypervec.txt.gz (Evaluate directionality on
BLESS.txt using hyperscore)
java -jar eval-bless.jar hypervec.txt.gz 2 1000 (Evaluate classification on
BIBLESS.txt, AWBLESS.txt using 2% of the training data and 1000 random iterations)