Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


The dataset for evaluating of vector differences.

Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning

Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin

Recent work has shown that simple vector subtraction over word embeddings is surprisingly effective at capturing different lexical relations, despite lacking explicit supervision. Prior work has evaluated this intriguing result using a word analogy prediction formulation and hand-selected relations, but the generality of the finding over a broader range of lexical relation types and different learning settings has not been evaluated. In this paper, we carry out such an evaluation in two learning settings: (1) spectral clustering to induce word relations, and (2) supervised learning to classify vector differences into relation types. We find that word embeddings capture a surprising amount of information, and that, under suitable supervised training, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.

There are asymmetric relations consisting of either verbs or nouns.

The final dataset consists of 12,458 triples {relation, word_1 , word_2}, comprising 15 relation types, extracted from SemEval’12 (Jurgens et al., 2012), BLESS (Baroni and Lenci, 2011), the MSR analogy dataset (Mikolov et al., 2013c), the light verb dataset of Tan et al. (2006a), Princeton WordNet (Fellbaum, 1998), Wiktionary, and a web lexicon of collective nouns

The suset of 9 classes consists of the following relations: collective_noun, event, hyper, mero, noun_singplur, prefix$re, verb_3rd, verb_3rd_past, verb_past

Other: [Slides from ACL'2016 talk] (, [arxiv version of the paper] (, [ACL'2016 talk/video] (, [an article on TechnologyReview] (


Generates the files with vectors difference
Usage: -s <semanticrels> -v <vectors> -o <outputfile>
<semanticrels> -- input file with relations
<vectors> -- input file with word embeddings (use ' '(space) as delimiter)
<outputfile> -- output file

Generate negative samples (opposite and shuffled pairs)
Usage: -s <semanticrels>
<semanticrels> -- input file with relations
Outputs <semanticrels>.oppos.csv and <semanticrels>.shuff.csv files with opposite and shiffled pairs
Run to get WEs for their differences

Run Spectral Clustering over the vectors
Usage: -f <file_vectors> -c <clust_type:affin, knn> -n <number_of_clusters> -d <distance:euclid, cosine> -p <parameter: gamma, k(NN)> [-s <std>]
Example: python -f google.300d.vec -c affin -n 80 -d euclid -p 0.1
* prints V-Measure and Homogeneity
* for each DiffVec it's cluster in the format using ' : ' as delimiter
Automaically gererates the file name like fname.affin."$clust_number".euclid.0.1.1

Script that runs
Need to specify settings and the file inside

Script for clustering evaluation.
Takes average V-Measure for various runs depending on the number of the clusters
Need to set your file name inside
Better to put each run over several clusters in a separate EX\d(EX1, EX2, ..) folder


Evaluating the Utility of Vector Differences for Lexical Relation Learning



No releases published


No packages published