Generic Contexts Word Embeddings: A count-based word embedding for generic contexts.
Based on Levy, Goldberg and Dagan's code (Levy et al. 2015): https://bitbucket.org/omerlevy/hyperwords/src/
Currently, this repository contains a direct copy of the aforementioned code. It will be subsequently modified for the aforementioned purposes.
hyperwords is a collection of scripts and programs for creating word representations, designed to facilitate academic research and prototyping of word representations. It allows you to tune many hyperparameters that are pre-set or ignored in other word representation packages.
hyperwords is free and open software. If you use hyperwords in scientific publication, we would appreciate citations:
"Improving Distributional Similarity with Lessons Learned from Word Embeddings"
Omer Levy, Yoav Goldberg, and Ido Dagan. TACL 2015.
Running hyperwords may require a lot of computational resources:
- disk space for independently pre-processing the corpus
- internal memory for loading sparse matrices
- significant running time; hyperwords is neither optimized nor multi-threaded
hyperwords assumes a *nix shell, and requires Python 2.7 (or later, excluding 3+) with the following packages installed: numpy, scipy, sparsesvd, docopt.
- Download the latest version from BitBucket, unzip, and make sure all scripts have running permissions (chmod 755 *.sh).
- Download a text corpus of your choice.
- To create word vectors...
- ...with SVD over PPMI, use: corpus2svd.sh
- ...with SGNS (skip-grams with negative sampling), use: corpus2sgns.sh
- The vectors should be available in textual format under <output_path>/vectors.txt
To explore the list of hyperparameters, use the -h or --help option.
##Pipeline## The following figure shows the hyperwords' pipeline:
DATA: raw corpus => corpus => pairs => counts => vocab
TRADITIONAL: counts + vocab => pmi => svd
EMBEDDINGS: pairs + vocab => sgns
raw corpus => corpus
- scripts/clean_corpus.sh
- Eliminates non-alphanumeric tokens from the original corpus.
corpus => pairs
- corpus2pairs.py
- Extracts a collection of word-context pairs from the corpus.
pairs => counts
- scripts/pairs2counts.sh
- Aggregates identical word-context pairs.
counts => vocab
- counts2vocab.py
- Creates vocabularies with the words' and contexts' unigram distributions.
counts + vocab => pmi
- counts2pmi.py
- Creates a PMI matrix (scipy.sparse.csr_matrix) from the counts.
pmi => svd
- pmi2svd.py
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.
pairs + vocab => sgns
- word2vecf/word2vecf
- An external program for creating embeddings with SGNS. For more information, see:
"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.
An example pipeline is demonstrated in: example_test.sh
##Evaluation## hyperwords also allows easy evaluation of word representations on two tasks: word similarity and analogies.
Word Similarity
- hyperwords/ws_eval.py
- Compares how a representation ranks pairs of related words by similarity versus human ranking.
- 5 readily-available datasets
Analogies
- hyperwords/analogy_eval.py
- Solves analogy questions, such as: "man is to woman as king is to...?" (answer: queen).
- 2 readily-available datasets
- Shows results of two analogy recovery methods: 3CosAdd and 3CosMul. For more information, see:
"Linguistic Regularities in Sparse and Explicit Word Representations". Omer Levy and Yoav Goldberg. CoNLL 2014.
These programs assume that the representation was created by hyperwords, and can be loaded by hyperwords.representations.embedding.Embedding. Dense vectors in textual format (such as the ones produced by word2vec and GloVe) can be converted to hyperwords' format using hyperwords/text2numpy.py.