-
Notifications
You must be signed in to change notification settings - Fork 5
/
README
89 lines (65 loc) · 3.71 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# hyperwords: Hyperparameter-Enabled Word Representations #
hyperwords is a collection of scripts and programs for creating word representations, designed to facilitate academic
research and prototyping of word representations. It allows you to tune many hyperparameters that are pre-set or
ignored in other word representation packages.
hyperwords is free and open software. If you use hyperwords in scientific publication, we would appreciate citations:
"Improving Distributional Similarity with Lessons Learned from Word Embeddings"
Omer Levy, Yoav Goldberg, and Ido Dagan. TACL 2015.
## Requirements ##
Running hyperwords may require a lot of computational resources:
- disk space for independently pre-processing the corpus
- internal memory for loading sparse matrices
- significant running time; hyperwords is neither optimized nor multi-threaded
hyperwords assumes a *nix shell, and requires Python 2.7 (or later, excluding 3+) with the following packages installed:
numpy, scipy, sparsesvd.
## Quick-Start ##
1. Download the latest version from BitBucket, unzip, and make sure all scripts have running permissions (chmod 755 *.sh).
2. Download a text corpus of your choice.
3. To create word vectors...
* ...with SVD over PPMI, use: *corpus2svd.sh*
* ...with SGNS (skip-grams with negative sampling), use: *corpus2sgns.sh*
4. The vectors should be available in textual format under <output_path>/vectors.txt
To explore the list of hyperparameters, use the *-h* or *--help* option.
##Pipeline##
The following figure shows the hyperwords' pipeline:
**DATA:** raw corpus => corpus => pairs => counts => vocab
**TRADITIONAL:** counts + vocab => pmi => svd
**EMBEDDINGS:** pairs + vocab => sgns
**raw corpus => corpus**
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.
**corpus => pairs**
- *corpus2pairs.py*
- Extracts a collection of word-context pairs from the corpus.
**pairs => counts**
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.
**counts => vocab**
- *counts2vocab.py*
- Creates vocabularies with the words' and contexts' unigram distributions.
**counts + vocab => pmi**
- *counts2pmi.py*
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.
**pmi => svd**
- *pmi2svd.py*
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.
**pairs + vocab => sgns**
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**
An example pipeline is demonstrated in: *example_test.sh*
##Evaluation##
hyperwords also allows easy evaluation of word representations on two tasks: word similarity and analogies.
**Word Similarity**
- *hyperwords/ws_eval.py*
- Compares how a representation ranks pairs of related words by similarity versus human ranking.
- 5 readily-available datasets
**Analogies**
- *hyperwords/analogy_eval.py*
- Solves analogy questions, such as: "man is to woman as king is to...?" (answer: queen).
- 2 readily-available datasets
- Shows results of two analogy recovery methods: 3CosAdd and 3CosMul. For more information, see:
**"Linguistic Regularities in Sparse and Explicit Word Representations". Omer Levy and Yoav Goldberg. CoNLL 2014.**
These programs assume that the representation was created by hyperwords, and can be loaded by
*hyperwords.representations.embedding.Embedding*. Dense vectors in textual format (such as the ones produced by word2vec
and GloVe) can be converted to hyperwords' format using *hyperwords/text2numpy.py*.