Skip to content

hbrouwer/coals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COALS

Synopsis

Correlated Occurrence Analogue to Lexical Semantics

Description

This implements the Correlated Occurrence Analogue to Lexical Semantics (COALS; Rohde, Gonnerman, & Plaut, 2005). COALS represents word meaning by means of high-dimensional vectors derived from word co-occurrence patterns in large corporora. The construction of these vectors starts by compiling a co-occurrence table using a ramped (or alternatively, flat) n-word window. Rohde et al. use a 4-word, ramped window:

 1 2 3 4 [0] 4 3 2 1

Next, all but the m (14.000 in case of Rohde et al.) columns, reflecting the most frequent words, are discarded. Co-occurrence counts are then converted to word-pair correlations; negative correlations set to zero, and positive correlations are replaced by their square root in order to reduce difference between them. The rows of the co-occurrence table then represent COALS vectors for their respective words. Optionally, Singular Value Decomposition (SVD) can be used to reduce the dimensionality of these vectors, and these reduced vectors can, in turn, be converted to binary vectors, by setting negative components to zero, and positive components to one.

The implementation also supports a recent extension to the COALS model as proposed by Chang et al. (2012).

  • Chang, Y., Furber, S., & Welbourne, S. (2012). Generating Realistic Semantic Codes for Use in Neural Network Models. In Miyake, N., Peebles, D, and Cooper, R. P. (Eds.). Proceedings of the 34th Annual Meeting of the Cognitive Science Society (CogSci 2012).

  • Rohde, D. L. T., Gonnerman, L. M., & Plaut, D. C. (2005). An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence.

Disclaimer

I released the code "as is"; that is, in the state in which I used it for the modeling work in my PhD thesis. As such, it does what it has to do, but there is still a lot of room for improvement.

Usage

$ ./coals

COALS version 0.99-beta
Copyright (c) 2012-2014 Harm Brouwer <me@hbrouwer.eu>
Center for Language and Cognition, University of Groningen
Netherlands Organisation for Scientific Research (NWO)

usage ./coals [options]

  constructing COALS vectors:
    --wsize <num>       set co-occurrence window size to <num>
    --wtype <type>      use <type> window for co-occurrences (dflt: ramped)
    --rows <num>        set number of rows of the co-occurrence matrix to <num>
    --cols <num>        set number of cols of the co-occurrence matrix to <num>
    --dims <num>        reduce COALS vectors to <num> dimensions (using SVD)
    --vtype <type>      construct <type> (real/binary[_pn]) vectors (dflt: real)
    --unigrams <file>   read unigram counts (word frequencies) from <file>
    --ngrams <file>     read n-gram counts (co-occurrence freqs.) from <file>
    --output <file>     write COALS vectors to <file>
    --enforce <file>    enforce inclusion of words in <file>

    --pos_fts <num>     number of positive features (for binary_pn vectors)
    --neg_fts <num>     number of negative features (for binary_pn vectors)

  extracting similar words:
    --vectors <file>    compute similarities on basis of vectors in <file>
    --output <file>     write top-k similar word sets to <file>
    --topk <num         extract top-<num> similar words for each word

  basic information for users:
    --help              shows this help message
    --version           shows version

Example

Here is an example of how to create 100-bits binary COALS vectors for the 15000 most frequent words in a corpus, using a 4-word ramped window, and the 14000 most frequent features:

coals --wsize 4 --wtype ramped --rows 15000 --cols 14000
      --unigrams data/1-grams --ngrams data/9-grams
      --vtype binary --dims 100 --output coals-svdb-100.model

The assumed input format for the unigram counts is:

1|unigram_1|f
1|unigram_2|f
.|.........|.
1|unigram_n|f

where f is an integer representing the frequency of the unigram. The assumed input format for the n-gram counts, in turn, is:

n|ngram_1|f
n|ngram_2|f
.|.......|...
n|ngram_n|f

where n denotes the size of the n-gram, and f its frequency.

Once you have constructed COALS vectors, you can use them to construct lists of top $k$ similar words. The following command will compute for each word, the top 25 most similar words, on the basis of the 100-bits binary COALS vectors that were constructed using the previous command:

coals --vectors coals-svdb-100.model --output top-similar.txt --topk 25

Dependencies

COALS requires uthash and svdlibc.

License

COALS is available under the Apache License, Version 2.0.