Skip to content
No description, website, or topics provided.
C Scala Python Shell Makefile
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
LICENSE
README.md

README.md

word-sense-modeling

These are some codes related to my master thesis. Supplementary documents and codes will be added in future.

Annotate words with parallel corpus (python code)

More info will come here

Convert to word2vec (C code)

This tool converts text based multiple word embedings to binary word2vec file formats. The output file can be used in gensim or other word2vec toolkits. This program has been tested on multiple sense vectors produced by this code and this paper by Neelakantan et al. 2014

The text file can be in two different formats:

Default format:
<Total Number of words> <Dimensionality> <Number of senses per word> <Does it have the max number of senses per word? (1 or 0)>
<word> <no. of senses>
<global context vector>
<first sense vector>
<second sense vector>
...

For example, the text file can be:

2 3 2 1
bank 3
0.2 0.1 0.3
0.1 0.2 0.2
0.3 0.3 0.3
0.1 0.1 0.1
lemon 1
0.4 0.5 0.1
0.1 0.6 0.2
Second format:
<Total Number of words> <Dimensionality>
<word> <no. of senses>
<global context vector>
<first sense vector>
<first sense cluster center>
<second sense vector>
<second sense cluster center>
...

For example, the text file can be:

2 3 2 1
bank 3
0.2 0.1 0.3
0.1 0.2 0.2
0.11 0.2 0.21
0.3 0.3 0.3
0.301 0.301 0.299
0.1 0.1 0.1
0.09 0.09 0.098
lemon 1
0.4 0.5 0.1
0.1 0.6 0.2
0.101 0.602 0.201

The output file will have global vector as exact word name. Each word-sense will become "word.." which <n> is an index number. For example if there was 3 sense vectors for the word "bank" in the model, there will be "bank..1", "bank..2" and "bank..3" as 3 sense vectors, and "bank" as the context vector in the output file. If the file was in second format, with supporing cluster centroids, then in output file they will become "bank..1c", "bank..2c" and "bank..3c".

You can’t perform that action at this time.