code for building dynamic contextual distributional semantic models
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
space
ReadMe
builder
counter.py
numcounter
numcounter.py
numcounter.pyc
numfrequer.py
numfrequer.pyc
numtaler.py
numtaler.pyc
numturner.py
numturner.pyc
replacer
replacer.py
replacer.pyc
runner
runner.py
stripper.py
stripper.pyc
talturner.py

ReadMe

This package of Python scripts will build a language model for generating conceptually sensitive conceptual space, and also provides a tool for projecting subspaces from the language model where contextualised clusterings of relevant words can be discovered.  PLEASE NOTE that the model building can take up quite a bit of space.  Trained on the English language Wikipedia, for instance, the model will take up about 40 GB of hard drive space, and, as configured, the software will consume about 15 GB of RAM in the process of building the model.  Time and space requirements will scale roughly logarithmically based on the size in word tokens of the input corpus.

To build a new model, simply run the "builder" script from within the "model" directory, for instance by simply entering the command "python builder" from a terminal with the "model" directory open.  The software will ask for a file name as input (including path if the file hasn't also been moved into the "model" directory).  This input file should be a plain text file containing the text of the corpus you would like to train over (in either ASCII or Unicode format).  On a corpus the scale of Wikipedia (ie, on the order of a couple billion word tokens), the model builder will take several hours to run.  It will also save a number of new, in some cases very large files in the "model/space" directory.  It will also save a very large number of files, possibly several million, to the "model/space/contexts" directory.  Please open these files and the "contexts" folder with caution--many text editors and operating systmes will not handle this well, and the manual opening of the files is not necessary for running the software.

Once the language model is built, the model can be explored by running the script "runner" (eg, by typing "python runner" into a terminal in the "model" directory).  The runner application will ask you to input a series of words which will be used to select the dimensions required to project a conceputally informed subspace from the overall model.  Simply hitting the enter key at the prompt without entering any text will initiate the projection.  You will be queried for the number of dimensions you would like to use to construct the space.  In our experiments, spaces on the order of 20 to 500 dimensions have returned good results.  Once the dimensions for the subspace have been selected, you will be asked to indicate how many word-vectors the model should return--generally, something in the range of 20 to 50 is interesting.  The model will the return words picked out of the space using three different metrics: by smallest distance from a central point, by largest norm of the word vectors, and by smallest cosine difference (ie, angle) from a central vector.  In general the first two sets of results will be of conceptual interest, though the cosine angle metric provides some interesting insight into the nature of the model, as well.