Clustering Code for Part-of-Speech (PoS) Tag Induction from Clark (2003)
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


This code was used in the paper

  title={{Combining distributional and morphological information for part of speech induction}},
  author={Clark, A.},
  journal={Proceedings of 10th EACL},

That would be a good paper to cite if you use this code.

This is some code for part of speech induction.
Its buggy and it isn't very robust -- classic one user research code.
It doesn't deal well with unicode so you need to map it all to ascii.
It accepts data in one word per line format.

Use a command line like this:

cluster_neyessenmorph -s 5 -m 5 -i 10 data data 32 > clusters.nem.32

Each binary has some minimal info about what the arguments and options are.
Run it with no args to get them, but these are probably out of date so look at the source instead.

The output is a file with each line of the form

word class prob

Where class is the class label (0 ... 31), prob is the conditional prob p(w|class).

Edit the makefile to put the binaries somewhere sensible

The binaries are

cluster_baseline:  all the n-1 most frequent words in their own clusters and all the rest in the nth cluster.
cluster_merge: not relevant
cluster_neyessen:  Ney Essen clustering
cluster_traintwohmm: Trains a two level HMM. ie. a HMM where the output function is an interpolation between a multinomial over the set of observed words, and a HMM over strings of letters. 

cluster tag_twohmm: Tags a corpus given a trained two hmm.
cluster_neyessenmorph: Ney essen clustering plus a morphological term.

Any comments etc. welcome.

Alex Clark
2 August 2005 and Feb 12 2008