Skip to content

hslh/pie-disambiguation

Repository files navigation

Automatic Sense Disambiguation of Potentially Idiomatic Expressions

This is the source code for a system to automatically disambiguate potentially idiomatic expressions (PIEs, for short) in text. It implements four methods of doing so: a baseline most-frequent-sense method, a baseline canonical form-based method (Fazly et al., 2009), a lexical cohesion graph-based method (Sporleder & Li, 2009), and a variation on that method using literal representations of idioms' figurative senses. It evaluates those methods on a combination of four corpora, the VNC-Tokens corpus, the IDIX corpus, the PIE Corpus, and the SemEval-2013 Task 5b dataset. For a detailed description of the systems, see our LAW-MWE-CxG paper.

Requirements

To run this code, you'll need the following Python setup:

  • Python 2.7.6
  • beautifulsoup4 4.5.1
  • numpy 1.14.0
  • scipy 0.19.1
  • spacy 2.0.6 + en_core_web_sm 2.0.0

Different versions might work just as well, but cannot be guaranteed.

You'll also need:

Getting Started

  • Clone the repository
  • Create subdirectories called working and ext
  • Add these symlinks (or edit config.py):
    • create a symlink ext/BNC to the Texts directory of your copy of the BNC
    • create a symlink ext/glove to the directory containing the GloVe embeddings
    • create symlinks ext/VNC, ext/IDIX, ext/PIE_Corpus, and ext/SemEval to the main directory of the respective corpora
  • Try and run the system with python psd.py -c 0 -m cg -gs 0s. This should run a basic lexical cohesion graph method and evaluate on the development set of the combined corpora.
  • Get an overview of all options by simply running python psd.py --help

Contact

For any questions about (running) the system, feel free to contact me.

About

Automatic Disambiguation of Potentially Idiomatic Expressions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages