diff --git a/README.md b/README.md index a1cd25c6..dc4e23f5 100644 --- a/README.md +++ b/README.md @@ -10,25 +10,23 @@ nltk 3, lxml, simplejson and yaml python libraries. I recommend to use a python and install the packages in the virtual environment with `pip`: ```bash -$ virtualenv --no-site-packages --distribute -p /usr/bin/python3 py3 -$ source py3/bin/activate -(py3)$ pip install -I nltk==3.0.0 -(py3)$ pip install lxml simplejson pyyaml +git clone git@github.com:mynlp/ccg2lambda.git +virtualenv --no-site-packages --distribute -p /usr/bin/python3 py3 +source py3/bin/activate +pip install lxml simplejson pyyaml -I nltk==3.0.0 ``` You also need to install WordNet: ```bash -(py3)$ python ->>> import nltk ->>> nltk.download('wordnet') +python -c "import nltk; nltk.download('wordnet')" ``` To ensure that all software is working as expected, you can run the tests: ```bash -(py3)$ cd ccg2lambda/ -(py3)$ python run_tests.py +cd ccg2lambda/ +python run_tests.py ``` (all tests should pass, except a few expected failures). @@ -50,7 +48,118 @@ Then, compile the coq library that contains the axioms: $ coqc coqlib.v ``` -## Running the pipeline. +## Using the Semantic Parser + +Let's assume that we have a file `sentences.txt` with one sentence per line, +and that we want to semantic parse those sentences. Here is the content of +my file: + +``` +All women ordered coffee or tea. +Some woman did not order coffee. +Some woman ordered tea. +``` + +And we want to obtain a symbolic semantic representation such as: + +``` +forall x. (woman(x) -> exists y. ((tea(y) \/ coffee(y)) /\ order(x, y))) +exists x. (woman(x) /\ -exists y. (cofee(y) /\ order(x, y))) +exists x. (woman(x) /\ exists y. (tea(y) /\ order(x, y))) +``` + +First we need to obtain the CCG derivations (parse trees) of the sentences +in the text file using C&C and convert its XML format into Jigg's XML format: + +```bash +cat sentences.txt | perl tokenizer.perl -l en 2>/dev/null > sentences.tok +/path/to/candc-1.00/bin/candc --models /path/to/candc-1.00/models --candc-printer xml --input sentences.tok > sentences.candc.xml +python candc2transccg.py sentences.candc.xml > sentences.xml +``` + +Then, we are ready to obtain the semantic representations by using semantic +templates and the CCG derivations obtained above: + +```bash +python semparse.py sentences.xml semantic_templates_en_emnlp2015.yaml sentences.sem.xml +``` + +The semantic representations are in the `sentences.sem.xml` file, +where a new XML node `` has been added with as many child nodes +as the CCG structure. Each semantic span has the logical representation +obtained up to that span. The root span has the logical representation +of the whole sentence. Here there is an excerpt of the semantics XML node +of the last sentence: + +```xml + + + + + + + + + + + + +``` + +The `sem` attribute contains the logical formulas, and the `type` attributes +the types of the predicates (types only appear at the leaves). + +## Using a prover (Coq) for recognizing textual entailment + +We believe that the semantic representations above can be used +for several NLP tasks. We have been using them so far +for recognizing textual entailment. For this purpose, +we assume that all sentences in the file are premises, +except the last one, which is the conclusion. + +To build a theorem out of those logical representations, +pipe it to a theorem prover (Coq) and judge the entailment +relation, you can run the following command: + +```bash +python prove.py sentences.sem.xml 2> graphdebug.html +``` + +That command will output `yes` (entailment relation - the conclusion +can be proved given the premises), `no` (contradiction - the negated +conclusion can be proved), `unknown` (otherwise). + +If the parsing process and theorem proving succeeded, +graphdebug.html will have a graphical representation +of the CCG trees, augmented with logical formulas at +every node below the syntactic category. The script +that pipes the theorem to Coq is also displayed at +the bottom. If the semantic parsing or prover fails, +graphdebug.html may contain plain debugging information +(e.g. python error messages, etc.). Here is the `graphdebug.html` +of the example above: + +Inline-style: +![alt text](./doc/images/graphdebug.png "Visualization of semantic parser and prover") + +## Visualization + +It is also possible to visualize CCG trees, either before +or after augmenting them with semantic representations. +For example, to visualize the CCG trees only (without +semantic representations): + +```bash +python visualize.py sentences.xml > sentences.html +``` + +and then open the file `sentences.html` with your favourite web browser. +You should be able to see something like this: + +Inline-style: +![alt text](./doc/images/ccg_html.png "Visualization of CCG tree (without semantic representations)") + +## Running the RTE pipeline on FraCas. First, you need to download the copy of [FraCaS provided by MacCartney and Manning (2007)](http://www-nlp.stanford.edu/~wcmac/downloads/fracas.xml): diff --git a/doc/images/ccg_html.png b/doc/images/ccg_html.png new file mode 100644 index 00000000..269ad0ad Binary files /dev/null and b/doc/images/ccg_html.png differ diff --git a/doc/images/graphdebug.png b/doc/images/graphdebug.png new file mode 100644 index 00000000..c8a4c2ed Binary files /dev/null and b/doc/images/graphdebug.png differ