Support for converting Dependency Parsing corpus in .conllx format

hankcs · Mar 2, 2018 · 1ef70b2 · 1ef70b2
1 parent 23cd94a
commit 1ef70b2
Show file tree

Hide file tree

Showing 3 changed files with 112 additions and 52 deletions.
diff --git a/.gitignore b/.gitignore
@@ -100,4 +100,5 @@ ENV/
 # mypy
 .mypy_cache/
 
-.idea
+.idea
+*.jar
diff --git a/README.md b/README.md
@@ -1,15 +1,35 @@
 # TreebankPreprocessing
-Python scripts preprocessing [Penn Treebank (PTB)](https://catalog.ldc.upenn.edu/ldc99t42) and [Chinese Treebank 5.1 (CTB)](https://catalog.ldc.upenn.edu/LDC2005T01). 
+Python scripts preprocessing [Penn Treebank (PTB)](https://catalog.ldc.upenn.edu/ldc99t42) and [Chinese Treebank 5.1 (CTB)](https://catalog.ldc.upenn.edu/LDC2005T01). They can convert treebanks to:
+
+* constituency parse tree in `.txt` format, one line for one sentence.
+* dependency parse tree in `.conllx` format.
+* part-of-speech tagging corpus in `.tsv` format: first column for words, second column for tags, sentences separated by a blank line.
+
+
+| Corpus | Format | Description |
+| --- | --- | --- |
+| constituency parse tree | `.txt` | one line for one sentence |
+| dependency parse tree | `.conllx` | [Basic Stanford Dependencies (SD)](https://nlp.stanford.edu/software/stanford-dependencies.shtml) |
+| part-of-speech tagging corpus | `.tsv` | columns are words, second column for tags, sentences separated by a blank line. |
+
 
 When designing a tagger or parser, preprocessing treebanks is a troublesome problem. We need to:
 
 - Split dataset into train/dev/test, following conventional splits.
 - Remove xml tags inside CTB.
 - Combine the multiline bracketed files into one file, one line for one sentence.
 
-I wondered why there were no open-source tools handling these tedious works. Then I decide to write one myself. Hopefully it will save you some time.
+I wondered why there were no open-source tools handling these tedious works. Finally I decide to write one myself. Hopefully it will save you some time.
 
-## Conventional Splits
+### Required software
+
+- Python3
+- NLTK
+- stanford-parser for converting dependency parsing and pos-tagging corpus
+
+## Overview
+
+What kind of task can we perform on treebanks?
 
 ### Part-Of-Speech Tagging
 
@@ -18,19 +38,27 @@ As per Collins (2002) and Choi (2016), splits are:
 - **PTB** Training: 0-18. Development: 19-21. Test: 22-24.
 
 ### Phrase Structure Parsing
-These scripts convert treebanks into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are:
+These scripts can also convert treebanks into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are:
 
 - **PTB** Training: 02-21. Development: 22. Test: 23.
 - **CTB** Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147.
 
-Let's do it on the fly.
+### Dependency Parsing
 
-### Required software
+You will need Stanford Parser for converting phrase structure trees to dependency parse trees. Please download the [Stanford Parser Version 3.3.0](https://nlp.stanford.edu/software/stanford-parser-full-2013-11-12.zip) and place them in this folder:
 
-- Python3
-- NLTK
+```
+TreebankPreprocessing
+├── ...
+├── stanford-parser-3.3.0-models.jar
+└── stanford-parser.jar
+```
+
+OK, let's do it on the fly.
 
 ## PTB
+
+
 
 ### 1. Import PTB into NLTK
 
@@ -46,6 +74,7 @@ ptb
 This script does all the work for you, only requires a path to store output.
 
 ```text
+$ python3 ptb.py --help 
 usage: ptb.py [-h] --output OUTPUT [--task TASK]
 
 Combine Penn Treebank WSJ MRG files into train/dev/test set
@@ -57,36 +86,33 @@ optional arguments:
                    parsing, pos for part-of-speech tagging
 ```
 
-E.g.
+* If you want part-of-speech tagging corpora, simply append `--task pos`.
+* Then, conversion to dependency parse corpora can be done by `tb_to_stanford.py`:
 
 ```
-$ python3 ptb.py --output ptb-combined
-Importing ptb from nltk
-
-Generating ptb-combined/train.txt
-1875 files...
-100.00%
-39832 sentences.
-
-Generating ptb-combined/dev.txt
-83 files...
-100.00%
-1700 sentences.
-
-Generating ptb-combined/test.txt
-100 files...
-100.00%
-2416 sentences.
-```
+$ python3 tb_to_stanford.py --help
+usage: tb_to_stanford.py [-h] --input INPUT --lang LANG --output OUTPUT
 
-If you want part-of-speech tagging corpora, simply append `--task pos`.
+Convert combined Penn Treebank files (.txt) to Stanford Dependency format
+(.conllx)
+
+optional arguments:
+  -h, --help       show this help message and exit
+  --input INPUT    The folder containing train.txt/dev.txt/test.txt in
+                   bracketed format
+  --lang LANG      Which language? Use en for English, cn for Chinese
+  --output OUTPUT  The folder where to store the output
+                   train.conllx/dev.conllx/test.conllx in Stanford Dependency
+                   format
+```
 
 ## CTB
 
 The CTB is a little messy, it contains extra xml tags in every gold tree, and is not natively supported by NLTK. You need to specify the CTB root path (the folder containing index.html).
 
 ```
-usage: ctb.py [-h] --ctb CTB --output OUTPUT
+$ python3 ctb.py --help           
+usage: ctb.py [-h] --ctb CTB --output OUTPUT [--task TASK]
 
 Combine Chinese Treebank 5.1 fid files into train/dev/test set
 
@@ -95,30 +121,11 @@ optional arguments:
   --ctb CTB        The root path to Chinese Treebank 5.1
   --output OUTPUT  The folder where to store the output
                    train.txt/dev.txt/test.txt
+  --task TASK      Which task (par, pos)? Use par for phrase structure
+                   parsing, pos for part-of-speech tagging
 ```
 
-E.g.
-
-```text
-$ python3 ctb.py --ctb corpus/ctb5.1 --output ctb5.1-combined
-Converting CTB: removing xml tags...
-Importing to nltk...
-
-Generating ctb5.1-combined/train.txt
-773 files...
-100.00%
-16083 sentences.
-
-Generating ctb5.1-combined/dev.txt
-36 files...
-100.00%
-803 sentences.
-
-Generating ctb5.1-combined/test.txt
-81 files...
-100.00%
-1910 sentences.
-```
+- Then pos and dependency parsing corpora can be converted similar to PTB.
 
 Then you can start your research, enjoy it!
 

diff --git a/tb_to_stanford.py b/tb_to_stanford.py
@@ -0,0 +1,52 @@
+# -*- coding:utf-8 -*-
+# Filename: ptb-to-stanford.py
+# Author：hankcs
+# Date: 2017-11-22 12:26
+
+import argparse
+import tempfile
+from os import system, remove, path
+from os.path import join
+
+from utility import make_sure_path_exists
+
+
+def convert(src, dst, lang):
+    if path.isfile(dst):
+        remove(dst)
+    java = 'edu.stanford.nlp.trees.EnglishGrammaticalStructure' if lang == 'en' \
+        else 'edu.stanford.nlp.trees.international.pennchinese.ChineseGrammaticalStructure'
+    print('Generating {}...'.format(dst))
+    with open(src) as src:
+        lines = src.readlines()
+        for n, line in enumerate(lines):
+            if n % 10 == 0 or n == len(lines) - 1:
+                print("%c%.2f%%" % (13, (n + 1) / float(len(lines)) * 100), end='')
+            if len(line.strip()) == 0:
+                continue
+            file = tempfile.NamedTemporaryFile()
+            tmp = file.name
+            with open(tmp, 'w') as out:
+                out.write(line)
+            system(
+                'java -cp "*" -mx1g {} -basic -keepPunct -conllx '
+                '-treeFile "{}" >> "{}"'.format(java,
+                                                tmp, dst))
+            # print(line)
+    print()
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description='Convert combined Penn Treebank files (.txt) to Stanford Dependency format (.conllx)')
+    parser.add_argument("--input", required=True,
+                        help='The folder containing train.txt/dev.txt/test.txt in bracketed format')
+    parser.add_argument("--lang", required=True, help='Which language? Use en for English, cn for Chinese')
+    parser.add_argument("--output", required=True, dest="output",
+                        help='The folder where to store the output train.conllx/dev.conllx/test.conllx in Stanford '
+                             'Dependency format')
+
+    args = parser.parse_args()
+    make_sure_path_exists(args.output)
+    for f in ['train', 'dev', 'test']:
+        convert(join(args.input, f + '.txt'), join(args.output, f + '.conllx'), args.lang)