Skip to content

Commit

Permalink
Support for converting Dependency Parsing corpus in .conllx format
Browse files Browse the repository at this point in the history
  • Loading branch information
hankcs committed Mar 2, 2018
1 parent 23cd94a commit 1ef70b2
Show file tree
Hide file tree
Showing 3 changed files with 112 additions and 52 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -100,4 +100,5 @@ ENV/
# mypy
.mypy_cache/

.idea
.idea
*.jar
109 changes: 58 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,35 @@
# TreebankPreprocessing
Python scripts preprocessing [Penn Treebank (PTB)](https://catalog.ldc.upenn.edu/ldc99t42) and [Chinese Treebank 5.1 (CTB)](https://catalog.ldc.upenn.edu/LDC2005T01).
Python scripts preprocessing [Penn Treebank (PTB)](https://catalog.ldc.upenn.edu/ldc99t42) and [Chinese Treebank 5.1 (CTB)](https://catalog.ldc.upenn.edu/LDC2005T01). They can convert treebanks to:

* constituency parse tree in `.txt` format, one line for one sentence.
* dependency parse tree in `.conllx` format.
* part-of-speech tagging corpus in `.tsv` format: first column for words, second column for tags, sentences separated by a blank line.


| Corpus | Format | Description |
| --- | --- | --- |
| constituency parse tree | `.txt` | one line for one sentence |
| dependency parse tree | `.conllx` | [Basic Stanford Dependencies (SD)](https://nlp.stanford.edu/software/stanford-dependencies.shtml) |
| part-of-speech tagging corpus | `.tsv` | columns are words, second column for tags, sentences separated by a blank line. |


When designing a tagger or parser, preprocessing treebanks is a troublesome problem. We need to:

- Split dataset into train/dev/test, following conventional splits.
- Remove xml tags inside CTB.
- Combine the multiline bracketed files into one file, one line for one sentence.

I wondered why there were no open-source tools handling these tedious works. Then I decide to write one myself. Hopefully it will save you some time.
I wondered why there were no open-source tools handling these tedious works. Finally I decide to write one myself. Hopefully it will save you some time.

## Conventional Splits
### Required software

- Python3
- NLTK
- stanford-parser for converting dependency parsing and pos-tagging corpus

## Overview

What kind of task can we perform on treebanks?

### Part-Of-Speech Tagging

Expand All @@ -18,19 +38,27 @@ As per Collins (2002) and Choi (2016), splits are:
- **PTB** Training: 0-18. Development: 19-21. Test: 22-24.

### Phrase Structure Parsing
These scripts convert treebanks into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are:
These scripts can also convert treebanks into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are:

- **PTB** Training: 02-21. Development: 22. Test: 23.
- **CTB** Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147.

Let's do it on the fly.
### Dependency Parsing

### Required software
You will need Stanford Parser for converting phrase structure trees to dependency parse trees. Please download the [Stanford Parser Version 3.3.0](https://nlp.stanford.edu/software/stanford-parser-full-2013-11-12.zip) and place them in this folder:

- Python3
- NLTK
```
TreebankPreprocessing
├── ...
├── stanford-parser-3.3.0-models.jar
└── stanford-parser.jar
```

OK, let's do it on the fly.

## PTB



### 1. Import PTB into NLTK

Expand All @@ -46,6 +74,7 @@ ptb
This script does all the work for you, only requires a path to store output.

```text
$ python3 ptb.py --help
usage: ptb.py [-h] --output OUTPUT [--task TASK]
Combine Penn Treebank WSJ MRG files into train/dev/test set
Expand All @@ -57,36 +86,33 @@ optional arguments:
parsing, pos for part-of-speech tagging
```

E.g.
* If you want part-of-speech tagging corpora, simply append `--task pos`.
* Then, conversion to dependency parse corpora can be done by `tb_to_stanford.py`:

```
$ python3 ptb.py --output ptb-combined
Importing ptb from nltk
Generating ptb-combined/train.txt
1875 files...
100.00%
39832 sentences.
Generating ptb-combined/dev.txt
83 files...
100.00%
1700 sentences.
Generating ptb-combined/test.txt
100 files...
100.00%
2416 sentences.
```
$ python3 tb_to_stanford.py --help
usage: tb_to_stanford.py [-h] --input INPUT --lang LANG --output OUTPUT
If you want part-of-speech tagging corpora, simply append `--task pos`.
Convert combined Penn Treebank files (.txt) to Stanford Dependency format
(.conllx)
optional arguments:
-h, --help show this help message and exit
--input INPUT The folder containing train.txt/dev.txt/test.txt in
bracketed format
--lang LANG Which language? Use en for English, cn for Chinese
--output OUTPUT The folder where to store the output
train.conllx/dev.conllx/test.conllx in Stanford Dependency
format
```

## CTB

The CTB is a little messy, it contains extra xml tags in every gold tree, and is not natively supported by NLTK. You need to specify the CTB root path (the folder containing index.html).

```
usage: ctb.py [-h] --ctb CTB --output OUTPUT
$ python3 ctb.py --help
usage: ctb.py [-h] --ctb CTB --output OUTPUT [--task TASK]
Combine Chinese Treebank 5.1 fid files into train/dev/test set
Expand All @@ -95,30 +121,11 @@ optional arguments:
--ctb CTB The root path to Chinese Treebank 5.1
--output OUTPUT The folder where to store the output
train.txt/dev.txt/test.txt
--task TASK Which task (par, pos)? Use par for phrase structure
parsing, pos for part-of-speech tagging
```

E.g.

```text
$ python3 ctb.py --ctb corpus/ctb5.1 --output ctb5.1-combined
Converting CTB: removing xml tags...
Importing to nltk...
Generating ctb5.1-combined/train.txt
773 files...
100.00%
16083 sentences.
Generating ctb5.1-combined/dev.txt
36 files...
100.00%
803 sentences.
Generating ctb5.1-combined/test.txt
81 files...
100.00%
1910 sentences.
```
- Then pos and dependency parsing corpora can be converted similar to PTB.

Then you can start your research, enjoy it!

Expand Down
52 changes: 52 additions & 0 deletions tb_to_stanford.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# -*- coding:utf-8 -*-
# Filename: ptb-to-stanford.py
# Author:hankcs
# Date: 2017-11-22 12:26

import argparse
import tempfile
from os import system, remove, path
from os.path import join

from utility import make_sure_path_exists


def convert(src, dst, lang):
if path.isfile(dst):
remove(dst)
java = 'edu.stanford.nlp.trees.EnglishGrammaticalStructure' if lang == 'en' \
else 'edu.stanford.nlp.trees.international.pennchinese.ChineseGrammaticalStructure'
print('Generating {}...'.format(dst))
with open(src) as src:
lines = src.readlines()
for n, line in enumerate(lines):
if n % 10 == 0 or n == len(lines) - 1:
print("%c%.2f%%" % (13, (n + 1) / float(len(lines)) * 100), end='')
if len(line.strip()) == 0:
continue
file = tempfile.NamedTemporaryFile()
tmp = file.name
with open(tmp, 'w') as out:
out.write(line)
system(
'java -cp "*" -mx1g {} -basic -keepPunct -conllx '
'-treeFile "{}" >> "{}"'.format(java,
tmp, dst))
# print(line)
print()


if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Convert combined Penn Treebank files (.txt) to Stanford Dependency format (.conllx)')
parser.add_argument("--input", required=True,
help='The folder containing train.txt/dev.txt/test.txt in bracketed format')
parser.add_argument("--lang", required=True, help='Which language? Use en for English, cn for Chinese')
parser.add_argument("--output", required=True, dest="output",
help='The folder where to store the output train.conllx/dev.conllx/test.conllx in Stanford '
'Dependency format')

args = parser.parse_args()
make_sure_path_exists(args.output)
for f in ['train', 'dev', 'test']:
convert(join(args.input, f + '.txt'), join(args.output, f + '.conllx'), args.lang)

0 comments on commit 1ef70b2

Please sign in to comment.