Skip to content

Data Augmentation scripts for the parser MaChAmp-TWG as part of my bachelor thesis titled "Data Augmentation for TWG Parsing via Syntactically Well-formed Nonsense Sentences".

License

Notifications You must be signed in to change notification settings

m4cit/MaChAmp-TWG-Data-Augmentation

Repository files navigation

Scripts

1_unimorph_to_conllu.py:

Translates the original UniMorph file into the language / format of the CoNLLU file.

2_improveUnimorph.py:

Looks up all verbs from the UniMorph file successively on dictionary.com, and categorizes them into 'transitive' and 'intransitive' (Very slow. Better to use a dictionary API if available.).

3_improveRRG.py:

Checks and adds the transitivity of all verbs to the RRG CoNLLU file.

4_filterForTrain.py:

Filters out all unused words / lines in the RRG CoNLLU file.

generate.py:

Replaces words in the RRG CoNLLU file with randomly chosen ones from the UniMorph file.

augment.py:

Replaces original words in the training file with random ones generated by the module 'generate.py', followed by the augmentation of the training file with new sentences.

Requirements

  • Python 3.6 or newer
  • modules from the requirements.txt file

Installation

  1. pip install -r requirements.txt
    
  2. Place all files and folders into the main directory of MaChAmp-TWG.

Input Parameters

Word Replacement Options

--unimorph0:

UniMorph inaccurate verb replacements with regard to transitivity. In place of --unimorph1, --internal, --supertag, or --original.

--unimorph1:

UniMorph accurate verb replacements with regard to transitivity. In place of --unimorph0, --internal, --supertag, or --original.

--internal:

Internal word replacements. In place of --unimorph0, unimorph1, --supertag, or --original.

--supertag:

Internal supertag word replacements. In place of --unimorph0, unimorph1, --internal, or --original.

--original:

Augmentation with unchanged original sentences. In place of --unimorph0, unimorph1, --internal, or --supertag.

General Options

-h, --help

-i, --RRGinput:

(OPTIONAL) Filtered RRG file input. Default file: "rrgparbank/conllu/filtered_acc_en_conllu.conllu".

-o, --RRGoutput:

(OPTIONAL) Filtered RRG file output directory. Default directory: "rrgparbank/conllu".

-t, --tag:

Word tags.

-ti, --trainInput:

(OPTIONAL) train.supertags file input. Default file: "experiments/rrgparbank-en/base/train.supertags".

-to, --trainOutput:

(OPTIONAL) train.supertags file output directory. If the directory is not specified, the default directory is used and filename changes to "new_train.supertags".

-s, --extensionSize:

Extension size of the resulting training file. Must be >= 2. "2" doubles the size (sentences) of the base training file, thus does 1 run through the file (-s input-1).

Available tags (--tag) for replacement task (not for --supertag)

nS: Noun Singular
nP: Noun Plural

aPoss: Adjective Possessive
aCmpr: Adjective Comparative
aSup: Adjective Superlative

vPst: Verb Past Tense
vPresPart: Verb Present Tense, Participle Form
vPstPart: Verb Past Tense, Participle Form

adv (for --internal only): Adverb
advInt (for --internal only): Adverb, Pronominal type: Interrogative
advSup (for --internal only): Adverb Superlative
advCmpr (for --internal only): Adverb Comparative

noun: All nouns
adj: All adjectives
verb: All verbs
all: All available tags

Usage

augment.py [-h] [--unimorph0] [--unimorph1] [--internal] [--supertag] [--original]
[-i RRGINPUT] [-o RRGOUTPUT] [-t TAG] [-ti TRAININPUT] [-to TRAINOUTPUT] -s EXTENSIONSIZE

Example 1:

python augment.py --unimorph0 --tag all --extensionSize 2

or

python augment.py --unimorph0 -t all -s 2



Example 2:

python augment.py --supertag --extensionSize 10

or

python augment.py --supertag -s 10

Sources

Tatiana Bladier, Kilian Evang, Valeria Generalova, Zahra Ghane, Laura Kallmeyer, Robin Möllemann, Natalia Moors, Rainer Osswald, and Simon Petitjean. 2022. RRGparbank: A Parallel Role and Reference Grammar Treebank. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4833–4841, Marseille, France. European Language Resources Association.

Kilian Evang, Tatiana Bladier, Laura Kallmeyer, and Simon Petitjean. 2021. Bootstrapping Role and Reference Grammar Treebanks via Universal Dependencies. In Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021), pages 30–48, Sofia, Bulgaria. Association for Computational Linguistics.

Tatiana Bladier, Jakub Waszczuk, and Laura Kallmeyer. 2020. Statistical Parsing of Tree Wrapping Grammars. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6759–6766, Barcelona, Spain (Online). International Committee on Computational Linguistics.

Kallmeyer, L., Osswald, R., Van Valin, R.D. 2013. Tree Wrapping for Role and Reference Grammar. In: Morrill, G., Nederhof, MJ. (eds) Formal Grammar. FG FG 2013 2012. Lecture Notes in Computer Science, vol 8036. Springer, Berlin, Heidelberg.

UniMorph

About

Data Augmentation scripts for the parser MaChAmp-TWG as part of my bachelor thesis titled "Data Augmentation for TWG Parsing via Syntactically Well-formed Nonsense Sentences".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages