Good-Enough Data Augmentation

A simple rule-based data augmentation scheme aimed at encouraging generalization in sequence-to-sequence models.

Jacob Andreas, ACL 2020. https://arxiv.org/abs/1904.09545

Data:

Semantic parsing dataset (Finnegan-Dolak et al.): https://github.com/jkkummerfeld/text2sql-data
SCAN dataset (Lake and Baroni): https://github.com/brendenlake/SCAN
Language modeling: Under data/lm. All data is from Wikipedia, except for the Na data, which is derived from "Cross-Lingual Word Embeddings for Low-Resource Language Modeling" (Adams et al. 2017). Note different train/test splits for Na.
Human sequence-to-sequence learning (Lake, Linzen and Baroni): Fig 2 in https://arxiv.org/pdf/1901.04587.pdf
COGS dataset (Kim and Linzen): https://github.com/najoungkim/COGS

To use on a new dataset:

Point torchdec at https://github.com/jacobandreas/torchdec.
Create a new data loader under data (look at data/colors.py for a minimal example).
Update get_dataset in train.py to use the new loader.
Run the experiment pipeline (look at exp/scan_jump/retrieval/run.sh for an example).

The wug_size and wug_count flags (defined in data/builder.py) determine the number and size of the fragments that will be extracted from each template. the template_sim flag determines whether the whole string or a fixed-size window will be used for evaluating template similarity; sim_window_size determines the window size. The number and diversity of generated templates can be further controlled using the variants and n_sample flags.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
_old/hyper		_old/hyper
data		data
exp		exp
.gitignore		.gitignore
README.md		README.md
a_overlap.py		a_overlap.py
compose.py		compose.py
eval.py		eval.py
eval_lm.py		eval_lm.py
fake_corpus.py		fake_corpus.py
fakeprof.py		fakeprof.py
flags.py		flags.py
fuzzy_index.py		fuzzy_index.py
grammar.py		grammar.py
info.py		info.py
model.py		model.py
scratch.txt		scratch.txt
tables.py		tables.py
torchdec		torchdec
train.py		train.py
train_ctx.py		train_ctx.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Good-Enough Data Augmentation

About

Releases

Packages

Languages

jacobandreas/geca

Folders and files

Latest commit

History

Repository files navigation

Good-Enough Data Augmentation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages