NoiseMix - data generation for natural language
Branch: master
Clone or download
Latest commit e93c843 May 26, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmarks initial migration from gitlab. May 10, 2018
noisemix initial migration from gitlab. May 10, 2018
.gitignore
LICENSE
README.md Update README.md May 26, 2018
requirements.txt initial migration from gitlab. May 10, 2018
setup.py initial migration from gitlab. May 10, 2018

README.md

NoiseMix

NoiseMix is a library for data generation for text datasets.

Data generation or augmentation with perturbations or distortions is a technique so successful in image tasks that support for it is included in major frameworks like PyTorch, Tensorflow and Keras.

However, data generation is not yet for common for natural language tasks. (Why? Well, arguably it is more difficult to generate realistic noise for data with many discontinuities.) For more on how NoiseMix can increase performance on various tasks, see Benchmarks.

NoiseMix adds new rows to datasets by applying changes to copies of existing rows. For example, consider:

This was a great book, but their shipping was too slow.

After augmentation, we have the original row plus a few new versions:

This was a great book, but their shipping was too slow.
This was a great book but there shipping was too slow.
this was a great book, but their shipping was to slow.

Thus the generated dataset is at least twice as large as the original.

Installing

From PyPI:

pip install noisemix

Running

To generate the noisy data, call the program with the path to your data file and the format, for example:

  python -m noisemix train.ft.txt -format fastText

This will generate a data file with the added suffix .nmx that includes the original rows and new noisy rows.

To see all parameters:

python -m noisemix

Config

NoiseMix offers word-level and sentence-level perturbations.

The word-level perturbations are: add_letter, repeat_letter, remove_letter, lowercase, remove_punct, word_swap, char_swap, flip_letters and typo_qwerty. The sentence-level perturbations are: remove_space and flip_words

Each of the perturbation can be enabled/disabled, repeated multiple times per line and frequency.

Some perturbations are specific to certain languages, groups of languages or alphabets.

Supported data formats

To be usable for labelled datasets, NoiseMix keeps intact non-language formatting like labels and delimiters.

fastText (__label__ prefix) is the only supported format for now.

Benchmarks

To test the effectiveness of NoiseMix, we compare it control data on several benchmarks. Several benchmarks and toy datasets are included.

See results and more in benchmarks

Developing

To make contributions, just git clone this repo or a fork of it, and submit a pull request with your changes.

git clone https://github.com/noisemix/noisemix.git
cd noisemix
pip install -r requirements.txt

Adding a new language

Beyond parameters that can be adjusted for the specifics of each language, NoiseMix includes hand-built lists of common noise for each language.

For example, in English corpora erroneous swaps of there and their are common, in Italian it is common for users without an Italian keyboard to type word-final a' instead of á .

Add support for new languages to data.py.

Adding a new format

Add support for new formats languages to format.py

You can also open an issue.