Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Data Augmentation Techniques Natively #39

Open
aflah02 opened this issue Mar 13, 2022 · 18 comments
Open

Adding Data Augmentation Techniques Natively #39

aflah02 opened this issue Mar 13, 2022 · 18 comments
Labels
type:feature New feature or request

Comments

@aflah02
Copy link
Collaborator

aflah02 commented Mar 13, 2022

I'm interested in contributing scripts which allow users to incorporate data augmentation techniques directly without using external libraries.
I can start with stuff like synonym replacement, random insertion, random swap, and random deletion from the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Over time this can be extended to incorporate more techniques as well such as the additional techniques mentioned here
Any hints and tips on how I can get started?

Edit: I also found this survey paper which seems pretty useful: A Survey of Data Augmentation Approaches for NLP

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 13, 2022

This is also partly inspired by the ideas mentioned in the GSOC Document

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 14, 2022

@mattdangerw and rest of the Keras-Team would be great to hear your thoughts on this

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 16, 2022

As a starting point I've implemented EDA whilst also fixing some of the bugs which are present in the original EDA code such as not excluding stop words in some cases as some issues seem to point out -
https://colab.research.google.com/drive/192mGhABi1n51cg8SFLvuUCwvsYIMNvx1?usp=sharing
My next step is to show that this achieves the gains mentioned in the paper by training a similar model with one of the datasets used with and without these methods

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 17, 2022

I've implemented it seems I'm getting the almost 3% gains mentioned in the paper
https://github.com/aflah02/Easy-Data-Augmentation-Implementation/blob/main/EDA.ipynb
What should be the next step now? @mattdangerw or anyone else from the Keras Team

@mattdangerw
Copy link
Member

Thank you very much for digging into this!

Bear with us for a bit here as we figure out an approach we would like to take with data augmentation. We are taking a look and will reply more soon. There's a lot of question this brings up (how to handle static assets, how to make layerized versions of these components, multilingual support). But overall data augmentation is something we would like to explore.

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 18, 2022

@mattdangerw Sure! I had quite a bit of fun while trying to implement this too, while you guys figure out the way you'd prefer I'll try implementing other techniques too

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 18, 2022

Backtranslation done on smaller sample size also seems pretty good at giving results:
https://github.com/aflah02/BackTranslation-Based-Data-Augmentation
Maybe we could do this using parallelized processes to make it faster as for large datasets, it sucks right now so I'll check that out too

@mattdangerw
Copy link
Member

If you are looking for something to pick up in the meantime, opened a couple issues tagged with "good first issue" where our design is fully defined. Will try to keep expanding that list.

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 19, 2022

I recently came across this paper SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness which uses a corruption and reconstruction function to recreate new samples from the real data. This seems to be an interesting one since although it's computationally more expensive than rule based techniques but gives substantial gains on OOD Samples in a couple of datasets. This can be one of the techniques we implement natively for users to use to get gains on OOD Samples maybe

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 21, 2022

Another great paper (Synthetic and Natural Noise Both Break Neural Machine Translation) which aims to make NMT models more robust to typos and other corruptions which humans can easily overcome. They use 4 techniques which include some interesting ones such as mimicking keyboard typos based on characters in the neighbourhood of another character.

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 26, 2022

Hi @mattdangerw
While I'm working on the other issue too are there any updates on how it's planned to incorporate these DA techniques?

@mattdangerw
Copy link
Member

Yes, we have been discussing this. I've been trying to capture some design requirements we have. Here's what I have so far...

  1. Augmentation functionality should be exposed as layers not utilities.
  2. Layers should take as input raw untokenized string, and output augmented raw untokenized strings.
  3. Layers computations should be representable as a tensorflow graph, but this graph can include calls to tf.numpy_function or tf.py_function.
  4. We do not want to add new dependencies (no using nltk).
  5. Layers should be fully usable with user provided data.

For 5) and EDA that means we would need some to represent a synonym lookup table that a user could provide. I'm unsure of what this should look like. Is wordnet the prevalent data source here? Do they have data available in a simple file format (json, text line)?

It would also be helpful to get a bit of the lay of the land here. For the papers mentioned in the survey you linked (https://github.com/styfeng/DataAug4NLP is the github continuously updated version), we should try to get a sense of what techniques are most commonly used. Citations are not a perfect metric, but might be the best place to start.

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 30, 2022

Hey @mattdangerw
Thanks for sharing this.
Just to confirm this means that these augmentation layers will always be utilized prior to any operations right? Since they take in and return untokenized inputs? Although I haven't seen any example so far but wouldn't it get tricky if some new work in the future introduced augmenting data during training instead of doing it before which most current works do?
Wordnet is quite commonly used for synonym replacement tasks in what I could find and the original papers also used Wordnet. In regards to having a user provide their synonym set we could do that too maybe as a parsed dictionary with keys as words and list of synonyms as values. Someone did parse wordnet and release it as json here. Also in one of the papers I listed below they used English Thesaurus from mytheas component used in LibreOffice project which inturn is created from WordNet however I haven't used this component/project so I don't have much clue how they did it and will have to research that bit

I did try to search for the most cited ones, among the rule based techniques EDA and Synonym Replacement (Character-Level Convolutional Networks for Text Classification) seems to be most cited with 684 and 4079 citations respectively. I think these can be a reasonable place to start while I'll be on the lookout for more cited ones as well for Non Rule Based Techniques too.
Also there are a ton of small rule based techniques which are also used depending on usecases and are provided by some other libraries like nlpaug like simulating keyboard or OCR typos and so on. I do have some data sources for these for instance for OCR errors there is this file which common errors.

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 30, 2022

There is also this for WordNet - https://wordnet.princeton.edu/download/current-version and the format is here - https://wordnet.princeton.edu/documentation/wndb5wn so we can create our own parser and parse it

@mattdangerw
Copy link
Member

@aflah02 the augmentations should be applied as operations, but applied before tokenization. A lot of discussion we've had was around 3) above. The layer transformations need to be expressible as a graph of tensorflow ops to work with tf.data. But we believe that doing transformations with purely tf.strings operations would be too restrictive, so using tf.numpy_function will allow writing pure python transformations of strings (at a performance hit).

Might be a little simpler to frame this in terms of workflows. A common flow we would expect is something like this...

def preprocess(x):
    x = keras_nlp.layers.SomeAugmentation()(x)
    x = keras_nlp.tokenizers.SomeTokenizer()(x)
    return x

dataset = load text dataset for disk
dataset = dataset.map(preprocess).batch(32)

model = ...
model.fit(dataset) # Each epoch will now apply a different augmentation.

@aflah02
Copy link
Collaborator Author

aflah02 commented Mar 31, 2022

Thanks!
Should I now get started with some basic ones which are easy to implement such as EDA following the above scheme?

@mattdangerw
Copy link
Member

@aflah02 yeah, I would say maybe rather than EDA maybe we should start with designing a layer for synonym replacement?

It's a strict subset of EDA as a whole, we would want it as a standalone layer, and will start answering a lot of the questions we have.

@aflah02
Copy link
Collaborator Author

aflah02 commented Apr 1, 2022

That sounds good. I'll also get to parsing the WordNet data as well and also try out the data in the GitHub release that I had shared

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants