New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Data Augmentation Techniques Natively #39
Comments
This is also partly inspired by the ideas mentioned in the GSOC Document |
@mattdangerw and rest of the Keras-Team would be great to hear your thoughts on this |
As a starting point I've implemented EDA whilst also fixing some of the bugs which are present in the original EDA code such as not excluding stop words in some cases as some issues seem to point out - |
I've implemented it seems I'm getting the almost 3% gains mentioned in the paper |
Thank you very much for digging into this! Bear with us for a bit here as we figure out an approach we would like to take with data augmentation. We are taking a look and will reply more soon. There's a lot of question this brings up (how to handle static assets, how to make layerized versions of these components, multilingual support). But overall data augmentation is something we would like to explore. |
@mattdangerw Sure! I had quite a bit of fun while trying to implement this too, while you guys figure out the way you'd prefer I'll try implementing other techniques too |
Backtranslation done on smaller sample size also seems pretty good at giving results: |
If you are looking for something to pick up in the meantime, opened a couple issues tagged with "good first issue" where our design is fully defined. Will try to keep expanding that list. |
I recently came across this paper SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness which uses a corruption and reconstruction function to recreate new samples from the real data. This seems to be an interesting one since although it's computationally more expensive than rule based techniques but gives substantial gains on OOD Samples in a couple of datasets. This can be one of the techniques we implement natively for users to use to get gains on OOD Samples maybe |
Another great paper (Synthetic and Natural Noise Both Break Neural Machine Translation) which aims to make NMT models more robust to typos and other corruptions which humans can easily overcome. They use 4 techniques which include some interesting ones such as mimicking keyboard typos based on characters in the neighbourhood of another character. |
Hi @mattdangerw |
Yes, we have been discussing this. I've been trying to capture some design requirements we have. Here's what I have so far...
For 5) and EDA that means we would need some to represent a synonym lookup table that a user could provide. I'm unsure of what this should look like. Is wordnet the prevalent data source here? Do they have data available in a simple file format (json, text line)? It would also be helpful to get a bit of the lay of the land here. For the papers mentioned in the survey you linked (https://github.com/styfeng/DataAug4NLP is the github continuously updated version), we should try to get a sense of what techniques are most commonly used. Citations are not a perfect metric, but might be the best place to start. |
Hey @mattdangerw I did try to search for the most cited ones, among the rule based techniques EDA and Synonym Replacement (Character-Level Convolutional Networks for Text Classification) seems to be most cited with 684 and 4079 citations respectively. I think these can be a reasonable place to start while I'll be on the lookout for more cited ones as well for Non Rule Based Techniques too. |
There is also this for WordNet - https://wordnet.princeton.edu/download/current-version and the format is here - https://wordnet.princeton.edu/documentation/wndb5wn so we can create our own parser and parse it |
@aflah02 the augmentations should be applied as operations, but applied before tokenization. A lot of discussion we've had was around 3) above. The layer transformations need to be expressible as a graph of tensorflow ops to work with tf.data. But we believe that doing transformations with purely Might be a little simpler to frame this in terms of workflows. A common flow we would expect is something like this...
|
Thanks! |
@aflah02 yeah, I would say maybe rather than EDA maybe we should start with designing a layer for synonym replacement? It's a strict subset of EDA as a whole, we would want it as a standalone layer, and will start answering a lot of the questions we have. |
That sounds good. I'll also get to parsing the WordNet data as well and also try out the data in the GitHub release that I had shared |
I'm interested in contributing scripts which allow users to incorporate data augmentation techniques directly without using external libraries.
I can start with stuff like synonym replacement, random insertion, random swap, and random deletion from the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Over time this can be extended to incorporate more techniques as well such as the additional techniques mentioned here
Any hints and tips on how I can get started?
Edit: I also found this survey paper which seems pretty useful: A Survey of Data Augmentation Approaches for NLP
The text was updated successfully, but these errors were encountered: