UC Berkeley MIDS Program
Joanna Yu, Spring 2020
While there has been rapid development in the field of Natural Language Processing in the last decade, the scarcity of labeled data remains a problem. Research has shown that unlabeled data can improve adversarial robustness and consistency training is among those that show great promises. The recent work done by Google, titled “Unsupervised Data Augmentation for Consistency Training”, shows that back-translation, as an advanced data augmentation technique, is an effective way to improve model performance. This work focuses on expanding the framework from the paper to further investigate the tradeoff between labeled and unlabeled data and the role of domain relevance in semi-supervised learning using unsupervised data augmentation.
-
Track model performance with respect to the proportion of labeled vs unlabeled data.
-
Investigate how domain relevance of unlabeled data affects performance of the semi-supervised model.
The main dataset, IMDb movie review dataset, is an ideal dataset for the proposed experiments since it contains a good amount of labeled and unlabeled examples. In addition, using a movie review dataset allows for the possibility of appending additional movie review data if one wishes to experiment with data beyond the size of the IMDb.
Data Type | Postive | Negative | Total |
---|---|---|---|
Labeled Training Data | 12,500 | 12,500 | 25,000 |
Test Data | - | - | 25,000 |
Unlabeled Training Data | - | - | 50,000 |
Dataset for domain relavance experimentations:
Dataset | Domain Relevance |
---|---|
IMDb Movie Reviews | In-Domain |
Amazon Movie and TV Reviews | In-Domain |
Amazon Office Product Reviews | Semi-In-Domain |
Twitter Airline Sentiment | Semi-In-Domain |
Kaggle Natural Disaster Tweets | Out-of-Domain |
- A series of notebooks and scripts are run on Google Colaboratory Pro using GPU/TPU environment.
- A large amount of Google Cloud Storage is used for this project due to the size of BERT and model checkpoint files.
- Depending on the total size of data being fine-tuned and trained, a single model takes 30 minutes to 6+ hours to run on a TPU.
- 7 different amounts of labeled data are selected.
- Model performances are tracked for every increase of 4,000 augmented unlabeled examples from 0 to 16,000, at which point the error rate begins to level off so additional models are run at 24,000, 48,000, and the full dataset of 69,972 examples.
- Unlabeled data is kept constant at 16,000 examples so the results are comparable across all datasets.
- When feasible, only examples over 128 tokens are selected so training can be done on longer text.
- Back-translation is performed to augment the unlabeled data.
A common belief is that domain relevance should play a role in semi-supervised learning and that fine-tuning with out-of-domain data will not benefit the model as much as in-domain data would. The results are surprising in two ways:
-
The performance of the model seems to rely much more heavily on the amount of labeled data than the domain relevance of the unlabeled data.
-
Longer text in the unlabeled dataset does not lead to any noticeable benefit to model performance.
The UDA main
notebook contains the experimentation done in this project. The results
notebook contains the results and graphs.
The experiments done in this project is leveraged from the framework developed by the UDA paper, https://arxiv.org/pdf/1904.12848.pdf