Skip to content
Unsupervised Data Augmentation (UDA)
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
back_translate better hyperparameters for gpu Jun 26, 2019
image better hyperparameters for gpu Jun 26, 2019
text initial release Jun 19, 2019
.gitignore update readme Jun 19, 2019 initial release Jun 19, 2019
LICENSE initial release Jun 19, 2019 update readme Jun 19, 2019

Unsupervised Data Augmentation


Unsupervised Data Augmentation or UDA is a semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks.

With only 20 labeled examples, UDA outperforms the previous state-of-the-art on IMDb trained on 25,000 labeled examples.

Model Number of labeled examples Error rate
Mixed VAT (Prev. SOTA) 25,000 4.32
BERT 25,000 4.51
UDA 20 4.20

It reduces more than 30% of the error rate of state-of-the-art methods on CIFAR-10 with 4,000 labeled examples and SVHN with 1,000 labeled examples:

ICT (Prev. SOTA) 7.66±.17 3.53±.07
UDA 5.27±.11 2.46±.17

It leads to significant improvements on ImageNet with 10% labeled data.

Model top-1 accuracy top-5 accuracy
ResNet-50 55.09 77.26
UDA 68.66 88.52

How it works

UDA is a method of semi-supervised learning, that reduces the need for labeled examples and better utilizes unlabeled ones.

What we are releasing

We are releasing the following:

  • Code for text classifications based on BERT.
  • Code for image classifications on CIFAR-10 and SVHN.
  • Code and checkpoints for our back translation augmentation system.

All of the code in this repository works out-of-the-box with GPU and Google Cloud TPU.


The code is tested on Python 2.7 and Tensorflow 1.13.

Text classifiation

Run on GPUs

Memory issues

The movie review texts in IMDb are longer than many classification tasks so using a longer sequence length leads to better performances. The sequence lengths are limited by the TPU/GPU memory when using BERT (See the Out-of-memory issues of BERT). As such, we provide scripts to run with shorter sequence lengths and smaller batch sizes.


If you want to run UDA with BERT base on a GPU with 11 GB memory, go to the text directory and run the following commands:

# Set a larger max_seq_length if your GPU has a memory larger than 11GB

# Download data and pretrained BERT checkpoints
bash scripts/

# Preprocessing
bash scripts/ --max_seq_length=${MAX_SEQ_LENGTH}

# Baseline accuracy: around 68%
bash scripts/ --max_seq_length=${MAX_SEQ_LENGTH}

# UDA accuracy: around 90%
# Set a larger train_batch_size to achieve better performance if your GPU has a larger memory.
bash scripts/ --train_batch_size=8 --max_seq_length=${MAX_SEQ_LENGTH}

Run on Cloud TPU v3-32 Pod to achieve SOTA performance

The best performance in the paper is achieved by using a max_seq_length of 512 and initializing with BERT large finetuned on in-domain unsupervised data. If you have access to Google Cloud TPU v3-32 Pod, try:


# Download data and pretrained BERT checkpoints
bash scripts/

# Preprocessing
bash scripts/ --max_seq_length=${MAX_SEQ_LENGTH}

# UDA accuracy: 95.3% - 95.9%

Run back translation data augmentation for your dataset

First of all, install the following dependencies:

pip install --user nltk
python -c "import nltk;'punkt')"
pip install --user tensor2tensor==1.13.4

The following command translates the provided example file. It automatically splits paraphrases into sentences, translates English sentences to French and then translates them back into English. Finally, it composes the paraphrased sentences into paragraphs. Go to the back_translate directory and run:


Guidelines for hyperparameters:

There is a variable sampling_temp in the bash file. It is used to control the diversity and quality of the paraphrases. Increasing sampling_temp will lead to increased diversity but worse quality. Surprisingly, diversity is more important than quality for many tasks we tried.

We suggest trying to set sampling_temp to 0.7, 0.8 and 0.9. If your task is very robust to noise, sampling_temp=0.9 or 0.8 should lead to improved performance. If your task is not robust to noise, setting sampling temp to 0.7 or 0.6 should be better.

If you want to do back translation to a large file, you can change the replicas and worker_id arguments in For example, when replicas=3, we divide the data into three parts, and each will only process one part according to the worker_id.

Image classification


We generate 100 augmented examples for every original example. To download all the augmented data, go to the image directory and run

bash scripts/ ${AUG_COPY}

Note that you need 230G disk space for all the augmented data. To save space, you can set AUG_COPY to a smaller number. For example, setting aug_copy to 30 and 10 will leads to an accuracy of 94.30 and 93.64 respectively on CIFAR-10.

Alternatively, you can generate the augmented examples yourself by running

bash scripts/ --aug_copy=${AUG_COPY}

CIFAR-10 with 4,000 examples

We provide different commands to train UDA on TPUs and GPUs since TPUs and GPUs have different implementations for batch norm. All of the scripts can achieve the current SOTA results on CIFAR-10 with 4,000 examples and SVHN with 1,000 examples.

GPU command:

# UDA accuracy: 94.5% - 94.9%
bash scripts/ --aug_copy=${AUG_COPY}

Google Cloud TPU v3-8/v2-8 command:

# UDA accuracy: 94.6% - 95.0%
bash scripts/ --aug_copy=${AUG_COPY}

Google Cloud TPU v3-32/v3-32 Pod command:

# UDA accuracy: 94.5% - 95.0%
bash scripts/ --aug_copy=${AUG_COPY}

SVHN with 1,000 examples

Google Cloud TPU v3-32/v3-32 Pod command:

# UDA accuracy: 97.1% - 97.8%
bash scripts/ --aug_copy=${AUG_COPY}

Our hyperparameters for SVHN are basically the same as CIFAR-10. To use GPUs or Cloud TPU v2/v3, you can take the script for CIFAR-10 and change task_name to svhn, change sup_size to 1000 and set learning_rate to 0.03 or 0.05.

General guidelines for setting hyperparameters:

UDA works out-of-box and does not require extensive hyperparameter tuning, but to really push the performance, here are suggestions about hyperparamters:

  • It works well to set the weight on unsupervised objective 'unsup_coeff' to 1.
  • Use a lower learning rate than pure supervised learning because there are two loss terms computed on labeled data and unlabeled data respecitively.
  • If your have an extremely small amount of data, try to tweak 'uda_softmax_temp' and 'uda_confidence_thresh' a bit. For more details about these two hyperparameters, search the "Confidence-based masking" and "Softmax temperature control" in the paper.
  • Effective augmentation for supervised learning usually works well for UDA.
  • Enumerating the TSA schedules (including not using TSA) is helpful.
  • For some tasks, we observed that increasing the batch size for the unsupervised objective leads to better performance. For other tasks, small batch sizes also work well. For example, when we run UDA with GPU on CIFAR-10, the best batch size for the unsupervised objective is 160.


A large portion of the code is taken from BERT and AutoAugment. Thanks!


Please cite this paper if you use UDA.

  title={Unsupervised data augmentation},
  author={Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Minh-Thang and Le, Quoc V},
  journal={arXiv preprint arXiv:1904.12848},


This is not an officially supported Google product.

You can’t perform that action at this time.