
<a href="https://colab.research.google.com/github/google-research/albert/blob/master/albert_glue_fine_tuning_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title Copyright 2020 The ALBERT Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# ALBERT End to End (Fine-tuning + Predicting) with Cloud TPU

## Overview

ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation.

For a technical description of the algorithm, see our paper:

https://arxiv.org/abs/1909.11942

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut

This Colab demonstates using a free Colab Cloud TPU to fine-tune GLUE tasks built on top of pretrained ALBERT models and 
run predictions on tuned model. The colab demonsrates loading pretrained ALBERT models from both [TF Hub](https://www.tensorflow.org/hub) and checkpoints.

**Note:**  You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud 
Storage) bucket for this Colab to run.

Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) for how to create GCP account and GCS bucket. You have [$300 free credit](https://cloud.google.com/free/) to get started with any GCP product. You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.

This notebook is hosted on GitHub. To view it in its original repository, after opening the notebook, select **File > View on GitHub**.

### Instructions

<h3><a href="https://cloud.google.com/tpu/"><img valign="middle" src="https://raw.githubusercontent.com/GoogleCloudPlatform/tensorflow-without-a-phd/master/tensorflow-rl-pong/images/tpu-hexagon.png" width="50"></a>  &nbsp;&nbsp;Train on TPU</h3>

   1. Create a Cloud Storage bucket for your TensorBoard logs at http://console.cloud.google.com/storage and fill in the BUCKET parameter in the "Parameters" section below.
 
   1. On the main menu, click Runtime and select **Change runtime type**. Set "TPU" as the hardware accelerator.
   1. Click Runtime again and select **Runtime > Run All** (Watch out: the "Colab-only auth for this notebook and the TPU" cell requires user input). You can also run the cells manually with Shift-ENTER.

### Set up your TPU environment

In this section, you perform the following tasks:

*   Set up a Colab TPU running environment
*   Verify that you are connected to a TPU device
*   Upload your credentials to TPU to access your GCS bucket.

In [None]:
# TODO(lanzhzh): Add support for 2.x.
%tensorflow_version 1.x
import os
import pprint
import json
import tensorflow as tf

assert "COLAB_TPU_ADDR" in os.environ, "ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!"
TPU_ADDRESS = "grpc://" + os.environ["COLAB_TPU_ADDR"] 
TPU_TOPOLOGY = "2x2"
print("TPU address is", TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    # Now credentials are set for all future sessions on this TPU.

### Prepare and import ALBERT modules
​
With your environment configured, you can now prepare and import the ALBERT modules. The following step clones the source code from GitHub.

In [2]:
#TODO(lanzhzh): Add pip support
import sys

!test -d albert || git clone https://github.com/google-research/albert albert
if not 'albert' in sys.path:
  sys.path += ['albert']
  
!pip install sentencepiece


Cloning into 'albert'...
remote: Enumerating objects: 367, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 367 (delta 5), reused 6 (delta 3), pack-reused 353[K
Receiving objects: 100% (367/367), 262.23 KiB | 3.50 MiB/s, done.
Resolving deltas: 100% (237/237), done.
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.4 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


## Prepare for training

This next section of code performs the following tasks:

*  Specify GS bucket, create output directory for model checkpoints and eval results.
*  Specify task and download training data.
*  Specify ALBERT pretrained model





In [4]:
#Download GLUE data
!git clone https://github.com/nyu-mll/GLUE-baselines download_glue

GLUE_DIR='glue_data'
!python download_glue/download_glue_data.py --data_dir $GLUE_DIR --tasks all

Cloning into 'download_glue'...
remote: Enumerating objects: 891, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 891 (delta 1), reused 2 (delta 0), pack-reused 886[K
Receiving objects: 100% (891/891), 1.48 MiB | 9.17 MiB/s, done.
Resolving deltas: 100% (610/610), done.
Downloading and extracting CoLA...
	Completed!
Downloading and extracting SST...
	Completed!
Processing MRPC...
	Error downloading standard development IDs for MRPC. You will need to manually split your data.
Downloading and extracting QQP...
	Completed!
Downloading and extracting STS...
	Completed!
Downloading and extracting MNLI...
	Note (12/10/20): This script no longer downloads SNLI. You will need to manually download and format the data to use SNLI.
	Completed!
Downloading and extracting QNLI...
	Completed!
Downloading and extracting RTE...
	Completed!
Downloading and extracting WNLI...
	Completed!
Downloading and extracting diagnostic...
	Co

In [24]:
# Please find the full list of tasks and their fintuning hyperparameters
# here https://github.com/google-research/albert/blob/master/run_glue.sh

BUCKET = "luanps" #@param { type: "string" }
TASK = 'RTE' #@param {type:"string"}
# Available pretrained model checkpoints:
#   base, large, xlarge, xxlarge
ALBERT_MODEL = 'base' #@param {type:"string"}

TASK_DATA_DIR = 'glue_data'

BASE_DIR = "gs://" + BUCKET
if not BASE_DIR or BASE_DIR == "gs://":
  raise ValueError("You must enter a BUCKET.")
DATA_DIR = os.path.join(BASE_DIR, "data")
MODELS_DIR = os.path.join(BASE_DIR, "models")
OUTPUT_DIR = 'gs://{}/albert-tfhub/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

# Download glue data.
#! test -d download_glue_repo || git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git download_glue_repo
#!python download_glue_repo/download_glue_data.py --data_dir=$TASK_DATA_DIR --tasks=$TASK
#print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))

ALBERT_MODEL_HUB = 'https://tfhub.dev/google/albert_' + ALBERT_MODEL + '/3'

***** Model output directory: gs://luanps/albert-tfhub/models/RTE *****


Now let's run the fine-tuning scripts. If you use the default MRPC task, this should be finished in around 10 mintues and you will get an accuracy of around 86.5.

## Choose hyperparameters using [Optuna](https://optuna.readthedocs.io/en/stable/index.html)

In [29]:
#Install Optuna optimzation lib
!pip install optuna

Installing collected packages: pyperclip, pbr, stevedore, Mako, cmd2, autopage, colorlog, cmaes, cliff, alembic, optuna
Successfully installed Mako-1.1.6 alembic-1.7.5 autopage-0.4.0 cliff-3.10.0 cmaes-0.8.2 cmd2-2.3.3 colorlog-6.6.0 optuna-2.10.0 pbr-5.8.0 pyperclip-1.8.2 stevedore-3.5.0


In [59]:
import optuna
import uuid

In [104]:
def get_last_acc_from_file(result_file):
    f = open(result_file,'r')
    results = f.readlines()
    result_dict = dict()
    for r in results:
        if 'eval_accuracy' in r:
            k,v = r.split(' = ')
    return float(v)

In [137]:
float(5e5)*10

5000000.0

In [127]:
def objective(trial):

    #hyperparameter setting: RTE task
    warmup_steps = trial.suggest_int('warmup_steps', 5,15,5)#100, 500,100)
    train_steps = trial.suggest_int('train_steps', 10,100,10) #400, 2000,100)
    learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 5e-5)
    batch_size = trial.suggest_int('batch_size', 16, 128,16)

    #Tmp config
    id = str(uuid.uuid4()).split('-')[0]
    OUTPUT_TMP = f'{OUTPUT_DIR}/{id}'
    os.environ['TFHUB_CACHE_DIR'] = OUTPUT_TMP

    !python -m albert.run_classifier \
            --data_dir="glue_data/" \
            --output_dir=$OUTPUT_TMP \
            --albert_hub_module_handle=$ALBERT_MODEL_HUB \
            --spm_model_file="from_tf_hub" \
            --do_train=True \
            --do_eval=True \
            --do_predict=False \
            --max_seq_length=512 \
            --optimizer=adamw \
            --task_name=$TASK \
            --warmup_step=$warmup_steps \
            --learning_rate=$learning_rate \
            --train_step=$train_steps \
            --save_checkpoints_steps=100 \
            --train_batch_size=$batch_size\
            --tpu_name=$TPU_ADDRESS \
            --use_tpu=True

    #Download results and load model accuracy
    !gsutil cp $OUTPUT_TMP/eval_results.txt 
    model_acc = get_last_acc_from_file(f'eval_results.txt')
    return model_acc

In [128]:
#Run Optuna optimization
study = optuna.create_study(direction='maximize',study_name=TASK)
study.optimize(objective, n_trials=2)

[32m[I 2022-01-09 00:35:02,612][0m A new study created in memory with name: RTE[0m


[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
I0109 00:35:36.142469 139639740966784 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:35:36.143696 139639740966784 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:35:36.144931 139639740966784 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:35:36.146044 139639740966784 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:35:36.147236 139639740966784 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:35:36.148418 139639740966784 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:35:36.149438 139639740966784 tokenization.py:237] using sentence piece tokenzier.
IN

[32m[I 2022-01-09 00:39:06,572][0m Trial 0 finished with value: 0.7545126 and parameters: {'warmup_steps': 10, 'train_steps': 30, 'learning_rate': 0.038339760285950646, 'batch_size': 80}. Best is trial 0 with value: 0.7545126.[0m


[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
I0109 00:39:38.714227 139651685500800 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:39:38.715208 139651685500800 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:39:38.716257 139651685500800 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:39:38.717212 139651685500800 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:39:38.718232 139651685500800 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:39:38.719267 139651685500800 tokenization.py:237] using sentence piece tokenzier.
INFO:tensorflow:using sentence piece tokenzier.
I0109 00:39:38.720203 139651685500800 tokenization.py:237] using sentence piece tokenzier.
IN

[32m[I 2022-01-09 00:43:24,428][0m Trial 1 finished with value: 0.7545126 and parameters: {'warmup_steps': 5, 'train_steps': 60, 'learning_rate': 20.461353439247414, 'batch_size': 112}. Best is trial 0 with value: 0.7545126.[0m


In [130]:
study.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_batch_size,params_learning_rate,params_train_steps,params_warmup_steps,state
0,0,0.754513,2022-01-09 00:35:02.616274,2022-01-09 00:39:06.571024,0 days 00:04:03.954750,80,0.03834,30,10,COMPLETE
1,1,0.754513,2022-01-09 00:39:06.579229,2022-01-09 00:43:24.427390,0 days 00:04:17.848161,112,20.461353,60,5,COMPLETE


In [108]:
#Pack Optuna results and save to Bucket
import joblib

study_file = f'{TASK}_study.pkl'
joblib.dump(study, study_file)
!gsutil cp $study_file $OUTPUT_DIR

Copying file://RTE_study.pkl [Content-Type=application/octet-stream]...
/ [1 files][  7.8 KiB/  7.8 KiB]                                                
Operation completed over 1 objects/7.8 KiB.                                      
