# Classify text with BERT

BERT and other Transformer encoder architectures have been wildly successful on a variety of NLP tasks. They compute vector space representations of natural language that are suitable for use in Deep Learning models. The BERT family of model uses the transformer encoder architecture to process each token of input text in the full context of tokens before and after.

In [None]:
#@title ##Install dependencies
#@markdown - tensorflow-text: text preprocessing
#@markdown - tf-models-official: pre-trained models hosted on tensorflow-hub
%%capture --no-stderr
!pip install -Uqq tensorflow-text
!pip install -Uqq tf-models-official

In [None]:
#@title ##Import packages
import os
import shutil
import uuid

import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from google.colab import auth, data_table
from official.nlp import optimization
from tensorflow.keras.preprocessing import text_dataset_from_directory

tf.get_logger().setLevel("ERROR")

print("tensorflow:", tf.__version__)
print("tensorflow_hub:", hub.__version__)

tensorflow: 2.4.0
tensorflow_hub: 0.10.0


In [None]:
#@title ## Google Cloud Setup
#@markdown - Authenticate user
#@markdown - Project config
#@markdown - Create dataset bucket
project_id = "gcp project id" #@param {type:"string"}

auth.authenticate_user()

# !gcloud config set project {project_id}

# bucket_name = f"bert-text-cls-{uuid.uuid1()}"
# !gsutil mb gs://{bucket_name}

# with open("to_upload.txt", "w") as f:
#     f.write("my sample file")

# !gsutil cp to_upload.txt gs://{bucket_name}
# !gsutil cat gs://{bucket_name}/to_upload.txt
# !gsutil rm -f -r gs://{bucket_name}

Updated property [core/project].
Creating gs://bert-text-cls-7bd7f590-4e04-11eb-a9f7-0242ac1c0002/...
Copying file://to_upload.txt [Content-Type=text/plain]...
/ [1 files][   14.0 B/   14.0 B]                                                
Operation completed over 1 objects/14.0 B.                                       
my sample fileRemoving gs://bert-text-cls-7bd7f590-4e04-11eb-a9f7-0242ac1c0002/to_upload.txt#1609706677141184...
/ [1 objects]                                                                   
Operation completed over 1 objects.                                              
Removing gs://bert-text-cls-7bd7f590-4e04-11eb-a9f7-0242ac1c0002/...


In [None]:
#@title Download Dataset - Large Movie Review Dataset
#@markdown [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) 
#@markdown that contains the text of 50,000 movie reviews from the 
#@markdown [Internet Movie Database](https://www.imdb.com/).

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file(
    "aclImdb_v1.tar.gz", url, untar=True, cache_dir=".", cache_subdir=""
)
dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")
train_dir = os.path.join(dataset_dir, "train")
print("Train dataset:", train_dir)

# remove unused folders from train folder
remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)


Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Train dataset: ./aclImdb/train


The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using `validation_split`. 

>When using `validation_split` and `subset` or to pass `suffle=False` so that the validation and training splits have no overlap.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
batch_size = 32
seed = 42

def optimize_dataset(dataset):
    return dataset.cache().prefetch(buffer_size=AUTOTUNE)

raw_train_ds = text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed
)

class_names = raw_train_ds.class_names
train_ds = optimize_dataset(raw_train_ds)

val_ds = text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed
)
val_ds = optimize_dataset(val_ds)

test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/test",
    batch_size=batch_size
)
test_ds = optimize_dataset(test_ds)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [None]:
text_batch, label_batch = next(train_ds.as_numpy_iterator())
df = pd.DataFrame({"text": text_batch, "label": label_batch})
data_table.DataTable(df, include_index=False, num_rows_per_page=3, )

Unnamed: 0,text,label
0,"b""The idea ia a very short film with a lot of ...",1
1,"b""When the Chamberlain family is camping near ...",1
2,b'It is rare that one comes across a movie as ...,1
3,"b'Best Stephen King film alongside IT, though ...",1
4,"b'Silent Night, Deadly Night 5 is the very las...",0
5,"b""Marjorie, a young woman who works in a museu...",1
6,"b""This movie scared the crap out of me! I have...",1
7,"b""Horrible film with bits of the Ramones strew...",0
8,b'This is really a new low in entertainment. E...,0
9,"b""The filming is pleasant and the environment ...",1


## BERT models overview

Currently these is the family of BERT models available on Tensorflow Hub:

- `BERT-Base`, `Uncased` the original BERT models.
- `Small BERTs` maintain the original architecture but with fewer and/or smaller Transformer blocks.
- `ALBERT` reduces the model size by sharing parameters between layers. Doesn't improve processing times.
- `BERT Experts` offer a choice of domain specific pre-trained models.
- `Electra` gets trained as a discriminator from GANs. (A must try!)
- `BERT with talking-heads Attention` has improved the core of the Transformers architecture.

### Game plan

1. Start with a smaller model (`sm BERT uncased`) since they are faster to train.
2. Upgrade to ALBERT looking for higher accuracy.
3. Models like Electra, Talking Heads or BERT expert are the next options in terms of improving accuracy.




In [34]:
#@title Choose a BERT model to fine-tune

bert_model_name = "sm BERT uncased" #@param ["sm BERT uncased", "albert", "sm electra", "talking heads"]

map_name_to_handle = {
    'sm BERT uncased':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'albert':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'sm electra':
        'https://tfhub.dev/google/electra_small/2',
    'talking heads':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'sm BERT uncased':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/2',
    'albert':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/2',
    'sm electra':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/2',
    'talking heads':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/2',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/2
