# TR Text Classification

## BERT and Tensorflow Hub

Things to Do:
    
* Find out the maximum number of tokens in dataset.
* Investigate use of preprocessing layer https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3
* Learning rate schedule


## Import Packages

In [1]:
#!git clone --depth 1 -b v2.3.0 https://github.com/tensorflow/models.git

In [2]:
# install requirements to use tensorflow/models repository
#!pip install -Uqr models/official/requirements.txt

In [3]:
import sys
sys.path.append('models')

import pandas as pd
from sklearn.model_selection import train_test_split

import tensorflow as tf
import tensorflow_hub as hub

from official.nlp.bert import tokenization

import warnings
warnings.filterwarnings("ignore")

2021-12-19 21:44:54.366165: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


In [4]:
print("TF Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

TF Version:  2.5.0
Eager mode:  True
Hub version:  0.12.0
GPU is available


2021-12-19 21:44:55.111097: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-12-19 21:44:55.143794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-12-19 21:44:55.143850: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-19 21:44:55.144350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:21:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-12-19 21:44:55.144364: I tensorflow/stream_executor/platform/defa

## Import Data

In [5]:
df_model = pd.read_csv("./1-Title_Classification/train1.csv")

## EDA

In [6]:
# Drop the ID columns
df_model = df_model.drop(columns = ['ID'])

In [7]:
# distribution of topics

df_model['TOPIC'].value_counts()

0    3107
1    2406
2    2404
Name: TOPIC, dtype: int64

In [8]:
df_model.head()

Unnamed: 0,TITLE,TOPIC
0,RITE AID CORP <RAD> SETS DIVIDEND,0
1,DEL E. WEBB INVESTMENT <DWPA> 4TH QTR NET,0
2,GENERAL HOST CORP <GH> SETS QUARTERLY,0
3,PROFESSOR LIFTS BANC TEXAS <BTX> PREFERRED STAKE,1
4,WINCHELL'S DONUT <WDH> SETS INITIAL QUARTERLY,0


In [9]:
df_model['TITLE'] = df_model['TITLE'].str.lower()

In [10]:
df_model.head()

Unnamed: 0,TITLE,TOPIC
0,rite aid corp <rad> sets dividend,0
1,del e. webb investment <dwpa> 4th qtr net,0
2,general host corp <gh> sets quarterly,0
3,professor lifts banc texas <btx> preferred stake,1
4,winchell's donut <wdh> sets initial quarterly,0


In [11]:
df_model.loc[0, 'TITLE']

'rite aid corp <rad> sets dividend'

In [12]:
# Longest String length

df_model["TITLE"].str.len().max()

135

## Create tf.data.Datasets for Training and Evaluation

In [13]:
df_train, df_valid = train_test_split(df_model,
                                      train_size=0.8,
                                      stratify=df_model['TOPIC'],
                                      random_state=42
                                     )

In [14]:
with tf.device('/cpu:0'):
    train_data = tf.data.Dataset.from_tensor_slices((df_train['TITLE'].values, df_train['TOPIC'].values))
    valid_data = tf.data.Dataset.from_tensor_slices((df_valid['TITLE'].values, df_valid['TOPIC'].values))

2021-12-19 21:44:55.253857: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-12-19 21:44:55.414837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-12-19 21:44:55.414899: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-19 21:44:55.415361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:21:00.0 name: GeForce 

In [15]:
for text, label in train_data.take(1):
    print(text)
    print(label)

tf.Tensor(b'teeco properties l.p. oper income down', shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)


## Download a Pre-trained BERT Model from TensorFlow Hub

Data preprocessing consists of transforming text to BERT input features: input_word_ids, input_mask, segment_ids

In [16]:
label_list = [0, 1, 2] # Label categories
max_seq_length = 128 # maximum length of (token) input sequences
train_batch_size = 32

# Get BERT layer and tokenizer:
# More details here: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=True)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

In [17]:
tokenizer.wordpiece_tokenizer.tokenize('rite aid corp <rad> sets dividend')

['rite', 'aid', 'corp', '<', '##rad', '##>', 'sets', 'divide', '##nd']

In [18]:
tokenizer.convert_tokens_to_ids(tokenizer.wordpiece_tokenizer.tokenize('rite aid corp <rad> sets dividend'))

[14034, 4681, 13058, 1026, 12173, 29631, 4520, 11443, 4859]

## Tokenize and Preprocess Text for BERT

In [19]:
# This provides a function to convert row to input features and label

def to_feature(text, label, label_list=label_list, max_seq_length=max_seq_length, tokenizer=tokenizer):
    example = classifier_data_lib.InputExample(guid = None,
                                               text_a = text.numpy(), 
                                               text_b = None, 
                                               label = label.numpy())

    feature = classifier_data_lib.convert_single_example(0, example, label_list,
                                                         max_seq_length, tokenizer)
  
    return (feature.input_ids, feature.input_mask, feature.segment_ids, feature.label_id)

In [20]:
# Wrap a Python Function into a TensorFlow op for Eager Execution

def to_feature_map(text, label):
    input_ids, input_mask, segment_ids, label_id = tf.py_function(to_feature, inp=[text, label], 
                                Tout=[tf.int32, tf.int32, tf.int32, tf.int32])

    # py_func doesn't set the shape of the returned tensors.
    input_ids.set_shape([max_seq_length])
    input_mask.set_shape([max_seq_length])
    segment_ids.set_shape([max_seq_length])
    label_id.set_shape([])

    x = {
        'input_word_ids': input_ids,
        'input_mask': input_mask,
        'input_type_ids': segment_ids
        }
    
    return (x, label_id)

## Create a TensorFlow Input Pipeline with `tf.data`

In [21]:
with tf.device('/cpu:0'):
    train_data = (train_data.map(to_feature_map,
                                 num_parallel_calls=tf.data.experimental.AUTOTUNE)
                                  .shuffle(1000)
                                  .batch(32, drop_remainder=True)
                                  .prefetch(tf.data.experimental.AUTOTUNE))

    valid_data = (valid_data.map(to_feature_map,
                                 num_parallel_calls=tf.data.experimental.AUTOTUNE)
                                  .batch(32, drop_remainder=True)
                                  .prefetch(tf.data.experimental.AUTOTUNE)) 

In [22]:
# data spec
train_data.element_spec

({'input_word_ids': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None),
  'input_mask': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None),
  'input_type_ids': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None)},
 TensorSpec(shape=(32,), dtype=tf.int32, name=None))

In [23]:
# data spec
valid_data.element_spec

({'input_word_ids': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None),
  'input_mask': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None),
  'input_type_ids': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None)},
 TensorSpec(shape=(32,), dtype=tf.int32, name=None))

## Add a Classification Head to the BERT Layer

In [26]:
# Building the model
def create_model():
    input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
    input_type_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_type_ids")

    pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, input_type_ids])

    drop = tf.keras.layers.Dropout(0.4)(pooled_output)
    output = tf.keras.layers.Dense(3, activation="softmax", name="output")(drop)

    model = tf.keras.Model(
    inputs={
        'input_word_ids': input_word_ids,
        'input_mask': input_mask,
        'input_type_ids': input_type_ids
    },
    outputs=output)
    return model

In [29]:
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")

In [30]:
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")

In [31]:
input_type_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_type_ids")

In [32]:
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, input_type_ids])

ValueError: in user code:

    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow_hub/keras_layer.py:237 call  *
        result = smart_cond.smart_cond(training,
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/saved_model/load.py:670 _call_attribute  **
        return instance.__call__(*args, **kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:889 __call__
        result = self._call(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:924 _call
        results = self._stateful_fn(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3022 __call__
        filtered_flat_args) = self._maybe_define_function(args, kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3444 _maybe_define_function
        graph_function = self._create_graph_function(args, kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3279 _create_graph_function
        func_graph_module.func_graph_from_py_func(
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/framework/func_graph.py:999 func_graph_from_py_func
        func_outputs = python_func(*func_args, **func_kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:672 wrapped_fn
        out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/saved_model/function_deserialization.py:285 restored_function_body
        raise ValueError(

    ValueError: Could not find matching function to call loaded from the SavedModel. Got:
      Positional arguments (3 total):
        * [<tf.Tensor 'inputs:0' shape=(None, 128) dtype=int32>, <tf.Tensor 'inputs_1:0' shape=(None, 128) dtype=int32>, <tf.Tensor 'inputs_2:0' shape=(None, 128) dtype=int32>]
        * False
        * None
      Keyword arguments: {}
    
    Expected these arguments to match one of the following 4 option(s):
    
    Option 1:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids')}
        * False
        * None
      Keyword arguments: {}
    
    Option 2:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_type_ids'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_word_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_mask')}
        * False
        * None
      Keyword arguments: {}
    
    Option 3:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_type_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_mask'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_word_ids')}
        * True
        * None
      Keyword arguments: {}
    
    Option 4:
      Positional arguments (3 total):
        * {'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), 'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids')}
        * True
        * None
      Keyword arguments: {}


## Fine-Tune BERT for Text Classification

In [28]:
model = create_model()

ValueError: in user code:

    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow_hub/keras_layer.py:237 call  *
        result = smart_cond.smart_cond(training,
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/saved_model/load.py:670 _call_attribute  **
        return instance.__call__(*args, **kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:889 __call__
        result = self._call(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:924 _call
        results = self._stateful_fn(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3022 __call__
        filtered_flat_args) = self._maybe_define_function(args, kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3444 _maybe_define_function
        graph_function = self._create_graph_function(args, kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3279 _create_graph_function
        func_graph_module.func_graph_from_py_func(
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/framework/func_graph.py:999 func_graph_from_py_func
        func_outputs = python_func(*func_args, **func_kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:672 wrapped_fn
        out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/saved_model/function_deserialization.py:285 restored_function_body
        raise ValueError(

    ValueError: Could not find matching function to call loaded from the SavedModel. Got:
      Positional arguments (3 total):
        * [<tf.Tensor 'inputs:0' shape=(None, 128) dtype=int32>, <tf.Tensor 'inputs_1:0' shape=(None, 128) dtype=int32>, <tf.Tensor 'inputs_2:0' shape=(None, 128) dtype=int32>]
        * False
        * None
      Keyword arguments: {}
    
    Expected these arguments to match one of the following 4 option(s):
    
    Option 1:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids')}
        * False
        * None
      Keyword arguments: {}
    
    Option 2:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_type_ids'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_word_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_mask')}
        * False
        * None
      Keyword arguments: {}
    
    Option 3:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_type_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_mask'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_word_ids')}
        * True
        * None
      Keyword arguments: {}
    
    Option 4:
      Positional arguments (3 total):
        * {'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), 'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids')}
        * True
        * None
      Keyword arguments: {}


In [27]:
model = create_model()
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=[tf.keras.metrics.CategoricalAccuracy()])
model.summary()

ValueError: in user code:

    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow_hub/keras_layer.py:237 call  *
        result = smart_cond.smart_cond(training,
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/saved_model/load.py:670 _call_attribute  **
        return instance.__call__(*args, **kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:889 __call__
        result = self._call(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:924 _call
        results = self._stateful_fn(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3022 __call__
        filtered_flat_args) = self._maybe_define_function(args, kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3444 _maybe_define_function
        graph_function = self._create_graph_function(args, kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/function.py:3279 _create_graph_function
        func_graph_module.func_graph_from_py_func(
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/framework/func_graph.py:999 func_graph_from_py_func
        func_outputs = python_func(*func_args, **func_kwargs)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py:672 wrapped_fn
        out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    /home/lawrence/miniconda3/envs/nlp/lib/python3.9/site-packages/tensorflow/python/saved_model/function_deserialization.py:285 restored_function_body
        raise ValueError(

    ValueError: Could not find matching function to call loaded from the SavedModel. Got:
      Positional arguments (3 total):
        * [<tf.Tensor 'inputs:0' shape=(None, 128) dtype=int32>, <tf.Tensor 'inputs_1:0' shape=(None, 128) dtype=int32>, <tf.Tensor 'inputs_2:0' shape=(None, 128) dtype=int32>]
        * False
        * None
      Keyword arguments: {}
    
    Expected these arguments to match one of the following 4 option(s):
    
    Option 1:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids')}
        * False
        * None
      Keyword arguments: {}
    
    Option 2:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_type_ids'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_word_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_mask')}
        * False
        * None
      Keyword arguments: {}
    
    Option 3:
      Positional arguments (3 total):
        * {'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_type_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_mask'), 'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='inputs/input_word_ids')}
        * True
        * None
      Keyword arguments: {}
    
    Option 4:
      Positional arguments (3 total):
        * {'input_word_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_word_ids'), 'input_mask': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_mask'), 'input_type_ids': TensorSpec(shape=(None, None), dtype=tf.int32, name='input_type_ids')}
        * True
        * None
      Keyword arguments: {}
