<a href="https://colab.research.google.com/github/gkv856/KaggleData/blob/main/Watson_BERT_v001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf


In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.4 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 49.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 51.5 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: Py

In [3]:
from transformers import BertTokenizer, TFBertModel
import matplotlib.pyplot as plt

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [5]:
TRAIN_URL = "https://raw.githubusercontent.com/gkv856/KaggleData/main/watson_train.csv"
# TRAIN_URL ="https://raw.githubusercontent.com/gkv856/KaggleData/main/train.csv"
train = pd.read_csv(TRAIN_URL)

In [6]:
train.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1


In [7]:
# later you can use the whole data for traiing and test
train_df, valid_df = train_test_split(train, 
                                       random_state=42, 
                                      #  train_size=0.9, 
                                       test_size=.25,
                                       stratify=train[["label", "lang_abv"]].values)

# use the below structure for testing if the data is huge
# train_df, remaining = train_test_split(train, 
#                                        random_state=42, 
#                                        train_size=0.0095, 
#                                        stratify=train[["label", "lang_abv"]].values)

# valid_df, _ = train_test_split(remaining, 
#                               random_state=42, 
#                               train_size=0.0095, 
#                               stratify=remaining[["label", "lang_abv"]].values)
len(train_df), len(valid_df)

(9090, 3030)

In [8]:
"""
Each line of the dataset is composed of the review text and its label
- Data preprocessing consists of transforming text to BERT input features:
input_word_ids, input_mask, segment_ids
- In the process, tokenizing the text is done with the provided BERT model tokenizer
"""

# Label categories, right now our data has these categories
label_list = [0, 1, 2]

# maximum length of (token) input sequences, or the words in a question
# to save speed we should reset this
max_seq_length = 128

train_batch_size = 32

In [9]:
model_name = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [10]:
def encode_sentence(s, is_cls=False):
   tokens = tokenizer.tokenize(s)
   if not is_cls:
    tokens.append('[SEP]')
   return tokenizer.convert_tokens_to_ids(tokens)

In [11]:
tf.convert_to_tensor(encode_sentence("i love machine learning"), dtype=tf.int32)

<tf.Tensor: shape=(5,), dtype=int32, numpy=array([  177, 16138, 21432, 26901,   102], dtype=int32)>

In [12]:
print("TF Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
# print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

TF Version:  2.6.0
Eager mode:  True
GPU is available


In [13]:
def pad_tokens(t1, t2, max_seq_length=max_seq_length):
  """
    this function makes sure that all the outputs are of same length.
    first sequence will be [CLS]{lst of tokens}[SEP]
    second sequence will be {lst of tokens}[SEP]
    there it is impottant to extract first and last item before slicing the list
  """
  half_len_t1 = max_seq_length // 2 - 1
  half_len_t2 = max_seq_length // 2
  
  frt = [t1[0]]
  lst = [t1[-1]]
  t1 = t1[1:-1]
  t1 = frt +t1[1:half_len_t1-1]  + lst
  # print(t1)

  lst = [t2[-1]]
  t2 = t2[0:-1]
  t2 = t2[1:half_len_t2]  + lst
  # print(t2)

  len_t1 = len(t1)
  len_t2 = len(t2)

  if len_t1 < half_len_t1:
    len_pad = half_len_t1 - len_t1
    lst = [0] * len_pad
    zeros = tf.constant(lst, dtype=tf.int32)
    t1 = tf.concat([t1, zeros], axis=0) 

  if len_t2 < half_len_t2:
    len_pad = half_len_t2 - len_t2
    lst = [0] * len_pad
    zeros = tf.constant(lst, dtype=tf.int32)
    t2 = tf.concat([t2, zeros], axis=0)
  
  return t1, t2

def to_feature(texta, textb, label):#, tokenizer=tokenizer):
  hypothese = texta.numpy().decode('UTF-8')
  premise = textb.numpy().decode('UTF-8')
  
  cls = encode_sentence("[CLS]", is_cls=True)
  
  sentence1 = encode_sentence(hypothese)
  sentence2 = encode_sentence(premise)
  
  sentence1, sentence2  = pad_tokens(sentence1, sentence2, max_seq_length=128)
  

  sentence1 = tf.convert_to_tensor(sentence1, dtype=tf.int32)
  sentence2 = tf.convert_to_tensor(sentence2, dtype=tf.int32)
  

  input_word_ids = tf.concat([cls, sentence1, sentence2], axis=0)
  input_mask = tf.ones_like(input_word_ids, dtype=tf.int32)

  type_cls = tf.zeros_like(cls, dtype=tf.int32)
  type_s1 = tf.zeros_like(sentence1, dtype=tf.int32)
  type_s2 = tf.ones_like(sentence2, dtype=tf.int32)

  input_type_ids = tf.concat([type_cls, type_s1, type_s2], axis=-1) 


  label = tf.constant(label.numpy(), dtype=tf.int32)


  return (input_word_ids, input_mask, input_type_ids, label)


In [14]:
def to_feature_map(in_text, label):
  #     print(text, label)
  texta = in_text[0]
  textb = in_text[1]
  out = tf.py_function(to_feature, inp=[texta, textb, label], 
                       Tout=[tf.int32, tf.int32, tf.int32, tf.int32])
  
  iids, imask, segids, label = out[0], out[1], out[2], out[3]
  
  iids.set_shape([max_seq_length])
  imask.set_shape([max_seq_length])
  segids.set_shape([max_seq_length])
  label.set_shape([])
  # print(iids)
 
  x = {
    "input_word_ids": iids,
    "input_mask": imask,
    "input_type_ids": segids
  }
  return (x, label)

In [15]:
# creating datasets
with tf.device('/cpu:0'):
  train_data = tf.data.Dataset.from_tensor_slices(((train_df[["premise", "hypothesis"]].values), 
                                                   train_df["label"].values))
  
  test_data = tf.data.Dataset.from_tensor_slices((valid_df["premise"].values,
                                                   valid_df["label"].values))
train_data, test_data

(<TensorSliceDataset shapes: ((2,), ()), types: (tf.string, tf.int64)>,
 <TensorSliceDataset shapes: ((), ()), types: (tf.string, tf.int64)>)

In [16]:
s = train_data.take(1)
for t, l in s:
  i = to_feature_map(t, l)
  inps = i[0]
  print(inps["input_word_ids"].shape, inps["input_mask"].shape, inps["input_type_ids"].shape, i[1])

(128,) (128,) (128,) tf.Tensor(2, shape=(), dtype=int32)


In [17]:
# creating datasets
with tf.device('/cpu:0'):
  train_data = tf.data.Dataset.from_tensor_slices(((train_df[["premise", "hypothesis"]].values), 
                                                   train_df["label"].values))
  
  test_data = tf.data.Dataset.from_tensor_slices(((valid_df[["premise", "hypothesis"]].values), 
                                                   valid_df["label"].values))
# train_data, test_data

train_data = train_data.map(to_feature_map, num_parallel_calls=tf.data.AUTOTUNE).batch(32, drop_remainder=True).prefetch(tf.data.AUTOTUNE)
test_data = test_data.map(to_feature_map, num_parallel_calls=tf.data.AUTOTUNE).batch(32, drop_remainder=True).prefetch(tf.data.AUTOTUNE)
train_data.element_spec

({'input_mask': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None),
  'input_type_ids': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None),
  'input_word_ids': TensorSpec(shape=(32, 128), dtype=tf.int32, name=None)},
 TensorSpec(shape=(32,), dtype=tf.int32, name=None))

In [18]:
for x, y in train_data.take(2):
  print(x)

{'input_word_ids': <tf.Tensor: shape=(32, 128), dtype=int32, numpy=
array([[  101, 14983, 61649, ...,     0,     0,     0],
       [  101, 10798, 42819, ...,     0,     0,     0],
       [  101, 10377, 30181, ...,     0,     0,     0],
       ...,
       [  101, 15923, 10133, ...,     0,     0,     0],
       [  101,  1450, 87506, ...,     0,     0,     0],
       [  101, 10159,   172, ...,     0,     0,     0]], dtype=int32)>, 'input_mask': <tf.Tensor: shape=(32, 128), dtype=int32, numpy=
array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]], dtype=int32)>, 'input_type_ids': <tf.Tensor: shape=(32, 128), dtype=int32, numpy=
array([[0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1]], dtype=int32)>

In [19]:
bert_encoder = TFBertModel.from_pretrained(model_name)

Downloading:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-multilingual-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-multilingual-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [20]:
def build_model():
    
    input_word_ids = tf.keras.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
    input_type_ids = tf.keras.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_type_ids")
    
    embedding = bert_encoder([input_word_ids, input_mask, input_type_ids])[0]
    output = tf.keras.layers.Dense(3, activation='softmax')(embedding[:,0,:])
    
    model = tf.keras.Model(inputs=[input_word_ids, input_mask, input_type_ids], outputs=output)
    model.compile(tf.keras.optimizers.Adam(lr=1e-5), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [21]:
model = build_model()
model.summary()

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported


  "The `lr` argument is deprecated, use `learning_rate` instead.")


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     TFBaseModelOutputWit 177853440   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

In [None]:
history = model.fit(train_data, 
                    epochs = 2, 
                    verbose = 1, 
                    validation_data=test_data
                    )

Epoch 1/2
Epoch 2/2
 16/284 [>.............................] - ETA: 7:18 - loss: 0.8752 - accuracy: 0.6133