<a href="https://colab.research.google.com/github/pastrop/kaggle/blob/master/BERT_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Based Sentiment Analysis <br>
*original is at: https://www.kaggle.com/pastrop/toxic-data-comp-data*

In [0]:
!pip install bert-for-tf2

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/35/5c/6439134ecd17b33fe0396fb0b7d6ce3c5a120c42a4516ba0e9a2d6e43b25/bert-for-tf2-0.14.4.tar.gz (40kB)
[K     |████████                        | 10kB 15.7MB/s eta 0:00:01[K     |████████████████▏               | 20kB 2.2MB/s eta 0:00:01[K     |████████████████████████▎       | 30kB 2.8MB/s eta 0:00:01[K     |████████████████████████████████| 40kB 2.0MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/a4/bf/c1c70d5315a8677310ea10a41cfc41c5970d9b37c31f9c90d4ab98021fd1/py-params-0.9.7.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2

In [0]:
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
import tensorflow_hub as hub
import bert

In [0]:
import pandas as pd
import os, time

In [0]:
print("TensorFlow Version:",tf.__version__)
print("Hub version: ",hub.__version__)

In [0]:
df = pd.read_csv('path to file')
df.head(5)

# Initial Dataset Processing - Adding word_ids, mask_ids, segment_ids

In [0]:
SEQUENCE_LENGTH = 256

DATA_PATH =  "../input/jigsaw-multilingual-toxic-comment-classification" # data location

#BERT_PATH = "../working/bert_model"
#BERT_PATH_SAVEDMODEL = "../working/bert_model"
#OUTPUT_PATH = "../working"

In [0]:
wiki_toxic_comment_data = "jigsaw-toxic-comment-train.csv" #training file to be processed
wiki_toxic_comment_train = pd.read_csv(os.path.join(DATA_PATH, wiki_toxic_comment_data))
wiki_toxic_comment_train.head()

In [0]:
#building the tokenizer
#bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_cased_L-24_H-1024_A-16/2",trainable=True) - English only
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/2",trainable=True) # Multi-language
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = bert.bert_tokenization.FullTokenizer(vocab_file, do_lower_case)

In [0]:
#examples of using the tokenizer (not needed for production code, useful to checking that tokenizer is working correctly)
example_sentence = wiki_toxic_comment_train.iloc[37].comment_text[:150]
print(example_sentence)

example_tokens = tokenizer.tokenize(example_sentence)
print(example_tokens[:17])

example_input_ids = tokenizer.convert_tokens_to_ids(example_tokens)
print(example_input_ids[:17])

This is a where the majority of the data prep time is spent. Every record in the input has to be processed. I believe that the process is highly parallezible. The steps are: (1) read file into memory (2) process every record (3) write back to storage. As an example, processing 230,000 records file takes ~600 seconds (Standard Kaggle compute resource). In theory step (3) may be avoided if you have enough memory to hold results in memory for future use yet input files my be arbitrarily large and current impementation of the Tensor Flow dataset class mandates input from the CSV file (not sure why...). This step is required for pretty much all modern NLP models and needs to be performed both in training and production

In [0]:
def process_sentence(sentence, max_seq_length=SEQUENCE_LENGTH, tokenizer=tokenizer):
    """Helper function to prepare data for BERT. Converts sentence input examples
    into the form ['input_word_ids', 'input_mask', 'segment_ids']."""
    # Tokenize, and truncate to max_seq_length if necessary.
    tokens = tokenizer.tokenize(sentence)
    if len(tokens) > max_seq_length - 2:
        tokens = tokens[:(max_seq_length - 2)]

    # Convert the tokens in the sentence to word IDs.
    input_ids = tokenizer.convert_tokens_to_ids(["[CLS]"] + tokens + ["[SEP]"])

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    pad_length = max_seq_length - len(input_ids)
    input_ids.extend([0] * pad_length)
    input_mask.extend([0] * pad_length)

    # We only have one input segment.
    segment_ids = [0] * max_seq_length

    return (input_ids, input_mask, segment_ids)

def preprocess_and_save_dataset(unprocessed_filename, text_label='comment_text',
                                seq_length=SEQUENCE_LENGTH, verbose=True):
    """Preprocess a CSV to the expected TF Dataset form for multilingual BERT,
    and save the result."""
    dataframe = pandas.read_csv(os.path.join(DATA_PATH, unprocessed_filename),
                                index_col='id')
    processed_filename = (unprocessed_filename.rstrip('.csv') +
                          "-processed-seqlen{}.csv".format(SEQUENCE_LENGTH))

    pos = 0
    start = time.time()

    while pos < len(dataframe):
        processed_df = dataframe[pos:pos + 10000].copy()

        processed_df['input_word_ids'], processed_df['input_mask'], processed_df['segment_ids'] = (
            zip(*processed_df[text_label].apply(process_sentence)))

        if pos == 0:
            processed_df.to_csv(processed_filename, index_label='id', mode='w')
        else:
            processed_df.to_csv(processed_filename, index_label='id', mode='a',
                                header=False)

        if verbose:
            print('Processed {} examples in {}'.format(
                pos + 10000, time.time() - start))
        pos += 10000
    return
  
# Process the training dataset.
preprocess_and_save_dataset(wiki_toxic_comment_data)

# Dataset transformation into tf.dataset

The below data transformations run fast

In [0]:
df = pd.read_csv('path to file',nrows = XXXXX) # Limit nrows for the test run, read the entire  file otherwise
df.head(3)

*tf.dataset creation*

In [0]:
test = df.filter(['toxic','input_word_ids','input_mask','all_segment_id'])
#test = test.rename(columns={"all_segment_id": "segment_ids"})
test.head(3)

In [0]:
# writing dataframe to CSV
test.to_csv('test_processed_256.csv', index = False, mode='w')

In [0]:
#building TF dataset:
def get_dataset(file_path = 'path to file'):
    dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12, # Artificially small, the dataset is batched up later.
      label_name='toxic', #label for the class column
      na_value="?",
      num_epochs=1,
      shuffle=False)
    return dataset

In [0]:
train_data = get_dataset('path to file')
train_data = train_data.unbatch()

In [0]:
def parse_string_list_into_ints(strlist):
    s = tf.strings.strip(strlist)
    s = tf.strings.substr(s, 1, tf.strings.length(s) - 2)  # Remove parentheses around list
    #s = tf.strings.split(s, ',', maxsplit=128)
    s = tf.strings.split(s, ',', maxsplit=256)
    s = tf.strings.to_number(s, tf.int32)
    #s = tf.reshape(s, [128])  # Force shape here needed for XLA compilation (TPU)
    s = tf.reshape(s, [256])  # Force shape here needed for XLA compilation (TPU)
    return s

In [0]:
# prototype function to process the dataset for the Bert layer
def elem_mod(data,label):
    for k,v in data.items():
        data[k] = parse_string_list_into_ints(v)
    return data,label    
    
result = train_data.map(lambda x,y:elem_mod(x,y))

In [0]:
#Final step in getting the dataset ready:
def make_dataset_pipeline(dataset, repeat_and_shuffle=True):
    """Set up the pipeline for the given dataset.   
    Caches, repeats, shuffles, and sets the pipeline up to prefetch batches."""
    cached_dataset = dataset.cache()
    if repeat_and_shuffle:
        cached_dataset = cached_dataset.shuffle(2048)
    #cached_dataset = cached_dataset.batch(32 * strategy.num_replicas_in_sync)
    cached_dataset = cached_dataset.batch(32)
    cached_dataset = cached_dataset.prefetch(tf.data.experimental.AUTOTUNE)
    return cached_dataset

In [0]:
cached_train_data = make_dataset_pipeline(result)

# Keras Model with BERT Layer

In [0]:
#Building the model (reformat as a function...)
max_seq_length = 256  # Your choice here.
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                       name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                   name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                    name="segment_ids")
# BERT layer from pretrained model
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/2",trainable=True)
# Dense Layers
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
output = tf.keras.layers.Dense(32, activation='relu')(pooled_output)
output = tf.keras.layers.Dense(1, activation='sigmoid', name='labels')(output)

In [0]:
# Model
model = tf.keras.Model(inputs={'input_word_ids': input_word_ids,
                                  'input_mask': input_mask,
                                  'all_segment_id': segment_ids},
                          outputs=output)

In [0]:
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
    metrics=[tf.keras.metrics.AUC()])

model.summary()

*Model Build&Compile if TPU is used*

In [0]:
#TPU based model DON'T RUN WITHOUT TPU
# instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope():
    model = tf.keras.Sequential( … ) # define your model normally
    model.compile( … )

In [0]:
# Train on English Wikipedia comment data.
history = model.fit(
    # Set steps such that the number of examples per epoch is fixed.
    # This makes training on different accelerators more comparable.
    cached_train_data,steps_per_epoch=4000//128,
    epochs=7, verbose=1)
#print()
#steps_per_epoch=4000//256
#validation_data=nonenglish_val_datasets['Combined'],
# validation_steps=100

In [0]:
results = model.evaluate(cached_vaidate_data,
                                     steps=100, verbose=0)
print('\nLoss, AUC before training:', results) 