# BERT training on raw data

This script fine-tunes the BERT model to the given dataset.
You have to provide two variables:

- data_path: Location of the data. Expects a csv file with a "Comment_text" and a "Hateful_or_not" column.
- model_name: Location where the model will be saved.

Finetuning a gigantic model like BERT takes a lot of time. It makes sense to use cloud computing resources to run this script.
Make sure to exclude the test set from the BERT finetuning process, otherwise you'll have data leakage.

## Preparatory steps



### Upload necessary files to the working directory

The data file as .csv and the .py script need to be in /content/ (which is the standard working directory for colab).

Upload the data and bert_functions_modified.py first by executing the following code twice and chossing the 2 necessary files:

In [None]:
# access data from the drive
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
from google.colab import files
uploaded = files.upload()

Saving training_data_hate_speech.csv to training_data_hate_speech.csv


### Installing necessary packages

Install the necessary bert packages and Tensorflow 1.15.2

In [None]:
!pip uninstall gast
!pip install gast==0.2.2
!pip install sentencepiece
!pip install bert-tensorflow==1.0.1
!pip install pandas==0.24

TensorFlow 1.x selected.
Uninstalling gast-0.3.3:
  Would remove:
    /usr/local/lib/python3.7/dist-packages/gast-0.3.3.dist-info/*
    /usr/local/lib/python3.7/dist-packages/gast/*
Proceed (y/n)? y
  Successfully uninstalled gast-0.3.3
Collecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Building wheels for collected packages: gast
  Building wheel for gast (setup.py) ... [?25l[?25hdone
  Created wheel for gast: filename=gast-0.2.2-cp37-none-any.whl size=7540 sha256=e6c4c7f2226ff21171af1aa106eecd8166156822dc40a8f3e2557dd743271680
  Stored in directory: /root/.cache/pip/wheels/5c/2e/7e/a1d4d4fcebe6c381f378ce7743a3ced3699feb89bcfbdadadd
Successfully built gast
Installing collected packages: gast
Successfully installed gast-0.2.2
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577

In [None]:
%tensorflow_version 1.x

In [None]:
# for importing modules in .py files, insert the working directory
# import sys
# sys.path.insert(0,'/content/drive/My Drive/AIandLaw/CodeAndModel/')

## Preparing the dataset

In [None]:
import pandas as pd
import numpy as np

import tensorflow as tf

from sklearn.model_selection import train_test_split
from bert_functions_modified import *




### User Input

In [None]:
data_path = "training_data_hate_speech.csv"
model_name = "bert_model_modified"

### Initialize data and session

In [None]:
# Load and split data into training and test data
data = pd.read_csv(data_path, sep=',')
X_train, X_test = train_test_split(data, test_size=0.25, random_state=25)
print(X_train.shape)
print(X_test.shape)

# Initialize session
sess = tf.Session()

# Params for bert model and tokenization
bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
# max comment length for padding
max_seq_length = 256

(146299, 3)
(48767, 3)


In [None]:
# Create datasets (Only take up to max_seq_length words for memory)
train_text = X_train['Comment_text'].tolist()
train_text = [' '.join(str(t).split()[0:max_seq_length]) for t in train_text]
train_text = np.array(train_text, dtype=object)[:, np.newaxis]
train_label = X_train['Hateful_or_not'].tolist()

test_text = X_test['Comment_text'].tolist()
test_text = [' '.join(t.split()[0:max_seq_length]) for t in test_text]
test_text = np.array(test_text, dtype=object)[:, np.newaxis]
test_label = X_test['Hateful_or_not'].tolist()

### Prepare input features from dataset

In [None]:
# Instantiate tokenizer
tokenizer = create_tokenizer_from_hub_module(bert_path, sess)

# Convert data to InputExample format
train_examples = convert_text_to_examples(train_text, train_label)
test_examples = convert_text_to_examples(test_text, test_label)

# Convert to features
(train_input_ids, train_input_masks, train_segment_ids, train_labels) = convert_examples_to_features(tokenizer, train_examples, max_seq_length=max_seq_length)
(test_input_ids, test_input_masks, test_segment_ids, test_labels) = convert_examples_to_features(tokenizer, test_examples, max_seq_length=max_seq_length)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore








HBox(children=(FloatProgress(value=0.0, description='Converting examples to features', max=146299.0, style=Pro…




HBox(children=(FloatProgress(value=0.0, description='Converting examples to features', max=48767.0, style=Prog…




## Build and train model

Note: For the next line to work, some modificiations in bert_functions.py were needed.

In particular,
"class BertLayer(tf.layers.Layer):"
was changed to
"class BertLayer(tf.keras.layers.Layer):"

And a custom get_config function was added to the BertLayer class.

The filed was renamed to bert_functions_modified.py

More details about the hyperparameters and layers can be found in bert_functions_modified.py

In [None]:
bert_model = build_model(max_seq_length, bert_path=bert_path)

# Instantiate variables
initialize_vars(sess)

bert_model.fit(
    [train_input_ids, train_input_masks, train_segment_ids],
    train_labels,
    validation_data=([test_input_ids, test_input_masks, test_segment_ids], test_labels),
    epochs=1,
    batch_size=128
)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 256)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 256)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 256)]        0                                            
__________________________________________________________________________________________________
bert_layer (BertLayer)          (None, 768)          110104890   input_ids[0][0]                  
                                                                 input_masks[0][0]          










Train on 146299 samples, validate on 48767 samples


In [None]:
bert_model.save(model_name + ".h5")