<a href="https://colab.research.google.com/github/julrods/cyber-bullying-detector/blob/main/Iteration_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Changes from iteration 1:** changed the loss function from SparseCategoricalCrossentropy to BinaryCrossentropy and the SparseCategoricalAccuracy to BinaryAccuracy

REFERENCIA: https://swatimeena989.medium.com/bert-text-classification-using-keras-903671e0207d#011a

# Environment

## Libraries

In [3]:
!pip install transformers



In [4]:
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import re
import unicodedata
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import keras
from tqdm import tqdm
import pickle
from keras.models import Model
import keras.backend as K
from sklearn.metrics import confusion_matrix,f1_score,classification_report
import matplotlib.pyplot as plt
from keras.callbacks import ModelCheckpoint
import itertools
from keras.models import load_model
from sklearn.utils import shuffle
#from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig, TFBertForSequenceClassification

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Functions

In [6]:
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def clean_stopwords_shortwords(w):
    stopwords_list=stopwords.words('english')
    words = w.split() 
    clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 2]
    return " ".join(clean_words) 

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    w = re.sub(r"([?.!,¿])", r" ", w)
    w = re.sub(r'[" "]+', " ", w)
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    w=clean_stopwords_shortwords(w)
    w=re.sub(r'@\w+', '',w)
    return w

## Importing the data

In [7]:
# Import PyDrive and associated libraries.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [8]:
# Download a file based on its file ID.
file_id = '1W2uM-pWHd9TX0G9WjXKkAORDyqVoDbyu' # id of aggression_parsed_dataset
downloaded = drive.CreateFile({'id': file_id})

In [9]:
### Import data from csv
downloaded.GetContentFile('aggression_parsed_dataset.csv')  
data_original = pd.read_csv('aggression_parsed_dataset.csv')
data = data_original.copy()

In [10]:
print('File has {} rows and {} columns'.format(data.shape[0],data.shape[1]))

File has 115864 rows and 5 columns


In [11]:
data.head()

Unnamed: 0,index,Text,ed_label_0,ed_label_1,oh_label
0,0,`- This is not ``creative``. Those are the di...,0.9,0.1,0
1,1,` :: the term ``standard model`` is itself le...,1.0,0.0,0
2,2,"True or false, the situation as of March 200...",1.0,0.0,0
3,3,"Next, maybe you could work on being less cond...",0.555556,0.444444,0
4,4,This page will need disambiguation.,1.0,0.0,0


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115864 entries, 0 to 115863
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   index       115864 non-null  int64  
 1   Text        115864 non-null  object 
 2   ed_label_0  115864 non-null  float64
 3   ed_label_1  115864 non-null  float64
 4   oh_label    115864 non-null  int64  
dtypes: float64(2), int64(2), object(1)
memory usage: 4.4+ MB


# Preprocessing

In [13]:
# Select required columns
data = data[['Text', 'oh_label']]

In [14]:
data = data.rename(columns = {'oh_label': 'label', 'Text': 'text'})

In [15]:
# Shuffle the dataset
data = shuffle(data)

# Print all the unique labels in the dataset   
print('Available labels: ',data.label.unique())

# Clean the text column using preprocess_sentence function defined above
data['text']=data['text'].map(preprocess_sentence)

Available labels:  [0 1]


In [16]:
data.head()

Unnamed: 0,text,label
5652,including including art credits part wiki proc...,0
67505,addresses constantly change several unrelated ...,1
72491,really think triollling mean wrong admit move ...,0
76784,may acceptable article infobox supposed show m...,0
97228,yoyoyoyoyo wassup heyyyyyaaaaaaaaaa mean pleas...,0


# Setting up BERT

In [17]:
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                             num_labels=len(data.label.unique()))

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Example with a sentence to see how the tokenizer works

In [None]:
sent = 'how to train the model, lets look at how a trained model calculates its prediction.'
tokens = bert_tokenizer.tokenize(sent)
print(tokens)

['how', 'to', 'train', 'the', 'model', ',', 'lets', 'look', 'at', 'how', 'a', 'trained', 'model', 'calculate', '##s', 'its', 'prediction', '.']


In [None]:
tokenized_sequence = bert_tokenizer.encode_plus(sent,
                                               add_special_tokens = True,
                                               max_length = 100,
                                               truncation = True,
                                               padding = 'max_length',
                                               return_attention_mask = True)

In [None]:
tokenized_sequence.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [None]:
# the zeros at the end are the padding to adjust to max length so that all the vectors have the same dimensions
print(tokenized_sequence['input_ids'])

[101, 2129, 2000, 3345, 1996, 2944, 1010, 11082, 2298, 2012, 2129, 1037, 4738, 2944, 18422, 2015, 2049, 17547, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
# what is this??
print(tokenized_sequence['token_type_ids'])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
# The attention mask signals if the model should pay attention to a token or not. It has one for the real tokens 
# and 0 for the padding tokens
print(tokenized_sequence['attention_mask'])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
# Decoding. Special tokens like [CLS], [SEP] and [PAD] are added by the tokenizer
bert_tokenizer.decode(tokenized_sequence['input_ids'])

'[CLS] how to train the model, lets look at how a trained model calculates its prediction. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

# Fine-tuning the pre-trained BERT model

## Encoding of the text data using BERT Tokenizer and obtaining the input_ids and attentions masks to feed into the model.

In [None]:
sentences = data['text']
labels = data['label']
len(sentences), len(labels)

(115864, 115864)

In [None]:
input_ids = []
attention_masks = []

for sent in sentences:
    bert_inp = bert_tokenizer.encode_plus(sent,
                                          add_special_tokens = True,
                                          max_length = 100,
                                          truncation = True,
                                          padding = 'max_length',
                                          return_attention_mask = True)
    input_ids.append(bert_inp['input_ids'])
    attention_masks.append(bert_inp['attention_mask'])

input_ids = np.asarray(input_ids)
attention_masks = np.array(attention_masks)
labels = np.array(labels)

In [None]:
len(input_ids), len(attention_masks), len(labels)

## Saving and loading the data into the pickle files

In [18]:
# print('Preparing the pickle file.....')
# 
pickle_inp_path='/content/gdrive/MyDrive/Cyber-bullying-project/data/bert_inp.pkl'
pickle_mask_path='/content/gdrive/MyDrive/Cyber-bullying-project/data/bert_mask.pkl'
pickle_label_path='/content/gdrive/MyDrive/Cyber-bullying-project/data/bert_label.pkl'
# 
# pickle.dump((input_ids), open(pickle_inp_path,'wb'))
# pickle.dump((attention_masks), open(pickle_mask_path,'wb'))
# pickle.dump((labels), open(pickle_label_path,'wb'))
# 
# print('Pickle files saved as ', pickle_inp_path, pickle_mask_path, pickle_label_path)

In [19]:
print('Loading the saved pickle files..')

input_ids = pickle.load(open(pickle_inp_path, 'rb'))
attention_masks = pickle.load(open(pickle_mask_path, 'rb'))
labels = pickle.load(open(pickle_label_path, 'rb'))

print('Input shape {} \nAttention mask shape {} \nInput label shape {}'.format(input_ids.shape, attention_masks.shape, labels.shape))

Loading the saved pickle files..
Input shape (115864, 100) 
Attention mask shape (115864, 100) 
Input label shape (115864,)


## Spitting into train, test and validation set

In [20]:
train_inp, test_inp, train_label, test_label, train_mask, test_mask = train_test_split(input_ids,
                                                                                    labels,
                                                                                    attention_masks,
                                                                                    test_size=0.2, 
                                                                                    stratify = labels)

print('Train inp shape {} Test input shape {}\nTrain label shape {} Test label shape {}\nTrain attention mask shape {} Test attention mask shape {}'.format(train_inp.shape, test_inp.shape, train_label.shape, test_label.shape, train_mask.shape, test_mask.shape))

Train inp shape (92691, 100) Test input shape (23173, 100)
Train label shape (92691,) Test label shape (23173,)
Train attention mask shape (92691, 100) Test attention mask shape (23173, 100)


In [21]:
train_inp, val_inp, train_label, val_label, train_mask, val_mask = train_test_split(train_inp,
                                                                                    train_label,
                                                                                    train_mask,
                                                                                    test_size=0.2,
                                                                                    stratify = train_label)

print('Train inp shape {} Val input shape {}\nTrain label shape {} Val label shape {}\nTrain attention mask shape {} Val attention mask shape {}'.format(train_inp.shape, val_inp.shape, train_label.shape, val_label.shape, train_mask.shape, val_mask.shape))

Train inp shape (74152, 100) Val input shape (18539, 100)
Train label shape (74152,) Val label shape (18539,)
Train attention mask shape (74152, 100) Val attention mask shape (18539, 100)


## Setting up the loss, metric and the optimizer

Read about callbacks: https://keras.io/api/callbacks/

In [29]:
log_dir = 'tensorboard_data/tb_bert'
model_save_path = '/content/gdrive/MyDrive/Cyber-bullying-project/models/bert_model2.h5'

callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath = model_save_path,
                                                save_weights_only = True,
                                                monitor = 'val_loss',
                                                mode = 'min',
                                                save_best_only=True),
             keras.callbacks.TensorBoard(log_dir = log_dir)]

print('\nBert Model', bert_model.summary())

loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
metric = tf.keras.metrics.BinaryAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,
                                     epsilon=1e-08)

bert_model.compile(loss = loss, optimizer = optimizer, metrics = [metric])

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________

Bert Model None


## Training the model

In [25]:
history = bert_model.fit(x = [train_inp, train_mask],
                         y = train_label,
                         batch_size = 32,
                         epochs = 2,
                         validation_data = ([val_inp, val_mask], val_label), 
                         callbacks = callbacks
                         )

#Modelo de Alberto:
#history = model.fit(
#    x={'input_ids': x['input_ids']},
#    y={'oh_label': y[:,0]},
#    validation_split=0.2,
#    batch_size=64,
#    epochs=2)

Epoch 1/2
Epoch 2/2


# Evaluating the performance of the model

In [26]:
#%load_ext tensorboard

In [28]:
#log_dir='tensorboard_data/bert_model'
#%tensorboard --logdir {log_dir}

In [30]:
bert_model.save_weights(model_save_path)

In [31]:
trained_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
trained_model.compile(loss = loss,
                      optimizer = optimizer, 
                      metrics = [metric])
trained_model.load_weights(model_save_path)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [32]:
preds = trained_model.predict([test_inp, test_mask],
                              batch_size=32)



In [33]:
pred_labels = [np.argmax(pred) for pred in preds[0]]
f1 = f1_score(test_label, pred_labels)

In [34]:
print('F1 score', f1)
print('Classification Report')
print(classification_report(test_label, pred_labels))

F1 score 0.7620931487142599
Classification Report
              precision    recall  f1-score   support

           0       0.96      0.98      0.97     20217
           1       0.81      0.72      0.76      2956

    accuracy                           0.94     23173
   macro avg       0.89      0.85      0.86     23173
weighted avg       0.94      0.94      0.94     23173

