# About this Notebook

The goal of this notebook is to build a classifier using a BERT pre-trained model to find toxic comments. The data has been taken from a series of Kaggle competitions to classify Wikipedia comments as toxic/nontoxic. The data has been sourced from Google and Jigsaw. 

Though the full dataset includes non-English comments, I will restrict myself to English-only comment for this iteration. 

For metrics, I will focus on both AUC for ROC and precision-recall curves. In addition, I will look at overall accuracy and perhaps the confusion matrix and performance across different flavors of toxicity.

Credits:
- https://www.kaggle.com/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert
- https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda
- https://www.kaggle.com/clinma/eda-toxic-comment-classification-challenge
- https://www.kaggle.com/abhi111/naive-bayes-baseline-and-logistic-regression

In [None]:
from tqdm import tqdm
import numpy as np
import pandas as pd
%matplotlib inline
  
pd.options.display.max_rows = 999

#Uncomment below if running in colab
#!pip install tokenizers
#!pip install transformers



# Install toxicity package

In [None]:
#Run below if toxicity package is not installed
#!pip install --upgrade git+https://github.com/jkchandalia/toxic-comment-classifier.git@fe5dfe51f09322c166cce0a56818f66a2a2fc5c7


In [None]:
from toxicity import constants, data, features, metrics, visualize, model, text_preprocessing, model_BERT

## Load data

In [None]:
#Mount drive if using google colab nb
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Use below for local
pre_path = './../'
#Use below for paperspace
#pre_path = '/storage/'
#Use below for colab with drive mounted
#pre_path = '/content/drive/My Drive/toximeter_project/'
input_data_path = pre_path+constants.INPUT_PATH
df_train = data.load(input_data_path, filter=False)

train_full = df_train.copy()
#df_train = df_train.loc[:10000,:]
print("Sample Toxic Comments: ")
print(df_train.comment_text[df_train.toxic==1][1:2].values)
print("Breakdown of nontoxic/toxic comments: ")
df_train.toxic.value_counts()


In [None]:
xtrain, xvalid, ytrain, yvalid = model.make_train_test(df_train)

In [None]:
xtrain.shape

## BERT

Using huggingface's tokenizer and DistilBert Model.
https://huggingface.co/transformers/main_classes/tokenizer.html

### Setup basic training configs

In [None]:
#IMP DATA FOR CONFIG
#AUTO = tf.data.experimental.AUTOTUNE

# Configuration
EPOCHS = 120
BATCH_SIZE = 64
MAX_LEN = 512

## Data Preparation/Tokenization




In [None]:
x_train = model_BERT.fast_encode(xtrain.astype(str), model_BERT.fast_tokenizer)
x_valid = model_BERT.fast_encode(xvalid.astype(str), model_BERT.fast_tokenizer)
#x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN)

y_train = ytrain
y_valid = yvalid

## Build Models

In [None]:
build_model = model_BERT.build_BERT_model_classification
build_model_lstm = model_BERT.build_BERT_model_lstm

In [None]:
model_classification = build_model(model_BERT.transformer_layer)
model_classification.summary()


In [None]:
model_lstm = build_model_lstm(model_BERT.transformer_layer)
model_lstm.summary()


## Callbacks

In [None]:
project_name = 'check_output'
callbacks = model_BERT.make_callbacks(pre_path, project_name)

## Start Training


In [None]:
train_history = model_lstm.fit(
    x_train,
    y_train,
    batch_size=BATCH_SIZE,
    validation_data=(x_valid, y_valid),
    epochs=EPOCHS,
    callbacks=callbacks
    
)

In [None]:
y_pred=model_lstm.predict(
    x_valid
)


In [None]:
from toxicity.metrics import run_metrics
run_metrics(y_pred>.5, y_pred, y_valid, visualize=True)