# Toxicity Classification

## Objective
Build toxicity classification model to demonstarte transfer learning, custom-built model, improvement strategies and the relevant NLP and deep learning techniques.

## Introduction 
The tocicity classification is a common NLP problem. A toxicity dataset is available in Tensorflow. It is a problem easy to understand and sufficient to demonstrate certain key NLP and Deep Learning techniques, including but not limitted to transfer learning, text embedding, and neural network architectures. We will build transfer learning model and custom-built models. Interesting performances will be shown but our focus will be more on the model architectures. The model architecutres presented here will not only be applicable to this problem, but also have a wide range of applications, which we will discuss in the end. 

## Outline
1. Data and problem description.
2. Use Tensorflow's pretrained to train a toxicity classification model with the Tensorflow NLP dataset.  
3. Build a basic model from scratch and compare the performance.
4. Propose a improvement strategy, build the model and compare the performance.
5. Discussion - Application & Improvement.

## Data and problem description

The dataset 'wikipedia_toxicity_subtypes' is chosen from Tensorflow NLP dataset. We will use the text and toxicity fields, where toxicity can be either 1 or 0, indicating whether the text is toxic or not. We will build a binary classification out of it for toxicity detection. 

In [1]:
# General
import numpy as np 
import pandas as pd 
import os

# TensorFlow
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

# Data Processing 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Warnings
import warnings
warnings.filterwarnings("ignore")

# Tabulate
from tabulate import tabulate

# Evaluation
from sklearn.metrics import classification_report

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#Read data
ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
ds = tfds.as_dataframe(ds)

#Basic data information
print(ds.shape)
print(ds.language.value_counts())
print(ds.toxicity.value_counts())

#Encode the Label to convert it into numerical values [Fake = 0; Real = 1]
lab_enc = LabelEncoder()

#Applying to the dataset
ds['label'] = lab_enc.fit_transform(ds['toxicity'])

#Decode text
ds['text'] = ds['text'].str.decode("utf-8")
ds = ds[['text', 'label']]
ds.head()

2022-10-29 00:24:57.802094: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


(159571, 9)
b'en'    159571
Name: language, dtype: int64
0.0    144277
1.0     15294
Name: toxicity, dtype: int64


Unnamed: 0,text,label
0,"""\nThanks Xeno. - • Talk • """,0
1,"2009 (UTC)\nFixed 03:36, 8 June",0
2,Question\nWhat was wrong with the repair I did?,0
3,"I agree myself now, actually. (Amazing how the...",0
4,Kisumu \n\nI saw that you contributed to Kisum...,0


### Remark
In the above, certain data cleaning technique can be used, e.g. using the text processing package texthero's clean function will easily clean the text. It may not yield better result though. This can be one of future areas to explore.

## Use Tensorflow's pretrained to train a toxicity classification model with the Tensorflow NLP dataset.

In [24]:
# Train Test Split
x_train,x_test,y_train,y_test = train_test_split(ds['text'], ds.label, test_size=0.1, random_state=0)

In [27]:
# Pre-Trained Text Embedding Model & Layer Definition
Embed = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1'
Trainable_Module = True
hub_layer = hub.KerasLayer(Embed, input_shape=[], dtype=tf.string, trainable=Trainable_Module)

In [28]:
model = tf.keras.Sequential()
model.add(hub_layer)           #pre-trained text embedding layer
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

In [29]:
print(" -- Model Summary --")
model.summary()

 -- Model Summary --
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 20)                400020    
                                                                 
 dense (Dense)               (None, 16)                336       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


### Description of the model
The pre-trained tensorflow model is a small model which has been trained 130GB corpus (20,000 vocabulary) and has 20 dimensions, using Swivel matrix factorization method. On top of the pre-trained model, add a dense layer of 16 dimension, and then the output layer of 1 dimension.

In [42]:
# Model Compile
import tensorflow_addons as tfa
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy']
              )

In [43]:
EPOCHS = 10           
BATCH_SIZE = 256      

history = model.fit(x_train,y_train, batch_size = BATCH_SIZE,
                    epochs = EPOCHS, validation_split= 0.1,
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [44]:
accr = model.evaluate(x_test,y_test, verbose=0)
tab_data = [ [ "Model Trained", '{:.2%}'.format(accr[0]), '{:.2%}'.format(accr[1]) ]]
print(tabulate(tab_data, headers=['','LOSS','ACCURACY'], tablefmt='pretty'))

+---------------+--------+----------+
|               |  LOSS  | ACCURACY |
+---------------+--------+----------+
| Model Trained | 30.21% |  93.78%  |
+---------------+--------+----------+


In [None]:
y_pred = [int(el[0] > 0.5) for el in model.predict(x_test)]

### Performance Metric
The above accuracy is not sufficient enough. Need the following precision, recall and f1-score to better understand the preformance. I will choose the two f1-scores (0 and 1) for the comparison between models. 

In [73]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.97      0.97     14422
           1       0.71      0.60      0.65      1536

    accuracy                           0.94     15958
   macro avg       0.83      0.79      0.81     15958
weighted avg       0.93      0.94      0.94     15958



## Build a basic model from scratch and compare the performance.

### Model Achitecture
The module 'dnn_dense_plus_sparse_module' is a deep learning architure I built using Tensorflow. It can process sparse features, dense features, build neural network layers, and combine the sparse and dense features. In this section, we will focus on using the sparse (TFIDF) features of the text. Inside the module, sparse matrix multiplication is implemented using Tensorflow, so it's fast.

The neural network structure is:
TFIDF Input -> Hidden Layer (SPARSE_HIDDEN_DIM1=500 specified below) -> Logits (Dimension 2) -> Output (produced by softmax on logits, so it's of dimension 2, and each represents the probability of 0 or 1). 

Note: This structured is defined in dnn_dense_plus_sparse_module.sparse_graph_sparseOnly.

In [56]:
%load_ext autoreload
%autoreload 2

from dnn_dense_plus_sparse_module import DnnModel
from dnn_dense_plus_sparse_module import CreateFeatures

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
root_path = '/Users/jili/Library/CloudStorage/GoogleDrive-hamlet.j@gmail.com/My Drive/DS/bestegg/'

# The CreateFeatures class of the custom-built module can create a TFIDF model and generate TFIDF for given text
cf = CreateFeatures()
cf.tfidf_fit(ds.text, root_path + '/objects/tfidf.pkl')
vecs = cf.generate_tfidf_from_text(ds.text, root_path + '/objects/tfidf.pkl')

Load Pre-trained TFIDF Vectorizer
Get The TFIDF of Texts


Prepare train and test sets

In [6]:
# Create one-hot target variables
targets = (pd.get_dummies(ds.label)).to_numpy()

# Split train and test
import random
random.seed(0)
n = len(ds)
test_idx = sorted(random.sample(range(n), int(0.2*n)))
train_idx = sorted(list(set(range(n)) - set(test_idx)))

vecs_train = vecs[train_idx]
vecs_test = vecs[test_idx]
trainY = targets[train_idx, :]
testY = targets[test_idx, :]

In [26]:
# Run the dnn model
num_eps = 3
batch_size = 16
train_steps = int(len(train_idx)/batch_size)+1
model_name = 'dnn_sparse221028'

dnn = DnnModel(LEARNING_RATE=0.001,
               BATCH_SIZE=batch_size,
               EVA_STEP=500,
               SAVE_STEP=train_steps,
               NUM_EPOCHS=num_eps,
               BETA=0.00000001,
               KEEP_PROB = 0.7)

dnn.dnn_train_sparseOnly(vecs_train,
                      trainY,
                      root_path+'objects/'+model_name,
                      SPARSE_HIDDEN_DIM1=500, 
                      REGULARIZATION=True)

Graph started running...
There will be 3 epochs. Each eopch will have 7979 steps.
Average loss at Epoch 0 and Step 499 is: 1.807281
Average loss at Epoch 0 and Step 999 is: 1.199665
Average loss at Epoch 0 and Step 1499 is: 0.974012
Average loss at Epoch 0 and Step 1999 is: 0.859267
Average loss at Epoch 0 and Step 2499 is: 0.702072
Average loss at Epoch 0 and Step 2999 is: 0.645045
Average loss at Epoch 0 and Step 3499 is: 0.620844
Average loss at Epoch 0 and Step 3999 is: 0.575954
Average loss at Epoch 0 and Step 4499 is: 0.506442
Average loss at Epoch 0 and Step 4999 is: 0.518207
Average loss at Epoch 0 and Step 5499 is: 0.499323
Average loss at Epoch 0 and Step 5999 is: 0.415307
Average loss at Epoch 0 and Step 6499 is: 0.409736
Average loss at Epoch 0 and Step 6999 is: 0.378404
Average loss at Epoch 0 and Step 7499 is: 0.352180
Model saved to: /Users/jili/Library/CloudStorage/GoogleDrive-hamlet.j@gmail.com/My Drive/DS/bestegg/objects/dnn_sparse221028
Average loss at Epoch 1 and St

In [27]:
# Evaluate
modelidx = int((train_steps * num_eps))
probs = dnn.dnn_eval_sparseOnly(vecs_test,
                     testY,
                     root_path+'objects/',
                     model_name+'-'+str(modelidx)+'.meta')
preds = np.argmax(probs, axis=1)

INFO:tensorflow:Restoring parameters from /Users/jili/Library/CloudStorage/GoogleDrive-hamlet.j@gmail.com/My Drive/DS/bestegg/objects/dnn_sparse221028-23937


INFO:tensorflow:Restoring parameters from /Users/jili/Library/CloudStorage/GoogleDrive-hamlet.j@gmail.com/My Drive/DS/bestegg/objects/dnn_sparse221028-23937
2022-10-28 23:20:34.014039: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" num_cores: 8 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }


Test loss: 0.23965509


In [28]:
print(classification_report([el[1] for el in testY], preds))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97     28837
           1       0.86      0.50      0.63      3077

    accuracy                           0.94     31914
   macro avg       0.90      0.74      0.80     31914
weighted avg       0.94      0.94      0.94     31914



### Remark
This custom-built model produces decent results compared with the pre-trained transfer learning model above. It f1-score on the positive set is 0.63, a bit less than the previous model's 0.65. It's f1-score on the negative set 0.97 is the same as the previous model.

## Propose a improvement strategy, build the model and compare the performance

#### The strategy I propose here is combining the sparse (TFIDF) and dense (Embedding from the pre-trained model) features. To do so, the module 'dnn_dense_plus_sparse_module' I built will 
1. create neural network layers on top of the sparse and dense features respectively,
2. two latent vectors (from sparse and dense layers respectively) will be concatenated, and 
3. build neural network layers on top of the concatenated features for the final predictions

### Model Achitecture

The sparse layer is similary as before: TFIDF Input -> Hidden Layer (SPARSE_HIDDEN_DIM1=500 specified below) -> Hidden Layer (SPARSE_HIDDEN_DIM2=200 specified below) 

The dense layser is: Embedding Input -> Hidden Layer (DENSE_HIDDEN_DIM1=500) -> Hidden Layer (DENSE_HIDDEN_DIM1=200)

Concatenate: Concatenate the sparse and dense layer (dim==400) -> Hidden Layer (CONCAT_HIDDEN_DIM=200) -> Logits (dim==2) -> Output Layer (by softmax, dim==2). 

Note: This model structure is defined in dnn_dense_plus_sparse_module.dnn_train. 

In [5]:
# Get the dense embedding from the pre-trained model
Embed = 'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1'
embed = hub.load(Embed)
emb = embed(ds.text.to_numpy())
emb = emb.numpy()

In [10]:
%load_ext autoreload
%autoreload 2

from dnn_dense_plus_sparse_module import DnnModel
from dnn_dense_plus_sparse_module import CreateFeatures
root_path = '/Users/jili/Library/CloudStorage/GoogleDrive-hamlet.j@gmail.com/My Drive/DS/bestegg/'

# Use the previouly generated TFIDF model to generate the sparse (TFIDF) features
cf = CreateFeatures()
#cf.tfidf_fit(ds.text, root_path + '/objects/tfidf.pkl')
vecs = cf.generate_tfidf_from_text(ds.text, root_path + '/objects/tfidf.pkl')

Instructions for updating:
non-resource variables are not supported in the long term


Instructions for updating:
non-resource variables are not supported in the long term


Load Pre-trained TFIDF Vectorizer
Get The TFIDF of Texts


Prepare Train and Test Data

In [12]:
# Create one-hot target variables
targets = (pd.get_dummies(ds.label)).to_numpy()

# Split train and test
import random
random.seed(0)
n = len(ds)
test_idx = sorted(random.sample(range(n), int(0.2*n)))
train_idx = sorted(list(set(range(n)) - set(test_idx)))

emb_train = emb[train_idx]
emb_test = emb[test_idx]
vecs_train = vecs[train_idx]
vecs_test = vecs[test_idx]
trainY = targets[train_idx, :]
testY = targets[test_idx, :]

In [19]:
# Run the dnn model
num_eps = 100
batch_size = 16
train_steps = int(len(trainY)/batch_size)+1
model_name = 'dnn_dense_plus_sparse221028'

dnn = DnnModel(LEARNING_RATE=0.001,
               BATCH_SIZE=batch_size,
               EVA_STEP=2000,
               SAVE_STEP=train_steps*num_eps/2,
               NUM_EPOCHS=num_eps,
               BETA=0.00000001,
               KEEP_PROB = 0.7)

dnn.dnn_train(emb_train,
              vecs_train,
              trainY,
              root_path+'objects/'+model_name,
              DENSE_HIDDEN_DIM1=500,
              DENSE_HIDDEN_DIM2=200,
              SPARSE_HIDDEN_DIM1=500, 
              SPARSE_HIDDEN_DIM2=200,
              CONCAT_HIDDEN_DIM=200,
              REGULARIZATION=True)

Graph started running...
There will be 100 epochs. Each eopch will have 7979 steps.
Average loss at Epoch 0 and Step 1999 is: 30.119905
Average loss at Epoch 0 and Step 3999 is: 11.324261
Average loss at Epoch 0 and Step 5999 is: 9.890114
Average loss at Epoch 1 and Step 20 is: 0.074699
Average loss at Epoch 1 and Step 2020 is: 8.184718
Average loss at Epoch 1 and Step 4020 is: 7.241530
Average loss at Epoch 1 and Step 6020 is: 6.760329
Average loss at Epoch 2 and Step 41 is: 0.118724
Average loss at Epoch 2 and Step 2041 is: 5.586884
Average loss at Epoch 2 and Step 4041 is: 5.027705
Average loss at Epoch 2 and Step 6041 is: 4.643208
Average loss at Epoch 3 and Step 62 is: 0.117959
Average loss at Epoch 3 and Step 2062 is: 3.911491
Average loss at Epoch 3 and Step 4062 is: 3.385349
Average loss at Epoch 3 and Step 6062 is: 3.202513
Average loss at Epoch 4 and Step 83 is: 0.116218
Average loss at Epoch 4 and Step 2083 is: 2.651119
Average loss at Epoch 4 and Step 4083 is: 2.307343
Aver

Average loss at Epoch 39 and Step 6818 is: 0.176280
Average loss at Epoch 40 and Step 839 is: 0.064223
Average loss at Epoch 40 and Step 2839 is: 0.169233
Average loss at Epoch 40 and Step 4839 is: 0.162135
Average loss at Epoch 40 and Step 6839 is: 0.170222
Average loss at Epoch 41 and Step 860 is: 0.075163
Average loss at Epoch 41 and Step 2860 is: 0.166058
Average loss at Epoch 41 and Step 4860 is: 0.156781
Average loss at Epoch 41 and Step 6860 is: 0.170938
Average loss at Epoch 42 and Step 881 is: 0.071023
Average loss at Epoch 42 and Step 2881 is: 0.164006
Average loss at Epoch 42 and Step 4881 is: 0.165172
Average loss at Epoch 42 and Step 6881 is: 0.166514
Average loss at Epoch 43 and Step 902 is: 0.072328
Average loss at Epoch 43 and Step 2902 is: 0.158829
Average loss at Epoch 43 and Step 4902 is: 0.158715
Average loss at Epoch 43 and Step 6902 is: 0.163570
Average loss at Epoch 44 and Step 923 is: 0.077420
Average loss at Epoch 44 and Step 2923 is: 0.147155
Average loss at E

Average loss at Epoch 78 and Step 7637 is: 0.124558
Average loss at Epoch 79 and Step 1658 is: 0.110349
Average loss at Epoch 79 and Step 3658 is: 0.120801
Average loss at Epoch 79 and Step 5658 is: 0.168154
Average loss at Epoch 79 and Step 7658 is: 0.132623
Average loss at Epoch 80 and Step 1679 is: 0.122042
Average loss at Epoch 80 and Step 3679 is: 0.127349
Average loss at Epoch 80 and Step 5679 is: 0.149564
Average loss at Epoch 80 and Step 7679 is: 0.134376
Average loss at Epoch 81 and Step 1700 is: 0.123229
Average loss at Epoch 81 and Step 3700 is: 0.123528
Average loss at Epoch 81 and Step 5700 is: 0.157738
Average loss at Epoch 81 and Step 7700 is: 0.131207
Average loss at Epoch 82 and Step 1721 is: 0.115307
Average loss at Epoch 82 and Step 3721 is: 0.121630
Average loss at Epoch 82 and Step 5721 is: 0.140189
Average loss at Epoch 82 and Step 7721 is: 0.129195
Average loss at Epoch 83 and Step 1742 is: 0.129241
Average loss at Epoch 83 and Step 3742 is: 0.112733
Average loss

In [20]:
# Evaluate
modelidx = int((train_steps * num_eps))
probs = dnn.dnn_eval(emb_test,
                     vecs_test,
                     testY,
                     root_path+'objects/',
                     model_name+'-'+str(modelidx)+'.meta')
preds = np.argmax(probs, axis=1)
print(classification_report([el[1] for el in testY], preds))

INFO:tensorflow:Restoring parameters from /Users/jili/Library/CloudStorage/GoogleDrive-hamlet.j@gmail.com/My Drive/DS/bestegg/objects/dnn_dense_plus_sparse221028-797900


INFO:tensorflow:Restoring parameters from /Users/jili/Library/CloudStorage/GoogleDrive-hamlet.j@gmail.com/My Drive/DS/bestegg/objects/dnn_dense_plus_sparse221028-797900


Test loss: 2.4071696
              precision    recall  f1-score   support

           0       0.97      0.98      0.97     28837
           1       0.77      0.69      0.73      3077

    accuracy                           0.95     31914
   macro avg       0.87      0.83      0.85     31914
weighted avg       0.95      0.95      0.95     31914



### Remark
1. As you can see above, it takes a lot epochs for this model structure to converge to a low loss, while it took only 3 epochs for the basice (sparse only) model structure to converge. 
2. This model structure generates and learns from richer features and eventually produces better predictions. 
3. The above overall perfomance is better than the pre-trained transfer learning model, and the basic model, with 0.73 f1-score on the positive set, vs 0.63-0.65 from the other models. 
4. Regularization techniques including dropout and l2 regularization are used.

## Discussion - Application & Improvement

1. The last custom-built model performs well. More importantly it has a wide range of applications. For example, when we need to combine structured and text data, we can use this model to take care of strutured data as dense features, and text data as either tfidf features, embedding features or both. Then the model structure can naturally combine them using neural network layers. 
2. In my work experience, I implemented such for combining structure and text data, and achieved good performance and scalability. 
3. I have mainly focus on model architetures. There are certainly areas for improvement, including data processing and modeling. 
4. I didn't include deployment pieces such as deploying as API. If desired, I can share an example of doing so.
5. For further model improvement, I would try a simple model, especially SVM, on the embedding features. I had experience that this type of model provides the best performance on the similar problems. 
6. In order to eventually use it in real world production, I would consider to explore the use of multiple different models. For example, logistic regression may not provide the best performance, but it may provide the best predictive probabilities and explanability. Good predictive probabilities are important for the downstreaming work. When logistic regression is not confident, the data can go to more complex models. And hence a portion of the incoming data is taken care of by logistic regression, and only another portion of the data goes to complex models. This helps reduce latency and improve accuracy. Unconfident predictions can also be routed to manual processing. 