## Dataset Description
You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.

File descriptions:
- train.csv - the training set, contains comments with their binary labels
- test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set - contains some comments which are not included in scoring.
- sample_submission.csv - a sample submission file in the correct format
- test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)

In [3]:
import numpy as np
import pandas as pd
import dask.dataframe as dd
from matplotlib import pyplot as plt
import seaborn as sns

import re
import string
from pickle import dump, load
import zipfile
import os

from collections import Counter

from nltk.corpus import stopwords
from nltk.probability import FreqDist

from wordcloud import WordCloud

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, Embedding, Conv1D, MaxPool1D, Bidirectional, LSTM, Flatten, Dense, concatenate
from tensorflow.keras.layers import BatchNormalization, MultiHeadAttention, Dropout
from tensorflow.keras.regularizers import l1_l2
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import ExponentialDecay

from tensorflow.keras.optimizers.schedules import ExponentialDecay
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.utils import Sequence, plot_model
from tensorflow.keras.metrics import AUC

import tensorflow as tf
import datetime
from tqdm import tqdm

from gensim.models import KeyedVectors

def extract_zip_files(zip_file_path, extract_to_dir):
    # Path to the zip file
    zip_file_path = zip_file_path
    # directory to extract contents
    extract_to_dir = extract_to_dir
    # Extracting the zip file
    try:
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            # Extract all contents to the specified directory
            zip_ref.extractall(extract_to_dir)
        print(f"Contents of {zip_file_path} extracted successfully to {extract_to_dir}.")
    except FileNotFoundError:
        print(f"Error: The file {zip_file_path} does not exist.")
    except zipfile.BadZipFile:
        print(f"Error: The file {zip_file_path} is not a valid zip file.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        
# load doc into memory
def load_doc(filename):
    with open(filename, 'r') as file:
        return file.read()
    

def clean_doc(doc):
    tokens = doc.split()
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = [re_punc.sub('', word) for word in tokens]
    tokens = [word for word in tokens if word.isalpha() or word in ['not', 'no']]
    tokens = [word.lower() for word in tokens]
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

# save a dataset to a file
def save_dataset(dataset, filename):
    dump(dataset, open(filename, 'wb'))
    print('Saved: %s' % filename)
    
# load a clean dataset
def load_output(filename):
    return load(open(filename, 'rb'))

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2  # Memory usage in MB

    for col in df.columns:
        col_type = df[col].dtype
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()

            # Downcast integers
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            # Downcast floats
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    if verbose:
        reduction_percent = 100 * (start_mem - end_mem) / start_mem
        print(f"Memory usage decreased to {end_mem:.2f} MB ({reduction_percent:.1f}% reduction)")

    return df

# Load data:

In [4]:
# extract main file
zip_file_path = 'jigsaw-toxic-comment-classification-challenge.zip'
extract_to_dir = './toxic_files'
extract_zip_files(zip_file_path, extract_to_dir)

Contents of jigsaw-toxic-comment-classification-challenge.zip extracted successfully to ./toxic_files.


In [5]:
# extract training zip file:
zip_file_path = './toxic_files/train.csv.zip'
extract_to_dir = './toxic_files/train_file'
extract_zip_files(zip_file_path, extract_to_dir)

# extract test zip file:
zip_file_path = './toxic_files/test.csv.zip'
extract_to_dir = './toxic_files/test_file'
extract_zip_files(zip_file_path, extract_to_dir)

# extract 
zip_file_path = './toxic_files/test_labels.csv.zip'
extract_to_dir = './toxic_files/test_file'
extract_zip_files(zip_file_path, extract_to_dir)

Contents of ./toxic_files/train.csv.zip extracted successfully to ./toxic_files/train_file.
Contents of ./toxic_files/test.csv.zip extracted successfully to ./toxic_files/test_file.
Contents of ./toxic_files/test_labels.csv.zip extracted successfully to ./toxic_files/test_file.


In [6]:
# load datasets
train_df = pd.read_csv('./toxic_files/train_file/train.csv', index_col=0)
test_doc = pd.read_csv('./toxic_files/test_file/test.csv')
ytest = pd.read_csv('./toxic_files/test_file/test_labels.csv')

# reduce memory:
train_df = reduce_mem_usage(train_df)
test_doc = reduce_mem_usage(test_doc)
ytest = reduce_mem_usage(ytest)

# perform data remediation for the test set:
test_df = pd.merge(test_doc, ytest, on='id', how='left')
test_df.set_index('id', inplace=True)

# remove none informative labels
test_df = test_df[~(test_df == -1).any(axis=1)]

# sumarise data sets:
print('\nData summary:')
print(f'Training dataset: {train_df.shape[0]} rows, {train_df.shape[1]} columns')
print(f'Test dataset: {test_df.shape[0]} rows, {test_df.shape[1]} columns')

# split into input and output:
train_docs, ytrain = train_df['comment_text'], train_df.iloc[:, -6:].values.astype(int)
test_docs, ytest = test_df['comment_text'], test_df.iloc[:, -6:].values.astype(int)

# save files:
train_docs = train_docs.to_frame()
test_docs = test_docs.to_frame()
train_docs.to_csv('train_docs.csv'), test_docs.to_csv('test_docs.csv')
save_dataset([ytrain, ytest], 'output.pkl')
save_dataset([train_docs, test_docs], 'input.pkl')

Memory usage decreased to 3.35 MB (65.6% reduction)
Memory usage decreased to 2.34 MB (0.0% reduction)
Memory usage decreased to 2.05 MB (75.0% reduction)

Data summary:
Training dataset: 159571 rows, 7 columns
Test dataset: 63978 rows, 7 columns
Saved: output.pkl
Saved: input.pkl


# Analysing toxicity by type: 
#### NOTE!!!! Cells in this section are not executed due to harmful words which flags this notebook as harmful on Github!!

In [7]:
class toxicity_analyser(object):
    def __init__(self, df, most_common):
        self.df = df
        self.most_common = most_common
        
    def combine_words(self, word_list):
        all_words = []
        for word in word_list:
            all_words += word
        return all_words
        
    def create_word_cloud(self, col_name):
        df = self.df[self.df[col_name] == 1]
        tokens = df['comment_text'].apply(clean_doc)
        reviewed_tokens = self.combine_words(tokens)
        mostcommon = FreqDist(reviewed_tokens).most_common(self.most_common)
        wordcloud = WordCloud(width=1500, height=800, background_color='white').generate(str(mostcommon))
        fig = plt.figure(figsize=(30,10), facecolor='white')
        plt.imshow(wordcloud, interpolation="bilinear")
        plt.axis('off')
        plt.title(f'Top {self.most_common} Most Common Words', fontsize=25)
        plt.show()

In [8]:
toxicity = toxicity_analyser(train_df, 100)

In [None]:
toxicity.create_word_cloud('toxic')

In [None]:
toxicity.create_word_cloud('severe_toxic')

In [None]:
toxicity.create_word_cloud('obscene')

In [None]:
toxicity.create_word_cloud('threat')

In [None]:
toxicity.create_word_cloud('insult')

In [None]:
toxicity.create_word_cloud('identity_hate')

# Create Vocabulary:

In [9]:
# add doc to vocabulary
def add_doc_to_vocab(filename, vocab):
    doc = load_doc(filename)
    tokens = clean_doc(doc)
    vocab.update(tokens)
    
# save list to file
def save_list(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close

vocab = Counter()
add_doc_to_vocab('train_docs.csv', vocab)
print(len(vocab))
print(vocab.most_common(50))

min_occurance = 2
tokens = [k for k,c in vocab.items() if c >= min_occurance]
print(len(tokens))
# save token to vocavulary
save_list(tokens, 'vocab.text')

212766
[('the', 491916), ('to', 296460), ('of', 223846), ('and', 221212), ('you', 201608), ('is', 175487), ('that', 153679), ('in', 143493), ('it', 128298), ('for', 102129), ('this', 95292), ('not', 92909), ('on', 89218), ('be', 83262), ('as', 76724), ('have', 72026), ('are', 71536), ('your', 62209), ('with', 59396), ('if', 57238), ('article', 55229), ('was', 54418), ('or', 52087), ('but', 50416), ('page', 45521), ('my', 44884), ('an', 44347), ('from', 41275), ('by', 40812), ('do', 39326), ('at', 39244), ('about', 36915), ('me', 36821), ('wikipedia', 35279), ('so', 35247), ('can', 33684), ('what', 32598), ('there', 31207), ('all', 30974), ('talk', 30801), ('has', 30653), ('will', 30327), ('would', 29103), ('its', 28186), ('one', 27880), ('please', 27658), ('like', 27641), ('no', 27355), ('just', 27261), ('they', 27000)]
89418


# Build model: n-gram Bidirectional CNN-LSTM Model with Multihead Attention & GloVe Embedding

In [2]:
def load_pkl(filename):
    return load(open(filename, 'rb'))
    
def to_lines(doc, vocab):
    tokens = doc.split()
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = [re_punc.sub('', word) for word in tokens]
    tokens = [word for word in tokens if word.isalpha() or word in ['not', 'no']]
    tokens = [word.lower() for word in tokens]
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

def calculate_doc_length(lines):
    return round(np.mean([len(s.split()) for s in lines]), None)

def encode_text(tokenizer, lines, length):
    encoded = tokenizer.texts_to_sequences(lines)
    padded = pad_sequences(encoded, maxlen=length, padding='post')
    return padded

def define_model(vocab_size, length, kernels, weights, embedding_dim=100):
    inputs = []
    combine = []
    for kernel in kernels:
        in_layer = Input(shape=(length,))
        embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim, weights=[weights], trainable=True)(in_layer)
        batch_norm1 = BatchNormalization()(embedding)
        conv = Conv1D(filters=32, kernel_size=kernel, activation='relu', kernel_initializer='he_uniform')(batch_norm1)
        drop = Dropout(0.5)(conv)
        pool = MaxPool1D(pool_size=2)(drop)
        flat = Flatten()(pool)
        inputs.append(in_layer)
        combine.append(flat)
    merged = concatenate(combine)
    dense = Dense(50, activation='relu', kernel_initializer='he_uniform')(merged)
    batch_norm2 = BatchNormalization()(dense)
    dense = Dropout(0.5)(batch_norm2)
    outputs = Dense(6, activation='softmax', kernel_initializer='he_uniform')(dense)
    model = Model(inputs=inputs, outputs=outputs)
    lr_schedule = ExponentialDecay(initial_learning_rate=0.001, decay_steps=1000, decay_rate=0.9)
    optimizer = Adam(learning_rate=lr_schedule)
    model.compile(loss='categorical_crossentropy', 
                  optimizer=optimizer, 
                  metrics=['accuracy', AUC(name='roc_auc', multi_label=False)])
    model.summary()
    plot_model(model, to_file='n_gram_multihead_cnn_model.png', show_shapes=True)
    return model

class DataGenerator(Sequence):
    def __init__(self, texts, labels, tokenizer, max_length, batch_size=32):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.texts) / self.batch_size))

    def __getitem__(self, idx):
        batch_texts = self.texts[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_labels = self.labels[idx * self.batch_size:(idx + 1) * self.batch_size]
        encoded_texts = encode_text(self.tokenizer, batch_texts, self.max_length)
        return [encoded_texts] * kernels, np.array(batch_labels)

def evaluate_model(model, Xtrain, ytrain, Xtest, ytest, kernels, tokenizer, length, n_repeats=5):
    train_generator = DataGenerator(Xtrain, ytrain, tokenizer, length, batch_size=64)
    test_generator = DataGenerator(Xtest, ytest, tokenizer, length, batch_size=64)
    acc_scores, auc_scores = [], []
    for i in range(1, n_repeats+1):
        rlp = ReduceLROnPlateau(monitor='val_loss', mode='min', factor=0.5, patience=3, min_lr=1e-6)
        es = EarlyStopping(monitor='val_roc_auc', mode='max', patience=3, restore_best_weights=True)
        model_name = 'best_model_iter_' + str(i) + '.keras'
        mc = ModelCheckpoint(model_name, monitor='val_roc_auc', mode='max', verbose=1, save_best_only=True)
        model.fit(train_generator, 
                  validation_data=test_generator, 
                  epochs=10, 
                  verbose=1, 
                  callbacks=[rlp, es, mc])
        metrics = model.evaluate(test_generator, verbose=1)
        acc, auc = metrics[1], metrics[2] 
        print(f'> run={i}: Accuracy={acc * 100:.3f}, ROC AUC={auc * 100:.3f}')
        acc_scores.append(acc)
        auc_scores.append(auc)
    mean_acc, mean_auc = np.mean(acc_scores), np.mean(auc_scores)
    print(f'Mean Accuracy={mean_acc * 100:.3f}, Mean ROC AUC={mean_auc * 100:.3f}')
    
if __name__ == '__main__':
    # Load vocab
    vocab_filename = 'vocab.text'
    vocab = load_doc(vocab_filename)
    vocab = set(vocab.split())

    # Load datasets
    ytrain, ytest = load_pkl('output.pkl')  
    train_docs, test_docs = load_pkl('input.pkl')

    # Clean training and test docs
    trainLines = train_docs['comment_text'].apply(lambda x: to_lines(x, vocab)).astype(str)
    testLines = test_docs['comment_text'].apply(lambda x: to_lines(x, vocab)).astype(str)

    # Tokenize and encode datasets
    tokenizer = create_tokenizer(trainLines)
    vocab_size = len(tokenizer.word_index) + 1
    length = calculate_doc_length(trainLines)
    print(f'Vocabulary size: {vocab_size}')
    print(f'Average document length: {length}')

    # load the embedding into memory
    embeddings_index = {}
    data_path = '/home/pmthisi/code_env/6_Deep_Learning_for_Natural_Language_Processing/glove_files'
    f = open(f'{data_path}/glove.6B.100d.txt', mode='rt', encoding='utf-8')
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    print('Loaded %s word vector' % len(embeddings_index))

    # create a weight matrix for words in the training docs
    embedding_matrix = np.random.uniform(-0.05, 0.05, size=(vocab_size, 100))
    for word, i in tokenizer.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

    # Define and evaluate model
    kernels = [3, 5, 7]
    model = define_model(vocab_size, length, kernels, embedding_matrix)
    evaluate_model(model, trainLines.tolist(), ytrain, testLines.tolist(), ytest, kernels, tokenizer, length)

Vocabulary size: 89419
Average document length: 62
Loaded 400000 word vector
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 62)]                 0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 62)]                 0         []                            
                                                                                                  
 input_3 (InputLayer)        [(None, 62)]                 0         []                            
                                                                                                  
 embedding (Embedding)       (None, 62, 100)              8941900   ['input_1[0][0]']             
                 

Epoch 4/10
Epoch 4: val_roc_auc improved from 0.60097 to 0.62397, saving model to best_model_iter_1.keras
Epoch 5/10
Epoch 5: val_roc_auc did not improve from 0.62397
Epoch 6/10
Epoch 6: val_roc_auc improved from 0.62397 to 0.65997, saving model to best_model_iter_1.keras
Epoch 7/10
Epoch 7: val_roc_auc did not improve from 0.65997
Epoch 8/10
Epoch 8: val_roc_auc improved from 0.65997 to 0.66973, saving model to best_model_iter_1.keras
Epoch 9/10
Epoch 9: val_roc_auc improved from 0.66973 to 0.67637, saving model to best_model_iter_1.keras
Epoch 10/10
Epoch 10: val_roc_auc did not improve from 0.67637
> run=1: Accuracy=99.706, ROC AUC=67.617
Epoch 1/10
Epoch 1: val_roc_auc improved from -inf to 0.67596, saving model to best_model_iter_2.keras
Epoch 2/10
Epoch 2: val_roc_auc improved from 0.67596 to 0.67620, saving model to best_model_iter_2.keras
Epoch 3/10
Epoch 3: val_roc_auc did not improve from 0.67620
Epoch 4/10
Epoch 4: val_roc_auc did not improve from 0.67620
Epoch 5/10
Epoch 5:

Epoch 2/10
Epoch 2: val_roc_auc improved from 0.67116 to 0.67209, saving model to best_model_iter_4.keras
Epoch 3/10
Epoch 3: val_roc_auc improved from 0.67209 to 0.67218, saving model to best_model_iter_4.keras
Epoch 4/10
Epoch 4: val_roc_auc did not improve from 0.67218
Epoch 5/10
Epoch 5: val_roc_auc did not improve from 0.67218
Epoch 6/10
Epoch 6: val_roc_auc did not improve from 0.67218
> run=4: Accuracy=99.597, ROC AUC=67.218
Epoch 1/10
Epoch 1: val_roc_auc improved from -inf to 0.67168, saving model to best_model_iter_5.keras
Epoch 2/10
Epoch 2: val_roc_auc did not improve from 0.67168
Epoch 3/10
Epoch 3: val_roc_auc improved from 0.67168 to 0.67216, saving model to best_model_iter_5.keras
Epoch 4/10
Epoch 4: val_roc_auc did not improve from 0.67216
Epoch 5/10
Epoch 5: val_roc_auc did not improve from 0.67216
Epoch 6/10
Epoch 6: val_roc_auc did not improve from 0.67216
> run=5: Accuracy=99.567, ROC AUC=67.216
Mean Accuracy=99.624, Mean ROC AUC=67.399
