# Problem Discussion

In this problem I have to develop machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. This is a binary classification problem.

In [None]:
import os
print(os.listdir("../input"))

## Importing Dependencies

* Numpy = Performs a number of mathematical operations on arrays 
* Pandas = Imports data from csv, Data manipulation operation
* Keras = Reduces the used amount of memory resources, offers consistent & simple APIs
* TensorFlow = Tensorflow is working in the backend for the tensor manipulation
* Warning = This module is used to ignore warnings.

In [None]:
import numpy as np 
import pandas as pd

from keras.preprocessing import text, sequence
from keras.models import Model
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, Dropout, add, concatenate
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.callbacks import LearningRateScheduler
from keras.losses import binary_crossentropy
from keras import backend as K

import warnings
warnings.filterwarnings("ignore")

## Importing Dataset

In [None]:
train = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
test = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')

## Text Pre-processing

The objective of the following function preprocess() is to clean and remove any punctuation marks from the common_text column. Replacing all punctation with the space.

In [None]:
def preprocess(data):
    
    punctuation = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~`" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
    
    def clean_special_chars(text, punct):
        
        for p in punctuation:
            
            text = text.replace(p, ' ')
            
        return text

    data = data.astype(str).apply(lambda x: clean_special_chars(x, punctuation))
    
    return data

In [None]:
x_train = preprocess(train['comment_text'])

Here I have preprocess the commom_text column from the training set and saved it in another dataframe x_train.

## Importing Word Embedding 

*Word Embedding: Renders a way to use an efficient, dense representation in which similar words have a similar encoding.*

'In this project I have tried three non-contextualized word embeddings such as word2vec, GloVe, fastText. fastText came up with a victorious result.'

*FastText Crawl 300-dimensional pretrained FastText English word vectors released by Facebook.*

Good sites of fastText are as follows:

* Captures the meaning of shorter words
* Allows the embeddings to understand suffixes and prefixes
* fastText works well with rare words

In [None]:
embedding_files = ['../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec', '../input/glove840b300dtxt/glove.840B.300d.txt']

## Extracting Vectors

The following three function extract vectors from word embeddings.

In [None]:
def get_coefs(word, *arr):
    
    return word, np.asarray(arr, dtype = 'float32')

In [None]:
def load_embeddings(path):
    
    with open(path) as f:
        
        return dict(get_coefs(*line.strip().split(' ')) for line in f)

In [None]:
def build_matrix(word_index, path):
    
    embedding_index = load_embeddings(path)
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    
    for word, i in word_index.items():
        
        try:
            embedding_matrix[i] = embedding_index[word]
            
        except KeyError:
            pass
        
    return embedding_matrix

## Comment Type

identity_columns which allow us to categorize the type of comment it is and according which we will allocate weight to these columns in our next step.

In [None]:
identity_columns = ['male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish', 
                    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']

* Allocating weights to the identity_columns

In [None]:
# Overall
weights = np.ones((len(x_train),)) / 4

# Subgroup
weights += (train[identity_columns].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int) / 4

# Background Positive, Subgroup Negative
weights += (( (train['target'].values>=0.5).astype(bool).astype(np.int) +
   (train[identity_columns].fillna(0).values<0.5).sum(axis=1).astype(bool).astype(np.int) ) > 1 ).astype(bool).astype(np.int) / 4

# Background Negative, Subgroup Positive
weights += (( (train['target'].values<0.5).astype(bool).astype(np.int) +
   (train[identity_columns].fillna(0).values>=0.5).sum(axis=1).astype(bool).astype(np.int) ) > 1 ).astype(bool).astype(np.int) / 4

loss_weight = 1.0 / weights.mean()

In [None]:
y_train = np.vstack([(train['target'].values>=0.5).astype(np.int),weights]).T
y_aux_train = train[['target', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat']].values

In [None]:
x_test = preprocess(test['comment_text'])

In [None]:
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(list(x_train) + list(x_test))

In [None]:
# Maximum Length of each comment is 220.

MAX_LEN = 220 

In [None]:
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)
x_train = sequence.pad_sequences(x_train, maxlen = MAX_LEN)
x_test = sequence.pad_sequences(x_test, maxlen = MAX_LEN)

In [None]:
embedding_matrix = np.concatenate([build_matrix(tokenizer.word_index, f) for f in embedding_files], axis =- 1)

## Model Building

In [None]:
NUM_MODELS = 1
BATCH_SIZE = 100
LSTM_UNITS = 128
DENSE_HIDDEN_UNITS = 4 * LSTM_UNITS
EPOCHS = 4

*Model Architecture*

1. Input
2. Word Embedding
3. Dropout 
4. Bidirectional CuDNN LSTM
5. Bidirectional CuDNN LSTM
6. Concatenation of GlobalMaxPooling1D & GlobalAveragePooling1D
7. Two hidden layer with activation function relu

In [None]:
def build_model(embedding_matrix, num_aux_targets, loss_weight):
    
    words = Input(shape = (MAX_LEN,))
    word_embedding = Embedding(*embedding_matrix.shape, weights = [embedding_matrix], trainable = False)(words)
    dropout = SpatialDropout1D(0.3)(word_embedding)
    blstm_1 = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences = True))(dropout)
    blstm_2 = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences = True))(blstm_2)

    hidden = concatenate([
        GlobalMaxPooling1D()(blstm_2),
        GlobalAveragePooling1D()(blstm_2),
    ])
    
    hidden_1 = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation = 'relu')(hidden)])
    hidden_2 = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation = 'relu')(hidden_1)])
    
    result = Dense(1, activation = 'sigmoid')(hidden_2)
    aux_result = Dense(num_aux_targets, activation = 'sigmoid')(hidden_2)
    
    model = Model(inputs = words, outputs = [result, aux_result])
    model.compile(loss = [custom_loss, 'binary_crossentropy'], loss_weights = [loss_weight, 1.0], optimizer = 'adam')

    return model

## Memory Reduction

Using temporary file to reduce memory. 

Pickle: It's the process of converting a Python object into a byte stream to store it in a file or database, maintain program state across sessions.

gc: gc exposes the underlying memory management mechanism of Python, the automatic garbage collector.

In [None]:
import pickle
import gc

with open('temporary.pickle', mode = 'wb') as f:
    
    pickle.dump(x_test, f)

In [None]:
del identity_columns, weights, tokenizer, train, test, x_test #unnecessary data frames
gc.collect()

In [None]:
checkpoint_predictions = []
weights = []

## Training Model

In [None]:
for model_idx in range(NUM_MODELS):
    
    model = build_model(embedding_matrix, y_aux_train.shape[-1], loss_weight)
    
    for global_epoch in range(EPOCHS):
        
        model.fit(
            x_train,
            [y_train, y_aux_train],
            batch_size = BATCH_SIZE,
            epochs = 1,
            verbose = 1
        )
        
        with open('temporary.pickle', mode='rb') as f:
            x_test = pickle.load(f)
            
        checkpoint_predictions.append(model.predict(x_test, batch_size = 1024)[0].flatten())
        
        del x_test
        gc.collect()
        
        weights.append(2 ** global_epoch)
        
    del model
    gc.collect()

## Preparing Submission File

In [None]:
predictions = np.average(checkpoint_predictions, weights = weights, axis = 0)

df_submit = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/sample_submission.csv')
df_submit.prediction = predictions

df_submit.to_csv('submission.csv', index = False)