In this notebook, we will try out a deep RCNN model based on RCNN v2.

# 1. Preparation
We need to first import the required library, download the data, and load the data into the memory.

## 1.1 Import

In [1]:
print('Importing required packages...')

from IPython.display import clear_output
import re
import csv
import pandas as pd
import numpy as np
np.random.seed()
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import TweetTokenizer
nltk.download('stopwords')
from nltk.stem.wordnet import WordNetLemmatizer 
nltk.download('wordnet')
from keras.preprocessing import sequence
from keras.preprocessing import text as ktxt
from keras.models import Sequential
from keras.layers import Dense, Embedding, GRU, SpatialDropout1D, Bidirectional, Dropout
from keras.layers import Concatenate, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.utils import class_weight


def hint(message):
    """
    erase previous ipynb output and show new message
    """
    clear_output()
    print(message)

  

Importing required packages...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ChuanLi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ChuanLi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Using TensorFlow backend.


## 1.2 Loading the Data

In [2]:
hint('loading data...')
train = pd.read_csv('data/train.csv')
train, valid = train_test_split(train, test_size=0.2)

labels = [
    'toxic', 
    'severe_toxic', 
    'obscene', 
    'threat', 
    'insult', 
    'identity_hate'
]

Ytr = train[labels].values
Yva = valid[labels].values

hint('Label distribution between training and validation set:')
print(pd.DataFrame({
    'label': labels,
    'train': [np.mean(train[label]) for label in labels],
    'validation' : [np.mean(valid[label]) for label in labels],
}))

Label distribution between training and validation set:
           label     train  validation
0          toxic  0.095648    0.096632
1   severe_toxic  0.009886    0.010434
2        obscene  0.053033    0.052608
3         threat  0.003047    0.002789
4         insult  0.049297    0.049632
5  identity_hate  0.008695    0.009243


# 2. Pre-processing the Input
There are many ways to pre-process the raw strings into valid input for the model. Here we will do it by building a dictionary with all the comments from the training set, mapping the words to their index in the dictionary, and pad/crop the resulting sequences so that they have the same length.

## 2.1 Cleaning Input

In [3]:
tkzr = TweetTokenizer(preserve_case=False)
eng_stopwords = (
    'what', 'which', 'who', 'whom', 
    'this', 'that', 'these', 'those', 
    'am', 'is', 'are', 'was', 'were', 
    'be', 'been', 'being', 
    'have', 'has', 'had', 'having', 
    'do', 'does', 'did', 'doing', 
    'a', 'an', 'the', 
    'and', 'but', 'if', 'or', 
    'because', 'as', 'until', 'while', 
    'of', 'at', 'by', 'for', 'with', 
    'about', 'against', 'between', 
    'into', 'through', 'during', 'before', 'after', 
    'above', 'below', 'to', 'from', 
    'up', 'down', 'in', 'out', 'on', 'off', 
    'over', 'under', 'again', 'further', 
    'then', 'once', 'here', 
    'there', 'when', 'where', 'why', 
    'how', 'all', 'any', 'both', 'each', 
    'few', 'more', 'most', 'other', 'some', 
    'such', 'no', 'nor', 'not', 'only', 
    'own', 'same', 'so', 'than', 'too', 'very', 
    'can', 'will', 'just', 'don', 'should', 'now'
)
lmtzr = WordNetLemmatizer()
appos = {
    "aren't" : "are not",
    "can't" : "cannot",
    "couldn't" : "could not",
    "didn't" : "did not",
    "doesn't" : "does not",
    "don't" : "do not",
    "hadn't" : "had not",
    "hasn't" : "has not",
    "haven't" : "have not",
    "he'd" : "he would",
    "he'll" : "he will",
    "he's" : "he is",
    "i'd" : "I would",
    "i'd" : "I had",
    "i'll" : "I will",
    "i'm" : "I am",
    "isn't" : "is not",
    "it's" : "it is",
    "it'll":"it will",
    "i've" : "I have",
    "let's" : "let us",
    "mightn't" : "might not",
    "mustn't" : "must not",
    "shan't" : "shall not",
    "she'd" : "she would",
    "she'll" : "she will",
    "she's" : "she is",
    "shouldn't" : "should not",
    "that's" : "that is",
    "there's" : "there is",
    "they'd" : "they would",
    "they'll" : "they will",
    "they're" : "they are",
    "they've" : "they have",
    "we'd" : "we would",
    "we're" : "we are",
    "weren't" : "were not",
    "we've" : "we have",
    "what'll" : "what will",
    "what're" : "what are",
    "what's" : "what is",
    "what've" : "what have",
    "where's" : "where is",
    "who'd" : "who would",
    "who'll" : "who will",
    "who're" : "who are",
    "who's" : "who is",
    "who've" : "who have",
    "won't" : "will not",
    "wouldn't" : "would not",
    "you'd" : "you would",
    "you'll" : "you will",
    "you're" : "you are",
    "you've" : "you have",
    "'re": " are",
    "wasn't": "was not",
    "we'll":" will",
    "didn't": "did not"
}

def preprocess(comment):
  
    # credit to the author of this post:
    # https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda

    # remove special format
    comment = re.sub('\n\t', '', comment)

    # remove IP addresses
    comment = re.sub('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', ' specipaddress ', comment)

    # remove username
    comment = re.sub("\[\[User.*\]", ' specusername ', comment)
    comment = re.sub("\[\[User.*\|", ' specusername ', comment)

    # tokenization 
    tokens = tkzr.tokenize(comment)

    # aphostophe replacement
    tokens = [ appos[token] if token in appos else token for token in tokens]

    # remove stopwords
    tokens = [ token for token in tokens if not token in eng_stopwords ]

    # stemming
    tokens = [ lmtzr.lemmatize(token, 'v') for token in tokens]

    return " ".join(tokens)
  

hint('Cleaning train set...')
Xtr = train['comment_text'].apply(lambda c: preprocess(c))
hint('Cleaning test set...')
Xva = valid['comment_text'].apply(lambda c: preprocess(c))
hint('Done')

Done


## 2.2 Transforming Comments to Sequences

In [4]:
vocab_max = 100000

hint('Fitting the tokenizer...')
tokenizer = ktxt.Tokenizer(num_words=vocab_max)
tokenizer.fit_on_texts(Xtr)

hint('Tokenizing...')
Xtr = tokenizer.texts_to_sequences(Xtr)
Xva = tokenizer.texts_to_sequences(Xva)

hint('padding the sequences...')
max_comment_length = 200  # padded/cropped comment length
Xtr = sequence.pad_sequences(Xtr, maxlen=max_comment_length)
Xva = sequence.pad_sequences(Xva, maxlen=max_comment_length)

hint('Done')

Done


# 3. Training Model

In [5]:
hint("Loading pre-embedding file...")
emb = pd.read_table(
    'preembedding/glove.6B.300d.txt', 
    sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE
)

hint("Preparing embedding matrix...")
embedding_dimension = 300
embedding_matrix = np.random.normal(
    emb.mean(axis=0), 
    emb.std(axis=0), 
    (vocab_max, embedding_dimension)
)
hint("Constructing embedding matrix")
for word, i in tokenizer.word_index.items():
    if i < vocab_max and word in emb.index:
        embedding_matrix[i] = emb.loc[word].as_matrix()

hint("Done")
# optional: free memory:
emb = None

Done


In [18]:
model = None
model = Sequential()
model.add(Embedding(
    vocab_max, 
    embedding_dimension, 
    weights=[embedding_matrix],
    input_length=max_comment_length
))
model.add(SpatialDropout1D(0.5))
model.add(Conv1D(filters=256, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(GRU(units=64, return_sequences=True)))
model.add(SpatialDropout1D(0.5))
model.add(Conv1D(filters=64, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(GRU(units=32, return_sequences=True)))
model.add(GlobalMaxPooling1D())
model.add(Dense(len(labels), activation='sigmoid'))
model.compile(
    optimizer='adam', 
    loss='binary_crossentropy', 
    metrics=['accuracy']
)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 200, 300)          30000000  
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 200, 300)          0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 200, 256)          230656    
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 100, 256)          0         
_________________________________________________________________
spatial_dropout1d_3 (Spatial (None, 100, 256)          0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 100, 128)          123264    
_________________________________________________________________
spatial_dropout1d_4 (Spatial (None, 100, 128)          0         
__________

Now training the model.

In [19]:
epochs = 3
batch_size = 64

def get_class_weight(x):
    k = 100
    return 3.32*np.log(k/x + 1)
    

history = model.fit(
    Xtr, Ytr, 
    epochs=epochs, 
    batch_size=batch_size,
    validation_data=(Xva, Yva),
    class_weight={
        0: get_class_weight(98),
        1: get_class_weight(10),
        2: get_class_weight(53),
        3: get_class_weight(2),
        4: get_class_weight(49),
        5: get_class_weight(8),
    }
)

Train on 127656 samples, validate on 31915 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


Making prediction on the validation set.

In [8]:
hint("Making prediction...")
Yva_ = model.predict(Xva)
hint("Done")

Done


# 4. Result Analysis
## 4.1 Global Accuracy

In [9]:
total_sample = Xva.shape[0]
print("validation set sample count: %d\n" % total_sample)
prediction_total = total_sample*Yva.shape[1]
best_t = None
best_accuracy = 0
for t in [i*0.1 for i in range(1, 10)]:
    accuracy = np.sum(Yva == (Yva_ >= t))/prediction_total
    if accuracy > best_accuracy: 
        best_t = t
        best_accuracy = accuracy
    print("accuracy for threshold %.1f: %.2f%%" % (t, accuracy*100))
Yva_T = Yva_ >= best_t
correct = Yva == Yva_T
print("\nbest threshold: %.1f" % best_t)
print("best accuracy: %.2f%%" % (best_accuracy*100))

validation set sample count: 31915

accuracy for threshold 0.1: 96.31%
accuracy for threshold 0.2: 97.25%
accuracy for threshold 0.3: 97.67%
accuracy for threshold 0.4: 97.96%
accuracy for threshold 0.5: 98.14%
accuracy for threshold 0.6: 98.24%
accuracy for threshold 0.7: 98.26%
accuracy for threshold 0.8: 98.19%
accuracy for threshold 0.9: 97.87%

best threshold: 0.7
best accuracy: 98.26%


## 4.2 Accuracy by Classes

In [10]:
overview = pd.DataFrame(index=[
    'label‰ of all',
    'total wrong', 
    'P->N', 
    'N->P', 
    'P->N %', 
    'N->P %',
    'avg len',
])

def analyze_class(i):
    wrong = valid[correct[:, i] != 1]
    total_class_error = len(wrong)
    print("%d predicted incorrectly (%.2f%% of all samples)" % (
        total_class_error, 
        100*total_class_error/total_sample
    ))
        
    wrong_seqs = Xva[correct[:, i] != 1]
    lens = [ len(seq[seq != 0]) for seq in wrong_seqs]
    avg_len = np.mean(lens)
    print("Falsely predicted sequences have an average length of %d" % avg_len)

    PpN = valid[(Yva[:, i] == 1) & (Yva_T[:, i] == 0)]
    PpN_count = len(PpN)
    print("\n%d (%.2f%%) positive label were predicted to be negative" % (
        PpN_count, 
        100*PpN_count/total_class_error 
    ))
    if PpN_count > 4:
        print("Samples:")
        for sample in PpN.sample(5)['comment_text']:
            display(sample)
  
    NpP = valid[(Yva[:, i] == 0) & (Yva_T[:, i] == 1)]
    NpP_count = len(NpP)
    print("\n%d (%.2f%%) negative label were predicted to be positive" % (
        NpP_count, 
        100*NpP_count/total_class_error 
    ))
    if NpP_count > 4:
        print("Samples:")
        for sample in NpP.sample(5)['comment_text']:
            display(sample)
  
    overview[labels[i]] = [
        np.mean(Yva[:, i]*1000),
        total_class_error, 
        PpN_count,  
        NpP_count,
        100*PpN_count/total_class_error,
        100*NpP_count/total_class_error,
        avg_len
    ]
  
    print('\n')
  

### 4.2.1 Toxic

In [11]:
analyze_class(0)

1186 predicted incorrectly (3.72% of all samples)
Falsely predicted sequences have an average length of 36

926 (78.08%) positive label were predicted to be negative
Samples:




'Current event???? \n\nwtf is the current event about?'

'"\n Yeah, report me \n\nHave fun reporting me, then I won\'t need to do it. Please be so kind and cite the following diffs to the admins, will you:  (""leave the real work to real men, not clowns"");  (""You think your funny, pair of Clowns, probably responsible for the Hitler redirect vandalism"");  (""Just because you get easily confused by logic..."");  (""sheer ignorance"");  (""stop being such a proud moron.""). BTW, I\'m not currently an administrator.  ☼ \n\n(editconflict)Regardless, I am listing the diffs of the context because that doesn\'t give you the right to abuse your authority and (no, ""Capital of Macedonia"" does not redirect here. Why would it?) lie, changing the redirect so that your statement would then be true. I can show the history with the diffs that you have been needlessly impeding and useless, basically provoking and abusing your wrongly given authority. So laugh all your want, adminship isn\'t permanent it comes under review and lies count.  \n\n Oh, and BT

'"\nStupid Vikings.  A creative, new way to lose. Instead of a run play to center the ball and kick the winning FG with time expiring, let\'s have the ol\' man throw across his body to his blind side and get picked off and force OT and hand the game to the Aints.  Creative.  As I\'ve always maintain, the vikings always find a way to lose.   | 76 "'

"I think he's a prick."


260 (21.92%) negative label were predicted to be positive
Samples:


'/* Im gay and im Proud */  \n\nListen every one i recently came out of the closet and came out as a homosexual and i have just told my parents this ....HOOORRAY BEER!!!!!!!!\n\n The VaNdAl'

'"\n\nWould you just do me a favor though and stop harassing me! I have reported you, and there will be something coming to you soon. I don\'t care that you have the ""Stupid Original Barnstar"", but that doesn\'t mean that you can harass me."'

'PHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHEESE!\n\nSALT!\nPHUQ OFF YOU KNOB OF CHE

'And you should also wait for people to tell what you are known for. After all what we are known for is something that belongs to others thoughts not ours. You should make a survey. Maybe it turns out that you are not known for viciously attacking stupidity but for your stupid vicious attacks. 99.231.81.164'

'dude you guys must be on crack! instead of blocking those idiots for trying to delete a page without legitimate reason you blocked me out? with morons like you Wikipedia will not survive mark my words!  \n\ni could not care less for my account any more as i will boycott this madness of a site'





### 4.2.2 Severe Toxic

In [12]:
analyze_class(1)

321 predicted incorrectly (1.01% of all samples)
Falsely predicted sequences have an average length of 28

319 (99.38%) positive label were predicted to be negative
Samples:


"by the way\nthis is a university computer.. so go ahead and block it cock sucker. i'll just use my home computer to let the world know what a stupid asshole you are. FUCK OFFFF DICKBAG."

'DAT CHICK \n\nOMG I DIDNT VANDALISE NUFFINK YH??? JUST EXPRESSIN MY OPINION AND MAKIN CONVERSATION INNIT. GET A LIFE, GET A JOB, FK U ENNIT.'

"You Are a fuck head I'm not a vandal so suck cock ass master.  We'll talk after your done with your Mom."

'oh, and armarkov, you fuck and are a cocksucker.'

'DOOSH DOOSH DOOSH DOOSH DOOSH FUCKING COCK SUCKER QUEIR BATE ASS WIPE DONT TELL ME WHAT THE FUCK TO WRITE'


2 (0.62%) negative label were predicted to be positive




### 4.2.3 Obscene

In [13]:
analyze_class(2)

622 predicted incorrectly (1.95% of all samples)
Falsely predicted sequences have an average length of 34

419 (67.36%) positive label were predicted to be negative
Samples:


"Why do you keep blocking me? \n\nYou are a big as idiot as DMacks now.  You cant block me as I just change my IP.  Why don't you try and negotiate with me and i will stop editing your page?   I WANT DMACKS PRIVELAGES REMOVED!!!\n\nBTW there is no threat so dont even try to turn it on me"

'complete bullshit - your past is catching up with you kurt  after you had defamed as many users as possible and spammed out the usenet groups it appears you moved over to wikipedia where hitherto you have remained unchallenged - that is up until now.'

'Piss off you evil dwarf'

'IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT/////IDIOT////

'A proposal to cut down on AIDS \n\nLess AIDS would be spread if you were to stop inserting your minuscule penis into little boys.'


203 (32.64%) negative label were predicted to be positive
Samples:


"Junior Barranquilla?\n\nWTF? how did you come up with that name Junior Barranquilla? should we call REAL MADRID... ROYAL MADRID??????????????  pleasse push for a revert!!! don't come up with that made up bullshit name! \n\nAtlético Junior is the common name!!! no one calls it Junior Barranquilla... maybe Junior de Barranquilla, but never something such name OMG.. that was just product of pure ignorance.. Cheers 76.19.149.244"

'Because it gives you diarrhea and YOU SPELLS DIARRHEA WRONG!!!!\n):):):('

'YOU NEED SUM PUSSY GO PAY A GIRL SHE WILL TAKE CARE OF YOU, I SAY THIS BECUASE MOST PPL HERE ARE DORKS AND VIRGINS THATS NEVER EVEN TOUCHED THE A FEMALE'

"Another idiot admin who doesn't know the policy of Wikipedia. Good job, Swarm, you stupidly blocked me for returning the page to the last consensus. That's what we're supposed to do in the case of a content dispute. Yes, I know, that is so hard to learn! Another moron example of how admins are the sickness of Wikipedia."

'"\n\n ""What\'s a ballsy?"" \n\nShe really can\'t be that stupid....can she???"'





### 4.2.4 Threat

In [14]:
analyze_class(3)

87 predicted incorrectly (0.27% of all samples)
Falsely predicted sequences have an average length of 26

69 (79.31%) positive label were predicted to be negative
Samples:


'(http://wiki.uschallengecoins.com/Main_Page)\n\nIF YOU REVERT MY FUCKIN EDITS METROS I WILL KILL YOU!'

"Fuck you, like I give a shit. Point to the word where I cast suspicion on her. Point to it. Point to it you stupid asshole. But you can't. Go fuck yourself. I hope you die."

'Fuck you cunt \n\n \nHow fucking dare u judge me cunt. Dont fucking stand up for telstra fucker. You small cock fag. ILl cum on your grave and hope that shit rains down it. Fucker. Fuck I hope you die. Fuck i want you dead. WHy lefty. Do you use your left hand to wank cunt? Fucking wanker. Oww i just came.'

"Personal Attack Number 2 \n\nThis is another personal attack about you being a massive donkey dick sucking homosexual. This is a concerned plea that you should at once drown yourself in a sewer. Fucking fuck off you fag! And if you threaten me with legal action again, I'll fucking come over there, tie you to your computer chair and stick my cock up your arse until it bleeds. You fucker!!"

'YOU ARE A FUCKING NERD. GO KILL YOURSELF YOU NERD.'


18 (20.69%) negative label were predicted to be positive
Samples:


"blow me\n\nIts my right of freedom of speech to edit wikipedia. So fuck off or I'll rape you anally and vaginally. Good day. 78.148.200.177"

'I am going to shoot you in the head and laugh as your brains splatter onto the ground. \n\nI am going to shoot you in the head and laugh as your brains splatter onto the ground.'

'I hope you die!==\nFind a pit and jump into it. <3\n\n=='

'Hi! \n\nI wanna rape you!'

"I'ma smack ya upside da head wit a shovel.  \n\nI'm takin ya down, boi."





### 4.2.5 Insult

In [15]:
analyze_class(4)

856 predicted incorrectly (2.68% of all samples)
Falsely predicted sequences have an average length of 31

450 (52.57%) positive label were predicted to be negative
Samples:


"You must have the lowest IQ of anyone on wikipedia, I can't converse with a fool like you"

'amelia earhart was a prostitute that live on broadway.'

'"\nThank you , and the same to you. worm "'

'Red herrings to continue to excuse Breivik.  Go away, nasty person.'




406 (47.43%) negative label were predicted to be positive
Samples:


"As far as Disputive Edits are concerned You Idiot I didn't made any. Just made a request to delete the Images. Keep your temper to yourself.\nThank You."

'WA NKER ALERT WA NKER ALERT\n\nif i find out where you or james whale are i promise you i will fucking smack the shit out of you'

'to munch pussy and not know how to spell good in English, because I am fucking foreigner for who with which English is the hard language I am dyke.'



'"\nIf you continue acting like a racist fool, you\'ll be called a racist fool. You\'re demonstrating abject anti-semitism, linking up to absurd ""proof"" of your racist theories. Being a racist fool isn\'t civil, and you are definitely behaving like one right now."'





### 4.2.6 Identity Hate

In [16]:
analyze_class(5)

252 predicted incorrectly (0.79% of all samples)
Falsely predicted sequences have an average length of 30

158 (62.70%) positive label were predicted to be negative
Samples:


'Haha. Go ahead. Stupid American.'

'Jewish \n\nPlease dont think that jews are everyone just based on your claims. Also dont think jews are smart. if you think so, you are making a clown out of yourself..'

'Faggot \n\nEnough said'

"American mouths are way too faster than American brain ) You Americans need to learn to listen if you truly want to get out of the mess you're in."

"Hey Harry, You are a fucking fag. There is nothing wrong with that. You delete all my edits because you know they are correct but are much smarter than anything you can come up with. Ps, you're a cunt. 144.132.144.57"


94 (37.30%) negative label were predicted to be positive
Samples:


'He is not gay in Rugrats, but on the internet, he is.'

'a crazy nerd who dislikes women'

'Is it gay if you bang an animal of the same secks?'

'sup \n\nThanks for the anal sex yesterday!\n\nLove,\nGeorge (your homosexual lover)'

'By the way, are there any Azeris looking like this guy. Having lived among the Jews for a long time I can declare that this guy is a Jew. He is a typical Gorsky Jew.'





### 4.2.7 Overview

In [17]:
overview.astype(int)

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
label‰ of all,96,10,52,2,49,9
total wrong,1186,321,622,87,856,252
P->N,926,319,419,69,450,158
N->P,260,2,203,18,406,94
P->N %,78,99,67,79,52,62
N->P %,21,0,32,20,47,37
avg len,36,28,34,26,31,30


# 5. Conclusion
The model did not perform as well as RCNN v2. Adding more layers is difficult to be done right. Since other approaches proved to work better, I will leave this approach for later.