In this notebook, we will test a improved RCNN model. It utlizies several new tools that might improve the model's performance. 

# 1. Preparation
We need to first import the required library, download the data, and load the data into the memory.

## 1.1 Import

In [1]:
print('Importing required packages...')

from IPython.display import clear_output
import re
import csv
import pandas as pd
import numpy as np
np.random.seed()
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import TweetTokenizer
nltk.download('stopwords')
from nltk.stem.wordnet import WordNetLemmatizer 
nltk.download('wordnet')
from keras.preprocessing import sequence
from keras.preprocessing import text as ktxt
from keras.models import Sequential
from keras.layers import Dense, Embedding, GRU, SpatialDropout1D, Bidirectional, Dropout
from keras.layers import Concatenate, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.utils import class_weight


def hint(message):
    """
    erase previous ipynb output and show new message
    """
    clear_output()
    print(message)

  

Importing required packages...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ChuanLi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ChuanLi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Using TensorFlow backend.


## 1.2 Loading the Data

In [2]:
hint('loading data...')
train = pd.read_csv('data/train.csv')
train, valid = train_test_split(train, test_size=0.2)

labels = [
    'toxic', 
    'severe_toxic', 
    'obscene', 
    'threat', 
    'insult', 
    'identity_hate'
]

Ytr = train[labels].values
Yva = valid[labels].values

hint('Label distribution between training and validation set:')
print(pd.DataFrame({
    'label': labels,
    'train': [np.mean(train[label]) for label in labels],
    'validation' : [np.mean(valid[label]) for label in labels],
}))

Label distribution between training and validation set:
           label     train  validation
0          toxic  0.095906    0.095598
1   severe_toxic  0.009949    0.010183
2        obscene  0.053151    0.052138
3         threat  0.003024    0.002883
4         insult  0.049344    0.049444
5  identity_hate  0.009032    0.007896


# 2. Pre-processing the Input
There are many ways to pre-process the raw strings into valid input for the model. Here we will do it by building a dictionary with all the comments from the training set, mapping the words to their index in the dictionary, and pad/crop the resulting sequences so that they have the same length.

## 2.1 Cleaning Input

In [3]:
tkzr = TweetTokenizer(preserve_case=False)
eng_stopwords = (
    'what', 'which', 'who', 'whom', 
    'this', 'that', 'these', 'those', 
    'am', 'is', 'are', 'was', 'were', 
    'be', 'been', 'being', 
    'have', 'has', 'had', 'having', 
    'do', 'does', 'did', 'doing', 
    'a', 'an', 'the', 
    'and', 'but', 'if', 'or', 
    'because', 'as', 'until', 'while', 
    'of', 'at', 'by', 'for', 'with', 
    'about', 'against', 'between', 
    'into', 'through', 'during', 'before', 'after', 
    'above', 'below', 'to', 'from', 
    'up', 'down', 'in', 'out', 'on', 'off', 
    'over', 'under', 'again', 'further', 
    'then', 'once', 'here', 
    'there', 'when', 'where', 'why', 
    'how', 'all', 'any', 'both', 'each', 
    'few', 'more', 'most', 'other', 'some', 
    'such', 'no', 'nor', 'not', 'only', 
    'own', 'same', 'so', 'than', 'too', 'very', 
    'can', 'will', 'just', 'don', 'should', 'now'
)
lmtzr = WordNetLemmatizer()
appos = {
    "aren't" : "are not",
    "can't" : "cannot",
    "couldn't" : "could not",
    "didn't" : "did not",
    "doesn't" : "does not",
    "don't" : "do not",
    "hadn't" : "had not",
    "hasn't" : "has not",
    "haven't" : "have not",
    "he'd" : "he would",
    "he'll" : "he will",
    "he's" : "he is",
    "i'd" : "I would",
    "i'd" : "I had",
    "i'll" : "I will",
    "i'm" : "I am",
    "isn't" : "is not",
    "it's" : "it is",
    "it'll":"it will",
    "i've" : "I have",
    "let's" : "let us",
    "mightn't" : "might not",
    "mustn't" : "must not",
    "shan't" : "shall not",
    "she'd" : "she would",
    "she'll" : "she will",
    "she's" : "she is",
    "shouldn't" : "should not",
    "that's" : "that is",
    "there's" : "there is",
    "they'd" : "they would",
    "they'll" : "they will",
    "they're" : "they are",
    "they've" : "they have",
    "we'd" : "we would",
    "we're" : "we are",
    "weren't" : "were not",
    "we've" : "we have",
    "what'll" : "what will",
    "what're" : "what are",
    "what's" : "what is",
    "what've" : "what have",
    "where's" : "where is",
    "who'd" : "who would",
    "who'll" : "who will",
    "who're" : "who are",
    "who's" : "who is",
    "who've" : "who have",
    "won't" : "will not",
    "wouldn't" : "would not",
    "you'd" : "you would",
    "you'll" : "you will",
    "you're" : "you are",
    "you've" : "you have",
    "'re": " are",
    "wasn't": "was not",
    "we'll":" will",
    "didn't": "did not"
}

def preprocess(comment):
  
    # credit to the author of this post:
    # https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda

    # remove special format
    comment = re.sub('\n\t', '', comment)

    # remove IP addresses
    comment = re.sub('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', ' specipaddress ', comment)

    # remove username
    comment = re.sub("\[\[User.*\]", ' specusername ', comment)
    comment = re.sub("\[\[User.*\|", ' specusername ', comment)

    # tokenization 
    tokens = tkzr.tokenize(comment)

    # aphostophe replacement
    tokens = [ appos[token] if token in appos else token for token in tokens]

    # remove stopwords
    tokens = [ token for token in tokens if not token in eng_stopwords ]

    # stemming
    tokens = [ lmtzr.lemmatize(token, 'v') for token in tokens]

    return " ".join(tokens)
  

hint('Cleaning train set...')
Xtr = train['comment_text'].apply(lambda c: preprocess(c))
hint('Cleaning test set...')
Xva = valid['comment_text'].apply(lambda c: preprocess(c))
hint('Done')

Done


## 2.2 Transforming Comments to Sequences

In [4]:
vocab_max = 100000

hint('Fitting the tokenizer...')
tokenizer = ktxt.Tokenizer(num_words=vocab_max)
tokenizer.fit_on_texts(Xtr)

hint('Tokenizing...')
Xtr = tokenizer.texts_to_sequences(Xtr)
Xva = tokenizer.texts_to_sequences(Xva)

hint('padding the sequences...')
max_comment_length = 200  # padded/cropped comment length
Xtr = sequence.pad_sequences(Xtr, maxlen=max_comment_length)
Xva = sequence.pad_sequences(Xva, maxlen=max_comment_length)

hint('Done')

Done


# 3. Training Model

In [5]:
hint("Loading pre-embedding file...")
emb = pd.read_table(
    'preembedding/glove.840B.300d.txt', 
    sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE
)

hint("Preparing embedding matrix...")
embedding_dimension = 300
embedding_matrix = np.random.normal(
    emb.mean(axis=0), 
    emb.std(axis=0), 
    (vocab_max, embedding_dimension)
)
hint("Constructing embedding matrix")
for word, i in tokenizer.word_index.items():
    if i < vocab_max and word in emb.index:
        embedding_matrix[i] = emb.loc[word].as_matrix()

hint("Done")
# optional: free memory:
emb = None

Done


In [6]:
model = None
model = Sequential()
model.add(Embedding(
    vocab_max, 
    embedding_dimension, 
    weights=[embedding_matrix],
    input_length=max_comment_length
))
model.add(SpatialDropout1D(0.5))
model.add(Conv1D(filters=256, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(GRU(units=64, return_sequences=True)))
model.add(GlobalMaxPooling1D())
#model.add(Dense(64, activation='relu'))
model.add(Dense(len(labels), activation='sigmoid'))
model.compile(
    optimizer='adam', 
    loss='binary_crossentropy', 
    metrics=['accuracy']
)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 300)          30000000  
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 200, 300)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 200, 256)          230656    
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 100, 256)          0         
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 100, 256)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 128)          123264    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
__________

Now training the model.

In [7]:
epochs = 2
batch_size = 64

def get_class_weight(x):
    k = 100
    return 3.32*np.log(k/x + 1)
    

history = model.fit(
    Xtr, Ytr, 
    epochs=epochs, 
    batch_size=batch_size,
    validation_data=(Xva, Yva),
    class_weight={
        0: get_class_weight(98),
        1: get_class_weight(10),
        2: get_class_weight(53),
        3: get_class_weight(2),
        4: get_class_weight(49),
        5: get_class_weight(8),
    }
)

Train on 127656 samples, validate on 31915 samples
Epoch 1/2
Epoch 2/2


Making prediction on the validation set.

In [8]:
hint("Making prediction...")
Yva_ = model.predict(Xva)
hint("Done")

Done


# 4. Result Analysis
## 4.1 Global Accuracy

In [9]:
total_sample = Xva.shape[0]
print("validation set sample count: %d\n" % total_sample)
prediction_total = total_sample*Yva.shape[1]
best_t = None
best_accuracy = 0
for t in [i*0.1 for i in range(1, 10)]:
    accuracy = np.sum(Yva == (Yva_ >= t))/prediction_total
    if accuracy > best_accuracy: 
        best_t = t
        best_accuracy = accuracy
    print("accuracy for threshold %.1f: %.2f%%" % (t, accuracy*100))
Yva_T = Yva_ >= best_t
correct = Yva == Yva_T
print("\nbest threshold: %.1f" % best_t)
print("best accuracy: %.2f%%" % (best_accuracy*100))

validation set sample count: 31915

accuracy for threshold 0.1: 97.19%
accuracy for threshold 0.2: 97.89%
accuracy for threshold 0.3: 98.16%
accuracy for threshold 0.4: 98.30%
accuracy for threshold 0.5: 98.37%
accuracy for threshold 0.6: 98.32%
accuracy for threshold 0.7: 98.24%
accuracy for threshold 0.8: 98.07%
accuracy for threshold 0.9: 97.66%

best threshold: 0.5
best accuracy: 98.37%


## 4.2 Accuracy by Classes

In [10]:
overview = pd.DataFrame(index=[
    'label‰ of all',
    'total wrong', 
    'P->N', 
    'N->P', 
    'P->N %', 
    'N->P %',
    'avg len',
])

def analyze_class(i):
    wrong = valid[correct[:, i] != 1]
    total_class_error = len(wrong)
    print("%d predicted incorrectly (%.2f%% of all samples)" % (
        total_class_error, 
        100*total_class_error/total_sample
    ))
        
    wrong_seqs = Xva[correct[:, i] != 1]
    lens = [ len(seq[seq != 0]) for seq in wrong_seqs]
    avg_len = np.mean(lens)
    print("Falsely predicted sequences have an average length of %d" % avg_len)

    PpN = valid[(Yva[:, i] == 1) & (Yva_T[:, i] == 0)]
    PpN_count = len(PpN)
    print("\n%d (%.2f%%) positive label were predicted to be negative" % (
        PpN_count, 
        100*PpN_count/total_class_error 
    ))
    if PpN_count > 4:
        print("Samples:")
        for sample in PpN.sample(5)['comment_text']:
            display(sample)
  
    NpP = valid[(Yva[:, i] == 0) & (Yva_T[:, i] == 1)]
    NpP_count = len(NpP)
    print("\n%d (%.2f%%) negative label were predicted to be positive" % (
        NpP_count, 
        100*NpP_count/total_class_error 
    ))
    if NpP_count > 4:
        print("Samples:")
        for sample in NpP.sample(5)['comment_text']:
            display(sample)
  
    overview[labels[i]] = [
        np.mean(Yva[:, i]*1000),
        total_class_error, 
        PpN_count,  
        NpP_count,
        100*PpN_count/total_class_error,
        100*NpP_count/total_class_error,
        avg_len
    ]
  
    print('\n')
  

### 4.2.1 Toxic

In [11]:
analyze_class(0)

1134 predicted incorrectly (3.55% of all samples)
Falsely predicted sequences have an average length of 35

850 (74.96%) positive label were predicted to be negative
Samples:


"Prediction Timetable is getting bloated \n\nWhy do you people have to add EVERY FREAKING DETAIL to the list? It needs to be kept simple and not as wordy, this is a wikipedia article not a damn novel. If you wanna know all the details WATCH THE DAMN SHOW – IT'S FREE ON GOOGLE VIDEOS! The list should only contain the most notable and important details. A lot of what is there isn't all that notable – Do we need to know the name of every stupid bridge and building shown collapsing or can we just say – a bunch of bridges and buildings collapsed here and mention ONE OR TWO by name as examples. I don't want to get involved in edit wars but I will delete things I believe are overly worded, redundant and unnecessary, and not to mention just stupid. Use common sense before you edit."

"I'm not new. I don't know what the hell you are talking about."

'Well i think that is just plain stupid..i have seen retarted things on this site, i actually put together a nice little thing for him with ppl editing it. He is a good teacher and a great local wrestler with a bright carrear in front of him, and i dont see wh yu wont have him on this, all i know is he is my personal hero and i think people should know about him and some of the things he does i.e the wrestling moves he does etc.'

'OMG!!! someone is making threats...iam so scared.. thats funny you change and make up your own barnstars thats low and embarassing!! Get a life!! like one out of this site!'

'. Start your hostile attacks now, Pgagnon999'


284 (25.04%) negative label were predicted to be positive
Samples:


"Strange that. It seemed like you were perfectly willing to engage in debate about what the evidence showed about the crash site until I started asking you simple physical questions like 'if the wings sheared off, how did they get in the building?','why would the wings shear off when they would have applied far more force than the stress bearing capacity of a concrete wall?', 'why weren't there any burn marks on the lawn?', and (for fuck sake) 'small hole, big plane, how does that work in a physical sense ?(with more sophisticated details than 'they were blown to hell and back')'. I asked these questions because the answers to them are vital for impartial users of the site to make an informed decision. I added photographs and measurements (and only these things, without conspiracy fairy stories) to help people do this. Instead of constructively discussing what aspects of the measurements were right and wrong and why, or helping me find the copyright details of the photographs (which I 

'YOUR UNDER ARREST \n\nMA NAME IS DI TYLER AND YOUR UNDERARREST BOI'

'dictatorship \n\ngot a pic of hitler above the mantel????'

'yopu smell like fish'

'"\n\n Possibly Twins? \n\nYou are WillowW and I am PillowP. Perhaps we are separated twins! \n\nWe seem to have so many things in common. I, too, sit around and drool at the computer for hours, furiously editing useless articles about math. And hey, I also have a mediocre knowledge of latin poetry! \n\nMaybe we can get together and feel better than other people because we patrol Wikipedia. Gee, but I look up to you. I wish I had a picture of a statue of a little girl holding a dead bird. (. I\'m pouting! \n\n  The Special Editor because you have been extraordinarily slow with the people who have left comments at your Retardation Featured list, taking the time to drool on every single reviewer.  \n\nI\'m a person just like you\nBut I\'ve got better things to do\nThan sit around and fuck my head\nHang out with the living dead\nSnort white shit up my nose\nPass out at the shows\nI don\'t even think about speed\nThat\'s something I just don\'t need\n\nI\'ve got the straight edge\n\n   \n\





### 4.2.2 Severe Toxic

In [12]:
analyze_class(1)

300 predicted incorrectly (0.94% of all samples)
Falsely predicted sequences have an average length of 25

219 (73.00%) positive label were predicted to be negative
Samples:


'You and Jack greenmaven are MotherFuckers'

"Hello all!  I have been blocked by these assholes for speaking my mind.  Wikipedia and their liberal minions can suck my fucking dick.  No funds from me and I'll see that nobody I know contributes.  Fuck OFF!!"

'"\n\n good job for sucking dick \n\n  dick trophy i dont have to do shit u say . and ur the worlds best dick sucker   "'

'fuck you \n\nfuck you deletionist.'

'SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYFUCK HOMELAND SECURITYF


81 (27.00%) negative label were predicted to be positive
Samples:


'FUCK U! \n\nI AINT G0TTA DO SHIT AND FUCK WIKIPEDIA YALL WEAK MOTHER FUCKERS GET UR INFO RIGHT AND LOCK ALL UR SHIT CUZ YOUR INFO IS ALL WRONG SO SUCK MY DICK AND FUCK U BITCH AND MIND UR BUSINESS OR U CAN SUCK MY DICK OR KISS MY ASS!'

'your fucking face. faggots'

'your a fag suck a dick monky nigger lover.'

'BULLSHIT Bullshit \nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT Bullshit\nBULLSHIT B

"Suitecivil your a pussy ass bitch i'd fucken kick your ass any fucken day you little bitch face cunt fuck you and fuck barcelona."





### 4.2.3 Obscene

In [13]:
analyze_class(2)

571 predicted incorrectly (1.79% of all samples)
Falsely predicted sequences have an average length of 38

257 (45.01%) positive label were predicted to be negative
Samples:


'"\n\nI\'m wondering whether to add the fact to the article that she recently received a bukkake from five black dudes and the video was uploaded to slutload.com. She was first face f*ucked. I\'m not sure if it\'s notable enough to include but it was recent (rather surprising news) and her husband is filing for a divorce after she refused to swallow his sperm. I don\'t have a definite source for this, unfortunately. If I did, I would add it to the article. Anyone have any references? (BTW, I know it sounds preposterous and I\'m surprised too but I\'ve read this in a very reliable source (put in ""nicer terms"" like ""she engaged in sexual acts with african americans"").) Thanks guys."'

"I'm sorry I was an asshole about it.  The point is that the vowels in man and noun are raised in AusE according to Felicity Cox (2005), and the phonetic transcription for that IDEA example I gave you transcribes man as , and indeed it sounds like that to my ears, although that speaker did seem a bit broad.  It's not that uncommon for vowels to be raised before nasals.  Take for example the pin-pen merger in Southern American English and in southwestern Ireland or the non-phonemic æ-tensing that takes place in accents that have undergone the Northern cities vowel shift.  The only reason I brought this up is because I imagine it is very noticeable to English ears (it's even noticeable to my American ears).  Many English people might pronounce man as , which sounds quite a bit different from .  You're the Aussie here, so I'm sure you can verify this by simply contrasting your man with your bad (in this case both would have a long vowel, so try to listen for the quality of the vowels) or 

'THERE HASNT BEEN ANY EPISODES SINCE MARCH 7, DUMBASSES!!! \n\n50.180.208.181'

'I will be perfectly clear:  a man saying another man is interested in male genatalia and children is fighting words (bad english, but the point is there).  You were deliberately rude to me and, frankly you can f*** o**.  Should you have the power with this site to remove/destroy my presence, then you had best: do so; attempt a different tone; leavev me alone.  I herein provide you that opportunity.\n\nMy best,\n\n Xchanter'

"i hate you computer nerds and if i get blocked please get me blocked by 'cant sleep clown will eat me'. ok thats all ive got to say to u patetic losers who sit in front of a computer all day.\n\nBold textFUCK YOU ALL!!!!!!!!"


314 (54.99%) negative label were predicted to be positive
Samples:


"This has nothing to do with the article, but I f*ucking love Paul Miller's late show. Just had to say that, lulz sorry."

'Hello, do you have a real job? Or is your life so F.U that you gain solace from editing wiki all day with your inane drivel? Get a real job that pays!'

"I shat somethin' out this mornin' prettier than you! \n\nI HOPE YOU GET FLUSHED DOWN A TOILET, WHERE YOU BELONG"

"Drakhan!  U have no power over mask of life, so knock off peging there admin and let the link alone, it dosn't break the link page rules, so suck it up!"

'you are dumb \n\nyou are dumb and stupid loser'





### 4.2.4 Threat

In [14]:
analyze_class(3)

89 predicted incorrectly (0.28% of all samples)
Falsely predicted sequences have an average length of 27

85 (95.51%) positive label were predicted to be negative
Samples:


"I'ma smack ya upside da head wit a shovel \n\nI'm takin ya down, boi."

'"\nWow, you\'re the one commenting on how users commenting on the page are ""drooling retards"" and you try to drop this courtesy C&P; on me? You\'re just as pathetic as you\'ve always been and I hope you die in a jizz fire."'



'I really hate you \n\nAnd I want to do nasty things to you'

'Go and hang yourself!'


4 (4.49%) negative label were predicted to be positive




### 4.2.5 Insult

In [15]:
analyze_class(4)

799 predicted incorrectly (2.50% of all samples)
Falsely predicted sequences have an average length of 28

438 (54.82%) positive label were predicted to be negative
Samples:


'" The ""C"" isn\'t just a logogram, like you religiously and idiotically insist it is.  "'

"Jaycie\nIt's not as simple as you put it man! It's not just about some trying to get a reputation! She already is known out there, and read the more detailed one for what I'm trying to get across! FFS! No-one is right deleting it, fuckin' fools!\nMAZITO - Tuesday, 26 December, 2006; 02:20 (GMT)"

'Shut up, you button!'

"Idiot \n\nI just wanted to let you know that you and 5 Albert square are doing an awful job with eastenders characters, full siblings have completely different family layouts than each other, some characters are on a relatives info box and when you click on that character, the character before isn't on their info box!! Just wanted to let you know that you are ruining Wikipedia for everyone else😊that is all."

'"\n\nhey warrens, fuck you too. ""Uncivil manner, means me trying to elaborate on my arguments and getting  revert on revert that\'s what uncivil means. So yeah your a big winner champ and won\'t get banned. Asshole. \n\nSince the vista article contains information about the anti trust situation which, since it seems some people here have got a mouthful of gate\'s balls, is mislabelled anti trust controversy. Not really a controversy though is it when a court elected by over two hundred million people issues a decision, it\'a court ruling, like it or not there\'s no controversy too that. Otherwise we should renaim the ted bundy trial, to the tend bundy controversy, since any court ruling merits this label. \n\nSince there\'s a section on anti trust laws and vista, it\'s worth noting the recent news that microsoft are going to be paying a hefty daily fine for completely ignoring the court\'s injuction. And this does have bearing on vista. You can\'t quote the m.s. pr saying we don\'t k


361 (45.18%) negative label were predicted to be positive
Samples:


"Don't be such a pussy \n\nlol it's funny how your panties are in a bunch...sad...are you on your period?"

'Hey Freepsbane, does your mother know that you have been getting fucked by for some time now?'

"You had no right to close my request. You are a lying pig. I did nothing wrong and am better than all of you f-ing idiots. You can't add 1 and 1 but I am the smartest person here."



'All those stupid little blocks will not stop me, I am very patient, I will keep coming back over and over again.'





### 4.2.6 Identity Hate

In [16]:
analyze_class(5)

235 predicted incorrectly (0.74% of all samples)
Falsely predicted sequences have an average length of 29

175 (74.47%) positive label were predicted to be negative
Samples:


'Hey \nFuck you, you Russia-loving cunt.'

"You are such a fucking faggot, there's no need for references for something that can't really be reliably cited. Even if there was a sociological study on people who have APD, coke drinkers, and cocaine users, it wouldn't be very reliable because there's no way to tell if people are lying. Fuck off you beaurocratic official peice of pedophile, and that citing and copyright shit has got to go. I refused to make bibliographies in high school for one reason: I never claimed that the work was mine, and I shouldn't have to claim who's it was because it's self-explanitory that I didn't make it since I didn't claim to have done so."

'Mate, sound like you are jewish\n\nGayness is in the air'

'To the faggot \n\nCoffee Expliciton'

'"\n\nSupport for that any many more similar statements is ALL OVER the Iranian media, including the ones who are not in line with Ahmadinejad from a political or ideological point of view.  Playing the retarded trick of ""bring me a verifiable quote"" for every sentence that you don\'t like or don\'t want, and even when a quote is provided, deleting it for lame excuses makes you look like, god forbid, a Jew.  Are you a dirty Jew by any chance?   "'


60 (25.53%) negative label were predicted to be positive
Samples:


'Your stupid greeting \n\nScrew that stupid greeting from you and get lost you nigger!'

'Ass rape \n\nAre you in an ass raping relationship with Splash15hotel? Fag semen and whore feces will eventually choke you!!!!'

'i live with my mom and im a gay fag that lives in england'

'answer: Gay people like me.'

'/* Im gay and im Proud */  \n\nListen every one i recently came out of the closet and came out as a homosexual and i have just told my parents this ....HOOORRAY BEER!!!!!!!!\n\n The VaNdAl'





### 4.2.7 Overview

In [17]:
overview.astype(int)

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
label‰ of all,95,10,52,2,49,7
total wrong,1134,300,571,89,799,235
P->N,850,219,257,85,438,175
N->P,284,81,314,4,361,60
P->N %,74,73,45,95,54,74
N->P %,25,27,54,4,45,25
avg len,35,25,38,27,28,29
