In this notebook, we will test the model from the RCNN performance analysis notebook, but this time with only a convolution layer. The GRU layer has been replaced with a dense layer with similar number of parameters. We do so to see if the recurrent layer has actually improved the performance of the model.

# 1. Preparation
We need to first import the required library, download the data, and load the data into the memory.

## 1.1 Import

In [1]:
print('Importing required packages...')

from IPython.display import clear_output
import re
import pandas as pd
import numpy as np
np.random.seed()
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import TweetTokenizer
nltk.download('stopwords')
from nltk.stem.wordnet import WordNetLemmatizer 
nltk.download('wordnet')
from keras.preprocessing import sequence
from keras.preprocessing import text as ktxt
from keras.models import Sequential
from keras.layers import Dense, Embedding, GRU, Dropout, GlobalMaxPooling1D 
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn import metrics


def hint(message):
    """
    erase previous ipynb output and show new message
    """
    clear_output()
    print(message)

  

Importing required packages...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ChuanLi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ChuanLi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Using TensorFlow backend.


## 1.2 Loading the Data

In [2]:
hint('loading data...')
train = pd.read_csv('data/train.csv')
train, valid = train_test_split(train, test_size=0.2)

labels = [
    'toxic', 
    'severe_toxic', 
    'obscene', 
    'threat', 
    'insult', 
    'identity_hate'
]

Ytr = train[labels].values
Yva = valid[labels].values

hint('Label distribution between training and validation set:')
print(pd.DataFrame({
    'label': labels,
    'train': [np.mean(train[label]) for label in labels],
    'validation' : [np.mean(valid[label]) for label in labels],
}))

Label distribution between training and validation set:
           label     train  validation
0          toxic  0.095898    0.095629
1   severe_toxic  0.009996    0.009995
2        obscene  0.052978    0.052828
3         threat  0.002953    0.003165
4         insult  0.049297    0.049632
5  identity_hate  0.008805    0.008805


# 2. Pre-processing the Input
There are many ways to pre-process the raw strings into valid input for the model. Here we will do it by building a dictionary with all the comments from the training set, mapping the words to their index in the dictionary, and pad/crop the resulting sequences so that they have the same length.

## 2.1 Cleaning Input

In [3]:
tkzr = TweetTokenizer(preserve_case=False)
eng_stopwords = (
    'what', 'which', 'who', 'whom', 
    'this', 'that', 'these', 'those', 
    'am', 'is', 'are', 'was', 'were', 
    'be', 'been', 'being', 
    'have', 'has', 'had', 'having', 
    'do', 'does', 'did', 'doing', 
    'a', 'an', 'the', 
    'and', 'but', 'if', 'or', 
    'because', 'as', 'until', 'while', 
    'of', 'at', 'by', 'for', 'with', 
    'about', 'against', 'between', 
    'into', 'through', 'during', 'before', 'after', 
    'above', 'below', 'to', 'from', 
    'up', 'down', 'in', 'out', 'on', 'off', 
    'over', 'under', 'again', 'further', 
    'then', 'once', 'here', 
    'there', 'when', 'where', 'why', 
    'how', 'all', 'any', 'both', 'each', 
    'few', 'more', 'most', 'other', 'some', 
    'such', 'no', 'nor', 'not', 'only', 
    'own', 'same', 'so', 'than', 'too', 'very', 
    'can', 'will', 'just', 'don', 'should', 'now'
)
lmtzr = WordNetLemmatizer()
appos = {
    "aren't" : "are not",
    "can't" : "cannot",
    "couldn't" : "could not",
    "didn't" : "did not",
    "doesn't" : "does not",
    "don't" : "do not",
    "hadn't" : "had not",
    "hasn't" : "has not",
    "haven't" : "have not",
    "he'd" : "he would",
    "he'll" : "he will",
    "he's" : "he is",
    "i'd" : "I would",
    "i'd" : "I had",
    "i'll" : "I will",
    "i'm" : "I am",
    "isn't" : "is not",
    "it's" : "it is",
    "it'll":"it will",
    "i've" : "I have",
    "let's" : "let us",
    "mightn't" : "might not",
    "mustn't" : "must not",
    "shan't" : "shall not",
    "she'd" : "she would",
    "she'll" : "she will",
    "she's" : "she is",
    "shouldn't" : "should not",
    "that's" : "that is",
    "there's" : "there is",
    "they'd" : "they would",
    "they'll" : "they will",
    "they're" : "they are",
    "they've" : "they have",
    "we'd" : "we would",
    "we're" : "we are",
    "weren't" : "were not",
    "we've" : "we have",
    "what'll" : "what will",
    "what're" : "what are",
    "what's" : "what is",
    "what've" : "what have",
    "where's" : "where is",
    "who'd" : "who would",
    "who'll" : "who will",
    "who're" : "who are",
    "who's" : "who is",
    "who've" : "who have",
    "won't" : "will not",
    "wouldn't" : "would not",
    "you'd" : "you would",
    "you'll" : "you will",
    "you're" : "you are",
    "you've" : "you have",
    "'re": " are",
    "wasn't": "was not",
    "we'll":" will",
    "didn't": "did not"
}

def preprocess(comment):
  
    # credit to the author of this post:
    # https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda

    # remove special format
    comment = re.sub('\n\t', '', comment)

    # remove IP addresses
    comment = re.sub('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', ' specipaddress ', comment)

    # remove username
    comment = re.sub("\[\[User.*\]", ' specusername ', comment)
    comment = re.sub("\[\[User.*\|", ' specusername ', comment)

    # tokenization 
    tokens = tkzr.tokenize(comment)

    # aphostophe replacement
    tokens = [ appos[token] if token in appos else token for token in tokens]

    # remove stopwords
    tokens = [ token for token in tokens if not token in eng_stopwords ]

    # stemming
    tokens = [ lmtzr.lemmatize(token, 'v') for token in tokens]

    return " ".join(tokens)
  

hint('Cleaning train set...')
Xtr = train['comment_text'].apply(lambda c: preprocess(c))
hint('Cleaning test set...')
Xva = valid['comment_text'].apply(lambda c: preprocess(c))
hint('Done')

Done


## 2.2 Transforming Comments to Sequences

In [4]:
vocab_max = 20000

hint('Fitting the tokenizer...')
tokenizer = ktxt.Tokenizer(num_words=vocab_max)
tokenizer.fit_on_texts(Xtr)

hint('Tokenizing...')
Xtr = tokenizer.texts_to_sequences(Xtr)
Xva = tokenizer.texts_to_sequences(Xva)

hint('padding the sequences...')
max_comment_length = 200  # padded/cropped comment length
Xtr = sequence.pad_sequences(Xtr, maxlen=max_comment_length)
Xva = sequence.pad_sequences(Xva, maxlen=max_comment_length)

hint('Done')

Done


# 3. Training Model
Note that the GRU layer has been replaced with a dense layer.

In [6]:
model = Sequential()
model.add(Embedding(vocab_max, 100, input_length=max_comment_length))
model.add(Dropout(0.2))
model.add(Conv1D(filters=64, kernel_size=3, padding='same', activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(64, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(len(labels), activation='sigmoid'))
model.compile(
    optimizer='adam', 
    loss='binary_crossentropy', 
    metrics=['accuracy']
)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 100)          2000000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 200, 100)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 200, 64)           19264     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 16)                1040      
_________________________________________________________________
dense_3 (Dense)              (None, 6)                 102       
Total para

Now training the model.

In [7]:
epochs = 5
batch_size = 64

history = model.fit(
    Xtr, Ytr, 
    epochs=epochs, 
    batch_size=batch_size,
    validation_data=(Xva, Yva)
)

Train on 127656 samples, validate on 31915 samples
Epoch 1/2
Epoch 2/2


Making prediction on the validation set.

In [8]:
hint("Making prediction...")
Yva_ = model.predict(Xva)
hint("Done")

Done


# 4. Result Analysis
## 4.1 Global Accuracy

In [9]:
total_sample = Xva.shape[0]
print("validation set sample count: %d\n" % total_sample)
prediction_total = total_sample*Yva.shape[1]
best_t = None
best_accuracy = 0
for t in [i*0.1 for i in range(1, 10)]:
    accuracy = np.sum(Yva == (Yva_ >= t))/prediction_total
    if accuracy > best_accuracy: 
        best_t = t
        best_accuracy = accuracy
    print("accuracy for threshold %.1f: %.2f%%" % (t, accuracy*100))
Yva_T = Yva_ >= best_t
correct = Yva == Yva_T
print("\nbest threshold: %.1f" % best_t)
print("best accuracy: %.2f%%" % (best_accuracy*100))

validation set sample count: 31915

accuracy for threshold 0.1: 96.32%
accuracy for threshold 0.2: 97.41%
accuracy for threshold 0.3: 97.90%
accuracy for threshold 0.4: 98.12%
accuracy for threshold 0.5: 98.21%
accuracy for threshold 0.6: 98.22%
accuracy for threshold 0.7: 98.15%
accuracy for threshold 0.8: 97.99%
accuracy for threshold 0.9: 97.65%

best threshold: 0.6
best accuracy: 98.22%


## 4.2 Accuracy by Classes

In [10]:
overview = pd.DataFrame(index=[
    'label‰ of all',
    'total wrong', 
    'P->N', 
    'N->P', 
    'P->N %', 
    'N->P %',
    'avg len',
])

def analyze_class(i):
    wrong = valid[correct[:, i] != 1]
    total_class_error = len(wrong)
    print("%d predicted incorrectly (%.2f%% of all samples)" % (
        total_class_error, 
        100*total_class_error/total_sample
    ))
        
    wrong_seqs = Xva[correct[:, i] != 1]
    lens = [ len(seq[seq != 0]) for seq in wrong_seqs]
    avg_len = np.mean(lens)
    print("Falsely predicted sequences have an average length of %d" % avg_len)

    PpN = valid[(Yva[:, i] == 1) & (Yva_T[:, i] == 0)]
    PpN_count = len(PpN)
    print("\n%d (%.2f%%) positive label were predicted to be negative" % (
        PpN_count, 
        100*PpN_count/total_class_error 
    ))
    if PpN_count > 4:
        print("Samples:")
        for sample in PpN.sample(5)['comment_text']:
            display(sample)
  
    NpP = valid[(Yva[:, i] == 0) & (Yva_T[:, i] == 1)]
    NpP_count = len(NpP)
    print("\n%d (%.2f%%) negative label were predicted to be positive" % (
        NpP_count, 
        100*NpP_count/total_class_error 
    ))
    if NpP_count > 4:
        print("Samples:")
        for sample in NpP.sample(5)['comment_text']:
            display(sample)
  
    overview[labels[i]] = [
        np.mean(Yva[:, i]*1000),
        total_class_error, 
        PpN_count,  
        NpP_count,
        100*PpN_count/total_class_error,
        100*NpP_count/total_class_error,
        avg_len
    ]
  
    print('\n')
  

### 4.2.1 Toxic

In [11]:
analyze_class(0)

1221 predicted incorrectly (3.83% of all samples)
Falsely predicted sequences have an average length of 35

821 (67.24%) positive label were predicted to be negative
Samples:


'"\n\nYou invalidate my contribution and then ask for more. If I wasn\'t being civil, I\'d say ""go #@%& yourself, Shii."" This is for you; [Let\'s Get Stupid About 2012 http://www.geocities.com/heartystar/2012]"'

"Oh cool, you deleted other part, what is next? You will delete the whole article too? Good job! Can block me again, it's just that you can do. I-D-I-O-T"

'Binksternet is an anti-Semitewho loves telling lies about Jews and who supports Islamic terrorism. \n\nBinksternet is an anti-Semitewho loves telling lies about Jews and who supports Islamic terrorism.'

'"\n\nNo, YOU ""shut up already"". The only ""consensus"" here is the one of your gang  which has hijacked this article to whitewash police actions blocking anyone who counters this disgusting propaganda. Are you guys getting paid by some police department to this? If I find out you have, I\'ll make sure all of you are banned from wikipedia forever. 69.228.251.134  "'

'Yeah, the only more obvious case I can think of is the gay one from S Club 7'


400 (32.76%) negative label were predicted to be positive
Samples:




'"\n\nThank you for experimenting with the page Fuck on Wikipedia. Your test worked, and has been reverted or removed. Please use the sandbox for any other tests you want to do. Take a look at the welcome page if you would like to learn more about contributing to our encyclopedia.   (talk) "'

"You're the Vandal!\n\nLeave my good buddies Tyar and Stopdroproll alone. Just because they live near each other doesn't mean they are the same hupan!"

'"\nAnd not arguing semantics is the last defence of someone who knows they are wrong or won\'t admit they are wrong. If I understand AGK correctly, he believes I was attempting to be malicious. He assumed I was saying something that I wasn\'t trying to. Instead of saying ""it was not constructive"" (which is untrue and not blockable) he backed his block by implying that I meant something I did not. There are no semantics about it. AGK simply does not believe me when I say it was simple wordplay. So how about you go tell everyone in that section that it was not collaborative? It obviously was since they responded. AGK is the one who had a knee-jerk reaction to it. All that was needed was ""Dude, \'cancer\' can be taken the wrong way. Strike it or get blocked."" Instedd he waited until HOURS after I had edited to make a 24 hour block. So don;t argue semantics with me: Jusitify your poor reaction. Not punitive my ass.  01:30"'







### 4.2.2 Severe Toxic

In [12]:
analyze_class(1)

315 predicted incorrectly (0.99% of all samples)
Falsely predicted sequences have an average length of 25

308 (97.78%) positive label were predicted to be negative
Samples:


'"\n\nhey fuck you, who the fuck checks ""Phnom Nam Lyr Wildlife Sanctuary"" for vandalism. Get a fucking life you turd bag"'

'I FIND YOU TO BE A FUCKING USELESS POS\nGO FUCK YOURSELF'

'hey dickbreath \n\nur retarded'

"gay fucking faggot who has no life and no job and spends his whole shitty fucking life on wikipedia feeling powerful because he can ban and block peoples ips. hes a basement dwelling cunt. Not that i'm suprised however if I had a face like that I would stay indoors too."

'FUCK YOU KID, Get the Fuck out of my country.'


7 (2.22%) negative label were predicted to be positive
Samples:


'I DONT CARE WHAT THE FUCK U DO!! \n\nFUCK you u Racist ass pieces of Shits! Imma let everyone know what wikipedia really is!!! Fuck off Rcaist fuckers!!!! Bitches'

'Fuck of \n\nYou fuck of...212.253.147.120'

'fuck you, you god damn fucking cocksucker. you fucking fuck yourself pan dan.'

'Fuck u bitch \n\nFuck u Bitch'

'Notrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \n\nAnthony Bradbury sucks cock and eats shit, Notrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks cocks \nNotrhbysouthbanof sucks co





### 4.2.3 Obscene

In [13]:
analyze_class(2)

621 predicted incorrectly (1.95% of all samples)
Falsely predicted sequences have an average length of 38

415 (66.83%) positive label were predicted to be negative
Samples:


'Do not tell me what to do, you shitbag.'

'"\n\nListen asshole, I\'m leaving this on my page so as not disrupt the integrity of yours - I\'ve had it with your constant pestering and what I believe is referred to as ""wikistalking."" This entire matter has been solved and concluded however, you seem to be enjoying yourself by way of acting like a goddamned three-year-old with a wet diaper. I have no interest in engaging in tit-for-tats with some jackass who has nothing better to do with his time than needle at users who settled a matter amicably and have moved on. Cut it out, leave me alone, and move on. \n\nFeel free to erase this when you\'re done reading it."'

"No Picture? \n\nWith such a heinous murder and rape spree, why isn't there a picture of the thug? His spiteful, bastard face must be seen to have the full effect: Mugshot\n\nAlso, it should be mentioned somewhere, maybe in a Controversies section, that the Oakland black community and black activist groups actually came to Mixon's defense- calling him a soldier, a hero, and a victim: Source 1 , Source 2 50.29.10.210"

'Oh, blow it out yer ear, you prententious little be-otch.'

'"""John Henry Haines was born on April 20, 1420, in Houston, Texas to missionary parents.""  WTF does this have to do with this article?\n\n"'


206 (33.17%) negative label were predicted to be positive
Samples:


'More lies and impotent sucking up.  Typical (but amusing nonetheless).'

'ha ha i shoulda known u really dont care bout the rules u just want n excuse to ban ne1 who doesnt kiss ass to admins'

'HATE YOU DO NOT EVER TRY TO DELETE STEPHANIE POOLE. STOP YOUR DUMB WEENIE LIKE TENDACIES!'

'"\n LOOOOOOOOOOOOOOOL. JUST BECAUSE YOU EDIT ON WIKIPEDIA FOR FOUR YEARS NOW, DOESN\'T MEAN I ASSUME YOU\'RE HAVING FUN. YOU PROBABLY HAVE NO LIFE BECAUSE YOU DO THIS AS A ""LIVING"". ITS SO FUNNY HOW YOU GET BUTTHURT OVER THE STUPIDEST SHIT. LIKE YOU SAY THAT EXCESSIVE INFO ABOUT ""AWARDS"" IS RECKLESS. WELL MAYBE YOU SHOULD GET AN AWARD FOR BEING AN ASSHOLE ABOUT IT. MAYBE YOU HAVE SOME CHILL IN YOUR LIFE THEN YOU CAN STFU AND HAVE DIFFERENT OPINIONS LIKE ME. I DONT CARE WHAT YOU AND OTHER GUY THINKS. SO SORRY, YOU DO YOU. I DO ME."'

"Dog -n- Suds article \n\nI'm sorry for coming off as such an asshole on the edit to dog n suds.  I was pissed off that day because I got 3 people threatening to ban me over retarded shit, I was mostly just being a smartass and was expecting it get reverted back and threatened with yet another banning.   I can't seem to do shit right here. aw well fuck em. Just wanted to say sorry  (    )"





### 4.2.4 Threat

In [14]:
analyze_class(3)

101 predicted incorrectly (0.32% of all samples)
Falsely predicted sequences have an average length of 28

101 (100.00%) positive label were predicted to be negative
Samples:


'"\n\n""I think we should kill all the jews and homosexuals and handicaped people."", (said Adolf Hitler)."'

"Shalom \n\nSemite, get the fuck out of here. I will kill you, you son of a bitch if you don't leave wikipedia. If you tell anyone I said this to you, I will rape your jew family. Choke on a bagel, you faggoty hymie cock sucker. You better fucking die, before I hunt you down and piss on your face. If you get me blocked for this, I will find you and shoot you, I just want your jew fucking self gone, you dumb motherfucking piece of shit g-d damn jew. Die! Shalom. We Came In?"

"I don't quite understand why you are so strongly in favour of censorship. You see, the issue is that I happen to be an expert on this subject. You are controlling information, by restricting it, and this is homosexual behaviour which I will be more than happy to rape you to death for."

'-good i hope you die inshallah'

'Urgent!!!! \n\nMind yo own fucken buisiness stop fucken changing my shit fuck you!!! eat a mutha fucken dick. Ill beat yo ass come find me nigga i go here!'


0 (0.00%) negative label were predicted to be positive




### 4.2.5 Insult

In [15]:
analyze_class(4)

875 predicted incorrectly (2.74% of all samples)
Falsely predicted sequences have an average length of 29

674 (77.03%) positive label were predicted to be negative
Samples:


'Get out my talk page you smelling wog!'

'i humped shadw jams mum'

'turkish people \n\npls stop editing turkic related topics, you dont know anything about turkic people and all i want to say that, anatolianism is completely bullshit and has nothing about turkish pple.'

'"\nREADY TO FUCK YOU UP OLD SCHOOL """"\nSee the inviting place with it\'s friendly and fair administrators\nready to give you an olde time wiki welcome.\nSee the how this admin feels about wikilov\n\nThis admin , a Democrat cyber thug, banns me due to differences in writing style.\n\nGamaliel uses this place as a sword for his own power drunk political agenda.\n\nGamaliel have absolutely no evidence to say I\'m a ""sock of joehazleton"" other then \nThe asinine ""duck test""  I would call the duck test an litmus test of\nany editor, which edits on political bio.  If you don\'t follow the\nGamaliel\'s radical democrat ideology test, you are then labeled\na sock of Joehazelton. \n\nThis is outrages, an abuse of power and due process as it shows wikipeidia unfit \nto to have any bios on it and I would encourage any with a wiki bio to \ninvestigate their legal options and force wikipieida to removed their biography from this website."'

'Attn IMROY: if I want your stinky opinion I would have flushed three times pal!'


201 (22.97%) negative label were predicted to be positive
Samples:


"User talk:24.63.62.169\nI am glad you were able to take time away from fapping to jimbo porn to protect my talk page.  Don't bother responding to this I won't read what you write.  I don't read stuff written by faggots because I don't want to get aids."

'not a crime when everybody can edit it ;)\nhave a nice night guys, keep that stick up your ass, heros, it looks good on yas.'

"Dear AuburnPilot \n\nI'll just come back with a different IP address you stupid son of a bitch.  You're a fucking moron."

'But keep in mind I am a gay Jew.'

'i enjoy licking penises and watching the semen flow out'





### 4.2.6 Identity Hate

In [16]:
analyze_class(5)

281 predicted incorrectly (0.88% of all samples)
Falsely predicted sequences have an average length of 26

281 (100.00%) positive label were predicted to be negative
Samples:


"he's not an editor \n\nhe's a nigger"

'Oh yes, john let us all join the hate fest against turks. Remember Turks are evil and Turkiye should be nuked...Yesssssss'

"edit filter disturbs i can't add a gay userbox to your userpage"

'BlackKite being a faggot'

'nigga wit attitude. \n\nyour such a cunt.'


0 (0.00%) negative label were predicted to be positive




### 4.2.7 Overview

In [17]:
overview.astype(int)

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
label‰ of all,95,9,52,3,49,8
total wrong,1221,315,621,101,875,281
P->N,821,308,415,101,674,281
N->P,400,7,206,0,201,0
P->N %,67,97,66,100,77,100
N->P %,32,2,33,0,22,0
avg len,35,25,38,28,29,26


# 5. Observation and Conclusion