# Toxic Comment Classification
## Part3: More Cleaning

In Part1, I showed Tfidf + Logistic regression can make good predictions for toxic comment classifications, but found some bad words hidden in disguised words, which had zero coefficients in the logistic regression model. In Part2, I tried more complex models with GloVe word embeddings and LSTM and also simple ensemble methods. The more complex methods outperformed the logistic regression. In this part (Part3), I attempt to clean up those problematic words found in Part1 and see if the simple logistic regression can result in improved performance.

Before cleaning comments more, I will first show increasing the bag of words (max_features) from 10,000 to 20,000 can help and do coefficient analysis for each type of toxicity (Part1 only showed coefficient analysis for 'toxicity' and 'indentity hate' types).

1. <a href='#Section1'>TFfidf + Logistic Regression with Increased max_features</a>
2. <a href='#Section2'>Coefficient Analysis</a>
3. <a href='#Section3'>TFfidf + Logistic Regression after More Cleaning</a>

In [1]:
# import all necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from nltk.stem.snowball import SnowballStemmer
from sklearn.pipeline import Pipeline
import string
import re

In [2]:
# Load datasets
train = pd.read_csv('train.csv') #training set
test = pd.read_csv('test.csv') #test set

In [3]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
train.describe()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805
std,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
# Get comments only (pandas series) 
train_comment = train['comment_text']
test_comment = test['comment_text']

<a id= 'Section1'></a>
## 1. TFfidf + Logistic Regression with Increased max_features

### Stemming

In [8]:
# stemming function
def text_preprocessing(text):
    #remove punctuations
    text_string = re.sub(r'[^\w\s]','',text)
    #split text into words
    word_list = text_string.split()
    #apply stemmer and combine them again
    stemmer= SnowballStemmer("english")
    words = ""
    for word in word_list:
        if word != "":
            word_stemmed = stemmer.stem(word)
            words += word_stemmed + " "
    return words

In [9]:
# Check original comments in the training set
train_comment.head()

0    Explanation\nWhy the edits made under my usern...
1    D'aww! He matches this background colour I'm s...
2    Hey man, I'm really not trying to edit war. It...
3    "\nMore\nI can't make any real suggestions on ...
4    You, sir, are my hero. Any chance you remember...
Name: comment_text, dtype: object

In [10]:
%%time
# stemming all comments in the training set
train_comment_stemmed = train_comment.apply(text_preprocessing)

Wall time: 1min 46s


In [11]:
train_comment_stemmed.head()

0    explan whi the edit made under my usernam hard...
1    daww he match this background colour im seem s...
2    hey man im realli not tri to edit war it just ...
3    more i cant make ani real suggest on improv i ...
4    you sir are my hero ani chanc you rememb what ...
Name: comment_text, dtype: object

In [12]:
%%time
# stemming all comments in the test set
test_comment_stemmed = test_comment.apply(text_preprocessing)

Wall time: 1min 31s


In [13]:
test_comment_stemmed.head()

0    yo bitch ja rule is more succes then youll eve...
1              from rfc the titl is fine as it is imo 
2                         sourc zaw ashton on lapland 
3    if you have a look back at the sourc the infor...
4                    i dont anonym edit articl at all 
Name: comment_text, dtype: object

### TfidfVectorizer with max_features 20000
A vectorizer converts a collection of text documents into a numerical (sparse) matrix with 1 row per document and 1 column per token (e.g. word). I will use the Tf-idf (term frequency inverse document frequency) vectorizer.

I increase max_features from 10,000 to 20,000, but keep all other hyperparameters found to be the best in Part1

In [194]:
# Tfidf vectorizing
vectorizer = TfidfVectorizer(analyzer ='word', 
                             stop_words='english',
                             sublinear_tf=True, #term-freq scaling
                             strip_accents='unicode', #works generally
                             token_pattern=r'\w{1,}', #1+ char words
                             ngram_range=(1,1),
                             max_features=20000) #top frequent 20000 words

vectorizer.fit(pd.concat([train_comment_stemmed,test_comment_stemmed]))

train_feature_matrix = vectorizer.transform(train_comment_stemmed)
test_feature_matrix = vectorizer.transform(test_comment_stemmed)

In [29]:
# helper function for checking results at glance
def print_results():
    print("### {} ###".format(category))
    print()
    print("Best hyper-parameters on development set:")
    print(clf.best_params_)
    print()
    print("Grid search scores on development set:")
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.4f (+/-%0.04f) for %r" % (mean, std * 2, params))
    print()    

### Logistic Regression

In [16]:
#train.columns # just to coveniently copy and paste category names below

In [17]:
categories =['toxic', 'severe_toxic', 'obscene', 
             'threat', 'insult', 'identity_hate']

#### Tuning for C

In [30]:
%%time
for category in categories:
    train_labels = train[category]
    parameters = {'C': [.2, .5, 1, 2, 5]}
    log_reg = LogisticRegression(solver = 'sag',random_state =42)   
    clf= GridSearchCV(log_reg, parameters, scoring='roc_auc', cv=3)
    clf.fit(train_feature_matrix, train_labels)
    print_results()

### toxic ###

Best hyper-parameters on development set:
{'C': 2}

Grid search scores on development set:
0.9639 (+/-0.0042) for {'C': 0.2}
0.9683 (+/-0.0033) for {'C': 0.5}
0.9701 (+/-0.0027) for {'C': 1}
0.9707 (+/-0.0023) for {'C': 2}
0.9695 (+/-0.0019) for {'C': 5}

### severe_toxic ###

Best hyper-parameters on development set:
{'C': 0.5}

Grid search scores on development set:
0.9844 (+/-0.0025) for {'C': 0.2}
0.9851 (+/-0.0028) for {'C': 0.5}
0.9850 (+/-0.0032) for {'C': 1}
0.9844 (+/-0.0037) for {'C': 2}
0.9824 (+/-0.0043) for {'C': 5}

### obscene ###

Best hyper-parameters on development set:
{'C': 2}

Grid search scores on development set:
0.9818 (+/-0.0038) for {'C': 0.2}
0.9839 (+/-0.0036) for {'C': 0.5}
0.9848 (+/-0.0035) for {'C': 1}
0.9849 (+/-0.0034) for {'C': 2}
0.9836 (+/-0.0032) for {'C': 5}

### threat ###

Best hyper-parameters on development set:
{'C': 2}

Grid search scores on development set:
0.9767 (+/-0.0026) for {'C': 0.2}
0.9802 (+/-0.0029) for {'C': 0.5}
0

#### Fine tuning for C

In [22]:
%%time
# Tune C for each category CV=5 
parameters_dic={'toxic':{'C':[1.1,1.2,1.3,1.4,1.5, 1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5]},
                'severe_toxic':{'C':[.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0]},
                'obscene':{'C':[1.1,1.2,1.3,1.4,1.5, 1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5]},
                'threat':{'C':[1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5]}, 
                'insult':{'C':[.6,.7,.8,.9,1.0,1.1,1.2,1.3,1.4,1.5]}, 
                'identity_hate':{'C':[.6,.7,.8,.9,1.0,1.1,1.2,1.3,1.4,1.5]}}
for category in categories:
    train_labels = train[category]
    parameters = parameters_dic[category]
    log_reg = LogisticRegression(solver='sag', random_state =42)   
    clf= GridSearchCV(log_reg, parameters, scoring='roc_auc', cv=5)
    clf.fit(train_feature_matrix, train_labels)
    print_results()

### toxic ###

Best hyper-parameters on development set:
{'C': 1.9}

Grid search scores on development set:
0.9710 (+/-0.0014) for {'C': 1.1}
0.9711 (+/-0.0013) for {'C': 1.2}
0.9711 (+/-0.0013) for {'C': 1.3}
0.9712 (+/-0.0013) for {'C': 1.4}
0.9712 (+/-0.0012) for {'C': 1.5}
0.9713 (+/-0.0012) for {'C': 1.6}
0.9713 (+/-0.0012) for {'C': 1.7}
0.9713 (+/-0.0012) for {'C': 1.8}
0.9713 (+/-0.0011) for {'C': 1.9}
0.9713 (+/-0.0011) for {'C': 2.0}
0.9713 (+/-0.0011) for {'C': 2.1}
0.9713 (+/-0.0011) for {'C': 2.2}
0.9712 (+/-0.0011) for {'C': 2.3}
0.9712 (+/-0.0010) for {'C': 2.4}
0.9712 (+/-0.0010) for {'C': 2.5}

### severe_toxic ###

Best hyper-parameters on development set:
{'C': 0.6}

Grid search scores on development set:
0.9841 (+/-0.0028) for {'C': 0.1}
0.9848 (+/-0.0028) for {'C': 0.2}
0.9852 (+/-0.0029) for {'C': 0.3}
0.9853 (+/-0.0030) for {'C': 0.4}
0.9854 (+/-0.0032) for {'C': 0.5}
0.9854 (+/-0.0033) for {'C': 0.6}
0.9854 (+/-0.0035) for {'C': 0.7}
0.9854 (+/-0.0036) for {'C':

The best values for the inverse regularization parameter C with 20000 for max_feature are slightly different from those with 10000 for  max_feature.

In [23]:
%%time
best_C = {'toxic':1.9,'severe_toxic':.6,'obscene':1.5,
        'threat':1.9, 'insult':1.1, 'identity_hate':1.2}
score_all = []
for category in categories:
    train_labels = train[category]
    log_reg = LogisticRegression(solver='sag',random_state =42)   
    clf= GridSearchCV(log_reg, {'C':[best_C[category]]}, scoring='roc_auc', cv=10)
    clf.fit(train_feature_matrix, train_labels)
    print("Mean CV score on dev. set for %s is %0.4f" %(category,clf.best_score_))
    score_all.append(clf.best_score_)
print()
print("Mean CV score on dev. set for All categories is %0.4f" %np.mean(score_all))

Mean CV score on dev. set for toxic is 0.9717
Mean CV score on dev. set for severe_toxic is 0.9857
Mean CV score on dev. set for obscene is 0.9857
Mean CV score on dev. set for threat is 0.9842
Mean CV score on dev. set for insult is 0.9770
Mean CV score on dev. set for identity_hate is 0.9767

Mean CV score on dev. set for All categories is 0.9802
Wall time: 2min 8s


20000 for max_feature gives slightly better performance than 10000 (0.9802 vs. 0.9799); 20000 is better for all types of toxicity except for Threat.

In [27]:
# Preparation for submission
submission = pd.DataFrame()
submission['id']=test['id']

for category in categories:
    train_labels = train[category]
    clf = LogisticRegression(solver='sag', C=best_C[category],random_state =42)   
    clf.fit(train_feature_matrix, train_labels)
    submission[category] = clf.predict_proba(test_feature_matrix)[:,1] #second column!!
    
submission.to_csv('submission_10.csv', index=False)

The only method to test my model with the testset is submitting predictions to the leaderboard (LB) in Kaggle. I found AUC for the testset is .9747 on the leaderboard (Part1 LB AUC: 0.9745). Doubling max_features increased AUC by 0.0002 (0.78% decrease in error).

In [192]:
# Error change in percent
(0.9747-0.9745)/(1-0.9745)*100

0.7843137254901107

<a id= 'Section2'></a>
## 2. Coefficient Analysis

Here I check words with high, low (negative) and zero coefficients to check what are most and least important words in the model.

In [195]:
# dictionary of word:feature_index key value pairs
voca_dic = vectorizer.vocabulary_ 

In [196]:
# Make a list of words odered by feature index
feature_words =[None]*len(voca_dic)
for word, index in voca_dic.items():
    feature_words[index]=word
#print(feature_words[:400])

In [44]:
# category: type of toxicity / num_to_check: number of words to output to check
def coeff_analysis(category, num_to_check):
    clf = LogisticRegression(solver='sag', C=best_C[category],random_state =42)   
    clf.fit(train_feature_matrix, train[category])
    coeff = clf.coef_
    #print("coefficients:\n", coeff, "\n")

    # High coefficient words
    sorted_voca = [word for _,word in sorted(zip(coeff[0],feature_words),reverse=True)]
    print("High coefficient words:\n", sorted_voca[:num_to_check])
    print()
    # Low coefficient words (negative coefficients)
    sorted_voca = [word for _,word in sorted(zip(coeff[0],feature_words))]
    print("Low (negative) coefficient words:\n", sorted_voca[:num_to_check])
    print()
    # Words with low absolute values of coefficient (just out of curiosity)
    sorted_voca = [word for _,word in sorted(zip(list(map(abs,coeff[0])),feature_words))]
    #sorted_voca = [(coef,word) for coef,word in sorted(zip(list(map(abs,coeff[0])),feature_words))]
    print("Close to zero or zero coefficient words:\n", sorted_voca[:num_to_check])

#### Toxic

In [197]:
coeff_analysis('toxic',100)

High coefficient words:
 ['fuck', 'idiot', 'stupid', 'shit', 'suck', 'bullshit', 'ass', 'asshol', 'bitch', 'dick', 'crap', 'faggot', 'peni', 'bastard', 'moron', 'pathet', 'cunt', 'nigger', 'jerk', 'fucker', 'shut', 'gay', 'dumbass', 'cock', 'hell', 'motherfuck', 'loser', 'retard', 'dumb', 'fag', 'pussi', 'liar', 'fuckin', 'racist', 'piss', 'fool', 'nazi', 'kill', 'dickhead', 'sex', 'jackass', 'hate', 'wtf', 'sick', 'damn', 'cocksuck', 'pig', 'whore', 'homosexu', 'die', 'bloodi', 'garbag', 'douchebag', 'scum', 'f', 'hypocrit', 'goddamn', 'wanker', 'fing', 'ars', 'vagina', 'fat', 'fk', 'dirti', 'disgust', 'fuckhead', 'nerd', 'fck', 'prick', 'dipshit', 'pedophil', 'fascist', 'anal', 'fart', 'ugli', 'screw', 'freak', 'rubbish', 'rape', 'donkey', 'shame', 'ridicul', 'porn', 'coward', 'fcking', 'imbecil', 'stink', 'nigga', 'monkey', 'fuk', 'lick', 'homo', 'mental', 'shitti', 'ahol', 'butt', 'mouth', 'looser', 'foolish', 'fuckwit']

Low (negative) coefficient words:
 ['thank', 'redirect', 'pl

#### Severe Toxic

In [198]:
coeff_analysis('severe_toxic',100)

High coefficient words:
 ['fuck', 'motherfuck', 'suck', 'bitch', 'asshol', 'fuckin', 'dick', 'fucker', 'shit', 'ass', 'faggot', 'die', 'cunt', 'rape', 'cock', 'cocksuck', 'nigger', 'bastard', 'kill', 'gay', 'fcking', 'fag', 'u', 'fat', 'peni', 'f', 'dumb', 'nazi', 'pig', 'stupid', 'shut', 'prick', 'whore', 'dirti', 'moron', 'piec', 'fing', 'fuke', 'filthi', 'fuk', 'mother', 'head', 'jew', 'cancer', 'mothjer', 'fucken', 'fck', 'dickhead', 'son', 'burn', 'big', 'sucker', 'retard', 'proassadhanibal911your', 'gonna', 'nerd', 'mouth', 'youfuck', 'slut', 'hell', 'pussi', 'dare', 'hate', 'butt', 'nigga', 'wanker', 'anal', 'homosexu', 'mom', 'ars', 'n', 'hairi', 'lick', 'douch', 'cum', 'rot', 'ur', 'shoot', 'urself', 'wank', 'pathet', 'splatter', 'bich', 'piss', 'loser', 'ya', 'eat', 'murder', 'fcuk', 'abus', 'damn', 'stick', 'georg', 'sex', 'dik', 'homo', 'fukin', 'ugli', 'hey', 'fuckfac']

Low (negative) coefficient words:
 ['pleas', 'articl', 'talk', 'thank', 'use', 'sourc', 'link', 'fact',

#### Obscene

In [199]:
coeff_analysis('obscene',100)

High coefficient words:
 ['fuck', 'asshol', 'ass', 'bitch', 'shit', 'bullshit', 'suck', 'dick', 'cunt', 'bastard', 'fucker', 'faggot', 'motherfuck', 'fuckin', 'pussi', 'stupid', 'cock', 'dumbass', 'peni', 'crap', 'idiot', 'damn', 'dickhead', 'cocksuck', 'fcking', 'jackass', 'nigger', 'jerk', 'wtf', 'fag', 'fk', 'fuckhead', 'moron', 'fck', 'anal', 'f', 'scum', 'ars', 'fuk', 'piss', 'whore', 'fuckwit', 'cum', 'butt', 'fucktard', 'slut', 'fing', 'lick', 'dumb', 'ahol', 'anus', 'hell', 'goddamn', 'rape', 'dipshit', 'vagina', 'sex', 'prick', 'fcuk', 'fool', 'homosexu', 'shithead', 'dumbfuck', 'ball', 'tit', 'filthi', 'fucken', 'fuke', 'fat', 'twat', 'scumbag', 'sht', 'nigga', 'masturb', 'blowjob', 'mom', 'dickfac', 'wanker', 'fking', 'shitti', 'shut', 'bloodi', 'u', 'mouth', 'asshat', 'bag', 'douch', 'n', 'piec', 'screw', 'bich', 'dirti', 'slick', 'fu', 'pig', 'boob', 'hole', 'freak', 'fggt', 'blow']

Low (negative) coefficient words:
 ['thank', 'redirect', 'articl', 'sure', 'agre', 'cheer'

#### Threat

In [200]:
coeff_analysis('threat',100)

High coefficient words:
 ['kill', 'die', 'rape', 'death', 'cut', 'burn', 'destroy', 'hunt', 'ill', 'beat', 'shoot', 'hang', 'stab', 'punch', 'hope', 'kick', 'ass', 'dead', 'knock', 'hous', 'face', 'live', 'murder', 'deserv', 'head', 'homosexu', 'hell', 'shot', 'blood', 'slowli', 'im', 'fuck', 'supertr0l', 'gonna', 'corps', 'knife', 'splatter', 'roast', 'disgust', 'watch', 'shall', 'aliv', 'ya', 'life', 'slit', 'skin', 'bloodi', 'gun', 'rvv', 'swear', 'children', 'behead', 'hate', 'smash', 'fking', 'yo', 'jew', 'pain', 'neck', 'castrat', 'fuckin', 'send', 'pull', 'traitor', 'robe', 'miseri', 'horribl', 'hit', 'track', 'come', 'dare', 'throw', 'onc', 'fat', 'hello', 'poop', 'gross', 'fuke', 'shithead', 'pineappl', 'motherfuck', 'bastard', 'throat', 'testicl', 'u', 'll', 'nazi', 'gut', 'blow', 'ki', 'stupid', 'feed', 'heil', 'rv', 'wear', 'reveng', 'wikistalk', '8d', 'laugh', 'suicid']

Low (negative) coefficient words:
 ['thank', 'articl', 'sourc', 'someon', 'pleas', 'chang', 'say', 'doe

#### Insult

In [201]:
coeff_analysis('insult',100)

High coefficient words:
 ['idiot', 'stupid', 'asshol', 'bitch', 'fuck', 'faggot', 'bastard', 'moron', 'cunt', 'ass', 'suck', 'dumb', 'jerk', 'fucker', 'dickhead', 'dick', 'retard', 'fool', 'pathet', 'loser', 'dumbass', 'scum', 'shit', 'nigger', 'motherfuck', 'fat', 'pig', 'fuckin', 'cocksuck', 'fag', 'coward', 'jackass', 'nigga', 'liar', 'scumbag', 'gay', 'slut', 'fcking', 'douchebag', 'ugli', 'fuckhead', 'whore', 'nazi', 'homosexu', 'freak', 'mouth', 'dirti', 'douch', 'prick', 'homo', 'ahol', 'maggot', 'nerd', 'imbecil', 'piec', 'mom', 'fuckwit', 'shithead', 'fucktard', 'hypocrit', 'stink', 'racist', 'pussi', 'fcuk', 'crap', 'goddamn', 'bullshit', 'f', 'smell', 'anus', 'filthi', 'wanker', 'cock', 'littl', 'twat', 'peni', 'looser', 'bulli', 'rape', 'shut', 'bag', 'hell', 'fascist', 'fing', 'damn', 'sick', 'anal', 'disgust', 'sucker', 'lazi', 'mother', 'prostitut', 'fggt', 'bloodi', 'fk', 'die', 'head', 'monkey', 'fuk', 'pompous']

Low (negative) coefficient words:
 ['thank', 'redirect'

#### Identity Hate

In [202]:
coeff_analysis('identity_hate',100)

High coefficient words:
 ['nigger', 'gay', 'nigga', 'homosexu', 'jew', 'faggot', 'nazi', 'homo', 'muslim', 'racist', 'black', 'negro', 'fag', 'white', 'asian', 'hate', 'american', 'scum', 'rape', 'fuck', 'lesbian', 'indian', 'turk', 'arab', 'stupid', 'homophob', 'dirti', 'fucker', 'nig', 'monkey', 'mexican', 'fat', 'niger', 'jewish', 'pedophil', 'shit', 'terrorist', 'niggaz', 'twat', 'chink', 'fagot', 'women', 'sucker', 'dumb', 'fool', 'burn', 'retard', 'disgust', 'boy', 'russian', 'r', 'allah', 'bitch', 'bastard', 'littl', 'queer', 'cunt', 'slave', 'damn', 'communist', 'whyd', 'fuckin', 'pig', 'slut', 'antisemit', 'like', 'irish', 'chines', 'g', 'death', 'islam', 'akbar', 'babi', 'jewhat', 'countri', 'trash', 'jesus', 'whore', 'ass', 'redneck', 'slavic', 'filthi', 'evil', 'hoe', 'filth', 'nl33er', 'kill', 'hitler', 'bag', 'dipshit', 'nao', 'anal', 'u', 'closet', 'malaysian', 'peni', 'motherfuck', 'crazi', 'idiot', 'huge']

Low (negative) coefficient words:
 ['articl', 'talk', 'thank',

The high coefficient words for each type of toxicity really shows its own set of bad words for each type. However, zero coefficient words, especially the disguised words, seem to be similar for all types. Thus, I will use high coefficient words for the most general type 'toxicity' in order to extract bad words from concatenated words.

### Error Analysis

I did some error analysis for each type of toxicity since I did this only for 'toxicity' in Part1. However, I will not show the sentences below due to profane, vulgar, or offensive words (uncommenting can show them).

In [81]:
pred = pd.DataFrame()
pred['comment_text']=train['comment_text']
pred['comment_stemmed'] =train_comment_stemmed

def error_analysis(category, prob_cut, num_comment_check):
    clf = LogisticRegression(solver='sag', C=best_C[category],random_state =42)   
    clf.fit(train_feature_matrix, train[category]) 
    pred[category] =train[category] #actual label
    pred['pred_prob'] = clf.predict_proba(train_feature_matrix)[:,1] #how likely the label is 1
    print("### {} ###\n".format(category))
    # positive with low predicted probability
    print("False Negatives:")
    print(pred[(pred[category]==1) & (pred['pred_prob']< prob_cut)]['comment_stemmed'].values[:num_comment_check])
    print()
    # negative with high predicted probability
    print("False Positives:")
    print(pred[(pred[category]==0) & (pred['pred_prob']> (1-prob_cut))]['comment_stemmed'].values[:num_comment_check])

In [203]:
#error_analysis('toxic', 0.1, 5)

In [204]:
#error_analysis('severe_toxic', 0.1, 5)

In [205]:
#error_analysis('obscene', 0.1, 5)

In [206]:
#error_analysis('threat', 0.2, 5) #had to increase threshold due to not ebough cases

In [207]:
#error_analysis('insult', 0.1, 5)

In [208]:
#error_analysis('identity_hate', 0.1, 5)

I think many of the false positives shoul be real positives. Some of them are containing several bad words and predicted as toxic, but human-rated as nontoxic. Thus, some false positives are unavoidable errors by human raters.

<a id= 'Section3'></a>
## 3. TFfidf + Logistic regression after More Cleaning

### More Cleaning 
__Cleaning plans__

- Change the numbers in disguised words back into letters (e.g. 5h1t --> shit)
- Extract bad words hidden in concatenated words
- Make too long nonsense words into 'toolongword' after extracting bad words (it might become a high coefficient word).

#### Preparation for removing diguised words with numbers

In [112]:
# dictionaty maping numbers to letters of similar shapes
num_to_char = {'1':'i', #e.g. '5h1t', 'c0pyr1ght'
               '3':'g', #e.g. 'ni33er'
               '5':'s', #e.g. '5h1t', '5hut', '5uck5'
               '0':'o', #e.g. 'c0pyr1ght'
               '2':'2', '4':'4','6':'6','7':'7','8':'8','9':'9'}

In [214]:
test_str = '1sd3g5h0f6'

In [215]:
if re.search('((?:[a-z]+[0-9]|[0-9]+[a-z])[a-z0-9]*)',test_str):
        print(re.sub('(\d)', lambda m: num_to_char[m.group()], test_str))

isdggshof6


In [209]:
#pattern = re.compile(r'((?:[a-z]+[0-9]|[0-9]+[a-z])[a-z0-9]*)')
#if pattern.search(test_str):
#        print(re.sub('(\d+)', lambda m: num_to_char[m.group()], test_str))

#### Preparation for removing disguised words with concatenation

In [123]:
category = 'toxic'
clf = LogisticRegression(solver='sag', C=best_C[category],random_state =42)   
clf.fit(train_feature_matrix, train[category])
coeff = clf.coef_
sorted_voca = [word for _,word in sorted(zip(coeff[0],feature_words),reverse=True)]

In [133]:
bad_words = [ word for word in sorted_voca[:100] if len(word)>1]
print(bad_words) #'f' removed, so 99 words

['fuck', 'idiot', 'stupid', 'shit', 'suck', 'bullshit', 'ass', 'asshol', 'bitch', 'dick', 'faggot', 'crap', 'peni', 'bastard', 'moron', 'pathet', 'nigger', 'fucker', 'cunt', 'jerk', 'shut', 'dumbass', 'cock', 'gay', 'motherfuck', 'hell', 'dumb', 'loser', 'retard', 'fag', 'pussi', 'fuckin', 'liar', 'racist', 'piss', 'dickhead', 'fool', 'jackass', 'nazi', 'kill', 'sex', 'cocksuck', 'hate', 'wtf', 'sick', 'pig', 'damn', 'whore', 'bloodi', 'homosexu', 'garbag', 'die', 'douchebag', 'goddamn', 'fing', 'scum', 'hypocrit', 'fk', 'wanker', 'vagina', 'ars', 'fuckhead', 'dirti', 'fck', 'fat', 'disgust', 'dipshit', 'nerd', 'pedophil', 'fart', 'prick', 'anal', 'fascist', 'rubbish', 'donkey', 'ugli', 'freak', 'screw', 'imbecil', 'rape', 'fcking', 'nigga', 'stink', 'porn', 'shame', 'ridicul', 'coward', 'shitti', 'fuk', 'looser', 'lick', 'monkey', 'fuckwit', 'foolish', 'ahol', 'homo', 'mental', 'butt', 'scumbag']


These are high coefficient words for 'toxicity'. They will be extracted from concatenated words.

In [218]:
test_word = 'aaaaidiotfooldumb'
#long code
#words_to_be_added = ''
#for word in bad_words:   
#    if test_word.find(word) >=0:
#        words_to_be_added += word + " "
#test_word = words_to_be_added 
#test_word

In [217]:
" ".join([word for word in bad_words if test_word.find(word) >=0])

'idiot dumb fool'

In [141]:
test_word2 = 'goodwordscocatenated'

In [142]:
" ".join([word for word in bad_words if test_word2.find(word) >=0])

''

#### All together

In [160]:
# this function will be added to stemming function
def more_cleaning(word_stemmed):
    ### more cleaning here ###
    # diguised with numbers
    if re.search('((?:[a-z]+[0-9]|[0-9]+[a-z])[a-z0-9]*)',word_stemmed):
        word_stemmed = re.sub('(\d)', lambda m: num_to_char[m.group()], word_stemmed)
    # disguised with concatenation
    hidden_bad_words = " ".join([word for word in bad_words if word_stemmed.find(word) >=0])
    if hidden_bad_words !='':
        word_stemmed = hidden_bad_words
    # too long without space (i.e. not treated in the above condition, but long)
    if (len(word_stemmed) > 20) & (word_stemmed.find(' ') == -1):
        word_stemmed = 'toolongword'
    return word_stemmed

In [161]:
more_cleaning('klajthwetsjethjkh5h1ttkshtkjthkejhksh')

'shit'

In [219]:
more_cleaning('5378632852jsghjkdhjskghjgklshjkglhsgs')

'toolongword'

In [163]:
def text_preprocessing_more_clean(text):
    #remove punctuations
    text_string = re.sub(r'[^\w\s]','',text)
    #split text into words
    word_list = text_string.split()
    #apply stemmer and combine them again
    stemmer= SnowballStemmer("english")
    words = ""
    for word in word_list:
        if word != "":
            word_stemmed = stemmer.stem(word)
            
            ##### more cleaning added here #####
            # diguised with numbers
            if re.search('((?:[a-z]+[0-9]|[0-9]+[a-z])[a-z0-9]*)',word_stemmed):
                word_stemmed = re.sub('(\d)', lambda m: num_to_char[m.group()], word_stemmed)
            # disguised with concatenation
            hidden_bad_words = " ".join([word for word in bad_words if word_stemmed.find(word) >=0])
            if hidden_bad_words !='':
                word_stemmed = hidden_bad_words
            # too long without space (i.e. not cocaternated in the above condition, but long)
            if (len(word_stemmed) > 20) & (word_stemmed.find(' ') == -1):
                word_stemmed = 'toolongword'
            ####################################
            
            words += word_stemmed + " "
    return words

In [168]:
%%time
train_comment_stemmed_clean = train_comment.apply(text_preprocessing_more_clean) 
test_comment_stemmed_clean = test_comment.apply(text_preprocessing_more_clean) 

Wall time: 9min 8s


### Tfidf

In [170]:
# Using stemmed and more cleaned comments
vectorizer = TfidfVectorizer(analyzer ='word', 
                             stop_words='english',
                             sublinear_tf=True, #term-freq scaling
                             strip_accents='unicode', #works generally
                             token_pattern=r'\w{1,}', #1+ char words
                             ngram_range=(1,1),
                             max_features=20000) #top frequent 20000 words

vectorizer.fit(pd.concat([train_comment_stemmed_clean,test_comment_stemmed_clean]))
# Make sure to use the right argument
train_feature_matrix = vectorizer.transform(train_comment_stemmed_clean)
test_feature_matrix = vectorizer.transform(test_comment_stemmed_clean)

### Logistic Regression

#### Tuning for C

In [171]:
%%time
for category in categories:
    train_labels = train[category]
    parameters = {'C': [.2, .5, 1, 2, 5]}
    log_reg = LogisticRegression(solver = 'sag',random_state =42)   
    clf= GridSearchCV(log_reg, parameters, scoring='roc_auc', cv=3)
    clf.fit(train_feature_matrix, train_labels)
    print_results()

### toxic ###

Best hyper-parameters on development set:
{'C': 2}

Grid search scores on development set:
0.9658 (+/-0.0036) for {'C': 0.2}
0.9694 (+/-0.0030) for {'C': 0.5}
0.9709 (+/-0.0026) for {'C': 1}
0.9711 (+/-0.0024) for {'C': 2}
0.9696 (+/-0.0024) for {'C': 5}

### severe_toxic ###

Best hyper-parameters on development set:
{'C': 0.5}

Grid search scores on development set:
0.9871 (+/-0.0007) for {'C': 0.2}
0.9877 (+/-0.0012) for {'C': 0.5}
0.9876 (+/-0.0019) for {'C': 1}
0.9869 (+/-0.0028) for {'C': 2}
0.9847 (+/-0.0042) for {'C': 5}

### obscene ###

Best hyper-parameters on development set:
{'C': 1}

Grid search scores on development set:
0.9861 (+/-0.0018) for {'C': 0.2}
0.9874 (+/-0.0019) for {'C': 0.5}
0.9879 (+/-0.0020) for {'C': 1}
0.9878 (+/-0.0020) for {'C': 2}
0.9864 (+/-0.0020) for {'C': 5}

### threat ###

Best hyper-parameters on development set:
{'C': 2}

Grid search scores on development set:
0.9728 (+/-0.0029) for {'C': 0.2}
0.9772 (+/-0.0029) for {'C': 0.5}
0

#### More find tuning for C

In [172]:
%%time
# Tune C for each category CV=5 
parameters_dic={'toxic':{'C':[1.1,1.2,1.3,1.4,1.5, 1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5]},
                'severe_toxic':{'C':[.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0]},
                'obscene':{'C':[.6,.7,.8,.9,1.0,1.1,1.2,1.3,1.4,1.5]},
                'threat':{'C':[1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,3.0]}, 
                'insult':{'C':[.6,.7,.8,.9,1.0,1.1,1.2,1.3,1.4,1.5]}, 
                'identity_hate':{'C':[.6,.7,.8,.9,1.0,1.1,1.2,1.3,1.4,1.5]}}
for category in categories:
    train_labels = train[category]
    parameters = parameters_dic[category]
    log_reg = LogisticRegression(solver='sag', random_state =42)   
    clf= GridSearchCV(log_reg, parameters, scoring='roc_auc', cv=5)
    clf.fit(train_feature_matrix, train_labels)
    print_results()

### toxic ###

Best hyper-parameters on development set:
{'C': 1.6}

Grid search scores on development set:
0.9715 (+/-0.0018) for {'C': 1.1}
0.9716 (+/-0.0018) for {'C': 1.2}
0.9716 (+/-0.0018) for {'C': 1.3}
0.9717 (+/-0.0018) for {'C': 1.4}
0.9717 (+/-0.0018) for {'C': 1.5}
0.9717 (+/-0.0018) for {'C': 1.6}
0.9717 (+/-0.0018) for {'C': 1.7}
0.9717 (+/-0.0018) for {'C': 1.8}
0.9716 (+/-0.0018) for {'C': 1.9}
0.9716 (+/-0.0018) for {'C': 2.0}
0.9716 (+/-0.0018) for {'C': 2.1}
0.9715 (+/-0.0018) for {'C': 2.2}
0.9715 (+/-0.0018) for {'C': 2.3}
0.9715 (+/-0.0018) for {'C': 2.4}
0.9714 (+/-0.0018) for {'C': 2.5}

### severe_toxic ###

Best hyper-parameters on development set:
{'C': 0.6}

Grid search scores on development set:
0.9866 (+/-0.0018) for {'C': 0.1}
0.9873 (+/-0.0014) for {'C': 0.2}
0.9876 (+/-0.0013) for {'C': 0.3}
0.9877 (+/-0.0014) for {'C': 0.4}
0.9878 (+/-0.0015) for {'C': 0.5}
0.9878 (+/-0.0016) for {'C': 0.6}
0.9878 (+/-0.0017) for {'C': 0.7}
0.9877 (+/-0.0019) for {'C':

In [173]:
%%time
best_C = {'toxic':1.6,'severe_toxic':.6,'obscene':1.2,
        'threat':2.2, 'insult':1.0, 'identity_hate':1.3}
score_all = []
for category in categories:
    train_labels = train[category]
    log_reg = LogisticRegression(solver='sag',random_state =42)   
    clf= GridSearchCV(log_reg, {'C':[best_C[category]]}, scoring='roc_auc', cv=10)
    clf.fit(train_feature_matrix, train_labels)
    print("Mean CV score on dev. set for %s is %0.4f" %(category,clf.best_score_))
    score_all.append(clf.best_score_)
print()
print("Mean CV score on dev. set for All categories is %0.4f" %np.mean(score_all))

Mean CV score on dev. set for toxic is 0.9720
Mean CV score on dev. set for severe_toxic is 0.9879
Mean CV score on dev. set for obscene is 0.9884
Mean CV score on dev. set for threat is 0.9829
Mean CV score on dev. set for insult is 0.9789
Mean CV score on dev. set for identity_hate is 0.9800

Mean CV score on dev. set for All categories is 0.9817
Wall time: 2min 14s


In [174]:
submission = pd.DataFrame()
submission['id']=test['id']

for category in categories:
    train_labels = train[category]
    clf = LogisticRegression(solver='sag', C=best_C[category],random_state =42)   
    clf.fit(train_feature_matrix, train_labels)
    submission[category] = clf.predict_proba(test_feature_matrix)[:,1] #second column!!
    
submission.to_csv('submission_11.csv', index=False)

LB AUC: 0.9778 

_This score is the best of all the methods I tried and even better than GloVe + LSTM in Part2!!! There was 12% error decrease on AUC after more cleaning!!!_

In [220]:
# Error decrease in percent
(.9778-.9747)/(1-.9747)*100

12.252964426877442

#### Check those words are reallly gone in zero cofficient words

In [176]:
# dictionary of word:feature_index key value pairs
voca_dic = vectorizer.vocabulary_ 

In [177]:
# Make a list of words odered by feature index
feature_words =[None]*len(voca_dic)
for word, index in voca_dic.items():
    feature_words[index]=word

In [186]:
coeff_analysis('toxic',100)

High coefficient words:
 ['fuck', 'idiot', 'stupid', 'shit', 'suck', 'bitch', 'asshol', 'dick', 'bastard', 'moron', 'crap', 'peni', 'pathet', 'dumb', 'nigger', 'cunt', 'fag', 'jerk', 'shut', 'gay', 'retard', 'cock', 'fck', 'racist', 'loser', 'pussi', 'fool', 'piss', 'nazi', 'damn', 'sick', 'scum', 'kill', 'bloodi', 'fuk', 'wtf', 'hypocrit', 'sex', 'f', 'whore', 'nigga', 'douchebag', 'dirti', 'garbag', 'wanker', 'jackass', 'disgust', 'faggot', 'vagina', 'hate', 'stink', 'ugli', 'shame', 'prick', 'freak', 'die', 'donkey', 'screw', 'rubbish', 'bullshit', 'pig', 'ridicul', 'imbecil', 'liar', 'coward', 'nerd', 'pedophil', 'kiss', 'fascist', 'cum', 'hell', 'porn', 'twat', 'masturb', 'mouth', 'homo', 'monkey', 'fart', 'silli', 'slut', 'poop', 'fk', 'worst', 'arrog', 'shove', 'fggt', 'mental', 'lazi', 'ahol', 'butt', 'douch', 'prostitut', 'hole', 'smell', 'ass', 'cougar', 'ball', 'tit', 'blood', 'disgrac']

Low (negative) coefficient words:
 ['thank', 'redirect', 'pleas', 'best', 'appreci', 'c

Close to zero or zero coefficient words now show those hidden bad words are gone!

This part showed that taking care of hidden bad words can improve toxic comment classifications. I believe this way of cleaning can also make the deep learning models in Part2 give enhanced results. Moreover, I checked only some parts of words, so there is still room for improvement if more problematic patterns are found. 