You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

    toxic
    severe_toxic
    obscene
    threat
    insult
    identity_hate
You must create a model which predicts a probability of each type of toxicity for each comment

https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams

In [35]:
import pandas as pd
import numpy as np

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
from scipy.special import logit, expit
from tqdm import tqdm

In [37]:
train = pd.read_csv("data/train.csv").fillna(' ') # .fillna("unknown", inplace=True)
test = pd.read_csv("data/test.csv").fillna(' ')

print(train.shape)
print(test.shape)

(159571, 8)
(153164, 2)


In [38]:
train.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


In [39]:
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [40]:
train.describe()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805
std,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0


In [41]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

### clean data

In [42]:
from KaggleWord2VecUtility import KaggleWord2VecUtility

In [43]:
%time train['comment_clean'] = KaggleWord2VecUtility.apply_by_multiprocessing(train['comment_text'], KaggleWord2VecUtility.review_to_join_words, workers=4)

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


CPU times: user 253 ms, sys: 486 ms, total: 739 ms
Wall time: 1min 4s


In [44]:
%time test['comment_clean'] = KaggleWord2VecUtility.apply_by_multiprocessing(test['comment_text'], KaggleWord2VecUtility.review_to_join_words, workers=4)

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


CPU times: user 238 ms, sys: 339 ms, total: 577 ms
Wall time: 57.8 s


In [45]:
X_train = train['comment_clean']
X_test = test['comment_clean']

X_all = pd.concat([X_train, X_test])

In [46]:
# word_vectorizer = TfidfVectorizer(
#                     sublinear_tf=True,
#                     strip_accents='unicode',
#                     analyzer='word',
#                     token_pattern=r'\w{1,}',
#                     ngram_range=(1, 1),
#                     max_features=15000)

word_vectorizer = TfidfVectorizer(
                    min_df=5,
                    sublinear_tf=True,
                    strip_accents='unicode',
                    analyzer='word',
                    token_pattern=r'\w{1,}',
                    ngram_range=(1, 2),
                    use_idf=True,
                    smooth_idf=True,
                    stop_words='english',
                    max_features=30000)

word_vectorizer.fit(X_all)
X_train_word = word_vectorizer.transform(X_train)
X_test_word = word_vectorizer.transform(X_test)

In [47]:
char_vectorizer = TfidfVectorizer(
                    sublinear_tf=True,
                    strip_accents='unicode',
                    analyzer='char',
                    token_pattern=r'\w{1,}',
                    use_idf=True,
                    smooth_idf=True,
                    stop_words='english',
                    ngram_range=(1, 5),
                    max_features=80000)

char_vectorizer.fit(X_all)
X_train_char = char_vectorizer.transform(X_train)
X_test_char = char_vectorizer.transform(X_test)

In [48]:
train_features = hstack([X_train_char, X_train_word])
test_features = hstack([X_test_char, X_test_word])

In [49]:
losses = []
predictions = {'id': test['id']}
for class_name in tqdm(class_names):
    train_target = train[class_name]
    classifier = LogisticRegression(solver='sag')

    cv_loss = np.mean(cross_val_score(classifier, train_features, train_target, cv=3, scoring='roc_auc'))
    losses.append(cv_loss)
    print('CV score for class {} is {}'.format(class_name, cv_loss))

    classifier.fit(train_features, train_target)
    predictions[class_name] = classifier.predict_proba(test_features)[:, 1]

print('Total CV score is {}'.format(np.mean(losses)))

  0%|          | 0/6 [00:00<?, ?it/s]

CV score for class toxic is 0.9783055759603592


 17%|█▋        | 1/6 [02:56<14:42, 176.58s/it]

CV score for class severe_toxic is 0.9888817470165217


 33%|███▎      | 2/6 [06:16<12:33, 188.39s/it]

CV score for class obscene is 0.9908723802852015


 50%|█████     | 3/6 [09:09<09:09, 183.20s/it]

CV score for class threat is 0.9888804922337462


 67%|██████▋   | 4/6 [12:51<06:25, 192.83s/it]

CV score for class insult is 0.9830436190931842


 83%|████████▎ | 5/6 [15:55<03:11, 191.13s/it]

CV score for class identity_hate is 0.9830380998065252


100%|██████████| 6/6 [18:57<00:00, 189.55s/it]

Total CV score is 0.9855036523992564





In [50]:
submission = pd.DataFrame.from_dict(predictions)

In [51]:
from datetime import datetime

current_time = datetime.now()
current_time = current_time.strftime("%Y%m%d_%H%M%S")

description = "baseline"

submission.to_csv("submissions/{description}_{time}_{score:.5f}.csv".format(description=description, score=np.mean(losses), time=current_time), index=False)

score / kaggle

    0.98512 / 0.9788 - baseline, logistic regression
    0.98521 / 0.9793 - clean data
    0.98527 / 0.9795 - word hyper parameter
    0.98542 / 0.9796 - char hyper parameter
    0.98544 - logistic regression(C=4)
    0.98502 - stopword 제거