### Kaggle Competition - 13 days to go

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

1. toxic
2. severe_toxic
3. obscene
4. threat
5. insult
6. identity_hate
You must create a model which predicts a probability of each type of toxicity for each comment.

#### Clone 1 - Logistic Regression with words and n-grams
https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams/code

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack

In [3]:
class_names = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']

In [4]:
train = pd.read_csv('data/train.csv').fillna(' ')
test = pd.read_csv('data/test.csv').fillna(' ')

In [5]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [6]:
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [12]:
train.severe_toxic.value_counts()

0    157976
1      1595
Name: severe_toxic, dtype: int64

In [13]:
train.toxic.value_counts()

0    144277
1     15294
Name: toxic, dtype: int64

In [14]:
train.obscene.value_counts()

0    151122
1      8449
Name: obscene, dtype: int64

In [15]:
train.threat.value_counts()

0    159093
1       478
Name: threat, dtype: int64

In [16]:
train.insult.value_counts()

0    151694
1      7877
Name: insult, dtype: int64

In [17]:
train.identity_hate.value_counts()

0    158166
1      1405
Name: identity_hate, dtype: int64

In [18]:
#Most of them are toxic comments

In [19]:
train_text = train['comment_text']
test_text = test['comment_text']

In [20]:
all_text = pd.concat([train_text,test_text])

In [21]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

In [25]:
char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

In [27]:
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])

In [31]:
scores = []
submission = pd.DataFrame.from_dict({'id': test['id']})
for class_name in class_names:
    train_target = train[class_name]
    classifier = LogisticRegression(solver='sag')

    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, cv=10, scoring='roc_auc'))
    scores.append(cv_score)
    print('CV score for class {} is {}'.format(class_name, cv_score))

    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]


CV score for class toxic is 0.979512410526
CV score for class severe_toxic is 0.988452439427
CV score for class obscene is 0.990792222448
CV score for class threat is 0.990725292354
CV score for class insult is 0.983185451574
CV score for class identity_hate is 0.9835214429


In [30]:
print('Total CV score is {}'.format(np.mean(scores)))

submission.to_csv('submission.csv', index=False)

Total CV score is 0.985302283329


Public CV - .9792 - Rank 2010