# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [4]:
import numpy as np
import pandas as pd
import re, string

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from scipy.sparse import hstack
from wordcloud import STOPWORDS
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
from nltk.tokenize.toktok  import ToktokTokenizer
from gensim.utils import tokenize
from nltk.stem import LancasterStemmer

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('../data/01/train.csv').fillna(' ')
test = pd.read_csv('../data/01/test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [3]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

### Какое слово встречается чаще всего в объединенном train и test датасете? *

In [24]:
all_text_in_one = " ".join(all_text.values)
tokens = tokenize(all_text_in_one.lower())

In [26]:
pd.Series(tokens).value_counts().argmax()

'the'

## TFIDF + crossvalidation

In [93]:
def tokenizer(text):
    #tokens = ToktokTokenizer().tokenize(text.lower()) - гірше
    #tokens = TweetTokenizer().tokenize(text.lower()) - гірше
    tokens = tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(t) for t in tokens]
    return lemmas

In [12]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
word_vectorizer = TfidfVectorizer(ngram_range=(1, 1), 
                                  tokenizer=tokenizer,
                                  max_features=20000, 
                                  norm='l2',
                                  smooth_idf=False,
                                  sublinear_tf=True,
                                  strip_accents='unicode',
                                  min_df=2, max_df=0.9)

In [13]:
%%time
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

CPU times: user 2min 49s, sys: 324 ms, total: 2min 50s
Wall time: 2min 50s


In [96]:
char_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                  strip_accents='unicode',
                                  norm = 'l2',
                                  analyzer='char',
                                  ngram_range=(2, 5),
                                  smooth_idf=False,
                                  max_features=50000)

In [97]:
%%time
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

CPU times: user 9min 2s, sys: 13.7 s, total: 9min 16s
Wall time: 10min 22s


In [14]:
%%time
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])

CPU times: user 3.52 s, sys: 2.85 s, total: 6.37 s
Wall time: 6.37 s


Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [16]:
%%time
#kaggle 97.98
scores= []
c_s = [1.5, 1, 1.5, 1.8, 1, 1]

for c, class_name in zip(c_s, class_names):
    train_target = train[class_name]
    classifier = LogisticRegression(solver='sag', C=c)
    #grid = GridSearchCV(param_grid=parameters, estimator=classifier, scoring='roc_auc', n_jobs=-1)
    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, scoring='roc_auc'))
    #grid.fit(train_features, train_target)
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)
    #c_s.append(grid.best_params_)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9794433352823435
CV score for class severe_toxic is 0.988447230370659
CV score for class obscene is 0.9906364594646182
CV score for class threat is 0.9898797548530512
CV score for class insult is 0.9833815847019892
CV score for class identity_hate is 0.9832311610565768
Total score is 0.9858365876215397
CPU times: user 8min 33s, sys: 27.1 s, total: 9min
Wall time: 9min


Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [17]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [18]:
%%time
c_s = [1.5, 1, 1.5, 1.8, 1, 1]
for c, class_name in zip(c_s, class_names):
    classifier = LogisticRegression(solver='sag', n_jobs=-1, C=c)
    train_target = train[class_name]
    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]    

CPU times: user 5min 18s, sys: 11.9 s, total: 5min 30s
Wall time: 5min 31s


In [21]:
submission.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999931,0.234588,0.999732,0.089795,0.984243,0.307368
1,0000247867823ef7,0.003666,0.001656,0.001487,0.00015,0.004741,0.002385
2,00013b17ad220c46,0.008445,0.002868,0.0048,0.000364,0.003073,0.001347
3,00017563c3f7919a,0.002142,0.001437,0.001673,0.000546,0.003105,0.000585
4,00017695ad8997eb,0.012878,0.001502,0.003669,0.000642,0.005104,0.000995


In [22]:
submission.to_csv('submission.csv', index=False)

## Some of 1000 tries that I decided to leave for my own

In [78]:
#kaggle 97.95, гірше
def tokenizer(text):
    #tokens = ToktokTokenizer().tokenize(text.lower())
    #tokens = TweetTokenizer().tokenize(text.lower())
    tokens = tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(t) for t in tokens]
    return lemmas
word_vectorizer = TfidfVectorizer(ngram_range=(1, 1), 
                                  tokenizer=tokenizer,
                                  max_features=15000, 
                                  norm='l2',
                                  smooth_idf=False,
                                  sublinear_tf=True,
                                  strip_accents='unicode')
char_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                  strip_accents='unicode',
                                  norm = 'l2',
                                  analyzer='char',
                                  ngram_range=(2, 4),
                                  smooth_idf=False,
                                  max_features=50000,
                                  stop_words=STOPWORDS)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
scores= []
c_s = [1.5, 1, 1.5, 1.8, 1, 1]

for c, class_name in zip(c_s, class_names):
    train_target = train[class_name]
    classifier = LogisticRegression(solver='lbfgs', C=c)
    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, scoring='roc_auc'))
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9794002214479471
CV score for class severe_toxic is 0.9883912317939827
CV score for class obscene is 0.9907130943933756
CV score for class threat is 0.9897664108800015
CV score for class insult is 0.983198022464808
CV score for class identity_hate is 0.9832588955950823
Total score is 0.9857879794291996


In [108]:
#kaggle 97.96 гірше
def tokenizer(text):
    #tokens = ToktokTokenizer().tokenize(text.lower())
    #tokens = TweetTokenizer().tokenize(text.lower())
    tokens = tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(t) for t in tokens]
    return lemmas
word_vectorizer = TfidfVectorizer(ngram_range=(1, 1), 
                                  tokenizer=tokenizer,
                                  max_features=15000, 
                                  norm='l2',
                                  smooth_idf=True,
                                  sublinear_tf=True,
                                  strip_accents='unicode', 
                                  max_df=0.95)
char_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                  strip_accents='unicode',
                                  norm = 'l2',
                                  analyzer='char',
                                  ngram_range=(2, 4),
                                  smooth_idf=True,
                                  max_features=30000,
                                  stop_words = STOPWORDS)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
scores= []
c_s = [1.5, 1, 1.5, 1.8, 1, 1]

for c, class_name in zip(c_s, class_names):
    train_target = train[class_name]
    classifier = LogisticRegression(solver='lbfgs', C=c)
    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, scoring='roc_auc'))
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))


c_s = [1.5, 1, 1.5, 1.8, 1, 1]
for c, class_name in zip(c_s, class_names):
    classifier = LogisticRegression(solver='lbfgs', n_jobs=-1, C=c)
    train_target = train[class_name]
    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]
    
submission.to_csv('submission1.csv', index=False)

CV score for class toxic is 0.9793825513525802
CV score for class severe_toxic is 0.9884973386970457
CV score for class obscene is 0.9907118909144917
CV score for class threat is 0.9897773714502339
CV score for class insult is 0.9832516759406967
CV score for class identity_hate is 0.9833925208704678
Total score is 0.9858355582042527


In [7]:
#kaggle 97.93
def tokenizer(text):
    #tokens = ToktokTokenizer().tokenize(text.lower())
    #tokens = TweetTokenizer().tokenize(text.lower())
    tokens = tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(t) for t in tokens]
    stemmer = LancasterStemmer()
    stems = [stemmer.stem(l) for l in lemmas]
    return stems
word_vectorizer = TfidfVectorizer(ngram_range=(1, 1), 
                                  tokenizer=tokenizer,
                                  max_features=15000, 
                                  norm='l2',
                                  smooth_idf=False,
                                  sublinear_tf=True,
                                  strip_accents='unicode')
char_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                  strip_accents='unicode',
                                  norm = 'l2',
                                  analyzer='char',
                                  ngram_range=(2, 5),
                                  smooth_idf=False,
                                  max_features=50000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)
print("1")
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)
print("1")
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
print("1")
submission = pd.DataFrame.from_dict({'id': test['id']})
c_s = [1.5, 1, 1.5, 1.8, 1, 1]
for c, class_name in zip(c_s, class_names):
    classifier = LogisticRegression(solver='sag', n_jobs=-1, C=c)
    train_target = train[class_name]
    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]
    
submission.to_csv('submission1.csv', index=False)

1
1
1


In [None]:
def tokenizer(text):
    #tokens = ToktokTokenizer().tokenize(text.lower())
    #tokens = TweetTokenizer().tokenize(text.lower())
    tokens = tokenize(text.lower())
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(t) for t in tokens]
    return lemmas
word_vectorizer = TfidfVectorizer(ngram_range=(1, 1), 
                                  tokenizer=tokenizer,
                                  max_features=15000, 
                                  norm='l2',
                                  smooth_idf=True,
                                  sublinear_tf=True,
                                  strip_accents='unicode',
                                  min_df=2, max_df=0.7)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)
print("1")
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
print("1")
submission = pd.DataFrame.from_dict({'id': test['id']})
c_s = [1.5, 1, 1.5, 1.8, 1, 1]
for c, class_name in zip(c_s, class_names):
    classifier = LogisticRegression(solver='sag', n_jobs=-1, C=c)
    train_target = train[class_name]
    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]
    
submission.to_csv('submission1.csv', index=False)

1
1
