# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, learning_curve, GridSearchCV
from scipy.sparse import hstack, save_npz, load_npz

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [3]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

# Самые частовстречаемые слова

In [10]:
words = pd.Series(' '.join(all_text).split())
words.value_counts()[:5]

the    820868
to     522905
of     400016
a      379213
and    376676
dtype: int64

# Генерация признаков

In [None]:
word_vectorizer = TfidfVectorizer(stop_words="english", max_features=50000, sublinear_tf=True)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)

In [None]:
char_vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 3), max_features=50000, sublinear_tf=True)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)

In [None]:
train_features = hstack([train_word_features, train_char_features])
del train_word_features
del train_char_features
save_npz('train.npz', train_features)

или считываем уже посчитаную матрицу

In [4]:
train_features = load_npz('train.npz')

In [5]:
train_features.shape

(159571, 100000)

# Подбор параметров

In [6]:
best_params = {}
best_models = {}

Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


In [7]:
%%time
scores = []

for class_name in class_names:
    train_target = train[class_name]
    params = {'C': [0.3, 1, 1.5, 2, 3]}
    
    grid = GridSearchCV(LogisticRegression(solver='lbfgs', n_jobs=-1), params, scoring='roc_auc')
    grid.fit(train_features, train_target)
    best_params[class_name] = grid.best_params_
    best_models[class_name] = grid.best_estimator_
    print(grid.best_params_)
    print('CV score for class {} is {}'.format(class_name, grid.best_score_))
    scores.append(grid.best_score_)

print('Total score is {}'.format(np.mean(scores)))


{'C': 1.5}
CV score for class toxic is 0.9781422249249571
{'C': 1}
CV score for class severe_toxic is 0.9881588284377838
{'C': 1.5}
CV score for class obscene is 0.9901983096686441
{'C': 2}
CV score for class threat is 0.9884130628542789
{'C': 1}
CV score for class insult is 0.9819761475826706
{'C': 1}
CV score for class identity_hate is 0.9822984537669159
Total score is 0.9848645045392085
CPU times: user 6min 20s, sys: 3min 6s, total: 9min 26s
Wall time: 1h 7min 47s


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [8]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [None]:
test_word_features = word_vectorizer.transform(test_text)
test_char_features = char_vectorizer.transform(test_text)
test_features = hstack([test_word_features, test_char_features])
del test_word_features
del test_char_features

In [None]:
save_npz('test.npz', test_features)

In [9]:
test_features = load_npz('test.npz')

In [10]:
for class_name in class_names:
    classifier = best_models[class_name]
    
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]    

In [11]:
submission.to_csv('best_subm.csv', index=False)