# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('train.csv').fillna(' ')
test = pd.read_csv('test.csv').fillna(' ')

## Most popular word

In [3]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [4]:
word_vect = TfidfVectorizer( binary= True )
words = word_vect.fit_transform(all_text)
word_vect.get_feature_names()[words.sum(axis =0 ).argmax()]

'the'

In [5]:
count_vectorizer = CountVectorizer(stop_words = 'english')
word_features = count_vectorizer.fit_transform(all_text)
count_vectorizer.get_feature_names()[word_features.sum(axis = 0).argmax()]

'article'

In [11]:
word_vectorizer = TfidfVectorizer( binary= True , min_df = 3 , norm = 'l2' , lowercase=True , smooth_idf = True  , ngram_range = (1 , 1))
word_vectorizer

TfidfVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [12]:
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [18]:
classifier = LogisticRegression(C =2.1 ,  penalty = 'l2' , solver ='sag' , n_jobs =-1  ) 
classifier

LogisticRegression(C=2.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='sag', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:
scores= []

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9725987960312402
CV score for class severe_toxic is 0.9850440655347154
CV score for class obscene is 0.9858141438756091
CV score for class threat is 0.9878025080075586
CV score for class insult is 0.9782569598734153
CV score for class identity_hate is 0.9759329231951609
Total score is 0.9809082327529498


In [20]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [21]:
for class_name in class_names:
    classifier.fit(train_word_features, train[class_name])
    submission[class_name] = classifier.predict_proba(test_word_features)[:, 1]  

In [22]:
submission.to_csv('submission.csv', index=False)