# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [2]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [3]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('D:/STUDY/Data_Science_Study/University_course/HW_2/train.csv').fillna(' ')
test = pd.read_csv('D:/STUDY/Data_Science_Study/University_course/HW_2/test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [4]:
train_text = train['comment_text']
test_text = test['comment_text']
#all_text = pd.concat([train_text, test_text])

# Data Cleaning

In [5]:
import re

In [6]:
def cleaning_data(noizy_comment):
    noizy_comment= re.sub(r'http\S+', '', noizy_comment)
    noizy_comment = noizy_comment.lower()
    noizy_comment = re.sub(r"what's", "what is ", noizy_comment)
    noizy_comment = re.sub(r"\'s", " ", noizy_comment)
    noizy_comment = re.sub(r"\'ve", " have ", noizy_comment)
    noizy_comment = re.sub(r"can't", "cannot ", noizy_comment)
    noizy_comment = re.sub(r"n't", " not ", noizy_comment)
    noizy_comment = re.sub(r"i'm", "i am ", noizy_comment)
    noizy_comment = re.sub(r"\'re", " are ", noizy_comment)
    noizy_comment = re.sub(r"\'d", " would ", noizy_comment)
    noizy_comment = re.sub(r"\'ll", " will ", noizy_comment)
    noizy_comment = re.sub(r"\'scuse", " excuse ", noizy_comment)
    noizy_comment = re.sub(r'\W', ' ', noizy_comment)
    noizy_comment = re.sub(r'\s+', ' ', noizy_comment)
    noizy_comment = re.sub(' +',' ',noizy_comment)
    noizy_comment = re.sub(r'\n','',noizy_comment)
    noizy_comment = noizy_comment.strip(' ')
    clean_comment = noizy_comment
    return clean_comment

In [7]:
def data_set_cleaning(noizy_data_set):
    cleaned_text = []
    for i in range(0,len(noizy_data_set)):
        text_cleaning = cleaning_data(noizy_data_set[i])
        cleaned_text.append(text_cleaning)
    noizy_data_set = pd.Series(cleaned_text).astype(str)
    return noizy_data_set

In [8]:
train_text = data_set_cleaning(train_text)

In [9]:
train_text.head()

0    explanation why the edits made under my userna...
1    d aww he matches this background colour i am s...
2    hey man i am really not trying to edit war it ...
3    more i cannot make any real suggestions on imp...
4    you sir are my hero any chance you remember wh...
dtype: object

In [10]:
test_text = data_set_cleaning(test_text)

In [11]:
all_text = pd.concat([train_text, test_text])

In [12]:
all_text.head()

0    explanation why the edits made under my userna...
1    d aww he matches this background colour i am s...
2    hey man i am really not trying to edit war it ...
3    more i cannot make any real suggestions on imp...
4    you sir are my hero any chance you remember wh...
dtype: object

In [13]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
word_vectorizer = TfidfVectorizer(analyzer='word',stop_words='english')
char_vectorizer = TfidfVectorizer(analyzer='char',ngram_range=(1,4),stop_words='english')

In [14]:
from scipy.sparse import hstack

In [15]:
#train_vect_word = word_vectorizer.fit_transform(train_text)
word_vectorizer.fit(all_text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [16]:
char_vectorizer.fit(all_text)

TfidfVectorizer(analyzer='char', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 4), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [17]:
train_word_features = word_vectorizer.transform(train_text)

In [18]:
train_char_features = char_vectorizer.transform(train_text)

In [19]:
test_char_features = char_vectorizer.transform(test_text)

In [20]:
test_word_features = word_vectorizer.transform(test_text)

In [22]:
del test_text 

In [24]:
del train_text

In [25]:
del all_text

In [26]:
train_features = hstack([train_word_features, train_char_features])

In [27]:
del train_word_features
del train_char_features

In [28]:
test_features = hstack([test_word_features, test_char_features])

In [29]:
train_features

<159571x947992 sparse matrix of type '<class 'numpy.float64'>'
	with 102831348 stored elements in COOrdinate format>

In [30]:
test_features

<153164x947992 sparse matrix of type '<class 'numpy.float64'>'
	with 88864365 stored elements in COOrdinate format>

In [31]:
#classifier = LogisticRegression(intercept_scaling=2.1,class_weight="balanced",solver="lbfgs")#ok lbfgs
classifier = LogisticRegression(solver="lbfgs")#ok lbfgs


In [32]:
scores= []

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9763835996794107
CV score for class severe_toxic is 0.9863922140280357
CV score for class obscene is 0.9888869666383201
CV score for class threat is 0.9858681687290457
CV score for class insult is 0.980764509561962
CV score for class identity_hate is 0.9808866897603467
Total score is 0.9831970247328535


In [50]:
scores= []

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9763836010389947
CV score for class severe_toxic is 0.9863922140280357
CV score for class obscene is 0.9888869666383201
CV score for class threat is 0.9858681687290457
CV score for class insult is 0.9807645145826589
CV score for class identity_hate is 0.9808867032410958
Total score is 0.983197028043025


In [33]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [34]:
for class_name in class_names:
    train_target = train[class_name]
    x_test = train_features
    y_test = train_target[:]
    print("classifier",class_name)
    classifier.fit(x_test, y_test)
    print("submission",class_name)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1] 
    print("end",class_name)
    #classifier.fit(...)
    #...
    #submission[class_name] = classifier.predict_proba(test_features)[:, 1]   

classifier toxic
submission toxic
end toxic
classifier severe_toxic
submission severe_toxic
end severe_toxic
classifier obscene
submission obscene
end obscene
classifier threat
submission threat
end threat
classifier insult
submission insult
end insult
classifier identity_hate
submission identity_hate
end identity_hate


for class_name in class_names:
    train_target = train[class_name]
    x_test = train_features
    y_test = train_target[:]
    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, scoring='roc_auc'))
    print("classifier",class_name)
    classifier.fit(x_test, y_test)
    print("submission",class_name)
    submission[class_name] = classifier_toxic.predict_proba(test_features)[:, 1] 
    print("end",class_name)
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

In [35]:
submission.to_csv('D:/STUDY/Data_Science_Study/University_course/HW_2/submission.csv', index=False)

print(os.listdir("../input"))

In [71]:
#print(os.listdir("../Output"))

In [None]:
# Попробуйте разные Vectorizer и рaaазные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
word_vectorizer = ... # TfidfVectorizer или CountVectorizer

In [None]:
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
classifier = LogisticRegression(...) # Попробуйте разные параметры, найтдите оттимальные на кросс-валидации

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [None]:
scores= []

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [None]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [None]:
for class_name in class_names:
    .....
    classifier.fit(...)
    ...
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]    

In [None]:
submission.to_csv('submission.csv', index=False)