<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#LogisticRegression" data-toc-modified-id="LogisticRegression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LogisticRegression</a></span></li><li><span><a href="#LightGBM" data-toc-modified-id="LightGBM-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LightGBM</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**План выполнения проекта**

1. Загрузить и подготовьте данные. Убрать лишние символы в текстах, провести лемматизацию, удалить стоп-слова.
2. Обучите модели LogisticRegression и LightGBM. 
3. Выбрать модель с лучшей метриков F1 и проверить ее на тестовой выборке.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Загрузка данных

In [16]:
!pip install lightgbm -U

Collecting lightgbm
  Downloading lightgbm-4.3.0-py3-none-win_amd64.whl.metadata (19 kB)
Downloading lightgbm-4.3.0-py3-none-win_amd64.whl (1.3 MB)
   ---------------------------------------- 0.0/1.3 MB ? eta -:--:--
   -- ------------------------------------- 0.1/1.3 MB 2.6 MB/s eta 0:00:01
   ----------------------- ---------------- 0.8/1.3 MB 9.7 MB/s eta 0:00:01
   ---------------------------------------  1.3/1.3 MB 10.5 MB/s eta 0:00:01
   ---------------------------------------- 1.3/1.3 MB 9.4 MB/s eta 0:00:00
Installing collected packages: lightgbm
Successfully installed lightgbm-4.3.0


In [32]:
import pandas as pd
import re
from nltk.tokenize import TweetTokenizer
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, make_scorer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
import warnings

In [33]:
warnings.filterwarnings('ignore')

In [4]:
df = pd.read_csv('/datasets/toxic_comments.csv', index_col=0)

df.info()
df.head(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
5,"""\n\nCongratulations from me as well, use the ...",0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,Your vandalism to the Matt Shirvington article...,0
8,Sorry if the word 'nonsense' was offensive to ...,0
9,alignment on this subject and which are contra...,0


## Подготовка данных

In [8]:
df[df['text']=='']

Unnamed: 0,text,toxic


Записей с пустыми комментариями нет. Подготовим тексты для преобразования их в векторный вид:
- приведем тексты к нижнему регистру; 
- уберем лишние символы, оставив только слова;
- преобразуем текст в токены;
- лемматизируем;
- уберем стопслова.

In [5]:
def prepare_comments(text):
    text = text.lower()
    text = remove_waste_symbols(text)
    tokens = tokenize(text)
    tokens = lemmatize(tokens)
    tokens = remove_stop_words(tokens)
    
    return ' '.join(tokens)

def remove_waste_symbols(text):
    return re.sub('[^a-z ]+', ' ', text)
                  
def tokenize(text):
    tknzr = TweetTokenizer()
    return tknzr.tokenize(text)

def lemmatize(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

def remove_stop_words(tokens):
    stop_words = stopwords.words('english')
    return [word for word in tokens if word not in stop_words]

In [6]:
df['tokens'] = df['text'].apply(lambda x: prepare_comments(x))
df.head()

Unnamed: 0,text,toxic,tokens
0,Explanation\nWhy the edits made under my usern...,0,explanation edits made username hardcore metal...
1,D'aww! He matches this background colour I'm s...,0,aww match background colour seemingly stuck th...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man really trying edit war guy constantly ...
3,"""\nMore\nI can't make any real suggestions on ...",0,make real suggestion improvement wondered sect...
4,"You, sir, are my hero. Any chance you remember...",0,sir hero chance remember page


Успешно провели первичную подготовку текста, очистив от символов, стоп слов и проведя лемматизацию.

Следующий шаг - разделим данные на обучающую и тестовую выборки.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(
    df['tokens'], df['toxic'], test_size=.2, random_state=12345, stratify=df['toxic'])

Последнее, что надо сделать перед обучением модели - преобразуем текст в векторное представление. Будем использовать TfidfVectorizer.

In [9]:
vectorizer = TfidfVectorizer()

x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)

## Обучение


Обучим модели логистической регрессии и градиентного бустинга и выберем модель с лучшей метрикой f1_score.

### LogisticRegression

In [26]:
params_logreg = {
    'max_iter': [10, 100],
    'C':[0.1, 1.0, 10.0]
}

f1_logreg = make_scorer(f1_score)

model_logreg = GridSearchCV(LogisticRegression(random_state=12345), param_grid=params_logreg, scoring=f1_logreg, verbose=10)
model_logreg.fit(x_train, y_train)

best_params_logreg = model_logreg.best_params_
best_f1_logreg = model_logreg.best_score_

print(best_params_logreg)
print(best_f1_logreg)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5; 1/6] START C=0.1, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5; 1/6] END ........................C=0.1, max_iter=10; total time=   0.4s
[CV 2/5; 1/6] START C=0.1, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5; 1/6] END ........................C=0.1, max_iter=10; total time=   0.5s
[CV 3/5; 1/6] START C=0.1, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5; 1/6] END ........................C=0.1, max_iter=10; total time=   0.5s
[CV 4/5; 1/6] START C=0.1, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5; 1/6] END ........................C=0.1, max_iter=10; total time=   0.5s
[CV 5/5; 1/6] START C=0.1, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5; 1/6] END ........................C=0.1, max_iter=10; total time=   0.3s
[CV 1/5; 2/6] START C=0.1, max_iter=100.........................................
[CV 1/5; 2/6] END .......................C=0.1, max_iter=100; total time=   1.7s
[CV 2/5; 2/6] START C=0.1, max_iter=100.........................................
[CV 2/5; 2/6] END .......................C=0.1, max_iter=100; total time=   1.9s
[CV 3/5; 2/6] START C=0.1, max_iter=100.........................................
[CV 3/5; 2/6] END .......................C=0.1, max_iter=100; total time=   1.6s
[CV 4/5; 2/6] START C=0.1, max_iter=100.........................................
[CV 4/5; 2/6] END .......................C=0.1, max_iter=100; total time=   1.9s
[CV 5/5; 2/6] START C=0.1, max_iter=100.........................................
[CV 5/5; 2/6] END .......................C=0.1, max_iter=100; total time=   1.8s
[CV 1/5; 3/6] START C=1.0, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5; 3/6] END ........................C=1.0, max_iter=10; total time=   0.4s
[CV 2/5; 3/6] START C=1.0, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5; 3/6] END ........................C=1.0, max_iter=10; total time=   0.5s
[CV 3/5; 3/6] START C=1.0, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5; 3/6] END ........................C=1.0, max_iter=10; total time=   0.5s
[CV 4/5; 3/6] START C=1.0, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5; 3/6] END ........................C=1.0, max_iter=10; total time=   0.5s
[CV 5/5; 3/6] START C=1.0, max_iter=10..........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5; 3/6] END ........................C=1.0, max_iter=10; total time=   0.4s
[CV 1/5; 4/6] START C=1.0, max_iter=100.........................................
[CV 1/5; 4/6] END .......................C=1.0, max_iter=100; total time=   3.0s
[CV 2/5; 4/6] START C=1.0, max_iter=100.........................................
[CV 2/5; 4/6] END .......................C=1.0, max_iter=100; total time=   2.9s
[CV 3/5; 4/6] START C=1.0, max_iter=100.........................................
[CV 3/5; 4/6] END .......................C=1.0, max_iter=100; total time=   3.0s
[CV 4/5; 4/6] START C=1.0, max_iter=100.........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5; 4/6] END .......................C=1.0, max_iter=100; total time=   3.4s
[CV 5/5; 4/6] START C=1.0, max_iter=100.........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5; 4/6] END .......................C=1.0, max_iter=100; total time=   3.5s
[CV 1/5; 5/6] START C=10.0, max_iter=10.........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5; 5/6] END .......................C=10.0, max_iter=10; total time=   0.5s
[CV 2/5; 5/6] START C=10.0, max_iter=10.........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5; 5/6] END .......................C=10.0, max_iter=10; total time=   0.3s
[CV 3/5; 5/6] START C=10.0, max_iter=10.........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5; 5/6] END .......................C=10.0, max_iter=10; total time=   0.5s
[CV 4/5; 5/6] START C=10.0, max_iter=10.........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5; 5/6] END .......................C=10.0, max_iter=10; total time=   0.5s
[CV 5/5; 5/6] START C=10.0, max_iter=10.........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5; 5/6] END .......................C=10.0, max_iter=10; total time=   0.3s
[CV 1/5; 6/6] START C=10.0, max_iter=100........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5; 6/6] END ......................C=10.0, max_iter=100; total time=   3.4s
[CV 2/5; 6/6] START C=10.0, max_iter=100........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5; 6/6] END ......................C=10.0, max_iter=100; total time=   3.5s
[CV 3/5; 6/6] START C=10.0, max_iter=100........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5; 6/6] END ......................C=10.0, max_iter=100; total time=   3.3s
[CV 4/5; 6/6] START C=10.0, max_iter=100........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5; 6/6] END ......................C=10.0, max_iter=100; total time=   3.4s
[CV 5/5; 6/6] START C=10.0, max_iter=100........................................


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5; 6/6] END ......................C=10.0, max_iter=100; total time=   3.3s
{'C': 10.0, 'max_iter': 100}
0.7615712241225763


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression показала f1_score = 0.76 на кроссвалидации.

### LightGBM

In [20]:
params_lgbm = {
    'boosting_type': ['gbdt', 'dart', 'goss']
}

f1_lgbm = make_scorer(f1_score)

model_lgbm = GridSearchCV(
    LGBMClassifier(random_state=12345),
    param_grid=params_lgbm,
    scoring=f1_lgbm,
    verbose=10
)

model_lgbm.fit(x_train, y_train)

best_params_lgbm = model_lgbm.best_params_
best_f1_lgbm = model_lgbm.best_score_

print(best_params_lgbm)
print(best_f1_lgbm)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5; 1/3] START boosting_type=gbdt..........................................
[LightGBM] [Info] Number of positive: 10359, number of negative: 91587
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.966833 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 509318
[LightGBM] [Info] Number of data points in the train set: 101946, number of used features: 9669
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101613 -> initscore=-2.179434
[LightGBM] [Info] Start training from score -2.179434
[CV 1/5; 1/3] END ........................boosting_type=gbdt; total time=  21.0s
[CV 2/5; 1/3] START boosting_type=gbdt..........................................
[LightGBM] [Info] Number of positive: 10359, number of negative: 91587
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.159934 seconds.
You can set `force_col_

LGBMClassifier показала f1_score = 0.75 на кроссвалидации.

Лучшая метрика F1-score у модели LogisticRegression. Проверим значение метрики на тестовой выборке.

In [30]:
best_model_logreg = LogisticRegression(random_state=12345, C=best_params_logreg['C'], max_iter=best_params_logreg['max_iter'])

best_model_logreg.fit(x_train, y_train)

predictions = best_model_logreg.predict(x_test)
f1_test = f1_score(y_test, predictions)

print('f1_score on test for LogisticRegression:', f1_test)

f1_score on test for LogisticRegression: 0.7801709401709402


## Выводы

В рамках работы:
- была проведена подготовка текста - очистка от символов, лемматизаций, удаление стоп-слов;
- тексты преобразованы в векторы с использованием TfidfVectorizer;
- обучены модели LogisticRegression и LightGBM;
- LogisticRegression показал f1_score = 0.76 на кроссвалидации;
- LightGBM показал f1_score = 0.75 на кроссвалидации;
- Лучшая модель LogisticRegression показала f1_score = 0.78 на тестовой выборке.

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны