## Автор ноутбука: Виталий Карташов

В данном ноутбуке представлено решение для соревнования [Goodreads Books Reviews](https://www.kaggle.com/competitions/goodreads-books-reviews-290312/) на платформе kaggle, в рамках которого участникам предлагалось построить модель для прогноза рейтинга (количество поставленных звезд) на основе оставленных рецензий на портале [Goodreads](https://www.goodreads.com/).

На основе написанных отзывов необходимо было предсказать оценку, которую поставил пользователь (от 0 до 5 звезд).

Более подробно с описанием задачи можно ознакомиться на [странице соревнования](https://www.kaggle.com/competitions/goodreads-books-reviews-290312/).

В данном ноутбуке представлено решение на основе логистической регрессии, Сatboost и BERT.

Максимальный балл был получен при тюнинге BERT, которое дало публичный скор 0.59377 (топ-79 решений).

In [None]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


import warnings
warnings.simplefilter(action='ignore')

with open('kaggle.json', 'r') as f:
  data = json.load(f)
  os.environ['KAGGLE_USERNAME'] = data["username"]
  os.environ['KAGGLE_KEY'] = data["key"]

In [None]:
!kaggle competitions download -c goodreads-books-reviews-290312
!unzip goodreads-books-reviews-290312.zip
!rm -rf goodreads-books-reviews-290312.zip

Downloading goodreads-books-reviews-290312.zip to /content
 99% 627M/635M [00:04<00:00, 136MB/s]
100% 635M/635M [00:04<00:00, 141MB/s]
Archive:  goodreads-books-reviews-290312.zip
  inflating: goodreads_sample_submission.csv  
  inflating: goodreads_test.csv      
  inflating: goodreads_train.csv     


In [None]:
df_train = pd.read_csv('goodreads_train.csv')
df_test = pd.read_csv('goodreads_test.csv')

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900000 entries, 0 to 899999
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       900000 non-null  object
 1   book_id       900000 non-null  int64 
 2   review_id     900000 non-null  object
 3   rating        900000 non-null  int64 
 4   review_text   900000 non-null  object
 5   date_added    900000 non-null  object
 6   date_updated  900000 non-null  object
 7   read_at       808234 non-null  object
 8   started_at    625703 non-null  object
 9   n_votes       900000 non-null  int64 
 10  n_comments    900000 non-null  int64 
dtypes: int64(4), object(7)
memory usage: 75.5+ MB


Раздел с EDA представлен в ноутбуке по [этой ссылке](https://colab.research.google.com/drive/1vuvkWXII2NUjuBidBQcBHSLn-iN25ida?usp=sharing)

Так же, как и в случае ноутбука с LightAutoMl, уменьшим раздел тренировочных данных, после чего посмотрим несколько подходов к решению задачи мультиклассовой классификации текстов

In [None]:
# Посмотрим на размерность тренировочных и тестовых данных

print(f"Train dataset shape: {df_train.shape}")
print(f"Test dataset shape: {df_test.shape}")

Train dataset shape: (900000, 11)
Test dataset shape: (478033, 10)


In [None]:
# Определеяем метрику, которой будем оценивать наши результаты - в данном случае используем f1_score (weighted)

from sklearn.metrics import f1_score

def f1_weighted(y_true, y_pred):
    return f1_score(y_true, y_pred, average='weighted')

In [None]:
%%capture
!pip install nltk

In [None]:
# Небольшой препроцессинг текста

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Лемматизация
lemmatizer = WordNetLemmatizer()

# Автозамены
contractions = {
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "won't": "will not",
    "isn't" : "is not",
    "weren't" : "were not",
    "aren't" : "are not",
    "ain't" : "are not"
}

contractions_re = re.compile('(%s)' % '|'.join(contractions.keys()))

def expand_contractions(text, contractions_dict=contractions):
    def replace(match):
        return contractions_dict[match.group(0)]

    return contractions_re.sub(replace, text)

def handle_negations(text):
    negation_pattern = re.compile(r'\b(not)\s+(\w+)')
    return negation_pattern.sub(r'\1\2', text)

def replace_words_with_repeats(text, replacement=''):
    pattern = r'\b\w*(\w)\1{2,}\w*\b' # Заменяем повтор символов (более двух подряд) на константные значения
    return re.sub(pattern, replacement, text)

def replace_numbers(text, replacement='num'):
    pattern = r'\b\d+\b'  # Заменяем все цифры на константу
    return re.sub(pattern, replacement, text)

def clean_text(text):
    text = text.lower()  # Преобразуем в нижний регистр
    text = expand_contractions(text)  # Меняем сокращения на полную форму
    text = handle_negations(text)  # Конкатенируем отрицания с последующим словом
    text = re.sub(r'[^\w\s]', ' ', text)  # Удалим все символы, не являющиеся словами
    text = re.sub(r'\n', '', text)
    text = replace_words_with_repeats(text)
    text = replace_numbers(text)
    text = re.sub(r'\s+', ' ', text).strip()

    words = nltk.word_tokenize(text)

    # Удаляем стоп-слова и берем лемму слов
    cleaned_text = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english')]

    # Возвращаем полученнный текст
    return ' '.join(cleaned_text)

In [None]:
df_train['review_text_cleaned'] = df_train['review_text'].apply(lambda x: clean_text(x))
df_test['review_text_cleaned'] = df_test['review_text'].apply(lambda x: clean_text(x))

In [None]:
# Наш уменьшенный датасет

train_data = df_train.sample(150000)

## Логистическая регрессия

In [None]:
# Посмотрим модель логистической регрессии на уменьшенном датасете

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(train_data['review_text_cleaned'])
y = train_data['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

logreg_model = LogisticRegression(multi_class='multinomial')
logreg_model.fit(X_train, y_train)

In [None]:
predictions = logreg_model.predict(X_test)

In [None]:
print("f1_weighted: ", f1_weighted(y_test, predictions))

f1_weighted:  0.5134982730944393


In [None]:
# Попробуем прогнать на всей выборке

tfidf = TfidfVectorizer()

X = tfidf.fit_transform(df_train['review_text_cleaned'])
y = df_train['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

logreg_model = LogisticRegression(multi_class='multinomial')
logreg_model.fit(X_train, y_train)

In [None]:
predictions = logreg_model.predict(X_test)

In [None]:
print("f1_weighted: ", f1_weighted(y_test, predictions))

f1_weighted:  0.5272591822833017


Теперь попробуем засабмитеть ответ

In [None]:
test_reviews = tfidf.transform(df_test['review_text_cleaned'])
predictions_test = logreg_model.predict(test_reviews)

In [None]:
lr_df = pd.concat([df_test[['review_id']], pd.DataFrame(predictions_test, columns=['rating'])], axis=1)
lr_df.head()

Unnamed: 0,review_id,rating
0,5c4df7e70e9b438c761f07a4620ccb7c,4
1,8eaeaf13213eeb16ad879a2a2591bbe5,4
2,dce649b733c153ba5363a0413cac988f,4
3,8a46df0bb997269d6834f9437a4b0a77,4
4,d11d3091e22f1cf3cb865598de197599,4


In [None]:
lr_df.to_csv('lr_df_prep.csv', index=False) # public score is 0.53085

Public score на Kaggle - 0.53085. Очень хороший результат при относительно простой модели. Однако решение нельзя считать экономным по времени - препроцессинг текста для тренировочного датасета занял чуть больше 2 часов времени.




## Использование Catboost

In [None]:
%%capture
!pip install optuna catboost

In [None]:
# Также прогоним модель на уменьшенном датасете. В качестве текстовых переменных берем данные без препроцессинга

from sklearn.model_selection import train_test_split

train_df = df_train.sample(150000)
X = train_df[['review_text']]
y = train_df[['rating']]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Попробуем запустить CatBoostClassifier с условно дефолтными значениями. Iterations ставим интуитивно

from catboost import CatBoostClassifier, Pool

model = CatBoostClassifier(
    task_type = 'GPU',
    verbose = 1000,
    iterations = 15000
)

train_pool = Pool(data=X_train, label=y_train, text_features=['review_text'])
test_pool = Pool(data=X_test, text_features=['review_text'])

model.fit(train_pool)

Learning rate set to 0.018061
0:	learn: 1.7709436	total: 53.2ms	remaining: 13m 17s
1000:	learn: 1.0913840	total: 24.8s	remaining: 5m 46s
2000:	learn: 1.0500232	total: 48.4s	remaining: 5m 14s
3000:	learn: 1.0200007	total: 1m 8s	remaining: 4m 35s
4000:	learn: 0.9953757	total: 1m 31s	remaining: 4m 11s
5000:	learn: 0.9735150	total: 1m 54s	remaining: 3m 49s
6000:	learn: 0.9537695	total: 2m 16s	remaining: 3m 25s
7000:	learn: 0.9352799	total: 2m 38s	remaining: 3m 1s
8000:	learn: 0.9183649	total: 3m 2s	remaining: 2m 39s
9000:	learn: 0.9022333	total: 3m 25s	remaining: 2m 17s
10000:	learn: 0.8869424	total: 3m 46s	remaining: 1m 53s
11000:	learn: 0.8721660	total: 4m 10s	remaining: 1m 31s
12000:	learn: 0.8579178	total: 4m 34s	remaining: 1m 8s
13000:	learn: 0.8443038	total: 4m 58s	remaining: 45.9s
14000:	learn: 0.8313220	total: 5m 19s	remaining: 22.8s
14999:	learn: 0.8185289	total: 5m 43s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7a798a81dff0>

In [None]:
predictions = model.predict(test_pool)

In [None]:
f1_weighted(y_test, predictions)

0.5542487388501727

Достаточно хороший результат. Так же, как и в первом случае, засабмитим и посмотрим на финальный результат.

In [None]:
# Прогоним на всей выборке

from catboost import CatBoostClassifier, Pool

model = CatBoostClassifier(
    task_type = 'GPU',
    verbose = 1000,
    iterations = 15000
)

train_pool = Pool(data=df_train[['review_text']], label=df_train[['rating']], text_features=['review_text'])
test_pool = Pool(data=df_test[['review_text']], text_features=['review_text'])

model.fit(train_pool)

Learning rate set to 0.027244
0:	learn: 1.7572672	total: 162ms	remaining: 40m 26s
1000:	learn: 1.0494207	total: 1m 15s	remaining: 17m 33s
2000:	learn: 1.0227927	total: 2m 25s	remaining: 15m 47s
3000:	learn: 1.0065117	total: 3m 35s	remaining: 14m 20s
4000:	learn: 0.9945610	total: 4m 43s	remaining: 12m 58s
5000:	learn: 0.9844828	total: 5m 51s	remaining: 11m 42s
6000:	learn: 0.9758285	total: 6m 58s	remaining: 10m 27s
7000:	learn: 0.9679397	total: 8m 5s	remaining: 9m 15s
8000:	learn: 0.9606937	total: 9m 13s	remaining: 8m 4s
9000:	learn: 0.9538373	total: 10m 22s	remaining: 6m 54s
10000:	learn: 0.9473592	total: 11m 31s	remaining: 5m 45s
11000:	learn: 0.9412422	total: 12m 40s	remaining: 4m 36s
12000:	learn: 0.9353618	total: 13m 49s	remaining: 3m 27s
13000:	learn: 0.9296474	total: 14m 58s	remaining: 2m 18s
14000:	learn: 0.9241063	total: 16m 8s	remaining: 1m 9s
14999:	learn: 0.9188003	total: 17m 18s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7b2c3d340310>

In [None]:
predictions = model.predict(test_pool)

In [None]:
cat_boost_df = pd.concat([df_test[['review_id']], pd.DataFrame(predictions, columns=['rating'])], axis=1)
cat_boost_df.head()

Unnamed: 0,review_id,rating
0,5c4df7e70e9b438c761f07a4620ccb7c,4
1,8eaeaf13213eeb16ad879a2a2591bbe5,4
2,dce649b733c153ba5363a0413cac988f,5
3,8a46df0bb997269d6834f9437a4b0a77,4
4,d11d3091e22f1cf3cb865598de197599,3


In [None]:
cat_boost_df.to_csv('catboost-sub.csv', index=False) # public score is 0.5749

Public score на Kaggle - 0.5749. Немного хуже результата, полученного с помощью LightAutoMl, однако стоит отметить, что это достаточно хороший вариант по соотношению скорость и качество (< 20 минут).


Попробуем оптимизировать гиперпараметры с помощью Optuna.

In [None]:
import optuna
from tqdm import tqdm
from sklearn.model_selection import cross_val_score
from catboost import CatBoostClassifier, Pool
import pandas as pd
from catboost import cv

def objective(trial):
    catboost_params = {
        'iterations': trial.suggest_int('iterations', 100, 20000),
        'task_type': 'GPU',
        'depth': trial.suggest_int('depth', 2, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 10),
        'random_state': 42,
        'loss_function': 'MultiClass',
        'verbose': False
    }

    catboost_model = CatBoostClassifier(**catboost_params)
    train_pool = Pool(data=X_train, label=y_train, text_features=['review_text'])
    cv_results = cv(pool=train_pool, params=catboost_params, fold_count=3, type='Classical')
    best_score = min(cv_results['test-MultiClass-mean'])

    return best_score

N_TRIALS = 3 # Для экономии времени

study = optuna.create_study(direction='maximize')
with tqdm(total=N_TRIALS) as pbar:
    def update_progress_bar(study, trial):
        pbar.update(1)

    study.optimize(objective, n_trials=N_TRIALS, callbacks=[update_progress_bar])

print("Best parameters:", study.best_params)


[I 2023-12-28 12:12:26,240] A new study created in memory with name: no-name-0004eea3-b957-4cbb-8fbc-f65d2e0d549c
  0%|          | 0/3 [00:00<?, ?it/s]

Training on fold [0/3]
bestTest = 1.065081219
bestIteration = 617
Training on fold [1/3]
bestTest = 1.059809082
bestIteration = 580
Training on fold [2/3]
bestTest = 1.067233928
bestIteration = 621


[I 2023-12-28 12:41:11,598] Trial 0 finished with value: 1.0642675542139892 and parameters: {'iterations': 19381, 'depth': 7, 'learning_rate': 0.2418038933705475, 'l2_leaf_reg': 8.991016177866072}. Best is trial 0 with value: 1.0642675542139892.
 33%|███▎      | 1/3 [28:45<57:30, 1725.36s/it]

Training on fold [0/3]
bestTest = 1.068182528
bestIteration = 1192
Training on fold [1/3]
bestTest = 1.06111084
bestIteration = 1247
Training on fold [2/3]


[I 2023-12-28 13:13:02,513] Trial 1 finished with value: 1.0669738500968953 and parameters: {'iterations': 6126, 'depth': 10, 'learning_rate': 0.0767197148751083, 'l2_leaf_reg': 8.634828015372172}. Best is trial 1 with value: 1.0669738500968953.
 67%|██████▋   | 2/3 [1:00:36<30:34, 1834.51s/it]

bestTest = 1.071457577
bestIteration = 1194
Training on fold [0/3]
bestTest = 1.076194628
bestIteration = 367
Training on fold [1/3]
bestTest = 1.066938281
bestIteration = 333
Training on fold [2/3]
bestTest = 1.077304842
bestIteration = 423


[I 2023-12-28 13:58:03,048] Trial 2 finished with value: 1.07368332373016 and parameters: {'iterations': 14860, 'depth': 9, 'learning_rate': 0.17629601066152945, 'l2_leaf_reg': 3.0343887261859273}. Best is trial 2 with value: 1.07368332373016.
100%|██████████| 3/3 [1:45:36<00:00, 2112.27s/it]

Best parameters: {'iterations': 14860, 'depth': 9, 'learning_rate': 0.17629601066152945, 'l2_leaf_reg': 3.0343887261859273}





In [None]:
# Прогоним Сatboost с определенными гиперпараметрами

model = CatBoostClassifier(
    task_type = 'GPU',
    verbose = 1000,
    iterations = 14860,
    depth = 9,
    learning_rate=0.17629601066152945,
    l2_leaf_reg=3.0343887261859273
)

train_pool = Pool(data=X_train, label=y_train, text_features=['review_text'])
test_pool = Pool(data=X_test, text_features=['review_text'])

model.fit(train_pool)

0:	learn: 1.5992232	total: 117ms	remaining: 29m 4s
1000:	learn: 0.6192861	total: 1m 6s	remaining: 15m 19s
2000:	learn: 0.4101463	total: 2m 11s	remaining: 14m 5s
3000:	learn: 0.2992838	total: 3m 15s	remaining: 12m 51s
4000:	learn: 0.2312913	total: 4m 16s	remaining: 11m 35s
5000:	learn: 0.1862195	total: 5m 18s	remaining: 10m 27s
6000:	learn: 0.1522820	total: 6m 20s	remaining: 9m 21s
7000:	learn: 0.1273902	total: 7m 22s	remaining: 8m 17s
8000:	learn: 0.1085653	total: 8m 24s	remaining: 7m 12s
9000:	learn: 0.0930475	total: 9m 26s	remaining: 6m 8s
10000:	learn: 0.0804575	total: 10m 26s	remaining: 5m 4s
11000:	learn: 0.0709058	total: 11m 28s	remaining: 4m 1s
12000:	learn: 0.0634134	total: 12m 28s	remaining: 2m 58s
13000:	learn: 0.0576102	total: 13m 29s	remaining: 1m 55s
14000:	learn: 0.0528381	total: 14m 29s	remaining: 53.4s
14859:	learn: 0.0495269	total: 15m 21s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7a6066972800>

In [None]:
model.predict(test_pool)

<catboost.core.Pool at 0x7a6034b169e0>

In [None]:
f1_weighted(y_test, predictions)

0.5134820191838674

Результат получился хуже, чем без выставления параметров. Вероятнее всего, для должной работы необходимо увеличивать количество испытаний.

In [None]:
# Прогоним на полном датасете

model = CatBoostClassifier(
    task_type = 'GPU',
    verbose = 1000,
    iterations = 14860,
    depth = 9,
    learning_rate=0.17629601066152945,
    l2_leaf_reg=3.0343887261859273
)

train_pool = Pool(data=df_train[['review_text']], label=df_train[['rating']], text_features=['review_text'])
test_pool = Pool(data=df_test[['review_text']], text_features=['review_text'])

model.fit(train_pool)


0:	learn: 1.5785417	total: 237ms	remaining: 58m 48s
1000:	learn: 0.8784131	total: 2m 36s	remaining: 36m
2000:	learn: 0.7710335	total: 5m 25s	remaining: 34m 54s
3000:	learn: 0.6839228	total: 8m 19s	remaining: 32m 55s
4000:	learn: 0.6118347	total: 11m 15s	remaining: 30m 34s
5000:	learn: 0.5517113	total: 14m 9s	remaining: 27m 53s
6000:	learn: 0.4996984	total: 17m 3s	remaining: 25m 10s
7000:	learn: 0.4558657	total: 19m 57s	remaining: 22m 24s
8000:	learn: 0.4178984	total: 22m 50s	remaining: 19m 34s
9000:	learn: 0.3844677	total: 25m 42s	remaining: 16m 44s
10000:	learn: 0.3548785	total: 28m 36s	remaining: 13m 54s
11000:	learn: 0.3291354	total: 31m 29s	remaining: 11m 2s
12000:	learn: 0.3062686	total: 34m 21s	remaining: 8m 11s
13000:	learn: 0.2859437	total: 37m 11s	remaining: 5m 19s
14000:	learn: 0.2675204	total: 40m 1s	remaining: 2m 27s
14859:	learn: 0.2533913	total: 42m 25s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7a7984c350f0>

In [None]:
predictions = model.predict(test_pool)

In [None]:
cat_boost_optuna_df = pd.concat([df_test[['review_id']], pd.DataFrame(predictions, columns=['rating'])], axis=1)
cat_boost_optuna_df.head()

Unnamed: 0,review_id,rating
0,5c4df7e70e9b438c761f07a4620ccb7c,4
1,8eaeaf13213eeb16ad879a2a2591bbe5,4
2,dce649b733c153ba5363a0413cac988f,5
3,8a46df0bb997269d6834f9437a4b0a77,4
4,d11d3091e22f1cf3cb865598de197599,3


In [None]:
cat_boost_optuna_df.to_csv('catboost-optuna-sub.csv', index=False) # public score is 0.55833

Public score на Kaggle - 0.55833. Немного хуже результата, полученного с Catboost с условно дефолтными параметрами (0.5749).


## Модель для классификации ревью на основе дообучения BERT-модели с HF

In [None]:
%%capture
!pip install transformers
!pip install datasets
!pip install accelerate -U

In [None]:
# Подгружаем данные
from datasets import load_dataset
dataset = load_dataset("csv", data_files="goodreads_train.csv")

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['user_id', 'book_id', 'review_id', 'rating', 'review_text', 'date_added', 'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments'],
        num_rows: 900000
    })
})

In [None]:
# Внутри датасета сплитим на тренировочную и тестовую

train_test_split = dataset["train"].train_test_split(test_size=0.2)

dataset["train"] = train_test_split["train"]
dataset["test"] = train_test_split["test"]

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['user_id', 'book_id', 'review_id', 'rating', 'review_text', 'date_added', 'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments'],
        num_rows: 720000
    })
    test: Dataset({
        features: ['user_id', 'book_id', 'review_id', 'rating', 'review_text', 'date_added', 'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments'],
        num_rows: 180000
    })
})


In [None]:
# Прописываем лейблы

from datasets import ClassLabel

feat_sentiment = ClassLabel(num_classes=6, names=['0_star', '1_star', '2_star', '3_star', '4_star', '5_star'])

dataset = dataset.cast_column("rating", feat_sentiment)

Casting the dataset:   0%|          | 0/720000 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/180000 [00:00<?, ? examples/s]

In [None]:
reviews_df = dataset['train'].to_pandas()
reviews_df.head()

Unnamed: 0,user_id,book_id,review_id,rating,review_text,date_added,date_updated,read_at,started_at,n_votes,n_comments
0,cb89b1792a04c86a9d496d32f0b5ea44,22551730,ecde3c9421af8ff912b3943d59a471b4,2,"DNF. This is very unfortunate, since I truly a...",Fri Oct 07 23:11:45 -0700 2016,Thu Oct 20 22:45:05 -0700 2016,Thu Oct 20 00:00:00 -0700 2016,Fri Oct 07 00:00:00 -0700 2016,1,2
1,87eee2a0b14d67566bfeb1e25f2cfa20,544891,ba5be3a482015fcc2884e01e691bdf26,4,"Cute. \n Beautiful, rich, popular, and worship...",Fri Aug 13 04:27:15 -0700 2010,Fri Jul 25 11:57:46 -0700 2014,Wed Mar 02 00:00:00 -0800 2011,Tue Mar 01 00:00:00 -0800 2011,0,0
2,17856e0571acf74a67a9119521d2b4e1,89717,d6fc8a0583f565019cc7b4572bc8b556,2,In a nutshell: 4 strangers search for evidence...,Mon Oct 10 11:22:05 -0700 2016,Wed Dec 14 13:56:19 -0800 2016,Mon Oct 17 00:00:00 -0700 2016,Mon Oct 10 00:00:00 -0700 2016,0,0
3,4ac9790a722813db73a51a479e904a80,3478,7fadae15c2fef72681fa96498fc2ee7e,3,Really more of a 3.5.WAH! WHY oh why is Nichol...,Wed Sep 11 05:32:54 -0700 2013,Thu Jun 15 08:38:30 -0700 2017,Thu Sep 26 00:00:00 -0700 2013,Wed Sep 11 00:00:00 -0700 2013,0,0
4,b3f40625bbfacfd161804d98ca45300f,13574417,4c6701f705a5d4272bb181ff5fea0fa2,3,"Decent, if predictable, YA sci-fi. Will Earth-...",Wed Apr 10 16:39:48 -0700 2013,Mon Jun 12 20:24:09 -0700 2017,Mon Jun 12 00:00:00 -0700 2017,Sun Jun 11 00:00:00 -0700 2017,0,0


In [None]:
features = dataset['train'].features

In [None]:
# Некоторые операции с таргетом

id2label = {idx:features['rating'].int2str(idx) for idx in range(6)}
id2label

{0: '0_star', 1: '1_star', 2: '2_star', 3: '3_star', 4: '4_star', 5: '5_star'}

In [None]:
# Некоторые операции с таргетом
label2id = {v:k for k,v in id2label.items()}
label2id

{'0_star': 0, '1_star': 1, '2_star': 2, '3_star': 3, '4_star': 4, '5_star': 5}

In [None]:
# Будем настраивать ту же модель, что и в решении с помощью LightAutoMl

from transformers import AutoTokenizer

model_ckpt = 'prajjwal1/bert-tiny'

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
def tokenize_text(examples):
  return tokenizer(examples['review_text'], truncation=True, max_length=512)

In [None]:
dataset = dataset.map(tokenize_text, batched=True)

Map:   0%|          | 0/720000 [00:00<?, ? examples/s]

Map:   0%|          | 0/180000 [00:00<?, ? examples/s]

In [None]:
# С учетом дисбаланса в классах таргета необходимо сделать небольшую корректировку

class_weights = (1 - (reviews_df['rating'].value_counts().sort_index() / len(reviews_df))).values
class_weights

array([0.96566667, 0.96803333, 0.91924306, 0.79027917, 0.65129306,
       0.70548472])

In [None]:
import torch

class_weights = torch.from_numpy(class_weights).float().to('cuda')
class_weights

tensor([0.9657, 0.9680, 0.9192, 0.7903, 0.6513, 0.7055], device='cuda:0')

In [None]:
# Особенность hf

dataset = dataset.rename_column('rating','labels')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['user_id', 'book_id', 'review_id', 'labels', 'review_text', 'date_added', 'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 720000
    })
    test: Dataset({
        features: ['user_id', 'book_id', 'review_id', 'labels', 'review_text', 'date_added', 'date_updated', 'read_at', 'started_at', 'n_votes', 'n_comments', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 180000
    })
})

In [None]:
from torch import nn
import torch
from transformers import Trainer


class WeightedLossTrainer(Trainer):
  def compute_loss(self, model, inputs, return_outputs=False):
    outputs = model(**inputs)
    logits = outputs.get('logits')
    labels = inputs.get('labels')
    loss_func = nn.CrossEntropyLoss(weight=class_weights)
    loss = loss_func(logits,labels)
    return (loss,outputs) if return_outputs else loss


In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt,
                                                           num_labels=6,
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from sklearn.metrics import f1_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average='weighted')
  return {'f1':f1}

In [None]:
# Значения выставляем интуитивно

from transformers import TrainingArguments

batch_size = 20
logging_steps = len(dataset['train']) // batch_size
output_dir = 'goodreads_rating_classification'
training_args = TrainingArguments(output_dir=output_dir,
                                  num_train_epochs=5,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy='epoch',
                                  logging_steps=logging_steps,
                                  fp16=True,
                                  push_to_hub=True)



In [None]:
# Подготавливаем тренера

trainer = WeightedLossTrainer(model=model,
                              args=training_args,
                              compute_metrics=compute_metrics,
                              train_dataset=dataset['train'],
                              eval_dataset=dataset['test'],
                              tokenizer=tokenizer)

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,1.1524,1.032172,0.568017
2,1.0338,0.999948,0.583437
3,1.0009,0.985208,0.589782
4,0.9823,0.979809,0.592436
5,0.971,0.978516,0.594124


TrainOutput(global_step=180000, training_loss=1.0280808810763888, metrics={'train_runtime': 6762.9704, 'train_samples_per_second': 532.31, 'train_steps_per_second': 26.616, 'total_flos': 4548953363971440.0, 'train_loss': 1.0280808810763888, 'epoch': 5.0})

In [None]:
trainer.save_model("./best_model_checkpoint")

In [None]:
# Посмотрим на результаты

results = trainer.predict(dataset['test'])
predicted_labels = results.predictions
true_labels = results.label_ids


In [None]:
f1_score(y_test, predicted_labels.argmax(-1), average='weighted')

0.5941236750143842

Результат получился лучше, чем у остальных моделей. Однако решение на совсем экономное по времени. Также важно отметить, что BERT ограничивается 512 токенами, что не всегда может подходить для решения задач (в этом случае можно обращаться к Longformer). Для нашей задачи 512 токенов достаточно, чтобы предсказать общий тон ревью и тем самым вывести потенциальный рейтинг.

Теперь сделаем предикт для наших тестовых данных

In [None]:
# Подгрузим нашу модель с hf и обернем логику в функцию predict

import torch
from transformers import AutoModelForSequenceClassification
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('kartashoffv/goodreads_rating_classification')
model = AutoModelForSequenceClassification.from_pretrained('kartashoffv/goodreads_rating_classification', return_dict=True)

@torch.no_grad()
def predict(review):
    inputs = tokenizer(review, max_length=512, padding=True, truncation=True, return_tensors='pt')
    outputs = model(**inputs)
    pred_label = torch.argmax(outputs.logits, dim=1).numpy()
    return pred_label[0]

In [None]:
df_test['rating'] = df_test['review_text'].apply(lambda x: predict(x))

In [None]:
df_test[['review_id','rating']].head()

Unnamed: 0,review_id,rating
0,5c4df7e70e9b438c761f07a4620ccb7c,4
1,8eaeaf13213eeb16ad879a2a2591bbe5,3
2,dce649b733c153ba5363a0413cac988f,4
3,8a46df0bb997269d6834f9437a4b0a77,4
4,d11d3091e22f1cf3cb865598de197599,3


In [None]:
df_test[['review_id','rating']].to_csv('bert-sub.csv', index=False) # public score is 0.59377