**Загрузим необходимый функционал**

In [1]:
import re
import sys
import numpy as np
import pandas as pd
from tqdm import tqdm
from enum import Enum
from src import utils

tqdm.pandas()

**Определим пути до файлов**

In [2]:
class Paths(Enum):
    train = "data/ranking_train.jsonl"
    stopwords_coms = "auxiliary_data/stopwords_coms.txt"
    stopwords_text = "auxiliary_data/stopwords_text.txt"
    hard_words = "auxiliary_data/hard_words.txt"

**Загрузим данные**

In [3]:
train = pd.read_json(Paths.train.value, lines = True)
print(train.shape)
print(f"[memory usage]: {round(sys.getsizeof(train) / 1024 ** 2, 3)}")
train.sample(5)

(88107, 2)
[memory usage]: 18.458


Unnamed: 0,text,comments
59710,Show HN: My weekend Project - Almost Flat UI T...,[{'text': 'Really nice. I'd remove the hover e...
56472,GitLab v4.1 released,[{'text': 'While I love that this is being bui...
58421,Software Engineers: What was your biggest ever...,"[{'text': 'My biggest f'up, so far..It happene..."
1319,Average startup profitability is to decrease b...,[{'text': 'Summary: there appear to be more (h...
83385,A framework for making 2D DOS games in Lua,[{'text': 'This is cool.But if you really want...


**Определим язык постов и удалим невалидные посты с категорией "unknown"**

In [4]:
train["lang"] = train["text"].progress_apply(utils.get_lang)
train.drop(
    train[train["lang"] == "unknown"].index,
    axis=0,
    inplace=True
)
train.reset_index(drop=True, inplace=True)
train.drop("lang", axis=1, inplace=True)

100%|███████████████████████████████████████████████████████████████████████████| 88107/88107 [07:32<00:00, 194.64it/s]


**Преобразуем данные в формат таблички**

In [5]:
train = utils.get_valid_stucture(train)
train.sample(5)

0it [00:00, ?it/s]

Unnamed: 0,text,comments,score
118249,Re-entry to US without backscatter or pat-down,"Oh wow, I had no idea they scan you after re-e...",4
185711,How to hire an idiot,The hiring process and possibly placing too mu...,1
415117,LibreDWG drama: the end or the new beginning? ...,The GPL prevents you from shipping compiled bi...,2
92286,I am an edge case,"Ok, how's this for an edge case...I'm a Canadi...",1
177000,Why Some Languages Sound Faster Than Others,I've been living in Vietnam and trying to lear...,0


**Чистим текст от url, оставляя домен второго уровня**

In [6]:
train["text"] = train["text"].progress_apply(utils.clean_domain)
train["comments"] = train["comments"].progress_apply(utils.clean_domain)

100%|██████████████████████████████████████████████████████████████████████| 440435/440435 [00:02<00:00, 170849.83it/s]
100%|███████████████████████████████████████████████████████████████████████| 440435/440435 [00:15<00:00, 29345.55it/s]


**Чистим текст от пунктуации**

In [7]:
train["text"] = train["text"].progress_apply(utils.clean_punctuation)
train["comments"] = train["comments"].progress_apply(utils.clean_punctuation)

100%|██████████████████████████████████████████████████████████████████████| 440435/440435 [00:00<00:00, 457570.16it/s]
100%|███████████████████████████████████████████████████████████████████████| 440435/440435 [00:04<00:00, 94156.61it/s]


**Загружаем стоп слова для поста и для комментариев** 

In [8]:
with open(Paths.stopwords_coms.value, "r") as file:
    stopwords_coms = list(map(
        lambda x: x.strip(), file.readlines()
    ))
    
with open(Paths.stopwords_text.value, "r") as file:
    stopwords_text = list(map(
        lambda x: x.strip(), file.readlines()
    ))

**Чистим текст от \xa0**

In [9]:
train["text"] = train["text"].progress_apply(utils.replace_xa0)
train["comments"] = train["comments"].progress_apply(utils.replace_xa0)

100%|█████████████████████████████████████████████████████████████████████| 440435/440435 [00:00<00:00, 1424630.43it/s]
100%|█████████████████████████████████████████████████████████████████████| 440435/440435 [00:00<00:00, 1220112.26it/s]


**Чистим текст от эмодзи**

In [10]:
train["text"] = train["text"].progress_apply(utils.remove_emoji)
train["comments"] = train["comments"].progress_apply(utils.remove_emoji)

100%|██████████████████████████████████████████████████████████████████████| 440435/440435 [00:01<00:00, 266701.18it/s]
100%|███████████████████████████████████████████████████████████████████████| 440435/440435 [00:06<00:00, 71976.55it/s]


**Чистим текст от множества пробелов**

In [11]:
train["text"] = train["text"].progress_apply(utils.remove_whitespaces)
train["comments"] = train["comments"].progress_apply(utils.remove_whitespaces)

100%|██████████████████████████████████████████████████████████████████████| 440435/440435 [00:01<00:00, 302064.09it/s]
100%|███████████████████████████████████████████████████████████████████████| 440435/440435 [00:09<00:00, 44829.34it/s]


**Удаляем стоп-слова**

In [12]:
train["text"] = train["text"].progress_apply(
    lambda x: utils.remove_stopwords(x.lower(), stopwords_text)
)
train["comments"] = train["comments"].progress_apply(
    lambda x: utils.remove_stopwords(x.lower(), stopwords_coms)
)

100%|███████████████████████████████████████████████████████████████████████| 440435/440435 [00:06<00:00, 67114.23it/s]
100%|████████████████████████████████████████████████████████████████████████| 440435/440435 [01:15<00:00, 5818.92it/s]


**Чистим текст от множества пробелов**

In [13]:
train["text"] = train["text"].progress_apply(utils.remove_whitespaces)
train["comments"] = train["comments"].progress_apply(utils.remove_whitespaces)

100%|██████████████████████████████████████████████████████████████████████| 440435/440435 [00:01<00:00, 367024.19it/s]
100%|███████████████████████████████████████████████████████████████████████| 440435/440435 [00:05<00:00, 86034.60it/s]


**Заменяем пустые строки в столбцах на "empty"**

In [14]:
index_nan_text = train[train["text"] == ''].index
index_nan_comments = train[train["text"] == ''].index

train.loc[index_nan_text, "text"] = "empty"
train.loc[index_nan_comments, "comments"] = "empty"

Проверим на написание пропусков в данных

In [15]:
train.isna().sum()

text        0
comments    0
score       0
dtype: int64

**Создаем файлы с эмбеддингами поста и комментариев**

In [16]:
utils.create_embeddings(
    lang_model="bert-base-multilingual-cased",
    data=train, 
    column="text",
    output_filename="text_embed_train",
    device="cuda"
)

No sentence-transformers model found with name C:\Users\ozher/.cache\torch\sentence_transformers\bert-base-multilingual-cased. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at C:\Users\ozher/.cache\torch\sentence_transformers\bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSeq

Batches:   0%|          | 0/13764 [00:00<?, ?it/s]

In [16]:
utils.create_embeddings(
    lang_model="bert-base-multilingual-cased",
    data=train, 
    column="comments",
    output_filename="comments_embed_train",
    device="cuda"
)

No sentence-transformers model found with name C:\Users\ozher/.cache\torch\sentence_transformers\bert-base-multilingual-cased. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at C:\Users\ozher/.cache\torch\sentence_transformers\bert-base-multilingual-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSeq

Batches:   0%|          | 0/13764 [00:00<?, ?it/s]

**Создаем файлы с вероятностью токсичности поста и комментариев**

In [22]:
utils.create_toxic(
    df=train,
    column="text",
    output_filename="text_toxic_train",
    device=0
)

In [23]:
utils.create_toxic(
    df=train,
    column="comments",
    output_filename="comments_toxic_train",
    device=0
)

**Создаем файлы 3 расстояний: косинусное, евклидово, манхэттенское**

In [19]:
with open("auxiliary_data/text_embed_train.npy", "rb") as f:
    df_text = np.load(f)
    
with open("auxiliary_data/comments_embed_train.npy", "rb") as f:
    df_coms = np.load(f)

In [21]:
utils.calulacte_cos_measure(
    df_coms,
    df_text,
    output_filename="cos_measure_train"
)
utils.calculate_euclidean_measure(
    df_coms,
    df_text,
    output_filename="euclidean_measure_train"
)
utils.calculate_manhattan_measure(
    df_coms,
    df_text,
    output_filename="manhattan_measure_train"
)

  0%|          | 0/88087 [00:00<?, ?it/s]

  0%|          | 0/88087 [00:00<?, ?it/s]

  0%|          | 0/88087 [00:00<?, ?it/s]

**Создаем 3 признака: количество слов в комменатрии, количество слов в группе, процент слов коммментария относительнно группы, а также удаляем признак количества слов в группе**

In [25]:
train["count_words"] = train.comments.progress_apply(utils.get_text_len)
train["count_words_group"] = train.groupby("text")["count_words"].transform("sum")
train["percent_words"] = train["count_words"] / train["count_words_group"]
train.drop("count_words_group", axis=1, inplace=True)

100%|██████████████████████████████████████████████████████████████████████| 440435/440435 [00:00<00:00, 693105.55it/s]


**Создаем признак уникальных слов в комменатрии**

In [26]:
train["unique_words"] = train.comments.progress_apply(
    lambda x: utils.get_nunique_words(re.sub("\s+", ' ', re.sub("\d+", '', x)))
)

100%|█████████████████████████████████████████████████████████████████████████| 440435/440435 [18:20<00:00, 400.34it/s]


**Создание признака определения схожести текста поста и комментария**

In [27]:
train["resemblance"] = train.progress_apply(
    lambda x: utils.resemblance_text(x["text"], x["comments"]), axis=1
)

100%|█████████████████████████████████████████████████████████████████████████| 440435/440435 [13:40<00:00, 536.48it/s]


**Создание признака токсичности комменатрия и признака совпадения токсичности поста и комменатрия**

In [30]:
text_toxic = pd.DataFrame(np.load("auxiliary_data/text_toxic_train.npy"))
text_toxic.columns = ["toxic_text"]

In [32]:
comments_toxic = pd.DataFrame(np.load("auxiliary_data/comments_toxic_train.npy"))
comments_toxic.columns = ["toxic"]

In [33]:
train = train.join(text_toxic)
train = train.join(comments_toxic)

In [35]:
train["equality_toxic"] = train.progress_apply(utils.check_equality_toxic, axis=1)

100%|██████████████████████████████████████████████████████████████████████| 440435/440435 [00:02<00:00, 181423.82it/s]


In [36]:
train.drop("toxic_text", axis=1, inplace=True)

**Загрузка файла с сложными словами и создание признака сложности предложения**

In [37]:
with open(Paths.hard_words.value, 'r') as file:
    hard_words = list(map(
        lambda x: x.strip(), file.readlines()
    ))

In [38]:
train["hard_sentence"] = train.comments.progress_apply(
    lambda x: utils.feature_hard_word(x, hard_words)
)

100%|██████████████████████████████████████████████████████████████████████| 440435/440435 [00:01<00:00, 303105.50it/s]


**Создаем 3 признака расстояний: косинусное, евклидово, манхэттенское**

In [39]:
cos_measure = pd.DataFrame(np.load("auxiliary_data/cos_measure_train.npy"))
cos_measure.columns = ["cos_measure"]
train = train.join(cos_measure)

In [40]:
euclidean_measure = pd.DataFrame(np.load("auxiliary_data/euclidean_measure_train.npy"))
euclidean_measure.columns = ["euclidean_measure"]
train = train.join(euclidean_measure)

In [41]:
manhattan_measure = pd.DataFrame(np.load("auxiliary_data/manhattan_measure_train.npy"))
manhattan_measure.columns = ["manhattan_measure"]
train = train.join(manhattan_measure)

In [43]:
train.to_csv("processed_train_data.csv", index=False)

---