**Загрузим необходимый функционал**

In [1]:
import re
import sys
import numpy as np
import pandas as pd
from tqdm import tqdm
from enum import Enum
from src import utils

tqdm.pandas()

**Определим пути до файлов**

In [2]:
class Paths(Enum):
    test = "data/ranking_test.jsonl"
    stopwords_coms = "auxiliary_data/stopwords_coms.txt"
    stopwords_text = "auxiliary_data/stopwords_text.txt"
    hard_words = "auxiliary_data/hard_words.txt"

**Загрузим данные**

In [3]:
test = pd.read_json(Paths.test.value, lines = True)
print(test.shape)
print(f"[memory usage]: {round(sys.getsizeof(test) / 1024 ** 2, 3)}")
test.sample(5)

(14004, 2)
[memory usage]: 3.026


Unnamed: 0,text,comments
1247,Why the World Is Too Optimistic About China's ...,[{'text': 'China is a special case. For most ...
7911,Company hit with $1.1M penalty under Canadian ...,[{'text': 'I&#x27;m seeing a lot of spam by co...
10362,Windows 10 IoT Core Insider Preview for Raspbe...,[{'text': 'Fun fact: after downloading it and ...
12851,Ask HN: Boring problems that will never be ven...,[{'text': 'I&#x27;m kind of in the middle of d...
9378,"New Tesla 70D all-wheel drive, 240 mile range ...",[{'text': 'The price listed in the title is &q...


**Преобразуем данные в формат таблички**

In [4]:
test = utils.get_valid_stucture(test, is_train=False)
test.sample(5)

0it [00:00, ?it/s]

Unnamed: 0,text,comments
48119,The Deceptive Anagram Question,I stopped when I scrolled as far as where he m...
44573,FoundationDB's Lesson: A Fast Key-Value Store ...,Regarding SQLite using FoundationDB as k&#x2F;...
40507,Can a city really ban cars from its streets? (...,Cars are a huge waste of space:https:&#x2F;&#x...
27552,Drone carrying drugs crashes near US-Mexico bo...,Physics &#x2F; engineering people: Ignoring co...
28721,Kythe: A pluggable ecosystem for building tool...,This sounds like the Grok project that Steve Y...


**Чистим текст от доменов, т.е. оставляет домен второго уровня**

In [5]:
test["text"] = test["text"].progress_apply(utils.clean_domain)
test["comments"] = test["comments"].progress_apply(utils.clean_domain)

100%|████████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 147989.58it/s]
100%|█████████████████████████████████████████████████████████████████████████| 70020/70020 [00:02<00:00, 30137.42it/s]


**Чистим текст от пунктуации**

In [6]:
test["text"] = test["text"].progress_apply(utils.clean_punctuation)
test["comments"] = test["comments"].progress_apply(utils.clean_punctuation)

100%|████████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 451745.49it/s]
100%|█████████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 70332.41it/s]


**Загружаем стоп слова для поста и для комментариев** 

In [7]:
with open(Paths.stopwords_coms.value, 'r') as file:
    stopwords_coms = list(map(
        lambda x: x.strip(), file.readlines()
    ))
    
with open(Paths.stopwords_text.value, 'r') as file:
    stopwords_text = list(map(
        lambda x: x.strip(), file.readlines()
    ))

**Чистим текст от \xa0**

In [8]:
test["text"] = test["text"].progress_apply(utils.replace_xa0)
test["comments"] = test["comments"].progress_apply(utils.replace_xa0)

100%|███████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 1296691.95it/s]
100%|███████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 1052936.92it/s]


**Чистим текст от эмодзи**

In [9]:
test["text"] = test["text"].progress_apply(utils.remove_emoji)
test["comments"] = test["comments"].progress_apply(utils.remove_emoji)

100%|████████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 242203.18it/s]
100%|█████████████████████████████████████████████████████████████████████████| 70020/70020 [00:01<00:00, 64762.24it/s]


**Чистим текст от множества пробелов**

In [10]:
test["text"] = test["text"].progress_apply(utils.remove_whitespaces)
test["comments"] = test["comments"].progress_apply(utils.remove_whitespaces)

100%|████████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 279861.14it/s]
100%|█████████████████████████████████████████████████████████████████████████| 70020/70020 [00:01<00:00, 39732.19it/s]


**Удаляем стоп-слова**

In [11]:
test["text"] = test["text"].progress_apply(
    lambda x: utils.remove_stopwords(x.lower(), stopwords_text)
)
test["comments"] = test["comments"].progress_apply(
    lambda x: utils.remove_stopwords(x.lower(), stopwords_coms)
)

100%|█████████████████████████████████████████████████████████████████████████| 70020/70020 [00:01<00:00, 64479.44it/s]
100%|██████████████████████████████████████████████████████████████████████████| 70020/70020 [00:13<00:00, 5343.76it/s]


**Чистим текст от множества пробелов**

In [13]:
test["text"] = test["text"].progress_apply(utils.remove_whitespaces)
test["comments"] = test["comments"].progress_apply(utils.remove_whitespaces)

100%|██████████| 70020/70020 [00:00<00:00, 167103.94it/s]
100%|██████████| 70020/70020 [00:01<00:00, 35864.51it/s]


**Заменяем пустые строки в столбцах на "empty"**

In [12]:
index_nan_text = test[test["text"] == ''].index
index_nan_comments = test[test["text"] == ''].index


test.loc[index_nan_text, "text"] = "empty"
test.loc[index_nan_comments, "comments"] = "empty"

**Проверим на наличие пропусков в данных**

In [13]:
test.isna().sum()

text        0
comments    0
dtype: int64

**Создаем файлы с эмбеддингами поста и комментариев**

In [15]:
utils.create_embeddings(lang_model='bert-base-multilingual-cased', data=test, 
                        column='text', output_filename='text_embed_test', device='cuda')

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batches:   0%|          | 0/2189 [00:00<?, ?it/s]

In [16]:
utils.create_embeddings(lang_model='bert-base-multilingual-cased', data=test, 
                        column='comments', output_filename='comments_embed_test', device='cuda')

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batches:   0%|          | 0/2189 [00:00<?, ?it/s]

**Создаем файлы с вероятностью токсичности поста и комментариев**

In [17]:
utils.create_toxic(df=test, column="text", output_filename="text_toxic_test", device=0)
utils.create_toxic(df=test, column="comments", output_filename="comments_toxic_test", device=0)

Downloading (…)lve/main/config.json:   0%|          | 0.00/811 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  0%|          | 0/70020 [00:00<?, ?it/s]

  0%|          | 0/70020 [00:00<?, ?it/s]

**Создаем файлы 3 расстояний: косинусное, евклидово, манхэттенское**

In [18]:
with open("auxiliary_data/text_embed_test.npy", "rb") as f:
    df_text = np.load(f)
    
with open("auxiliary_data/comments_embed_test.npy", "rb") as f:
    df_coms = np.load(f)

In [22]:
utils.calulacte_cos_measure(
    df_coms,
    df_text,
    output_filename="cos_measure_test"
)
utils.calculate_euclidean_measure(
    df_coms,
    df_text,
    output_filename="euclidean_measure_test"
)
utils.calculate_manhattan_measure(
    df_coms,
    df_text,
    output_filename="manhattan_measure_test"
)

  0%|          | 0/14004 [00:00<?, ?it/s]

  0%|          | 0/14004 [00:00<?, ?it/s]

  0%|          | 0/14004 [00:00<?, ?it/s]

**Создаем 3 признака: количество слов в комменатрии, количество слов в группе, процент слов коммментария относительнно группы, а также удаляем признак количества слов в группе**

In [14]:
test["count_words"] = test.comments.progress_apply(utils.get_text_len)
test["count_words_group"] = test.groupby("text")["count_words"].transform("sum")
test["percent_words"] = test["count_words"] / test["count_words_group"]
test.drop("count_words_group", axis=1, inplace=True)

100%|████████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 625146.70it/s]


**Создаем признак уникальных слов в комменатрии**

In [15]:
test["unique_words"] = test.comments.progress_apply(
    lambda x: utils.get_nunique_words(re.sub("\s+", ' ', re.sub("\d+", '', x)))
)

100%|███████████████████████████████████████████████████████████████████████████| 70020/70020 [03:11<00:00, 366.50it/s]


**Создание признака определения схожести текста поста и комментария**

In [16]:
test["resemblance"] = test.progress_apply(
    lambda x: utils.resemblance_text(x["text"], x["comments"]), axis=1
)

100%|███████████████████████████████████████████████████████████████████████████| 70020/70020 [02:05<00:00, 557.54it/s]


**Создание признака токсичности комменатрия и признака совпадения токсичности поста и комменатрия**

In [17]:
text_toxic = pd.DataFrame(np.load("auxiliary_data/text_toxic_test.npy"))
text_toxic.columns = ["toxic_text"]

In [18]:
comments_toxic = pd.DataFrame(np.load("auxiliary_data/comments_toxic_test.npy"))
comments_toxic.columns = ["toxic"]

In [19]:
test = test.join(text_toxic)
test = test.join(comments_toxic)

In [20]:
test["equality_toxic"] = test.progress_apply(utils.check_equality_toxic, axis=1)

100%|████████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 173757.02it/s]


In [21]:
test.drop("toxic_text", axis=1, inplace=True)

**Загрузка файла с сложными словами и создание признака сложности предложения**

In [22]:
with open(Paths.hard_words.value, 'r') as file:
    hard_words = list(map(
        lambda x: x.strip(), file.readlines()
    ))

In [23]:
test["hard_sentence"] = test.comments.progress_apply(
    lambda x: utils.feature_hard_word(x, hard_words)
)

100%|████████████████████████████████████████████████████████████████████████| 70020/70020 [00:00<00:00, 283513.55it/s]


**Создаем 3 признака расстояний: косинусное, евклидово, манхэттенское**

In [24]:
cos_measure = pd.DataFrame(np.load("auxiliary_data/cos_measure_test.npy"))
cos_measure.columns = ["cos_measure"]
test = test.join(cos_measure)

In [25]:
euclidean_measure = pd.DataFrame(np.load("auxiliary_data/euclidean_measure_test.npy"))
euclidean_measure.columns = ["euclidean_measure"]
test = test.join(euclidean_measure)

In [26]:
manhattan_measure = pd.DataFrame(np.load("auxiliary_data/manhattan_measure_test.npy"))
manhattan_measure.columns = ["manhattan_measure"]
test = test.join(manhattan_measure)

In [30]:
test.to_csv("processed_test_data.csv", index=False)

---