# TASK DESCRIPTION

**Legend:**

Young Alex has a beloved BERT model that he carries everywhere on his trusty flash drive. One day, during an excursion along the River Styx, a few drops of water landed on the precious device, corrupting the model's weights.

Heartbroken, Alex rushed home to fix the neural network. After quick analysis, he discovered only the token embeddings were damaged - the rest of the architecture (attention blocks and heads) remained perfectly intact. Now he needs to restore the model's performance on Sentiment Analysis Task.

**Task:**

You need to fix the broken vectors of the Embeddings matrix of the model so as to improve the quality of the model on the task of text sentiment analysis.

**Restrictions:**

- You can not use any other transformer based pre-trained models and LLMs.

- You can not any additional data

- You can not fine-tune or pre-train model

===

When you make a submit, make a Quick Save of the notebook, otherwise we may reject your solution.

You must solve this task on KAGGLE (YOU CAN'T USE CLOUD.RU)

==========

**Легенда:**

Young Alex имеет любимую модель BERT, которую он везде носит на своей надежной флешке. Однажды, во время экскурсии вдоль реки Стикс, несколько капель воды попало на драгоценное устройство, повредив веса модели.

С разбитым сердцем Алекс поспешил домой, чтобы починить нейронную сеть. После быстрого анализа он обнаружил, что повреждены только эмбеддинги токенов — остальная архитектура (блоки внимания и головы) осталась полностью нетронутой.

Теперь ему нужно восстановить производительность модели, оставив все остальные веса замороженными (никакие изменения в механизмах внимания или других компонентах не допускаются). Ваша задача — помочь Алексу достичь этой цели, не нарушив его ностальгическую привязанность к оригинальной модели.

**Задача:**

Вам необходимо починить сломанные вектора матрицы Embeddings модели так, чтобы улучшить качество модели на задаче анализа тональности текста.

**Ограничения:**

- Вы не можете использовать никакие другие предобученные модели на основе архитектуры Трансформер и LLM.

- Вы не можете использовать никакие дополнительные данные.

- Вы не можете дообучать или предобучать модель.

===

При отправке решения сделайте Quick Save ноутбука, иначе мы можем отклонить ваше решение.

Эту задачу необходимо решить на KAGGLE (ВЫ НЕ МОЖЕТЕ ИСПОЛЬЗОВАТЬ CLOUD.RU)


# DEPENDINGS

In [2]:
import numpy as np
import pandas as pd
import torch
np.random.seed(21)

# LOAD DATASET

Upload kaggle.json file to colab.

In [2]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

mv: cannot stat 'kaggle.json': No such file or directory


In [3]:
!kaggle competitions download -c neoai-2025-broken-bert

neoai-2025-broken-bert.zip: Skipping, found more recently modified local copy (use --force to force download)


In [4]:
!unzip neoai-2025-broken-bert.zip

Archive:  neoai-2025-broken-bert.zip
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: test.csv                
  inflating: val_dataset.csv         


In [3]:
val_data_path = "val_dataset.csv"
test_data_path = "test.csv"

val_df = pd.read_csv(val_data_path)

test_df = pd.read_csv(test_data_path)

# LOAD TOKENIZER & MODEL

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Ilseyar-kfu/broken_bert")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [6]:
val_encodings = tokenizer(val_df["text"].to_list(), truncation=True, padding=True, max_length=256)
val_dataset = Dataset(val_encodings, val_df["labels"].to_list())

In [7]:
texts_2_score = val_df["text"].to_list() + test_df["text"].to_list()

# MODEL CHANGES

### Solution: get average sub tokens' embeddings for each token with zero embedding.

Get tokens with zero indices

In [9]:
from tqdm import tqdm

Get token indices with zero embeddings.

In [10]:
model_embeddings = model.bert.embeddings.word_embeddings.weight.detach().cpu()
zero_rows = (model_embeddings == 0).all(dim=1)
zero_indices = torch.nonzero(zero_rows).squeeze().numpy()

Get tokenizer vocabulary:

In [14]:
token_to_ids = tokenizer.get_vocab()
ids_to_token = {v: k for k,v in token_to_ids.items()}
non_zero_ids_to_token = {v: k for k,v in token_to_ids.items() if v not in zero_indices}

Function to get all tokens from the tokenizer vocabulary that are included in the current token.

In [15]:
def get_sub_tokens(token, ids_to_token):
  sub_tokens = []
  for idx, sub_token in ids_to_token.items():
    if sub_token in token:
      sub_tokens.append(idx)
  return sub_tokens

For each token with zero embeddings, we find subtokens from the tokenizer vocabulary and replace the vectors of these tokens with the averaged embeddings of the subtokens:

In [17]:
for zero_index in tqdm(zero_indices):
  zero_token = ids_to_token[zero_index]
  sub_tokens = get_sub_tokens(zero_token, non_zero_ids_to_token)
  if len(sub_tokens) != 0:
    mean_embedding = model_embeddings[sub_tokens].mean(dim=0)
    model_embeddings[zero_index] = mean_embedding
  else:
    model_embeddings[zero_index] = torch.rand(1, model_embeddings.shape[1])

100%|██████████| 12208/12208 [00:23<00:00, 523.20it/s]


Replace the model's embedding matrix with a new emedding matrix

In [19]:
new_embedings = model_embeddings

In [26]:
model = AutoModelForSequenceClassification.from_pretrained("Ilseyar-kfu/broken_bert")

# There's magic going on here!!! And we get very new !!! new_embedings !!!

model.bert.embeddings.word_embeddings.weight = torch.nn.Parameter(torch.Tensor(new_embedings))

# EVALUATION

In [21]:
from sklearn.metrics import f1_score
from numpy import argmax
from transformers import pipeline
import wandb
wandb.init(mode= "disabled")

In [22]:
from sklearn.metrics import classification_report

def evaluate_on_validation(model, tokenizer, df_val):
    label_2_dict = {'LABEL_0': 'neutral', "LABEL_1" : 'positive', "LABEL_2": 'negative'}
    classifier = pipeline("text-classification", model= model, tokenizer = tokenizer)
    answ = classifier.predict(list(df_val["text"]))
    answ = [label_2_dict[el["label"]] for el in answ]

    # print(f1_score(p.label_ids, preds, average='macro'))
    print(classification_report(df_val["labels"], answ))

In [23]:
evaluate_on_validation(model, tokenizer, val_df)

Device set to use cpu


              precision    recall  f1-score   support

    negative       0.72      0.13      0.22       935
     neutral       0.33      0.87      0.48       759
    positive       0.57      0.22      0.32       806

    accuracy                           0.39      2500
   macro avg       0.54      0.41      0.34      2500
weighted avg       0.55      0.39      0.33      2500



# MODEL SCORING
When you make a submit,
1. Make a Quick Save of the notebook, otherwise we may reject your solution!
2. Add notebook version to the comment for the submit.

===

При отправке решения:

1. Сделайте Quick Save ноутбука, иначе мы можем отклонить ваше решение!
2. Добавьте версию ноутбука в комментарий к отправке.

In [24]:
import hashlib

def create_submission(model, tokenizer, df_test):
    label_2_dict = {'LABEL_0': 'neutral', "LABEL_1" : 'positive', "LABEL_2": 'negative'}
    classifier = pipeline("text-classification", model= model, tokenizer = tokenizer)
    answ = classifier.predict(list(df_test["text"]))
    answ = [label_2_dict[el["label"]] for el in answ]

    df = pd.DataFrame({"labels" : answ, "id": df_test['id']})
    hsh = hashlib.sha256(df.to_csv(index=False).encode('utf-8')).hexdigest()[:8]
    submit_path = f"submit_{hsh}.csv"
    print(f"SUBMIT_NAME: {submit_path}")
    print(df.head(10))
    df.to_csv(submit_path,index=False)

In [25]:
create_submission(model, tokenizer, test_df)

Device set to use cpu


SUBMIT_NAME: submit_f5fc7d7e.csv
     labels    id
0  positive  5000
1   neutral  5001
2   neutral  5002
3   neutral  5003
4   neutral  5004
5   neutral  5005
6   neutral  5006
7   neutral  5007
8   neutral  5008
9   neutral  5009
