# Fine-tuning the T5-small model

## Fine-tuning the model
### Run following cells in case if you want to fine-tune the model by yourself

In [None]:
!python ../src/models/T5/T5_model_train.py

## Download the weights of the T5-small model

In [None]:
!python ../src/data/load_weights.py

Downloading...
From (uriginal): https://drive.google.com/uc?id=1TS2FeNofbu_bydF3AaxcszcWs6htSk-G
From (redirected): https://drive.google.com/uc?id=1TS2FeNofbu_bydF3AaxcszcWs6htSk-G&confirm=t&uuid=e5fbdea2-d79b-4371-a159-0fb6b71084a8
To: /home/karinochka/Innopolis/text-detoxification/models/t5-small/best-model.zip
100%|████████████████████████████████████████| 225M/225M [00:38<00:00, 5.84MB/s]


## Data example

In [None]:
with open('../data/interim/test_reference.txt', 'r') as file:
    # Read the contents of the file
    test = file.read().splitlines()

print(test[:5])

['"Walter put a lot ofgoddamn money in his billfold, that\'s why."', 'Everybody nice and fucking comfy now?', 'I feel like a jackass now.', '"Get out of here, you vultures!"', 'He could actually die here.']


## Predicting with the model

In [None]:
!python ../src/models/T5/T5_model_predict.py

2023-11-05 22:33:30.830784: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-05 22:33:30.869950: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
100%|█████████████████████████████████████| 5267/5267 [1:03:40<00:00,  1.38it/s]


## Manualy testing the model

In [None]:
# take random sentence from test set
import random

example = random.choice(test)
print(example)

"You don't really fucking care at all, do you?"


In [None]:
import sys
from transformers import AutoTokenizer

sys.path.append('../src/models/T5/')

from T5_model_predict import detoxificate, load_model

tokenizer = AutoTokenizer.from_pretrained('t5-small')
model = load_model("../models/t5-small/t5-small")
result = detoxificate(model, tokenizer, example)

print(result)

"you don't care about all, do you?"


## Evaluate metrics of a model

In [7]:
!python text-detoxification/src/metrics/metrics.py --inputs text-detoxification/data/interim/test_translation.txt --preds text-detoxification/data/interim/result.txt

Calculating style of predictions
Downloading (…)okenizer_config.json: 100% 25.0/25.0 [00:00<00:00, 94.0kB/s]
Downloading (…)olve/main/vocab.json: 100% 798k/798k [00:00<00:00, 7.92MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 8.86MB/s]
Downloading (…)cial_tokens_map.json: 100% 239/239 [00:00<00:00, 937kB/s]
Downloading (…)lve/main/config.json: 100% 794/794 [00:00<00:00, 3.02MB/s]
Downloading pytorch_model.bin: 100% 501M/501M [00:05<00:00, 94.6MB/s]
Some weights of the model checkpoint at SkolkovoInstitute/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForS