# Fine-tuning the T5 model 

The checkpoint was used: `t5-base`.

For the whole implementation, check the `src/models/t5` folder.

In [None]:
# Required installations
# !pip install sacrebleu rouge_score

## Cloning the repository

In [1]:
!git clone https://github.com/leiluk1/text-detoxification.git

Cloning into 'text-detoxification'...
remote: Enumerating objects: 153, done.[K
remote: Counting objects: 100% (130/130), done.[K
remote: Compressing objects: 100% (92/92), done.[K
remote: Total 153 (delta 51), reused 107 (delta 30), pack-reused 23[K
Receiving objects: 100% (153/153), 44.14 MiB | 51.84 MiB/s, done.
Resolving deltas: 100% (52/52), done.


In [2]:
# If using a Kaggle
%cd /kaggle/working/text-detoxification

/kaggle/working/text-detoxification


## Making the dataset

In [3]:
!python ./src/data/make_dataset.py 

Done successfully! Check data/interim folder.


## Training the model

In [5]:
!python ./src/models/t5/train.py --batch_size 16 --epochs 10

Downloading (…)lve/main/config.json: 100%|█| 1.21k/1.21k [00:00<00:00, 8.29MB/s]
Downloading (…)ve/main/spiece.model: 100%|███| 792k/792k [00:00<00:00, 8.44MB/s]
Downloading (…)/main/tokenizer.json: 100%|█| 1.39M/1.39M [00:00<00:00, 41.2MB/s]
Downloading builder script: 7.65kB [00:00, 6.04MB/s]                            
Downloading builder script: 5.60kB [00:00, 4.49MB/s]                            
Downloading model.safetensors: 100%|██████████| 892M/892M [00:03<00:00, 228MB/s]
Downloading (…)neration_config.json: 100%|█████| 147/147 [00:00<00:00, 1.00MB/s]
100%|███████████████████████████████████████████| 76/76 [00:06<00:00, 11.92ba/s]
100%|███████████████████████████████████████████| 19/19 [00:01<00:00, 12.27ba/s]
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 1.9855, 'learning_rate': 1.95789473684

## Making the predictions on test set

The test set is 5000 text examples from the preprocessed dataset.

In [8]:
!python ./src/models/t5/predict.py 

Generating predictions...: 100%|██████████| 5000/5000 [1:21:27<00:00,  1.02it/s]
Done!


In [9]:
import pandas as pd 

res = pd.read_csv('data/interim/t5_results.csv')
res.head()

Unnamed: 0,reference,detox_reference,t5_result
0,"He didn't invite me to his wedding, and he's s...",he didn't invite me to the wedding and he's re...,"he didn't invite me to his wedding, and he's s..."
1,"God damn it, Mary, that's no life for you!","come on Mary, it's not your life!","oh, my God, Mary, that's not a life for you!"
2,Cut the crap. You wouldn't even vote for your ...,wouldn't you vote for your sister-in-law?,you wouldn't even vote for your sister-in-law?
3,"""Only the image of evil stupidity?""","""just an image of evil?""","""only the image of evil?"""
4,We're here to forget about all that shit.,we're here to forget everything.,we're here to forget everything.


## Inference: some examples

In [6]:
!python ./src/models/t5/predict.py --inference "What a stupid joke."

Detoxified: what a bad joke.


In [7]:
!python ./src/models/t5/predict.py --inference "Fucking damn joke!"

Detoxified: it's a terrible joke!


T5 model results:

| Source sentence       | Detoxified sentence    |
|-----------------------|------------------------|
| "What a stupid joke." | "what a bad joke."     |
| "Fucking damn joke!"  | "it's a terrible joke!"|

Based on the these examples, it can be concluded that the model is capable of detoxifying toxic sentences and providing non-toxic alternatives in a natural way. 