# DistilBERT

This notebook is only a guide on how to fine-tune the **DistilBERT** model, because the actual training of the model took place on **Google Colab**, in order to utilize the GPU. For this reason, only some cells have been executed.

In [1]:
import torch

from thc.utils.env import check_repository_path


REPOSITORY_DIR = check_repository_path()
PROCESSED_DATA_DIR = REPOSITORY_DIR.joinpath("data", "processed")
print(
    f"Is cuda available? {torch.cuda.is_available()}."
)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Is cuda available? False.


## Data preparation

Train-valid-test datasets should be prepared according to the previous [01-train-valid-split](https://github.com/mrtovsky/thc/blob/main/notebooks/01-train-valid-split.ipynb) notebook and the final data **processed** folder structure should look as follows:

In [2]:
!tree ../data/processed/

[01;34m../data/processed/[00m
├── [01;32mtest_tags.txt[00m
├── [01;32mtest_text.txt[00m
├── train_tags.txt
├── train_text.txt
├── valid_tags.txt
└── valid_text.txt

0 directories, 6 files


Prepare dataset loaders.

In [3]:
from torch.utils.data import DataLoader

from thc import datasets
from thc.preprocessing import TRANSFORMS


BATCH_SIZE: int = 16


train_tweets = datasets.TweetsDataset(
    text_file=PROCESSED_DATA_DIR.joinpath("train_text.txt"),
    tags_file=PROCESSED_DATA_DIR.joinpath("train_tags.txt"),
    transform=TRANSFORMS,
)
valid_tweets = datasets.TweetsDataset(
    text_file=PROCESSED_DATA_DIR.joinpath("valid_text.txt"),
    tags_file=PROCESSED_DATA_DIR.joinpath("valid_tags.txt"),
    transform=TRANSFORMS,
)

train_dataloader = DataLoader(
    dataset=train_tweets,  batch_size=BATCH_SIZE, shuffle=True, num_workers=2
)
valid_dataloader = DataLoader(
    dataset=valid_tweets,  batch_size=BATCH_SIZE, shuffle=False, num_workers=2
)

## Modeling

**Transformers** native implementation of the **DistilBERT** has been slightly modified to meet the requirements of the problem posed. The size of the output has been changed to match the 3-classes classification problem and a dropout layer has been added preceding the fully-connected layer. The pre-trained weights remained unchanged.

In [4]:
from thc.models import DistilBertClassifier


model = DistilBertClassifier(output_size=3)

### Fine-tuning

In [None]:
import codecs

import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
from transformers import get_linear_schedule_with_warmup

from thc.arena import run_experiment, TrainTestDataloaders
from thc.preprocessing import TOKENIZER


EPOCHS = 20


artifacts_dir = REPOSITORY_DIR.joinpath("models", "distilbert-fine-tuning")
artifacts_dir.mkdir(exist_ok=True)
logs_dir = REPOSITORY_DIR.joinpath("logs", "distilbert-fine-tuning")
logs_dir.mkdir(exist_ok=True)
optimizer = optim.AdamW(model.parameters(), lr=3e-5)
scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=len(train_dataloader) * EPOCHS * 0.05,
    num_training_steps=len(train_dataloader) * EPOCHS,
)
# Apply balanced class weights
with codecs.open(PROCESSED_DATA_DIR.joinpath("train_tags.txt"), "r") as file:
    train_tags = [int(tag) for tag in file]
class_weights = torch.from_numpy(
    len(train_tags)
    / (len(np.unique(train_tags)) * np.bincount(train_tags))
).float().to(DEVICE)
print("Class weights:", class_weights)
objective = nn.CrossEntropyLoss(weight=class_weights)
train_test_dataloaders = TrainTestDataloaders(train=train_dataloader, test=valid_dataloader)
writer = SummaryWriter(log_dir=logs_dir)

run_experiment(
    model=model,
    dataloaders=train_test_dataloaders,
    tokenizer=TOKENIZER,
    device=DEVICE,
    optimizer=optimizer,
    objective=objective,
    epochs=EPOCHS,
    scheduler=scheduler,
    artifacts_dir=artifacts_dir,
    writer=writer,
)