##Fine-tuning regression on arbitrarily long texts

Data: https://www.cs.cornell.edu/people/pabo/movie-review-data/

Target task: prediction of movie rating via regression to the scale of 0.0-1.0 (the same that is used in dataset, but continuous)

###Problem:

Normally, transformers accept no more than 512 tokens on the input, including the beginning and end of text markers. Some of the texts in our dataset have number of tokens approaching 3K.

###Possible solutions

1. Truncate the input size
2. Use a model with a larger maximum sequence length
3. Split the input text into separate chunks and use the same label for each chunk
4. Run the input text through a text summarization model and train on the summarized version (kind of like dimensionality reduction)
5. **Pooling / sliding window: split text into chunks, then take the average of each chunk's prediction**, as described here: https://github.com/google-research/bert/issues/27

I implement the last approach.

Some resources I used:

https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f

Good chunking solution, but only for inference, not fine-tuning.

####Main problem I ran into:

I couldn't figure out how to batch the different length input first with mapping the dataset for tokenizing, and then with torch's DataLoader: data instances couldn't be stacked into a batch tensor due to a variable amount of chunks. Found this resource offering a solution:

https://github.com/mim-solutions/bert_for_longer_texts

It is proposed to simply force torch to have each batch as a list, not a stacked tensor. It's a good enough solution, but it would probably slow down the training if we had more data. I don't like all code in this repository, but I borrowed their pooling fuction.

The longest string in the dataset had 6 chunks, so its input_ids had the shape of [6, 512]. I would be interested to see how the model would behave if we padded every data instance with zero tensors to have the shape [6, 512].

Finally, in order to fine-tune for regression, I simply specified `num_labels=1` while loading the pre-trained model, and used mean squared error as the loss function.

In order to evaluate how well the model performs, it would be a good idea to establish an MSE baseline (for instance, what is the MSE if every prediction is 0.5), and compare this to the current loss.

To improve this solution further, I'd look into pre-processing the text differently, fine-tuning the model hyperparameters, maybe using a learning rate scheduler, increasing the amount of epochs and stopping when the validation loss begins to increase, using a heavier transformer version.

In [None]:
# I can't really use requirements.txt because I have my virtual env customly set up for Mac M1
!pip install torch
!pip install transformers
!pip install datasets
!pip install accelerate -U

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Installing collected pac

In [None]:
import requests

data_link = ' https://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz'

r = requests.get(data_link)
open('scale_data.tar.gz', 'wb').write(r.content)

4029756

In [None]:
import os

In [None]:
import tarfile

data_zip_path = 'scale_data.tar.gz'
data_folder = 'scale_data'
if os.path.isfile(data_zip_path) and not os.path.exists(data_folder):
    tar = tarfile.open(data_zip_path, "r:gz")
    tar.extractall(data_folder)
    tar.close()

In [None]:
import pandas as pd

In [None]:
data_path = os.path.join(data_folder, 'scaledata')

In [None]:
# read all data into a dataframe
data = pd.DataFrame()
for author in os.listdir(data_path):
    if author[0] != '.':
        texts_path = os.path.join(data_path, author, 'subj.' + author)
        texts = open(texts_path, 'r').read().strip().split('\n')

        ratings_path = os.path.join(data_path, author, 'rating.' + author)
        ratings = open(ratings_path, 'r').read().strip().split('\n')
        ratings = [float(rating) for rating in ratings]

        temp_data = pd.DataFrame(
            {'text': texts,
             'label': ratings
            })
        data = pd.concat([data, temp_data])
        del temp_data

In [None]:
len(data)

5006

In [None]:
data.head()

Unnamed: 0,text,label
0,i'm guessing -- and from the available evidenc...,0.0
1,"there's bad buzz , and then there's the the ba...",0.0
2,director : richard rush . director richard rus...,0.0
3,screenplay : johnny brennan & kamal ahmed and ...,0.0
4,screenplay : tim burns & tom stern and anthony...,0.1


In [None]:
import torch
from transformers import AutoTokenizer
from datasets import Dataset
from torch.utils.data import DataLoader

In [None]:
dataset = Dataset.from_pandas(data, preserve_index=False)

In [None]:
# an attempt at memory management
del data

In [None]:
MAX_LENGTH = 512
BATCH_SIZE = 16
EPOCHS = 3
MODEL_NAME = 'distilbert-base-uncased'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# custom dataset creation and batching
def custom_tokenize(dataset_dict, max_length=MAX_LENGTH):
    tokens = tokenizer(dataset_dict['text'],
                       add_special_tokens=False,
                       return_tensors='pt')

    # split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)
    input_id_chunks = list(tokens['input_ids'][0].split(max_length - 2))
    mask_chunks = list(tokens['attention_mask'][0].split(max_length - 2))

    # loop through each chunk
    for i in range(len(input_id_chunks)):
        # add CLS and SEP tokens to input IDs
        input_id_chunks[i] = torch.cat([
            torch.tensor([101]), input_id_chunks[i], torch.tensor([102])
        ])
        # add attention tokens to attention mask
        mask_chunks[i] = torch.cat([
            torch.tensor([1]), mask_chunks[i], torch.tensor([1])
        ])
        # get required padding length
        pad_len = max_length - input_id_chunks[i].shape[0]
        # check if tensor length satisfies required chunk size
        if pad_len > 0:
            # if padding length is more than 0, we must add padding
            input_id_chunks[i] = torch.cat([
                input_id_chunks[i], torch.Tensor([0] * pad_len)
            ])
            mask_chunks[i] = torch.cat([
                mask_chunks[i], torch.Tensor([0] * pad_len)
            ])

    # check length of each tensor
    for chunk in input_id_chunks:
        assert chunk.shape[0] == max_length

    input_ids = torch.stack(input_id_chunks)
    attention_mask = torch.stack(mask_chunks)

    return {
        'input_ids': input_ids.long(),
        'attention_mask': attention_mask.int()
    }

In [None]:
# this can't be batched due to different token lengths
tokenized_dataset = dataset.map(custom_tokenize)

Map:   0%|          | 0/5006 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (518 > 512). Running this sequence through the model will result in indexing errors


In [None]:
del dataset

In [None]:
# a bug converts tensors to lists during mapping, convert back
# https://github.com/huggingface/datasets/issues/1046
tokenized_dataset.set_format(type = 'torch')

In [None]:
train_test_valid = tokenized_dataset.train_test_split(test_size=0.3)
train_dataset = train_test_valid['train']
print(train_dataset)
test_valid = train_test_valid['test'].train_test_split(test_size=0.5)
valid_dataset = test_valid['train']
print(valid_dataset)
test_dataset = test_valid['test']
print(test_dataset)

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 3504
})
Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 751
})
Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 751
})


In [None]:
del tokenized_dataset

In [None]:
# by default, torch stacks tensors, but our tensors have different lengths, so the default method will throw an error
# this forces DataLoader to return lists instead of tensors
def collate_fn_pooled_tokens(data):
    input_ids = [data[i]['input_ids'] for i in range(len(data))]
    attention_mask = [data[i]['attention_mask'] for i in range(len(data))]
    labels = torch.Tensor([data[i]['label'] for i in range(len(data))])

    return [input_ids, attention_mask, labels]

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_pooled_tokens)
valid_dataloader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_pooled_tokens)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_pooled_tokens)

In [None]:
del train_dataset
del valid_dataset
del test_dataset

In [None]:
from transformers import AutoModelForSequenceClassification
from torch.optim import AdamW

In [None]:
# for regression problems num_labels=1
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=1)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
num_training_steps = EPOCHS * len(train_dataloader)

In [None]:
# mps is Mac M1
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [None]:
# calculate predictions per paragraph
def evaluate_single_batch(model, batch):
    input_ids = batch[0]
    attention_mask = batch[1]
    number_of_chunks = [len(x) for x in input_ids]

    # concatenate all input_ids into one batch

    input_ids_combined = []
    for x in input_ids:
        input_ids_combined.extend(x.tolist())

    input_ids_combined_tensors = torch.stack([torch.tensor(x).to(device) for x in input_ids_combined])

    # concatenate all attention masks into one batch

    attention_mask_combined = []
    for x in attention_mask:
        attention_mask_combined.extend(x.tolist())

    attention_mask_combined_tensors = torch.stack(
        [torch.tensor(x).to(device) for x in attention_mask_combined]
    )

    # get model predictions for the combined batch
    preds = model(input_ids_combined_tensors, attention_mask_combined_tensors)
    preds_logits = preds.logits.flatten().cpu()

    # split result preds into chunks

    preds_logits_split = preds_logits.split(number_of_chunks)
    # the final prediction is the average of all predictions per text chunk
    pooled_preds = torch.cat([torch.mean(x).reshape(1) for x in preds_logits_split])

    return pooled_preds

In [None]:
from tqdm.auto import tqdm
from torch.nn import MSELoss

In [None]:
progress_bar = tqdm(range(num_training_steps))
model.train()
mse = MSELoss()
train_dataloader_length = len(train_dataloader)
val_dataloader_length = len(valid_dataloader)
test_dataloader_length = len(test_dataloader)

for epoch in range(EPOCHS):
    all_train_loss = 0.0
    for step, batch in enumerate(train_dataloader):
        labels = batch[-1].float().cpu()
        predictions = evaluate_single_batch(model, batch)
        loss = mse(predictions, labels)
        all_train_loss += loss.item()
        loss.backward()

        if step > 0 and step % 50 == 0:
            print('Train loss at batch ' + str(step) + ':', "{0:.4f}".format(all_train_loss/(step+1)))

        optimizer.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    with torch.no_grad():
        all_val_loss = 0.0
        for batch in valid_dataloader:
            labels = batch[-1].float().cpu()
            predictions = evaluate_single_batch(model, batch)
            all_val_loss += mse(predictions, labels).item()

    print('Epoch', epoch)

    epoch_train_loss = all_train_loss / train_dataloader_length
    print('Train loss:', "{0:.4f}".format(epoch_train_loss))

    epoch_val_loss = all_val_loss / val_dataloader_length
    print('Val loss:', "{0:.4f}".format(epoch_val_loss))
    print('###########################\n')

  0%|          | 0/657 [00:00<?, ?it/s]

Train loss at batch 50: 0.0366
Train loss at batch 100: 0.0279
Train loss at batch 150: 0.0248
Train loss at batch 200: 0.0223
Epoch 0
Train loss: 0.0219
Val loss: 0.0197
###########################

Train loss at batch 50: 0.0134
Train loss at batch 100: 0.0116
Train loss at batch 150: 0.0122
Train loss at batch 200: 0.0114
Epoch 1
Train loss: 0.0113
Val loss: 0.0118
###########################

Train loss at batch 50: 0.0100
Train loss at batch 100: 0.0086
Train loss at batch 150: 0.0078
Train loss at batch 200: 0.0076
Epoch 2
Train loss: 0.0075
Val loss: 0.0145
###########################



In [None]:
all_test_loss = 0

with torch.no_grad():
  for batch in test_dataloader:
      labels = batch[-1].float().cpu()
      predictions = evaluate_single_batch(model, batch)
      all_test_loss += mse(predictions, labels).item()

all_test_loss / test_dataloader_length

0.012520544201214897

In [None]:
# sanity check: see if the predicted ratings make sense
with torch.no_grad():
  for batch in test_dataloader:
      labels = batch[-1].float().cpu()
      predictions = evaluate_single_batch(model, batch)

      for ind in range(BATCH_SIZE):
        print("{0:.2f}".format(labels[ind].item()),
              "{0:.2f}".format(predictions[ind].item()))
      break

0.30 0.38
0.40 0.33
0.60 0.63
0.70 0.55
0.50 0.62
0.80 0.73
0.50 0.55
0.30 0.43
0.50 0.35
0.60 0.55
0.30 0.48
0.44 0.37
0.40 0.50
0.53 0.39
0.60 0.62
0.60 0.44


In [None]:
torch.save(model, 'my_model')