# Recovering columns hidden by the 🤗  Trainer

- comments: false
- categories: [til,nlp,huggingface,transformers]
- badges: true

In [None]:
#hide
# uncomment if running on Colab
# !pip install transformers datasets pandas

In [1]:
#hide
import warnings
import datasets
import transformers

warnings.filterwarnings("ignore")
datasets.logging.set_verbosity_error()
transformers.logging.set_verbosity_error()

Lately, I've been using the `transformers` trainer together with the `datasets` library and I was a bit mystified by the disappearence of some columns in the training and validation sets after fine-tuning. It wasn't until I saw [Sylvain Gugger's](https://twitter.com/GuggerSylvain?s=20) tutorial on [question answering](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb) that I realised this is by design!  Indeed, as noted in the [docs](https://huggingface.co/transformers/main_classes/trainer.html?highlight=trainer#id1){% fn 1 %} for the `train_dataset` and `eval_dataset` arguments of the `Trainer`:

> If it is an `datasets.Dataset`, columns not accepted by the `model.forward()` method are automatically removed.

A simple one-liner to restore the missing columns is the following:

```python
dataset.set_format(type=dataset.format["type"], columns=list(dataset.features.keys()))
``` 

To understand _why_ this works, we can peek inside the relevant `Trainer` code

In [164]:
??Trainer._remove_unused_columns

[0;31mSignature:[0m
[0mTrainer[0m[0;34m.[0m[0m_remove_unused_columns[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdataset[0m[0;34m:[0m[0;34m'datasets.Dataset'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdescription[0m[0;34m:[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
    [0;32mdef[0m [0m_remove_unused_columns[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mdataset[0m[0;34m:[0m [0;34m"datasets.Dataset"[0m[0;34m,[0m [0mdescription[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mif[0m [0;32mnot[0m [0mself[0m[0;34m.[0m[0margs[0m[0;34m.[0m[0mremove_unused_columns[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    

and see that we're effectively undoing the final `dataset.set_format()` operation. 

## A simple example

To see this in action, let's grab 1,000 examples from the COLA dataset:

In [165]:
from datasets import load_dataset

cola = load_dataset('glue', 'cola', split='train[:1000]')
cola

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 1000
})

Here we can see that each split has three `Dataset.features`: `sentence`, `label`, and `idx`. By inspecting the `Dataset.format` attribute

In [166]:
cola.format

{'type': None,
 'format_kwargs': {},
 'columns': ['idx', 'label', 'sentence'],
 'output_all_columns': False}

we also see that the `type` is `None`. Next, let's load a pretrained model and its corresponding tokenizer:

In [167]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

num_labels = 2
model_name = 'distilbert-base-uncased'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = (AutoModelForSequenceClassification
         .from_pretrained(model_name, num_labels=num_labels)
         .to(device))

Before fine-tuning the model, we need to tokenize and encode the dataset, so let's do that with a simple `Dataset.map` operation:

In [168]:
def tokenize_and_encode(batch): 
    return tokenizer(batch['sentence'], truncation=True)

cola_enc = cola.map(tokenize_and_encode, batched=True)
cola_enc

Dataset({
    features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence'],
    num_rows: 1000
})

Note that the encoding process has added two new `Dataset.features` to our dataset: `attention_mask` and `input_ids`. Since we don't care about evaluation, let's create a minimal trainer and fine-tune the model for one epoch:

In [169]:
from transformers import TrainingArguments, Trainer

batch_size = 16
logging_steps = len(cola_enc) // batch_size

training_args = TrainingArguments(
    output_dir="results",
    num_train_epochs=1,
    per_device_train_batch_size=batch_size,
    disable_tqdm=False,
    logging_steps=logging_steps)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=cola_enc,
    tokenizer=tokenizer)

trainer.train();

Step,Training Loss
62,0.630255


By inspecting one of the training examples

In [170]:
cola_enc[0]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101,
  2256,
  2814,
  2180,
  1005,
  1056,
  4965,
  2023,
  4106,
  1010,
  2292,
  2894,
  1996,
  2279,
  2028,
  2057,
  16599,
  1012,
  102],
 'label': 1}

it seems that we've lost our `sentence` and `idx` columns! However, by inspecting the `features` attribute

In [171]:
cola_enc.features

{'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'label': ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'], names_file=None, id=None),
 'sentence': Value(dtype='string', id=None)}

we see that they are still present in the dataset. Applying our one-liner to restore them gives the desired result:

In [173]:
cola_enc.set_format(type=cola_enc.format["type"], columns=list(cola_enc.features.keys()))
cola_enc[0]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'idx': 0,
 'input_ids': [101,
  2256,
  2814,
  2180,
  1005,
  1056,
  4965,
  2023,
  4106,
  1010,
  2292,
  2894,
  1996,
  2279,
  2028,
  2057,
  16599,
  1012,
  102],
 'label': 1,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

#hide

## Footnotes

{{ "Proof positive that I only read documentation after some threshold of confusion." | fndetail: 1 }}