# Using data collators with 🤗 Transformers and Datasets
> Calculating top losses by example

- toc: true 
- badges: false
- comments: false
- categories: [til,huggingface,transformers,nlp,datasets]
- image: images/icml.png

In [1]:
#hide
import warnings
import datasets
import transformers

warnings.filterwarnings("ignore")
datasets.logging.set_verbosity_error()
transformers.logging.set_verbosity_error()

Recently, [Sylvain Gugger](https://twitter.com/GuggerSylvain?s=20) from HuggingFace has created some nice tutorials on using `transformers` for [text classification](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb) and [named entity recognition](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=545PP3o8IrJV). One trick that caught my attention was the use of a _data collator_ in the trainer, which automatically pads the model inputs in a batch to the length of the longest example. This bypasses the need to set a _global_ maximum sequence length, and in practice leads to faster training since we perform fewer redundant computations on the padded tokens.

I wanted to use to use a data collator for both training _and_ error analysis (e.g. by inspecting the top losses of the model). One problem: each batch in a `Dataset.map` function is a dictionary of lists, while the data collators in `transformers` expect a list of dictionaries.

```python
# get from Trainer.data_collator
data_collator = ...

def processing_function(batch):
    # convert dict of lists to list of dicts
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    # pad inputs and (possibly) labels
    batch = data_collator(features)
    ...
    return batch
```

In [1]:
from transformers import DataCollatorWithPadding

In [4]:
data_collator = DataCollatorWithPadding(tokenizer)

In [None]:
batch = {'input_ids':[1,2,3], 'attention_mask':[1,]}

In [2]:
from datasets import load_dataset

imdb = load_dataset('imdb', split='train').train_test_split(train_size=750, test_size=250)
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 750
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 250
    })
})

In [3]:
#hide_output
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

num_labels = 2
model_name = 'distilbert-base-cased'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier

In [4]:
def tokenize(batch): return tokenizer(batch['text'], truncation=True)

imdb_enc = imdb.map(tokenize, batched=True)
imdb_enc

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text'],
        num_rows: 750
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text'],
        num_rows: 250
    })
})

In [5]:
imdb_enc['test'][:1].keys()

odict_keys(['attention_mask', 'input_ids', 'label', 'text'])

In [6]:
import numpy as np
from datasets import load_metric

accuracy_score = load_metric("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_score.compute(predictions=predictions, references=labels)

In [7]:
from transformers import TrainingArguments

batch_size = 16
logging_steps = len(imdb_enc['train']) // batch_size

training_args = TrainingArguments(
    output_dir="results",
    num_train_epochs=1,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    disable_tqdm=False,
    logging_steps=logging_steps,
)

In [8]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=imdb_enc['train'],
    eval_dataset=imdb_enc['test'],
    tokenizer=tokenizer
)

trainer.train();

Epoch,Training Loss,Validation Loss,Accuracy
1,0.657553,0.601859,0.812


In [9]:
data_collator = trainer.data_collator

In [10]:
def forward_pass_with_label(batch):
    # Convert dict of lists to list of dicts
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    # Pad inputs and labels
    batch = data_collator(features)
    input_ids = torch.tensor(batch["input_ids"], device=device)
    attention_mask = torch.tensor(batch["attention_mask"], device=device)
    labels = torch.tensor(batch["labels"], device=device)

    with torch.no_grad():
        output = model(input_ids, attention_mask)
        batch["predicted_label"] = torch.argmax(output.logits, axis=1)

    loss = torch.nn.functional.cross_entropy(output.logits, labels, reduction="none")

    batch["loss"] = loss
    
    # Datasets requires list of NumPy array data types
    for k, v in batch.items():
        batch[k] = v.cpu().numpy()

    return batch

In [11]:
losses_ds = imdb_enc['test'].map(forward_pass_with_label, batched=True, batch_size=16, remove_columns=['input_ids'])
losses_ds.set_format('pandas')
losses_df = losses_ds[:][['label', 'predicted_label', 'loss']]

In [12]:
losses_df['text'] = imdb['test']['text']

In [13]:
losses_df.head()

Unnamed: 0,label,predicted_label,loss,text
0,1,0,0.69818,"While the original 1932 version, with Preston ..."
1,1,1,0.459517,Ironically the most talked-about American film...
2,0,0,0.602936,"Bingo is the game, bullshit is the name. Rarel..."
3,1,1,0.475849,There are so many reasons as to why I rate the...
4,0,0,0.60692,I thought watching employment videos on corpor...


In [15]:
import pandas as pd
pd.set_option("display.max_colwidth", None)
losses_df.sort_values("loss", ascending=False).head()

Unnamed: 0,label,predicted_label,loss,text
199,0,1,1.011133,"It has been said, ""a city on hill cannot hide itself"" and Virginia City, Nevada, perched on the side of Mt. Davidson at 6200 ft. west of Tahoe, is a prime example, or in the context of the movie, should be. Virginia City exploded in the American dream as a shower of gold and silver, suspiciously the same year the Civil War began. It was the birthplace of the dean of American letters; it was where a young reporter named Samuel Clemens began using the name ""Mark Twain"" and went on to become America's most famous writer. It was also the birthplace of the great Hearst fortune, and the launching pad of John Mackay, who became the wealthiest man in America, the third wealthiest man in the world. Hey, they should have made the movie about him! In the 1860's Virginia CIty was THE boomtown of all boomtowns, the home of the big bonanza, at one time the largest ""metropolitan"" area west of St. Louis and East of San Francisco. But Virginia City (the movie) misses all that and is more about a hogwash North/South duello between the characters played by Errol Flynn and Randolph Scott. Flynn is Capt. Kerry Bradford, a Union officer who is a POW in a concentration camp run by a mean Confederate commander named Capt. Vance Irby, played by Scott. These two are always getting in each other's way. Bradford escapes and then tries to stop a shipment of gold bullion being ""snuck"" out of VC by who else other than . . . Irby! ""Hey, what's he doing here!?"" Horrible. Bogart plays a laughable Mexican bandit who can't decide who's side he's on. Miriam Hopkins plays a murky character named ""Julia Hayne"", obviously a historical lunge at the town's first lady, Julia Bulette, who in real life a celebrated prostitute. She goes to Washington and talks Honest Abe about saving BRADFORD (not Irby) from hanging and blah blah blah. Go figure. They shoulda hung the writer. In ""real life"" Twain reports that on the last day of the War, the setting sun caused the American flag atop Mt. Davidson to appear to the puzzled residents to be weirdly on fire, kind of like the movie. Three days later they discovered that on that day the South capitulated. One interesting quirk in the film is how sidekicks Alan Hale and Guin Williams flick their pistols forward when they shoot, like they're fishing, or trying to make the bullets go faster. Not a bad idea for the movie. The same kind of goofiness is lathered over sap and corn throughout the movie. Gosh, how could they miss the gold madness, profligate wealth, gun battles in the silver mines, Mark Twain getting run out of town and beat up after a showdown, the crooked railroad, the Opera House fire, Artemis Ward, Bulette's huge funeral, the Chinese tongs, the black saloons, the Auction . . ? All this high on a mountain surrounded by desert? The truth was unreal. Did its fabulous wealth actually spark the great American holocaust? Well, if you count this movie, it wouldn't be the first debacle to come out of Virginia City. It's a disappointment for Virginia City fans because it misses what made the town a ""city of illusions,"" where it is said evil seeps out of the ground . . . Okay, other than that it's a fun movie. Flynn and the gang are always great no matter what history they're destroying. If Flynn would just play his rotten self I'd double my rating."
129,1,0,0.966005,"This movie has everything that makes a bad movie worth watching - sloppy editing, little to no continuity, insane dialog, bad (you might even say non-existent) acting, pointless story lines, shots that go on FAR too long...and it's perfect for MST3K-style riffing, not to mention the ""Corpse Eaters Drinking Game"": Scribble on forms...take a shot - Sign your name...take a shot - Catch a bad Foley edit...take many, many shots.<br /><br />The only reason I didn't rate it higher than 8 is because there's not enough gratuitous nudity and because despite its insane badness, it's only an hour long - hell, a movie like this should have been at least 20-30 minutes longer!"
102,0,1,0.928655,"Imagine that in adapting a James Bond novel into a movie, the filmmakers eliminated all the action and suspense in order to make it kid-friendly. Or if a television producer told Chris Rock he couldn't cuss so that his specials could be rated PG. In the same way, the director of the movie ""Something Wicked This Way Comes"" took out the excitement and gore in favor of melodrama for younger audiences. This created a monotonous plot without the complications of the book. In trying to make the story of ""Something Wicked This Way Comes"" easier for children to follow, the filmmakers eliminated the theme of good and evil both existing in everyone, and good always prevailing over evil. This is apparent in Will's character transformation, Charles Halloway's rescue of Jim, and the carnival's defeat.<br /><br />Will's transformation into a more adventurous boy has been muted in the movie. The scene in which the Dust Witch visits Will's house in a balloon has been cut from the film. Instead, a green mist follows Jim and Will home and gives them the same bad dream about the Witch and her spiders. The balloon attack shows us that Will has begun to conquer his fear of doing things on his own. He gets on top of a neighbor's roof and tears the balloon with a bow, defeating the Witch. ""Sorry, Dad, he thought, and sat up, smiling. This time it's me out, alone,"" Will decides as he prepares to face her (147.) Removing this scene from the movie prevents us from understanding that Will is becoming more adventuresome. The film shows us many examples of Will being afraid to follow Jim, but never growing curious like his friend. In the book, Will has both a good, quiet side and an ""evil,"" daring side like his Jim. In the movie, each boy only has one mode of thought, which destroys Bradbury's theme of both good and evil being present in each person.<br /><br />In the book, Will saves Jim because of their friendship, but, in the movie, Charles Halloway saves Jim to repay Jim's father. Will pulls Jim off the carousel in Bradbury's novel because he doesn't want his best friend to grow up without him. It is the good in Will, the fact that he cares about his friend, that saves Jim from the evil curse of the carnival. On the carousel, Jim ""gestured his other hand free to trail on the wind, the one part of him, the small, white, separate part that still remembered their friendship"" (269.) This shows that there was good left inside of Jim, which has the potential to still defeat evil. But when Charles Halloway saves Jim in the movie, he does it to repay the debt he owes Jim's father, who saved Will when he was a little boy. By changing the motivation for saving Jim, the filmmakers have ruined Bradbury's original idea that it takes good to win against evil.<br /><br />In the end of the movie, the carnival is defeated by a tornado and lightning instead of smiles and laughter. When the book ends, Mr. Dark turns himself into a little boy and Charles Halloway smiles and laughs at him so much that he can't stand it and evaporates. In Bradbury's world, evil people feed off fear and can only be defeated by happiness and love. His message is that good will always prevail over evil, but only if that goodness is expressed outwardly. ""Good to evil seems evil,"" says Charles Halloway as he holds the dying Mr. Dark. ""So I will do only good to you, Jed. I'll simply hold you and watch you poison yourself"" (275.) In the movie, Mr. Dark is the only one left on the carousel when lightning hits it, and he dies. By eliminating the weapon of laughter and smiles, the filmmakers imply that bad weather is the most effective way to defeat evil, as if lightning only strikes those who are bad. This takes away the major theme of Bradbury's book, which is that doing ""good"" toward others wards off evil.<br /><br />Good may always triumph over evil, but trying to make movies more kid-friendly will always force filmmakers to leave out some of the themes from the books they are based on. In the movie, ""Something Wicked This Way Comes,"" Will does not transform, Will's friendship does not save Jim, and smiles and laughter do not defeat the carnival. As a result, the filmmakers have left out too many of Bradbury's main points. The process of adapting a book to a movie too often ruins the world the author has established. In the case of this story, Bradbury's frightening world of opposing forces of good and evil has been reduced to a tamer, simpler version of itself."
65,0,1,0.921415,"I'm a Jean Harlow fan, because she had star quality. I don't think her movies are good and I don't even think that she was a good actress, but she certainly was Great in comedies. Every bit of comedy in The Girl from Missouri is very good. But this movie is perhaps more like a love story. Jean Harlow is wonderful in this one and you can forget the rest of the cast - their performances bring nothing new. It always impresses me much to think that Harlow's beautiful body was that of an ill woman. Well, in this movie she does look beautiful."
131,0,1,0.901468,"This is by far one of the most boring and horribly acted accounts of the early days of Adolf Hitler that I have ever watched. Robert Carlyle is a wonderful actor, but to cast him as Hitler is just plain wrong. To cast Liev Schrieber as Hitler's longtime friend and aid, Haefengstal must have emitted cries of despair and anguish from the Simon Wiesenthal Centre. A J-W playing a Nazi supporter, bad bad bad casting. This was not an enjoyable family film with a good historical background. This was Hollywood rubbish at its finest, cashing in on the strength of a strong (but sorely under utilized) supporting cast of actors whom seemed to have all but disappeared from the acting radar in the past 5 years.<br /><br />The fake German accents (vee vill vin zis var) is insulting to German people everywhere. My mother is German and she sat fuming at the sound of the voices which kept switching from American/English/German all in the same sentence. The supporting cast make better cardboard cutouts at the local video store than they do on screen. Jenna Malone as the fated Geli Raubal, was splendid though, she captured the innocence and confusion of this tragic young woman who ultimately ended her own life to escape what her future would have been like in Hitler's shadow.<br /><br />If you would like a tremendously fantastic and historically accurate account of Hitler's early years leading up to and including the war/holocaust, rent ""Inside the Third Reich"" 1983 starring Rutger Hauer as Albert Speer and Derek Jacobi as Hitler. It was good and made more sense then this baloney.<br /><br />As a historical researcher of the Third Reich I can honestly tell you, this had me reaching for my books to confirm its myriad of inaccuracies."
