In [34]:
import transformers
from datasets import load_dataset, load_metric, DatasetDict
import evaluate
print(transformers.__version__)

4.32.1


## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions. `load_dataset` will cache the dataset to avoid downloading it again the next time you run this cell.

In [7]:
dataset = load_dataset("imdb")
accuracy_metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [8]:
dataset[]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [35]:
# Reference: https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090/13
# Split train test into train+validation sets in 80:20 ratio
train_plus_validation = dataset["train"].train_test_split(test_size=0.2)
dataset = DatasetDict({
    'train': train_plus_validation['train'],
    'validation': train_plus_validation['test'],
    'test': dataset["test"],
})

To access an actual element, you need to select a split first, then give an index:

In [36]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [37]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [38]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"I wasn't terribly impressed with Dante's 1st season offering in ""Homecoming"", it wasn't much of a horror story, but rather a smart political statement with the undead. Screwfly situation is the story of a virus unleashed on the world that causes men's sexual drive to replaced with murderous tendencies toward women. The episode starts out all right with a short film explaining the way the screw fly was killed of by scientists. Then there is short scene where a man is arrested when females bodies are discovered in his home. I assume this is supposed to show the beginning of the outbreak, but is unclear because this is never revisited. The episode go ons for a while introducing characters blah blah blah.It seems cool and mysterious but the episode stars to get worse and worse as it lurches forward until its sad and unsatisfying end. The worst episode. Well, except for chocolate.",neg
1,"Okay, now, I know there are millions of Americans who believe in The Rapture: that moment when all people born again in Christ will be raptured up to meet God and all the rest of humanity will be left on earth to perish in plagues and fire and the heartbreak of psoriasis as the Antichrist battles it out with Jesus (in an uncharacteristically warlike mode). And I know the books were best sellers. . .among believers, anyway. And I mean no disrespect to all that.<br /><br />But I have to say, they stuffed this movie into a sack and beat it with the Suck Stick.<br /><br />I'm sure the books are much better. Really.<br /><br />The plot needs no reprising. If you've watched this movie, chances are you read the book. I may be one of the only people on earth who actually watched this just for the sheer bad-moving-making experience, and I wasn't disappointed. Especially not by Kirk Cameron, the creepy little ""Growing Pains"" gremlin, who came of age on that show, found Christ, and decided that the SHOW should reflect his Christian values. Well, Kirk, your career has gone to the dogs, but now you can be happy that you're spreading the word of God in movies so bad, they never even make it to theatrical release. Well, that's not strictly true: I guess this was the only movie ever made that went to DVD FIRST, with a voucher for a free viewing of the movie when it was briefly released in theaters! I still have the voucher! How many people do you suppose showed up? I don't know about you, but it never came to my town. Of course, I live in NYC, where we Godless liberals sit around tearing pages out of the bible and use them to roll joints. So there you go. In fact, I'll bet out of three million people on Manhattan Island, not one would be raptured.<br /><br />Check out the supplementary materials on the DVD, where you'll learn the creepy behind the scenes details of these movies. . .the CAST and CREW all must be of the same religious mindset. They don't come right out and say this, but listen closely to what the filmmakers say. It's like a bunch of Pod People got together to make a Pod movie. How creepazoid is that? Honestly, this stuff just preaches to the converted, doesn't it? Can you imagine anyone who DOESN'T subscribe to the whole apocalypse thing watching this, slapping his forehead and saying, ""HOLY HOOVER DAM! I better get saved PRONTO!"" Anyhow, I'm hooked. I gotta see the rest of these Christian fiasco movies, especially the one with Gary Busey, which I think is TRIBULATIONS. At least Busey has an excuse for taking the part.. . .he cracked his head on some pavement when he crashed his motorcycle.<br /><br />Oy.<br /><br />Oh, and one more thing. What's with all the shots of poor,innocent dogs whimpering, their leashes dragging uselessly along the ground, because their owners have been called to heaven? What's up with that? Are we supposed to feel badly for the dogs, and if we do, what are we to make of God? Doesn't it IRK people that there's no room in heaven for man's best friend? Foo.<br /><br />This is one more reason I'm agnostic. Good night and good luck.",neg
2,"An unmarried, twenty-something hick (played by John Travolta) leaves the farm and goes to Houston, where he learns about life and love in a Texas honky-tonk. At face value, it's a modern love story ... Texas style. There's gobs of cowboy hats, pickup trucks, neon beer signs, and references to big belt-buckles and rodeos. The music, if not Texas native, is Texas adapted, courtesy of the talents of Mickey Gilley, Johnny Lee, and the Charlie Daniels Band. And that Texas twang ... ""y'all"".<br /><br />The story and the characters are about as subtle as the taste of Texas five-alarm chili made with Jalapeno peppers. It's enough to make civilized viewers abort the film in favor of a genteel classic, one starring Laurence Olivier or Ingrid Bergman, maybe. ""Hamlet"" it's not. But ""Urban Cowboy"" is spicy and explicit, and I kinda like it.<br /><br />Technically, the film is generally good. The dialogue, the production design, and the costumes are all realistic; the editing is skillful. And both the casting and the acting are commendable, if not Oscar worthy. I would not have cast Travolta in the role he plays, but he does a fine job ... ditto Debra Winger. Barry Corbin and Brooke Alderson, among others, are good too, in support roles. But, the cinematography seemed weak. The film copy I watched was grainy, and at times suffered from a reddish/orange tint, a visual trait I have noticed in other films from the same time period.<br /><br />At first glance, the film does not seem to offer any social or political ""message"". But I would argue that when ""Urban Cowboy"" was released twenty-five years ago, it had rather prophetic implications. In 1980 the U.S. had all kinds of problems, not the least being American hostages held by Iran. In the minds of a lot of folks back then, the U.S. was being pushed around, bullied.<br /><br />This film, along with others of its time, offered something that Americans wanted to see in their political leaders ... toughness. ""Urban Cowboy"" is a very physical film. The characters in it may not be the brightest people on Earth. But, they're tough!<br /><br />Everything about ""Urban Cowboy"" is anti-intellectual. As a vehicle for cultural expression then, this 1980 film was one of several that augured a new get-tough era for the U.S. It started in 1980 with the election of Reagan. And that era continues to this day, with a President who probably will not be remembered for his intellect, but will be remembered for his toughness and aggression, traits that Americans seem to gravitate to as surely as Texans to five-alarm chili.",pos
3,"Laughs, adventure, a good time, a killer soundtrack, oscar-worthy acting, and special effects/ animitronics like none other, what else could you want in a movie? If you see this will be on the telly, WATCH IT, otherwise, run out now to RENT IT!!!",pos
4,"I've received this movie from a cousin in Norway and had to convert it from Norwegian to American format with a copied video. Comparing this film (1948) with the Heroes of Telemark (1965), Kampen om Tungtvannet (The Struggle for the Heavy Water) casts the saboteurs themselves, playing their respective roles, though actors were also cast to play the roles of the saboteurs who have given their lives in Norway's struggle for freedom in later campaigns. The plot is in four languages: Norwegian along with French, German and English (complete with Norwegian subtitles).<br /><br />Impressive during this course of history was what led to the struggle. French scientists were interested in obtaining some two hundred kilograms of heavy water from Norsk Hydro in Vemork to take back to France in order to do lab studies on its effectiveness. Simultaneously, the Nazis, too, were interested in obtaining heavy water to build a secret weapon. The French were worried that the Nazis might take an early lead by invading Norway, and through secret codes, their man carefully eluded Nazi spies on his trip to Oslo where he received the heavy water and making it back without hindrance. He was watched by two spies as he boarded an airliner, but they did not see him hop out on the other side where he crossed the tarmac to another plane nearby where his cargo was waiting for him. This clever trick worked by using the airliner as a decoy that the Nazis later forced down in Hamburg.<br /><br />However, the invasion of Norway on the morning of April 9, 1940, the Nazis took over Norsk Hydro and it was up to the Norwegian Underground and British intelligence in London to take action. Professor Leif Trondstad volunteered the services of eleven young Norwegians; the ""Swallow"" and ""Gunnerside"" groups who would successfully sabotage the heavy water production in Vemork. This was shown in detail on how they actually carried out the operation, including the sinking of the ferryboat after the Nazis abandoned Norsk Hydro to take the shipment of heavy water on rail cars to Berlin.<br /><br />The quality of the film was fair though there were many splices in the film. I highly recommend this film to anyone interested in World War II history.",pos
5,"This is the epitome of fairytale! The villains are completely wicked and the heroes are refreshingly pure. Danes, Deniro,and Pfieffer are wonderful as well as the new actor who plays the role of Tristan. Outstanding performances, delightful magic, funny and dramatic, and a perfect fairytale ending make this film absolutely fabulous! I'm not so sure all content is appropriate for younger children but for an older audience, there are plenty of hilarious subtleties! The previews do not do this movie justice! My fiancé and I were quite skeptical but were so thrilled we had taken a chance on this movie that I can only hope to assure anyone on the fence about this movie to give it a try!",pos
6,"'I don't understand. None of this makes any sense!', exclaims one exasperated character towards the end of Death Smiles at Murder. Having just sat through this thoroughly confusing mess of a movie, I know exactly how he feels. The story, by the film's director Aristide Massacessi (good old Joe D'amato using his real name for a change), is a clumsy mix of the supernatural, murder/mystery, and pretentious arty rubbish, the likes of which will probably appeal to those who admire trippy 70s garbage such as Jess Franco's more bizarre efforts, but which had me struggling to remain conscious.<br /><br />Opening with a hunchback mourning the death of his beautiful sister (with whom he had been having an incestuous affair, before eventually losing her to a dashing doctor), Death Smiles at Murder soon becomes very confusing when the very same woman (played by Ewa Aulin, who stars in the equally strange 'Death Laid an Egg') is seen alive and kicking, the sole survivor of a coach accident that occurs outside the estate of Walter and Eva von Ravensbrück. After being invited to stay and recuperate in their home, where she is tended to by creepy Dr. Sturges (Klaus Kinski in a throwaway role), the comely lass begins love affairs with both Mr. and Mrs. Ravensbrück (meaning that viewers are treated to some brief but welcome scenes of nookie and lesbian lovin').<br /><br />'So far, so good', I thought to myself at this point, 'we've had hunchbacks, incest, some blood and guts, and gratuitous female nudity'all ingredients of a great trashy Euro-horror; what follows, however, is a lame attempt by Massacessi to combine giallo style killings, ghostly goings on, and even elements from Edgar Allan Poe's 'The Black Cat', to tell a very silly, utterly bewildering, and ultimately extremely boring tale of revenge from beyond the grave.<br /><br />This film seems to have quite few admirers here on IMDb, but given the choice, I would much rather watch one of the director's sleazier movies from later in his career; I guess incomprehensible, meandering, surreal 70s Gothic horror just ain't my thing! 2.5 out of 10 (purely for the cheesy gore and nekkidness), rounded up to 3 for IMDb.",neg
7,"I wasn't sure how to rate this movie, since it was so bad it was actually very funny. I'm not a Gackt fan by any means, though he is talented, despite the weird pseudonym that sounds like a cat coughing up a hairball. I always thought Hyde was talented though, Faith is an interesting album.<br /><br />But on topic here folks. This movie is ridiculous. It's so over the top and nonsensical it's almost like a parody of supernatural action films.<br /><br />The movie has almost no plot here, except it's just about vampires with gangster friends. In a way, this film almost reminded me of Spider-Man 3, with how there were too many ideas, which resulted in not enough time to pay attention on one of them.<br /><br />The action scenes were laughable. Quickly edited, almost hard to understand, with choreography that's so laughably bad. Though Hyde looked very stylish during the action scenes, but that's this film's only such redemption. I'm a sucker for good action movies, but the action was horribly done. Though the final shootout was OK and the highlight of this otherwise depressing movie.<br /><br />It keeps jumping between genres, not a good thing. It wants to be a drama, or an action flick, or a horror, or a romance... what the hell.<br /><br />If this review is making you mad, why? Is it because Gackt and Hyde are your love? Don't fool yourself, this MOVIE IS BAD.",neg
8,"It was so terrible. It wasn't fun to watch at all. Even the scene where the girl is using a vibrator, even that's not fun to watch in this movie. I say again, the scene where a girl is masturbating with a vibrator is not even fun to watch. Or maybe if that was the only part of the movie that you watched, just girl on couch using a vibrator. Maybe they should have just released that one scene in theaters, maybe then the movie would be enjoyable on a certain level. My advice, fast forward to that point, watch it, rewind the movie, watch it again, rewind, repeat. Maybe you could enjoy yourself for 2 hours that way. This movie ranks alongside I spit on your grave and Doom generation in the category of worst movies that I have ever seen.",neg
9,"Facts about National Lampoon Goes to the Movies, a.k.a. National Lampoon's Movie Madness:<br /><br />1. The movie is poor, even by Lampoon's typical standards. 2. It's not funny. 3. No one goes to see a movie.<br /><br />So, after I finished watching it, I began wondering why on earth it's called 'National Lampoon Goes to the Movies,' and why it was ever conceived, much less actually made. It would be like calling Austin Powers 'An American Guy Goes to the Movies.' How lame. He isn't American, and he doesn't go to movies. None of the characters in Lampoon's so-called 'satire' are funny, and none go see movies, which causes a bit of a problem. I had hoped it would be something in the vein of Mystery Science Theater 3000, but it isn't.<br /><br />This was National Lampoon's first film after Animal House, although you couldn't tell it from the quality of film. Poorly developed, rough and amateurish by any standard, it induces headaches  not a good sign for an 89-minute movie that seems double the length.<br /><br />I've noticed a pattern. Really bad movies are typically renamed  and this little disaster falls under that category. It has two separate titles -- probably to help try and promote it to people too stupid to remember how bad a panning it received from home video critics in 1982/83. 'Hmm, Movie Madness  I've never heard of this movie before! Let's rent it!' And then, the realization: 'Hey, wait a minute, this is just National Lampoon Goes to the Movies!'<br /><br />It was shelved by MGM/UA, never to be released into theaters or DVD; it occasionally pops up on television a few times per decade, which is just about the only place you'll manage to find it.<br /><br />It's split up into three stories  a parody of self-enlargement videos, butter and corporate ruthlessness, and police brutality/cop-buddy films (I guess). The first segment stars Peter Riegert (Animal House) as a frustrated guy who divorces his wife and does some other stuff. I'm not sure what because it was so boring my mind started to drift. Until the sex scene popped up.<br /><br />Part II is about an exotic dancer raped by a stick of butter (don't ask) who decides to become Queen of the Margarine so she can cut off the supply of dairy products. Ouch! This contains the only funny line in the movie: 'Only I can make love with my son!' If you think that doesn't sound very funny, you're right  it's not. And just imagine  it's the highlight of this film!<br /><br />Part III is about a cop who chases down a serial killer (Christopher Lloyd) only to lose his nerve and shoot the guy. It does contain one funny scene but it's extremely over-acted  only Lloyd really exhibits any humor, playing his character dry and compassionate, yet strangely surreal. The part where he's choking his victim and the meek cop stands by watching it all unfold, at least, evoked a chuckle or two.<br /><br />It's a shame to watch such a cast of semi-famous names resort to low standards. The writers of each segment clearly believe that they're being very ironic and clever by spoofing so-called stereotypes  the fault being that the movie becomes one huge contradiction, favoring the standard T & A instead of plot; crude humor instead of witty dialogue; desperate performances instead of inspired ones. It's easy to see that none of the actors were enthralled with the material, muttering their lines, often so embarrassed they can seldom make eye contact with the camera.<br /><br />The movie isn't funny, as I said before. I laughed once, at only one line, and even then it was a halfhearted one. Two chuckles, a smile, and a very weak laugh. Compared to Movie Madness, a number of other decent comedies seem like regular laugh tracks.<br /><br />I like National Lampoon's Vacation series (or, at least three of four installments), and their classic Animal House, but their recent slew of direct-to-video bombs such as Golf Punks (with that great comic genius Tom Arnold) provide a good example of why their magazine went out of print more than a decade ago. It gets really old, really fast.<br /><br />Sad to see a new film, called Gold Diggers, is being released with their 'stamp of approval.' It's like condemning a film before it even hits theaters  maybe they should start not advertising their name all over the place<br /><br />Distributor: 'This movie is bad. It gets the National Lampoon stamp of approval. That'll teach you not to make something so awful next time.'<br /><br />Forget the death penalty. Just stick a bunch of criminals in a room and make them watch this over and over every day for a month.<br /><br />It's so bad that I can't even begin to explain its putrid vileness. I give up.",neg


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [39]:
accuracy_metric

EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
    

You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [40]:
import numpy as np

fake_preds = np.array([0, 1, 0, 1, 0])#np.random.randint(0, 2, size=(64,))
fake_labels = np.array([0, 1, 0, 0, 1]) #np.random.randint(0, 2, size=(64,))
accuracy_metric.compute(predictions=fake_preds, references=fake_labels)

{'accuracy': 0.6}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [41]:
from transformers import AutoTokenizer
model_checkpoint = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [42]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [1, 5365, 261, 291, 311, 4378, 300, 2, 414, 291, 4378, 1242, 275, 278, 260, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [43]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [44]:
preprocess_function(dataset['train'][10000:10005])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': [[1, 15506, 304, 5660, 271, 21278, 697, 265, 262, 733, 6846, 260, 25648, 280, 268, 697, 271, 33556, 1015, 266, 509, 625, 1927, 271, 12092, 354, 267, 315, 1532, 2928, 7244, 268, 263, 315, 1108, 18660, 374, 269, 1419, 3529, 352, 262, 1016, 260, 26182, 77071, 269, 299, 15356, 35502, 261, 9006, 6721, 269, 1970, 304, 42457, 263, 9061, 110609, 269, 260, 260, 260, 260, 371, 261, 9061, 110609, 269, 370, 265, 339, 269, 1299, 275, 25648, 267, 291, 926, 261, 304, 373, 702, 280, 297, 553, 322, 611, 618, 264, 2364, 262, 3817, 263, 4522, 260, 4052, 8981, 840, 1504, 4052, 8981, 840, 1504, 17777, 272, 262, 25130, 270, 291, 1421, 652, 278, 670, 267, 3259, 288, 2320, 310, 354, 392, 743, 335, 362, 1315, 261, 304, 272, 363, 1174, 4372, 272, 2131, 1032, 354, 2156, 768, 260, 672, 261, 262, 25373, 1270, 303, 331, 1992, 288, 266, 8782, 37042, 1039, 272, 3441, 262, 5930, 263, 2449, 278, 396, 264, 266, 717, 360, 375, 743, 260, 4052, 8981, 840, 1504, 4052, 8981, 840, 1504, 35905, 493, 267, 1169, 26

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [45]:
encoded_dataset = dataset.map(preprocess_function, batched=True, load_from_cache_file=False)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels):

In [46]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 1
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define two more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [49]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]
task="imdb_classification"
batch_size = 2
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/bert-finetuned-mrpc"` or `"huggingface/bert-finetuned-mrpc"`).

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits (our just squeeze the last axis in the case of STS-B):

In [50]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [51]:
validation_key = "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

You might wonder why we pass along the `tokenizer` when we already preprocessed our data. This is because we will use it once last time to make all the samples we gather the same length by applying padding, which requires knowing the model's preferences regarding padding (to the left or right? with which token?). The `tokenizer` has a pad method that will do all of this right for us, and the `Trainer` will use it. You can customize this part by defining and passing your own `data_collator` which will receive the samples like the dictionaries seen above and will need to return a dictionary of tensors.

We can now finetune our model by just calling the `train` method:

In [1]:
trainer.train()

NameError: name 'trainer' is not defined

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

{'eval_loss': 0.6563718318939209,
 'eval_matthews_correlation': 0.5432575763528743,
 'epoch': 5.0,
 'total_flos': 354601098011100}

To see how your model fared you can compare it to the [GLUE Benchmark leaderboard](https://gluebenchmark.com/leaderboard).

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")
```

## Hyperparameter search

The `Trainer` supports hyperparameter search using [optuna](https://optuna.org/) or [Ray Tune](https://docs.ray.io/en/latest/tune/). For this last section you will need either of those libraries installed, just uncomment the line you want on the next cell and run it.

In [None]:
# ! pip install optuna
# ! pip install ray[tune]

During hyperparameter search, the `Trainer` will run several trainings, so it needs to have the model defined via a function (so it can be reinitialized at each new run) instead of just having it passed. We jsut use the same function as before:

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

And we can instantiate our `Trainer` like before:

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

The method we call this time is `hyperparameter_search`. Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the `train_dataset` line above by:
```python
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)
```
for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.

In [None]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

[32m[I 2020-10-15 17:06:50,302][0m A new study created in memory with name: no-name-5ca133fa-884f-4b0b-beb6-82fb19c9da06[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the 

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.567259,0.567516,0.267643
2,0.424672,0.588076,0.360967
3,0.304736,0.882551,0.373388
4,0.172879,1.131409,0.427625


[32m[I 2020-10-15 17:08:54,670][0m Trial 0 finished with value: 0.4276247641425646 and parameters: {'learning_rate': 9.066394855554376e-05, 'num_train_epochs': 4, 'seed': 3, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 0.4276247641425646.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificati

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.498541,0.470532,0.480649


[32m[I 2020-10-15 17:09:27,505][0m Trial 1 finished with value: 0.4806486312161432 and parameters: {'learning_rate': 4.70956471834944e-05, 'num_train_epochs': 1, 'seed': 40, 'per_device_train_batch_size': 8}. Best is trial 1 with value: 0.4806486312161432.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificati

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.467531,0.446775
2,0.428433,0.476509,0.521516


[32m[I 2020-10-15 17:09:51,733][0m Trial 2 finished with value: 0.5215162259225145 and parameters: {'learning_rate': 4.357724525964853e-05, 'num_train_epochs': 2, 'seed': 38, 'per_device_train_batch_size': 32}. Best is trial 2 with value: 0.5215162259225145.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassifica

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.526249,0.48895,0.412369
2,0.344729,0.5057,0.505773
3,0.242301,0.565479,0.514625


[32m[I 2020-10-15 17:10:42,034][0m Trial 3 finished with value: 0.5146251651052666 and parameters: {'learning_rate': 1.966064728684351e-05, 'num_train_epochs': 3, 'seed': 4, 'per_device_train_batch_size': 16}. Best is trial 2 with value: 0.5215162259225145.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassificat

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.61917,0.615369,0.0


  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2020-10-15 17:11:00,530][0m Trial 4 finished with value: 0.0 and parameters: {'learning_rate': 1.1681626444738994e-06, 'num_train_epochs': 1, 'seed': 35, 'per_device_train_batch_size': 16}. Best is trial 2 with value: 0.5215162259225145.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initia

Epoch,Training Loss,Validation Loss


  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2020-10-15 17:11:16,708][0m Trial 5 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at d

Epoch,Training Loss,Validation Loss


[32m[I 2020-10-15 17:11:27,293][0m Trial 6 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.467245,0.477506


[32m[I 2020-10-15 17:11:39,998][0m Trial 7 finished with value: 0.47750559698251505 and parameters: {'learning_rate': 8.294058232883187e-05, 'num_train_epochs': 1, 'seed': 14, 'per_device_train_batch_size': 64}. Best is trial 2 with value: 0.5215162259225145.[0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassific

Epoch,Training Loss,Validation Loss


  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
[32m[I 2020-10-15 17:11:50,629][0m Trial 8 pruned. [0m
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at d

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.526511,0.487868,0.449608


[32m[I 2020-10-15 17:12:09,216][0m Trial 9 finished with value: 0.4496077935780946 and parameters: {'learning_rate': 3.983123210032869e-05, 'num_train_epochs': 1, 'seed': 35, 'per_device_train_batch_size': 16}. Best is trial 2 with value: 0.5215162259225145.[0m


The `hyperparameter_search` method returns a `BestRun` objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.

In [None]:
best_run

BestRun(run_id='2', objective=0.5215162259225145, hyperparameters={'learning_rate': 4.357724525964853e-05, 'num_train_epochs': 2, 'seed': 38, 'per_device_train_batch_size': 32})

You can customize the objective to maximize by passing along a `compute_objective` function to the `hyperparameter_search` method, and you can customize the search space by passing a `hp_space` argument to `hyperparameter_search`. See this [forum post](https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10) for some examples.

To reproduce the best training, just set the hyperparameters in your `TrainingArgument` before creating a `Trainer`:

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.467531,0.446775
2,0.428433,0.476509,0.521516


TrainOutput(global_step=536, training_loss=0.4191349371155696)