# Sentiment Classification using Transformer

Before diving into this notebook, we strongly recommend 
going through **all** the chapters of the official [🤗 Hugging Face course](https://huggingface.co/course/chapter1/1). This will make it much easier for 
you to follow this notebook and transfer the knowledge to **your** tasks.

In this notebook, we will simulate a real-world use 
case and try to solve it using tools of the Hugging Face ecosystem.

We strongly recommend using this notebook as a template/example to 
solve **your** real-world use case.

# **Defining Task, Dataset & Model**

Before jumping into the actual coding part, it's important to have a clear definition of the use case that you would like to automate or partly automate.
A clear definition of the use case helps in identifying the most suitable task, dataset to use, and model to apply for your use case.

## **Define your NLP task**

Alright, let's dive into a hypothetical problem we wish to save using models of natural language processing. Let's assume, we are selling a product and our customer support team receives thousands of messages including feedback, complaints, and questions which ideally should all be answered. 

Quickly, it becomes obvious though that customer support is by no means able to reply to every message. Thus, we decide to only reply 
to the most unsatisfied custmoers and set the goal of replying to 100% of very unsatisfied messages.

Assuming that a) messages of very unsatisfied customers represent only a fraction of all messages and b) that we can filter out unsatisfied messages in an automated way, customer support should be able to reach this goal.

To filter out unsatisfied messages in an automated way, we plan on applying natural language processing technologies. 


The first step is now to map our use case - *filtering out unsatisfied messages* - to a natural language processing task.

To do so, it is recommended to go over all available tasks on the Hugging Face Hub [here](https://huggingface.co/tasks). If you are not sure which task applies to your use case, you should click on all of the different tasks to better understand them, *e.g.* 

Automatically replying  The task of finding messages of the most unsatisfied customers can be labeled as a text classification task: Classify a message into one of *very unsatisfied*, *unsatisfied*, *neutral*, *satisfied*, or *very satisfied*.



## **Find suitable datasets**

Having decided on the task, next we should find the data the model will be trained on. This is usually more important for the downstream performance of your use case than picking the right model architecture.
Keep in mind that a model is **only as good as the data it has been trained on**. Thus, we should be very careful when curating and/or selecting the dataset.

Since we consider the hypothetical use case of *filtering out unsatisfied messages*, let's look into what datasets are available to us.

For your real-world use case, it is **very likely** that you have internal data that best represents the actual data your NLP system is supposed to handle. Therefore, you should use such internal data to train your NLP system.
It can nevertheless be helpful to also include publicly available to improve the generalizability of your model.

Let's take a look at all available Datasets on the [Hugging Face Hub](https://huggingface.co/datasets). On the left side, you can filter the datasets according to *Task Categories* as well as *Tasks* which are more specific. Our use case corresponds to *Text Classification* -> *Sentiment Analysis* so let's select [these filters](https://huggingface.co/datasets?task_categories=task_categories:text-classification&task_ids=task_ids:sentiment-classification&sort=downloads). We are left with *ca.* 80 datasets at the time of writing this notebook. Two aspects should be evaluated when picking a dataset:

- **Quality**: Is the dataset of high quality? More specifically: Does the data correspond to the data you expect to deal with in your use case? Is the data diverse, unbiased, ...?
- **Size**: How big is the dataset? Usually one can safely say the bigger the dataset, the better.

It's quite difficult to efficiently evaluate whether a dataset is of high quality and it's even more difficult to know whether and how the dataset is biased.
 An efficient and reasonable heuristic for high quality is to look at the download statistics. The more downloads, the more usage, the higher chance that the dataset is of high quality. The size is easy to evaluate as it can usually be quickly read upon. Let's take a look at the most downloaded datasets:

- [Glue](https://huggingface.co/datasets/glue)
- [Amazon polarity](https://huggingface.co/datasets/amazon_polarity)
- [Tweet eval](https://huggingface.co/datasets/tweet_eval)
- [Yelp review full](https://huggingface.co/datasets/yelp_review_full)
- [Amazon reviews multi](https://huggingface.co/datasets/amazon_reviews_multi)

Now we can inspect those datasets in more detail by reading through the dataset card which ideally should give all relevant and important information. In addition, the [dataset viewer](https://huggingface.co/datasets/glue/viewer/cola/test) is an incredibly powerful tool to inspect whether the data suits your use case.

Let's quickly go over the dataset cards of the models above: 
- *GLUE* is a collection of small datasets that mostly serves as a means to compare new model architectures for researchers. The datasets are too small and don't correspond enough to our use case.
- *Amazon polarity* is huge and a well-suited dataset for customer feedback since the data deals with customer reviews. However, it only has binary labels (positive/negative) whereas we are looking for more granularity in the sentiment classification. 
- *Tweet eval* uses different emojis as labels which cannot that easily be mapped to a scale going from unsatisfied to satisfied.
- *Amazon reviews multi* seems to be the most suited dataset here. We have sentiment labels ranging from 1-5 corresponding to 1-5 stars on Amazon. These labels can very well be mapped to *very unsatisfied, unsatisfied, neutral, satisfied, very satisfied*. Having inspected some examples on [the dataset viewer](https://huggingface.co/datasets/amazon_reviews_multi/viewer/en/train) we can see that the reviews look very similar to how customer feedback reviews would look, so this seems like a very good dataset. In addition, each review has a `product_category` label so we could even go as far as to only use reviews of a product category that corresponds to the one we are working in. The dataset is multi-lingual, but we are just interested in the English version for now.
- *Yelp review full* looks like a very suitable dataset. It's large and contains product reviews and sentiment labels from 1 to 5. Sadly, the dataset viewer is not working here at the moment and the dataset card is also relatively sparse requiring some more time to inspect the dataset. At this point, we should read the paper, but given the time-constraint of this blog post, we'll choose to go for *Amazon reviews multi*.

As a conclusion, let's focus on the [*Amazon reviews multi*](https://huggingface.co/datasets/amazon_reviews_multi) dataset considering all training examples.

As a final note, we recommend making use of Hub's dataset functionality even when working with private datasets. The Hugging Face Hub, Transformers, and Datasets are flawlessly integrated, which makes it trivial to use them in combination when training models.

In addition, the Hugging Face Hub offers:

- [A dataset viewer for every dataset](https://huggingface.co/datasets/amazon_reviews_multi)
- [Easy demoing of every model using widgets](https://huggingface.co/docs/hub/main#whats-a-widget)
- [Private and Public models](https://huggingface.co/docs/hub/adding-a-model#creating-a-repository)
- [Git version control for repositories](https://huggingface.co/docs/hub/main#whats-a-repository)
- [Highest security mechanisms](https://huggingface.co/docs/hub/security)

## **Find a suitable model**

Having decided on the task and the dataset that best describes our use case, we can now look into choosing a model to be used.

Most likely, you will have to fine-tune a pretrained model for your use case, but it is worth checking whether they are already fine-tuned models on the Hub that perform well. In this case, you might reach a higher performance by just continuing to fine-tune such a model on your dataset.

Let's take a look at all models that have been fine-tuned on Amazon Reviews Multi, you can find the list of models on the bottom right corner - clicking on *Browse models trained on this dataset* you can see [a list of all models fine-tuned on the dataset that are publicly available](https://huggingface.co/models?dataset=dataset:amazon_reviews_multi). Note that we are only interested in the English version of the dataset because our customer feedback will only be in English. It looks like most of the most downloaded models are trained on the multi-lingual version of the dataset and those that don't seem to be multi-lingual have very little information or poor performance. At this point, 
it might be more sensible to fine-tune a purely pretrained model instead of using one of the already fine-tuned ones shown in the link above.

Alright, the next step now is to find a suitable pretrained model to be used for fine-tuning. This is actually more difficult than it seems given the large amount of pretrained and fine-tuned models that are the [Hugging Face Hub](https://huggingface.co/models) . The best option is usually to simply try out a variety of different models to see which one performs best. 
We still haven't found the perfect way of comparing different model checkpoints to each other at Hugging Face, but we provide some resources that are worth looking into:

- The [model summary](https://huggingface.co/docs/transformers/model_summary) gives a short overview of different model architectures.
- A task-specific search on the Hugging Face Hub, *e.g.* [a search on text-classification models](https://huggingface.co/models), shows you the most downloaded checkpoints which is also an indication of how well those checkpoints perform.

Both of the above resources are currently however a bit suboptimal. The model summary is not always kept up to date. The speed at which new model architectures are released and old model architectures become outdated makes it extremely difficult to have an up-to-date summary of all model architectures.
Similarly, it doesn't necessarily mean that the most downloaded model checkpoint is the best one. E.g. [`bert-base-cased`](https://huggingface.co/bert-base-uncased) is amongst the most downloaded model checkpoints but is not the best performing checkpoint anymore.  

The best is often to try out a variety of different model architectures, stay up to date with new model architectures by following experts in the field and checking well-known leaderboards.

For text-classification, the important benchmarks to look at are [GLUE](https://gluebenchmark.com/leaderboard) and [SuperGLUE](https://super.gluebenchmark.com/leaderboard). Both benchmarks evaluate pretrained models on a variety of text-classification tasks, such as grammatical correctness, natural language inference, Yes/No question answering, etc..., which are quite similar to our target task of sentiment analysis. Thus, it is reasonable to choose one of the leading models of these benchmarks for our task.

At the time of writing this notebook, the best performing models are very large models containing more than 10 billion parameters most of which are not open-sourced, *e.g.* *ST-MoE-32B*, *Turing NLR v5*, or
*ERNIE 3.0*. One of the top-ranking models that is easily accessible is [DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta). Because  Let's try out DeBERTa's newest base version - *i.e.* [`microsoft/deberta-v3-base`](https://huggingface.co/microsoft/deberta-v3-base).

# **Training / Fine-tuning a model with 🤗 Transformers and 🤗 Datasets**

In this section, we will jump into the technical details of how to 
fine-tune a model end-to-end to be able to automatically filter out very unsatisfied customer feedback messages.

Cool, let's start by installing all necessary pip packages and by setting up our code environment, then look into preprocessing the dataset and finally start training the model.

The following notebook can be run online in a google colab pro with the GPU runtime environment enabled.

## **Install all necessary packages**

To begin with, let's install [`git-lfs`](https://git-lfs.github.com/) so that we can automatically upload our trained checkpoints to the Hub during training.

Also, we install the 🤗 Transformers and 🤗 Datasets libraries to run this notebook. Since we will be using [DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta-v2#debertav2) in this notebook, we also need to install the [`sentencepiece`](https://github.com/google/sentencepiece) library for its tokenizer.

In [1]:
%%capture
!pip install datasets transformers[sentencepiece]

Next, let's login into our [Hugging Face account](https://huggingface.co/join) so that models are uploaded correctly under your name tag.

## **Preprocess the dataset**

Before we can start training the model, we should bring the dataset in a format 
that is understandable by the model.

Thankfully, the 🤗 Datasets library makes this extremely easy as you will see in the following cells.

The `load_dataset` function loads the dataset, nicely arranges it into predefined attributes, such as `review_body` and `stars`, and finally saves the newly arranged data using the [arrow format](https://arrow.apache.org/#:~:text=Format,data%20access%20without%20serialization%20overhead.) on disk. 
The arrow format allows for fast and memory-efficient data reading and writing.

Let's load and prepare the English version of the `amazon_reviews_multi` dataset.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

In [5]:
imdb_df = pd.read_csv('/content/drive/MyDrive/NLP/DLandNLP/NLP/labeledTrainData.tsv', sep = '\t')

In [6]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [14]:
imdb_df = imdb_df[['review', 'sentiment']]

In [15]:
from sklearn.model_selection import train_test_split

In [42]:
trainval, test = train_test_split(imdb_df, test_size = 0.2)
train, val = train_test_split(trainval, test_size = 0.1)

In [43]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18000 entries, 2192 to 1536
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     18000 non-null  object
 1   sentiment  18000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 421.9+ KB


In [44]:
train.shape, test.shape, val.shape

((18000, 2), (5000, 2), (2000, 2))

In [45]:
train = train.reset_index()
test = test.reset_index()
val = val.reset_index()

Great, that was fast 🔥. Let's take a look at the structure of the dataset.

In [50]:
from datasets import Dataset, DatasetDict

imdb_review_ds = {'train' : Dataset.from_pandas(train),
                  'test' : Dataset.from_pandas(test),
                  'val': Dataset.from_pandas(val)}

In [51]:
imdb_review = DatasetDict(imdb_review_ds)

In [52]:
imdb_review

DatasetDict({
    train: Dataset({
        features: ['index', 'review', 'sentiment'],
        num_rows: 18000
    })
    test: Dataset({
        features: ['index', 'review', 'sentiment'],
        num_rows: 5000
    })
    val: Dataset({
        features: ['index', 'review', 'sentiment'],
        num_rows: 2000
    })
})

We have 200,000 training examples as well as 5000 validation and test examples. This sounds reasonable for training! We're only really interested in the input being the `"review_body"` column and the target being the `"starts"` column.

Let's check out a random example.

In [49]:
import random 

In [57]:
random_id = random.randint(0, 10000)

print("Sentiment:", imdb_review["train"][random_id]["sentiment"])
print("Review:", imdb_review["train"][random_id]["review"])

Sentiment: 1
Review: There were a lot of films made by Hollywood during the war years that were designed to drum up support for our troops from the public. Seen today, some might dismiss them or just see them as propaganda--which they technically are, but of a positive sort and meant to unify the nation. This film is a pretty effective and entertaining example of the genre--having a pretty realistic script and good production values. Pat O'Brien plays pretty much the same character he played in MANY other films (you know, the tough-talking, hard-driven but \swell guy\"). Randolph Scott is, as always, competent and entertaining and the rest of the extras are excellent (look for a young Robert Ryan as one of the bombardiers in training). While the story is reminiscent of several other movies about our pilots and crews, the film is well-crafted enough to make it interesting and not too far-fetched. That it, perhaps, except for the very end--where the film is a bit over-the-top but also VE

The dataset is in a human-readable format, but now we need to transform it into a "machine-readable" format. Let's define the model repository which includes all utils necessary to preprocess and fine-tune the checkpoint we decided on.

Next, we load the tokenizer of the model repository, which is a [DeBERTa's Tokenizer](https://huggingface.co/docs/transformers/model_doc/deberta-v2#transformers.DebertaV2Tokenizer).

In [58]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

As mentioned before, we will use the `"review_body"` as the model's input and `"stars"` as the model's target. Next, we make use of the tokenizer to transform the input into a sequence of token ids that can be understood by the model. The tokenizer does exactly this and can also help you to limit your input data to a certain length to not run into a memory issue. Here, we limit 
the maximum length to 128 tokens which in the case of DeBERTa corresponds to roughly 100 words which in turn corresponds to *ca.* 5-7 sentences. Looking at the [dataset viewer](https://huggingface.co/datasets/amazon_reviews_multi/viewer/en/test) again, we can see that this covers pretty much all training examples. 
**Important**: This doesn't mean that our model cannot handle longer input sequences, it just means that we use a maximum length of 128 for training since it covers 99% of our training and we don't want to waste memory. Transformer models have shown to be very good at generalizing to longer sequences after training.

If you want to learn more about tokenization in general, please have a look at [the Tokenizers docs](https://huggingface.co/course/chapter6/1?fw=pt).

The labels are easy to transform as they already correspond to numbers in their raw form, *i.e.* the range from 1 to 5. Here we just shift the labels into the range 0 to 4 since indexes usually start at 0.

Great, let's pour our thoughts into some code. We will define a `preprocess_function` that we'll apply to each data sample. 

In [80]:
def preprocess_function(example):
    output_dict = tokenizer(example["review"], max_length=512, truncation=True)
    output_dict["labels"] = example["sentiment"]
    return output_dict

To apply this function to all data samples in our dataset, we just use the [`map`](https://huggingface.co/docs/datasets/master/en/package_reference/main_classes#datasets.Dataset.map) method of the `amazon_review` object we created earlier. This will apply the function on all the elements of all the splits in `amazon_review`, so our training, validation, and testing data will be preprocessed in one single command. We run the mapping function in `batched=True` mode to speed up the process and also remove all columns since we don't need them anymore for training.

In [81]:
tokenized_datasets = imdb_review.map(preprocess_function, 
                                     batched=True, 
                                     remove_columns=imdb_review["train"].column_names)

Map:   0%|          | 0/18000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Let's take a look at the new structure.

In [82]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 18000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5000
    })
    val: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

We can see that the outer layer of the structure stayed the same but the naming of the columns has changed. 
Let's take a look at the same random example we looked at previously only that it's preprocessed now.

In [83]:
random_id = random.randint(0, 1000)

print("Stars:", imdb_review["train"][random_id]["sentiment"])
print("Review:", imdb_review["train"][random_id]["review"])

Stars: 0
Review: I don't see enough TV game shows to understand the attraction of SHOW ME THE MONEY, but I suppose it holds some appeal for undemanding audiences. Ostensibly a quiz show, it offers contestants huge sums of money for answering a few simple questions. However, its quiz elements play only a small part in the proceedings, which I find tortuously complicated. For example, before answering a question, a contestant selects which question is to be asked by choosing from among random \A,\" \"B,\" or \"C\" choices. Does this serve any purpose other than to slow the game down? It would be a lot quicker simply to start with \"A.\" Contestants can pass on questions, but must answer one of the three questions in each category.<br /><br />After responding to a question, the contestant is then asked to \"lock in\" the answer--another delaying tactic. The contestant's next task is to name which woman from about a dozen go-go dancers in cages is to unveil a card that indicates how much t

In [84]:
print("Input IDS:", tokenized_datasets["train"][random_id]["input_ids"])
print("Input IDS:", tokenized_datasets["train"][random_id]["attention_mask"])
print("Labels:", tokenized_datasets["train"][random_id]["labels"])

Input IDS: [101, 1045, 2123, 1005, 1056, 2156, 2438, 2694, 2208, 3065, 2000, 3305, 1996, 8432, 1997, 2265, 2033, 1996, 2769, 1010, 2021, 1045, 6814, 2009, 4324, 2070, 5574, 2005, 6151, 16704, 4667, 9501, 1012, 23734, 1037, 19461, 2265, 1010, 2009, 4107, 10584, 4121, 20571, 1997, 2769, 2005, 10739, 1037, 2261, 3722, 3980, 1012, 2174, 1010, 2049, 19461, 3787, 2377, 2069, 1037, 2235, 2112, 1999, 1996, 8931, 1010, 2029, 1045, 2424, 17153, 8525, 13453, 8552, 1012, 2005, 2742, 1010, 2077, 10739, 1037, 3160, 1010, 1037, 10832, 27034, 2029, 3160, 2003, 2000, 2022, 2356, 2011, 10549, 2013, 2426, 6721, 1032, 1037, 1010, 1032, 1000, 1032, 1000, 1038, 1010, 1032, 1000, 2030, 1032, 1000, 1039, 1032, 1000, 9804, 1012, 2515, 2023, 3710, 2151, 3800, 2060, 2084, 2000, 4030, 1996, 2208, 2091, 1029, 2009, 2052, 2022, 1037, 2843, 19059, 3432, 2000, 2707, 2007, 1032, 1000, 1037, 1012, 1032, 1000, 10584, 2064, 3413, 2006, 3980, 1010, 2021, 2442, 3437, 2028, 1997, 1996, 2093, 3980, 1999, 2169, 4696, 1012, 10

Alright, the input text is transformed into a sequence of integers which can be transformed to word embeddings by the model, and the label index is simply shifted by -1.

## **Fine-tune the model**

Having preprocessed the dataset, next we can fine-tune the model. We will make use of the popular [Hugging Face Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer) which allows us to start training in just a couple of lines of code. The Trainer can be used for more or less all tasks in PyTorch and is extremely convenient by taking care of a lot of boilerplate code needed for training.

 Let's start by loading the model checkpoint using the convenient [`AutoModelForSequenceClassification`](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification). Since the checkpoint of the model repository is just a pretrained checkpoint we should define the size of the classification head by passing `num_lables=5` (since we have 5 sentiment classes).

In [85]:
from transformers import AutoModelForSequenceClassification

model_repository = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_repository, num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

Next, we load a data collator. A [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator) is responsible for making sure each batch is correctly padded during training, which should happen dynamically since training samples are reshuffled before each epoch.

In [86]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

During training, it is important to monitor the performance of the model on a held-out validation set. To do so, we should pass a to define a `compute_metrics` function to the `Trainer` which is then called at each validation step during training.

The simplest metric for the text classification task is *accuracy*, which simply states how much percent of the training samples were correctly classified. Using the *accuracy* metric might be problematic however if the validation or test data is very unbalanced. Let's verify quickly that this is not the case by counting the occurrences of each label.

In [87]:
from collections import Counter

print("Validation:", Counter(tokenized_datasets["val"]["labels"]))
print("Test:", Counter(tokenized_datasets["test"]["labels"]))

Validation: Counter({1: 1027, 0: 973})
Test: Counter({0: 2507, 1: 2493})


The validation and test data sets are as balanced as they can be, so we can safely use accuracy here!

 Let's load the [accuracy metric](https://huggingface.co/metrics/accuracy) via the datasets library.

In [88]:
from datasets import load_metric

accuracy = load_metric("accuracy")

Next, we define the `compute_metrics` which will be applied to the predicted outputs of the model which is of type [`EvalPrediction`](https://huggingface.co/docs/transformers/main/en/internal/trainer_utils#transformers.EvalPrediction) and therefore exposes the model's predictions and the gold labels.
We compute the predicted label class by taking the `argmax` of the model's prediction before passing it alongside the gold labels to the accuracy metric.

In [89]:
import numpy as np

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_classes = np.argmax(pred_logits, axis=-1)
    labels = np.asarray(pred.label_ids)

    acc = accuracy.compute(predictions=pred_classes, references=labels)

    return {"accuracy": acc["accuracy"]}

Great, now all components required for training are ready and all that's left to do is to define the hyper-parameters of the `Trainer`. We need to make sure that the model checkpoints are uploaded to the Hugging Face Hub during training. By setting `push_to_hub=True`, this is done automatically at every `save_steps` via the convenient [`push_to_hub`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) method.

Besides, we define some standard hyper-parameters such as learning rate, warm-up steps and training epochs. We will log the loss every 500 steps and run evaluation every 5000 steps.

In [90]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="imdbreviews_v1",
    num_train_epochs=2, 
    learning_rate=2e-5,
    warmup_steps=200,
    logging_steps=500,
    save_steps=5000,
    eval_steps=5000,
    evaluation_strategy="steps",
)

Putting it all together, we can finally instantiate the Trainer by passing all required components. We'll use the `"validation"` split as the held-out dataset during training.

In [91]:
from transformers import Trainer

trainer = Trainer(
    args=training_args,
    compute_metrics=compute_metrics,
    model=model,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["val"]
)

The Trainer is ready to go 🚀 You can start training by calling `trainer.train()`.

In [92]:
train_metrics = trainer.train().metrics
trainer.save_metrics("train", train_metrics)



Step,Training Loss,Validation Loss


Cool, we see that the model seems to learn something! Training loss and validation loss is going down and the accuracy also ends up being well over random chance (20%). Interestingly, we see accuracy of around **58.6 %** already after 5000 steps which doesn't improve that much anymore afterward. Choosing a bigger model or training for longer would have probably given better results here, but that's good enough for our hypothetical use case!

Alright, finally let's upload the model checkpoint to the Hub.

In [93]:
trainer.save_model("./model/imdb_model")

## **Evaluate / Analyse the model**

Now that we have fine-tuned the model we need to be very careful about analyzing its performance. It's usually not enough to just look at basic metrics defining the quality of a model purely on a metric, such as *accuracy*.
The better approach is to find a metric that best describes the actual use case of the model.

Let's dive into evaluating the model 🤿.

The model has been uploaded to the Hub under [`deberta_v3_amazon_reviews`](https://huggingface.co/patrickvonplaten/deberta_v3_amazon_reviews) after training, so in a first step, let's download it from there again. If this notebook is run all at once the following cell will simply load the model from the cache.

In [94]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("./model/imdb_model")

The Trainer is not only an excellent class to train a model, but also to evaluate a model on a dataset. Let's instantiate the trainer with the same instances and functions as before, but this time there is no need to pass a training dataset.

In [95]:
trainer = Trainer(
    args=training_args,
    compute_metrics=compute_metrics,
    model=model,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

We use the Trainer's [`predict`]( ) function to evaluate the model on the test dataset on the same metric 

In [96]:
prediction_metrics = trainer.predict(tokenized_datasets["test"]).metrics
prediction_metrics

{'test_loss': 0.34345147013664246,
 'test_accuracy': 0.9172,
 'test_runtime': 73.9023,
 'test_samples_per_second': 67.657,
 'test_steps_per_second': 8.457}

It does seem to generalize quite well to real-world data 🔥

## Optimization

As soon as you think the model's performance is good enough for production it's all about making the model as memory efficient and fast as possible.

There are some obvious solutions to this like choosing the best suited accelerated hardware, *e.g.* better GPUs, making sure no gradients are computed during the forward pass, or lowering the precision, *e.g.* to float16. 

More advanced optimization methods include using open-source accelerator libraries such as [ONNX Runtime](https://onnxruntime.ai/index.html), [quantization](https://pytorch.org/docs/stable/quantization.html), and inference servers like [Triton](https://developer.nvidia.com/nvidia-triton-inference-server).

At Hugging Face, we have been working a lot to facilitate the optimization of models, especially with our open-source [Optimum library](https://huggingface.co/hardware). Optimum makes it extremely simple to optimize most 🤗 Transformers models.

If you're looking for **highly optimized** solutions which don't require any technical knowledge, you might be interested in one of Hugging Face's paid inference services:

- [Inference API](https://huggingface.co/inference-api)
- [Infinity](https://huggingface.co/infinity)