<a href="https://colab.research.google.com/github/pierfrancescomartinello/NLP-Project/blob/main/src/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Transformers installation
# ! pip install transformers datasets
!pip install --upgrade transformers
!pip install tf-keras
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'



# Fine-tune a pretrained model

There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. 🤗 Transformers provides access to thousands of pretrained models for a wide range of tasks. When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning, an incredibly powerful training technique. In this tutorial, you will fine-tune a pretrained model with a deep learning framework of your choice:

* Fine-tune a pretrained model with 🤗 Transformers [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer).
* Fine-tune a pretrained model in TensorFlow with Keras.
* Fine-tune a pretrained model in native PyTorch.

<a id='data-processing'></a>

## Train

### Loading data for Keras

When you want to train a 🤗 Transformers model with the Keras API, you need to convert your dataset to a format that
Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras.
Let's try that first before we do anything more complicated.

First, load a dataset. We'll use the CoLA dataset from the [GLUE benchmark](https://huggingface.co/datasets/glue),
since it's a simple binary text classification task, and just take the training split for now.

In [None]:
from datasets import load_dataset

dataset = load_dataset("glue", "cola")
dataset = dataset["train"]  # Just take the training split for now

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Next, load a tokenizer and tokenize the data as NumPy arrays. Note that the labels are already a list of 0 and 1s,
so we can just convert that directly to a NumPy array without tokenization!

In [None]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True)
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_data = dict(tokenized_data)

labels = np.array(dataset["label"])  # Label is already an array of 0 and 1

Finally, load, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), and [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) the model. Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to:

In [None]:
from transformers import TFAutoModelForSequenceClassification

# Load and compile our model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from keras.optimizers import Adam

# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer=Adam(3e-5))  # No loss argument!

model.fit(tokenized_data, labels)

ValueError: Could not interpret optimizer identifier: <keras.src.optimizers.adam.Adam object at 0x7e7d480a97b0>

<Tip>

You don't have to pass a loss argument to your models when you `compile()` them! Hugging Face models automatically
choose a loss that is appropriate for their task and model architecture if this argument is left blank. You can always
override this by specifying a loss yourself if you want to!

</Tip>

This approach works great for smaller datasets, but for larger datasets, you might find it starts to become a problem. Why?
Because the tokenized array and labels would have to be fully loaded into memory, and because NumPy doesn’t handle
“jagged” arrays, so every tokenized sample would have to be padded to the length of the longest sample in the whole
dataset. That’s going to make your array even bigger, and all those padding tokens will slow down training too!

<a id='pytorch_native'></a>