<a href="https://colab.research.google.com/github/nazimboudeffa/hugging-face-trainer/blob/main/fine-tune-gpt-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Why Should We Fine-Tune Pre-Trained Models?

2. How Does Fine-Tuning Happen Using LoRA and What is Q-LoRA?

3. How Can We Fine-Tune Pre-Trained Models Using Only Open-Source Tools?


[accelerate](https://huggingface.co/docs/accelerate/en/index) : Hugging face libary to run raw pytorch training script on any kind of device

[transformers](https://huggingface.co/docs/transformers): Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models.


In [9]:
!pip install -q accelerate datasets transformers

In [10]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    Trainer
)

In [11]:
# The model that you want to train from the Hugging Face hub
model_name = "gpt-2"

# The instruction dataset to use
dataset_name = "sigmund-freud-a-general-introduction-to-psychoanalysis"

# Fine-tuned model name
new_model = "gpt-2-sigmund-freud-psychoanalysis"


# QLoRA parameters

In [12]:
# QLoRA parameters

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1


# bitsandbytes parameters

In [13]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

# TrainingArguments parameters


In [14]:

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25


# SFT parameters

In [15]:
# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [17]:
# Load dataset (you can process it here)
dataset = load_dataset('text', data_files={'train': 'sigmund-freud-a-general-introduction-to-psychoanalysis.txt'})

In [18]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir="./results",  # Spécifie le chemin vers le répertoire d'enregistrement
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=10_000,  # Fréquence d'enregistrement des checkpoints (ajuste selon tes besoins)
    save_total_limit=2,  # Nombre maximum de checkpoints à sauvegarder (pour éviter de remplir trop d’espace)
)


In [19]:
# Charger le modèle et le tokenizer GPT-2
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Assigner un pad_token
tokenizer.pad_token = tokenizer.eos_token  # Utiliser le token de fin de séquence comme pad token

# Fonction de préparation du dataset (tokenisation + création des labels)
def preprocess_function(examples):
    # Tokenisation du texte
    inputs = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=24)

    # Créer les labels en copiant les input_ids
    inputs['labels'] = inputs['input_ids'].copy()

    # Remplacer les tokens de padding par -100 dans les labels (ce qui les ignore lors du calcul de la perte)
    inputs['labels'] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in inputs['labels']]

    return inputs

# Appliquer la tokenisation et le prétraitement à tout le dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Afficher un échantillon pour vérifier
print(tokenized_datasets['train'][0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Map:   0%|          | 0/386 [00:00<?, ? examples/s]

{'text': 'INTRODUCTION ', 'input_ids': [1268, 5446, 28644, 2849, 220, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'attention_mask': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [1268, 5446, 28644, 2849, 220, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]}


#Tranining

In [20]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=tokenized_datasets['train'],
)

# Fine-tuning du modèle
trainer.train()

Step,Training Loss
500,2.3476


TrainOutput(global_step=579, training_loss=2.2676173920046687, metrics={'train_runtime': 1411.0669, 'train_samples_per_second': 0.821, 'train_steps_per_second': 0.41, 'total_flos': 14183258112000.0, 'train_loss': 2.2676173920046687, 'epoch': 3.0})

In [21]:
# Save trained model
trainer.model.save_pretrained(new_model)

#Resutls

In [22]:
# Run text generation pipeline with our next model
prompt = "What is the definition of psychoanalysis?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<s>[INST] What is the definition of psychoanalysis? [/INST] I must, of course, ask you this question. And, by its very "satisfactory" and "impossible"  definition,  this  exposition  describes  everything  which  psychoanalysis  claims  to  prove,  and  what  objections  the  psychoanalytic  lecturer  may  have  raised  him.  One  has  been  satisfied  with,  for  a  few  lectures.  I  cannot  imagine*  how*  some  of  the  students  at  these  lectures.  They* are,  the  result--or   result-  of  one  of  the  previous  lectures.  And  the  lectures  present  no  objection,  no  objection,  no  one,  no  of  them.  There,  I  shall


#Upload

In [24]:
# Pour le télécharger sur Hugging Face
from huggingface_hub import login
login(token="")

model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-66f9a2cd-3c45394c717e7e23753d7307;f60aa50f-025a-4580-81c4-883c7ffe77c9)

Invalid username or password.