<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/LLM_Finetuning_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{llm-finetune-hf} -->

# LLM Finetuning with HuggingFace and Weights and Biases
<!--- @wandbcode{llm-finetune-hf} -->
- Fine-tune a lightweight LLM (OPT-125M) with LoRA and 8-bit quantization using Launch
- Checkpoint the LoRA adapter weights as artifacts
- Link the best checkpoint in Model Registry
- Run inference on a quantized model

The same workflow and principles from this notebook can be applied to fine-tuning some of the stronger OSS LLMs (e.g. Llama2)

### Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`

In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.
The fine-tuning method will rely on a recent method called "Low Rank Adapters" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.
After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!

### Install requirements

First, run the cells below to install the requirements:

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git
!pip install -q wandb
!pip install -q ctranslate2

### W&B Set-up 🚀

### Client Configuration

We need to connect the client to an account in the web server. We do this in a notebook by calling `wandb.login`.

Otherwise, you will see a authoriztion link and be asked to enter an API key. If you already have an account, you can follow the authorization link and then copy and paste the displayed API key.

In [None]:
# Note that https://api.wandb.ai is the default and points to the publicly hosted
# app. You'll want to change this to a different API endpoint if you are trying
# to connect to a privately hosted server.
#
# You can configure this with environment variables:do
# export WANDB_API_KEY=<your-api-key>
# export WANDB_BASE_URL=<your-base-url>

# Or in Colab
# %env WANDB_BASE_URL=<your-base-url>
# wandb.login(host="<your-base-url>")

In [None]:
import wandb

wandb.login()

### Track your experiments

### `wandb.init`

The `wandb.init` function initializes a new `Run`, which you can think of as a comprehensive record of your machine learning experiment. Tracking starts when you call `wandb.init` and ends when you call `wandb.finish` (called automatically via `atexit` hooks if you don't want to invoke manually). You can also use python's `with` statement to initialize and finish runs (see code cell below).

#### init pattern 1
`wandb.init()`

`// code here`

`wandb.finish()` # if in a notebook, otherwise not needed

### init pattern 2
`with wandb.init():`

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`// code here`

### `wandb.init` arguments
`wandb.init()` accepts a number of arguments. The full list can be found [here](https://docs.wandb.ai/ref/python/init) but the most common ones are
- `entity` (str) which logs the run to a specific entity (individual or team)
- `project` (str) which logs the run to a specific project
- `config` (dict) which sets wandb.config for saving inputs to your run such as hyperparameters for a model or settings for a data preprocessing job

### `wandb.log`

You can call `wandb.log` within your experiment add metrics to your `Run`. The idea is that you will call `wandb.log` many times over an experiment for the same metric, in which case the run saves the whole history of each metric across all of your `wandb.log` calls. The code cell below demonstrates how this looks in a typical stochastic gradient descent loop.

### Entity
An entity is a username or team name where you're sending runs. This entity must exist before you can send runs there, so make sure to create your account or team in the UI before starting to log runs. Teams you are part of may appear in brackets after your username. _If you want to log to your personal entity, it is only the username that you need._


In [None]:
PROJECT = "<>" # give your project a name!
ENTITY = "<>" #  for the training you can find your username here https://wandb.ai/settings or <your-base-url>/settings -- the entity can also be a W&B team that you are part of.

In [None]:
config = dict(
  batch_size=32,
  learning_rate=1e-4,
  llm='gpt-5',
  finetune_techinque = 'DPO',
)


"""
The pattern of "with wandb.init()..." causes wandb.finish() to be called as
soon as we leave the with block. This is especially useful when you have a script
or notebook that initializes multiple runs that you want to track separately.
"""
with wandb.init(entity=ENTITY, project=PROJECT, config=config, job_type='finetune') as run:

    for key, value in dict(wandb.config).items():
        print(key, value)

    # Imagine we run 100 epochs of model training / fine tuning
    for x in range(2, 100):

        # Insert model training here...
        # ...

        # Compute metrics (or in this case, make them up)
        metrics = dict(
            loss=(1/x)**0.25,
            accuracy=1-(1/x)*2
        )

        # Pass metrics to Weights & Biases
        run.log(metrics)
    print(metrics)


### Model Loading

- Here we leverage 8-bit quantization to reduce the memory footprint of the model during training

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-125m",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

In [None]:
model

### Training
- [W&B HuggingFace integration](https://docs.wandb.ai/guides/integrations/huggingface) automatically tracks important metrics during the course of training
- Also track the HF checkpoints as artifacts and register them in the model registry!
- Change the number of steps to 200+ for real results!

In [None]:
import transformers
from datasets import load_dataset
import wandb

os.environ["WANDB_LOG_MODEL"] = "checkpoint"

wandb.init(project=PROJECT,
           entity=ENTITY,
           job_type="training")

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        report_to="wandb",
        warmup_steps=5,
        max_steps=25,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        save_steps=5,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)


model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()
wandb.finish()

## Sweep over hyperparameters 🧹

a Sweep combines a strategy for trying out a bunch of hyperparameter values with the code that evalutes them. Whether that strategy is as simple as trying every option or as complex as BOHB, Weights & Biases Sweeps have you covered. You just need to define your strategy in the form of a configuration.

When you're setting up a Sweep in a notebook like this, that config object is a nested dictionary. When you run a Sweep via the command line, the config object is a YAML file.

Let's walk through the definition of a Sweep config together. We'll do it slowly, so we get a chance to explain each component. In a typical Sweep pipeline, this step would be done in a single assignment.

The first thing we need to define is the method for choosing new parameter values.

We provide the following search methods:

grid Search – Iterate over every combination of hyperparameter values.
Very effective, but can be computationally costly.

random Search – Select each new combination at random according to provided distributions. Surprisingly effective!

bayesian Search – Create a probabilistic model of metric score as a function of the hyperparameters, and choose parameters with high probability of improving the metric. Works well for small numbers of continuous parameters but scales poorly.

We'll stick with bayesian.

In [None]:
sweep_config = {
    'method': 'bayes'
    }

In [None]:
metric = {
    'name': 'train/loss',
    'goal': 'minimize'
    }

sweep_config['metric'] = metric

In [None]:
parameters_dict = {
    'warmup_steps': {
        'values': [5,10]
        },
    'max_steps': {
        'values': [10, 15, 20]
        },
    'learning_rate': {
        'distribution': 'uniform', # https://docs.wandb.ai/guides/sweeps/define-sweep-configuration#distribution
        'min': 2e-4,
        'max': 2e-2,
      },
      'per_device_train_batch_size': {
        'values': [4,6,8,10,20]
      }
    }

sweep_config['parameters'] = parameters_dict

In [None]:
def train(config=None):
  wandb.init()

  config = wandb.config
  for key, value in dict(config).items():
        print(key, value)

  data = load_dataset("Abirate/english_quotes")
  data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

  trainer = transformers.Trainer(
      model=model,
      train_dataset=data['train'],
      args=transformers.TrainingArguments(
          per_device_train_batch_size=config.per_device_train_batch_size,
          gradient_accumulation_steps=4,
          report_to="wandb",
          warmup_steps=config.warmup_steps,
          max_steps=config.max_steps,
          learning_rate=config.learning_rate,
          fp16=True,
          logging_steps=1,
          save_steps=5,
          output_dir='outputs'
      ),
      data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
  )
  model.config.use_cache = False
  trainer.train()
  wandb.finish()

In [None]:
sweep_id = wandb.sweep(sweep_config, entity=ENTITY, project=PROJECT) # creates sweep controller on W&B server side

In [None]:
wandb.agent(sweep_id, train, count = 3)
wandb.teardown() # if we want to do normal runs after a sweep, in the same session, we must run this.

In [None]:
# Find run with best loss from sweep

# Initialize the wandb API
api = wandb.Api()

# Fetch the sweep
sweep = api.sweep(f"{ENTITY}/{PROJECT}/{sweep_id}")

# Get all runs from the sweep
runs = sweep.runs

# Initialize the best run and the highest accuracy
best_run = None
lowest_loss = float('inf')

# Iterate over all runs to find the one with the highest accuracy
for run in runs:
    # Fetch the metrics for the current run
    metrics = run.summary._json_dict

    # Check if the 'accuracy' metric is in the current run's metrics
    if 'train/loss' in metrics:
        # Compare the accuracy with the highest accuracy found so far
        if metrics['train/loss'] < lowest_loss:
            lowest_loss = metrics['train/loss']
            best_run = run

# Print the best run and its accuracy
print(f"The best run is {best_run.id} with an train loss of {lowest_loss}")

## Adding Model Weights to W&B Registry
- Here we get our best checkpoint from the finetuning run and register it as our best model

In [None]:
registered_model_name = "OPT-125M-english"

REGISTRY_NAME = "model"
COLLECTION_TYPE = "model"

run = wandb.init(project=PROJECT,
                 entity=ENTITY,
                 job_type="registering_best_model")

best_model = wandb.use_artifact(f'{ENTITY}/{PROJECT}/model-{best_run.id}:latest')


run.link_artifact(best_model, f"wandb_Y72QKAKNEFI3G/wandb-registry-model/{registered_model_name}") #change registered model name to path supplied in UI


run.finish()


## Consuming Model From Registry and Quantizing using ctranslate2
- LLMs are typically too large to run in full-precision on even decent hardware.
- You can quantize the model to run it more efficiently with minimal loss in accuracy.
   - CTranslate2 is a great first pass at quantization but doesn't do "smart" quantization. It just converts all weights to half precision.
   - Checkout out GPTQ and AutoGPTQ for SOTA quantization at scale

In [None]:
# Pull model from the registry

wandb.init(project=PROJECT, entity=ENTITY, job_type="ctranslate2")
best_model = wandb.use_artifact(f'wandb_<>/wandb-registry-model/{registered_model_name}:latest') #change registrt path to path supplied in UI
best_model.download(root=f'models/{registered_model_name}:latest')
wandb.finish()

In [None]:
from peft import PeftModel, PeftConfig

def convert_qlora2ct2(adapter_path=f'models/{registered_model_name}:latest',
                      full_model_path="opt125m-finetuned",
                      offload_path="opt125m-offload",
                      ct2_path="opt125m-finetuned-ct2",
                      quantization="int8"):


    peft_model_id = adapter_path
    peftconfig = PeftConfig.from_pretrained(peft_model_id)

    model = AutoModelForCausalLM.from_pretrained(
      "facebook/opt-125m",
      offload_folder  = offload_path,
      device_map='auto',
    )

    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")

    model = PeftModel.from_pretrained(model, peft_model_id)

    print("Peft model loaded")

    merged_model = model.merge_and_unload()

    merged_model.save_pretrained(full_model_path)
    tokenizer.save_pretrained(full_model_path)

    if quantization == False:
        os.system(f"ct2-transformers-converter --model {full_model_path} --output_dir {ct2_path} --force")
    else:
        os.system(f"ct2-transformers-converter --model {full_model_path} --output_dir {ct2_path} --quantization {quantization} --force")
    print("Convert successfully")

In [None]:
convert_qlora2ct2(adapter_path=f'models/{registered_model_name}:latest')

## Run Inference Using Quantized CTranslate2 Model
- Record the results in a W&B Table!

In [None]:
import ctranslate2


run = wandb.init(project=PROJECT, entity=ENTITY, job_type="inference")
run.use_artifact(f'{ENTITY}/{PROJECT}/model-{best_run.id}:latest')

generator = ctranslate2.Generator("opt125m-finetuned-ct2")

prompts = ["Hey, are you conscious? Can you talk to me?",
           "What is machine learning?",
           "What is W&B?"]


wandb_table = wandb.Table(columns=['prompt', 'completion'])
for prompt in prompts:
  start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
  results = generator.generate_batch([start_tokens], max_length=30)
  output = tokenizer.decode(results[0].sequences_ids[0])
  wandb_table.add_data(prompt, output)

wandb.log({"inference_table": wandb_table})
wandb.finish()