# Fine-Tune BERT LLM for Label Prediction with Kubeflow PyTorchJob

This Notebook requires:

- A working Kubeflow installation. 
- At least **2 GPU** on your Kubernetes cluster to fine-tune BERT model on 2 workers.
- GCS bucket to download the custom dataset and export the fine-tuned model.

You might also want to make yourself familiar with the "local" version of the finetuning, shown in `fine_tuning_local.ipynb`. 

## Install required packages

In [None]:
!pip install transformers datasets gcsfs
!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python

## Create script to fine-tune BERT model

We need to wrap our fine-tuning logic in a function to create Kubeflow PyTorchJob.

In [5]:
def train_func(parameters):
    import os
    import gcsfs
    import evaluate
    import numpy as np
    from datasets import load_dataset
    from datasets.distributed import split_dataset_by_node
    from transformers import (
        AutoModelForSequenceClassification,
        AutoTokenizer,
        Trainer,
        TrainingArguments,
    )

    model_name = parameters['MODEL_NAME']
    storage_options= parameters['STORAGE_OPTIONS'] 
    dataset = load_dataset("json", data_files=f'gs://{parameters["BUCKET"]}/{parameters["DATASET_FILE"]}', storage_options=storage_options)
    ds = dataset["train"].train_test_split(test_size=0.2)
    
    labels = [label for label in ds['train'].features.keys() if label not in ['body', 'title']]
    id2label = {idx:label for idx, label in enumerate(labels)}
    label2id = {label:idx for idx, label in enumerate(labels)}

    print("-" * 40)
    print("Download BERT Model")
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
            
    def preprocess_data(example):
      text = f'{example["title"]}\n{example["body"]}'
      # encode them
      encoding = tokenizer(text, padding=True, truncation=True)
    
      lbls = [0. for i in range(len(labels))]
      for label in labels:
        if label in example and example[label] == True:
          label_id = label2id[label]
          lbls[label_id] = 1.
    
      encoding["labels"] = lbls  
      return encoding
    
    # Map custom dataset to BERT tokenizer.
    print("-" * 40)
    print("Map dataset to BERT Tokenizer")
    encoded_dataset = ds.map(preprocess_data, remove_columns=ds['train'].column_names)
    encoded_dataset.set_format("torch")
    
    # Distribute train and test datasets between PyTorch workers.
    # Every worker will process chunk of training data.
    # RANK and WORLD_SIZE will be set by Kubeflow Training Operator.
    RANK = int(os.environ["RANK"])
    WORLD_SIZE = int(os.environ["WORLD_SIZE"])
    distributed_ds_train = split_dataset_by_node(
        encoded_dataset["train"],
        rank=RANK,
        world_size=WORLD_SIZE,
    )
    distributed_ds_test = split_dataset_by_node(
        encoded_dataset["test"],
        rank=RANK,
        world_size=WORLD_SIZE,
    )
    
    # Evaluate accuracy.    
    clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

    def sigmoid(x):
       return 1/(1 + np.exp(-x))
    
    def compute_metrics(eval_pred):
       predictions, labels = eval_pred
       predictions = sigmoid(predictions)
       predictions = (predictions > 0.5).astype(int).reshape(-1)
       return clf_metrics.compute(predictions=predictions, references=labels.astype(int).reshape(-1))


    batch_size = 3
    metric_name = "f1"
    args = TrainingArguments(
        f"{model_name}",
        evaluation_strategy = "epoch",
        save_strategy = "epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=5,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model=metric_name,
    )

    # Define Trainer
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=distributed_ds_train,
        eval_dataset=distributed_ds_test,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )
    
    print("-" * 40)
    print(f"Start Distributed Training. RANK: {RANK} WORLD_SIZE: {WORLD_SIZE}")
    
    trainer.train()
    
    print("-" * 40)
    print("Training is complete")
    
    # Export trained model to GCS from the worker with RANK = 0 (master).
    if RANK == 0:
        trainer.save_model(f"./{model_name}")
        fs = gcsfs.GCSFileSystem(**storage_options)
        files = ['config.json', 'model.safetensors', 'special_tokens_map.json', 'tokenizer_config.json', 'tokenizer.json', 'training_args.bin', 'vocab.txt']
        for f in files: 
            fs.put(f'{model_name}/{f}', f'{parameters["BUCKET"]}/{model_name}/{f}')
    
    print("-" * 40)
    print("Model export complete")

## Create Kubeflow PyTorchJob to fine-tune BERT on GPUs

Use `TrainingClient()` to create PyTorchJob which will fine-tune BERT on **2 workers** using **1 GPU** for each worker.

Your Kubernetes cluster should have sufficient **GPU** resources available.

In [6]:
from kubeflow.training import TrainingClient

job_name = "fine-tune-bert-label-prediction"
bucket = "label-prediction"
model_name = "bert-finetuned"

In [9]:
# Create PyTorchJob
TrainingClient().create_job(
    name=job_name,
    train_func=train_func,
    parameters={
        "BUCKET": bucket,
        "STORAGE_OPTIONS": {"project": "<REPLACE-WITH-YOUR-GCLOUD-PROJECT-ID>", "token": "google_default"},
        "MODEL_NAME": model_name,
        "DATASET_FILE": "prepared-issues-reduced.json"
    },
    num_workers=2,  # Number of PyTorch workers to use.
    resources_per_worker={
        "cpu": "3",
        "memory": "10G",
        "gpu": "1",
    },
    packages_to_install=[
        "gcsfs",
        "transformers",
        "datasets==2.16",
        "evaluate",
        "accelerate",
        "scikit-learn",
    ],  # PIP packages will be installed during PyTorchJob runtime.
)

### Check the PyTorchJob conditions

Use `TrainingClient()` APIs to get information about created PyTorchJob.

In [10]:
print("PyTorchJob Conditions")
print(TrainingClient().get_job_conditions(job_name))
print("-" * 40)

# Wait until PyTorchJob has Running condition.
job = TrainingClient().wait_for_job_conditions(
    job_name,
    expected_conditions={"Running"},
)
print("PyTorchJob is running")

PyTorchJob Conditions
[{'last_transition_time': datetime.datetime(2024, 10, 28, 10, 46, tzinfo=tzlocal()),
 'last_update_time': datetime.datetime(2024, 10, 28, 10, 46, tzinfo=tzlocal()),
 'message': 'PyTorchJob fine-tune-bert-label-prediction is created.',
 'reason': 'PyTorchJobCreated',
 'status': 'True',
 'type': 'Created'}]
----------------------------------------
PyTorchJob is running


### Get the PyTorchJob pod names

Since we set 2 workers, PyTorchJob will create 1 master pod and 1 worker pod to execute distributed training.

In [11]:
TrainingClient().get_job_pod_names(job_name)

['fine-tune-bert-label-prediction-master-0',
 'fine-tune-bert-label-prediction-worker-0']

### Get the PyTorchJob training logs

Every worker processes a part of the training samples on each epoch since we distribute trianing across 2 workers.

In [12]:
logs, _ = TrainingClient().get_job_logs(job_name, follow=True)

Downloading data: 100%|██████████| 276k/276k [00:00<00:00, 84.3MB/s]
Generating train split: 359 examples [00:00, 47251.24 examples/s]
[Pod fine-tune-bert-label-prediction-master-0]: ----------------------------------------
[Pod fine-tune-bert-label-prediction-master-0]: Download BERT Model
[Pod fine-tune-bert-label-prediction-master-0]: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
[Pod fine-tune-bert-label-prediction-master-0]: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[Pod fine-tune-bert-label-prediction-master-0]: ----------------------------------------
[Pod fine-tune-bert-label-prediction-master-0]: Map dataset to BERT Tokenizer
Map: 100%|██████████| 287/287 [00:00<00:00, 1331.63 examples/s]
Map: 100%|██████████| 72/72 [00:00<00:00, 1108.36 examples/s]
Downloading builder script

Once you see "Model export complete" in the logs the training is done! 

If you are trying to run it again, make sure to delete the Job resource or rename the Job to avoid name conflicts. 

### Download the fine-tuned model

We can download our fine-tuned BERT model from GCS to evaluate it.

In [13]:
import gcsfs
import os

storage_options = {"project": "<REPLACE-WITH-YOUR-GCLOUD-PROJECT-ID>", "token": "google_default"},
fs = gcsfs.GCSFileSystem(**storage_options)

# config.json is the model metadata.
# model.safetensors is the model weights & biases.
local_name = "bert-local"
if not os.path.exists(local_name):
    os.makedirs(local_name)
files = ['config.json', 'model.safetensors', 'special_tokens_map.json', 'tokenizer_config.json', 'tokenizer.json', 'training_args.bin', 'vocab.txt']
for f in files:
    fs.get(f'{bucket}/{model_name}/{f}', f'{local_name}/{f}')

### Test the fine-tuned BERT model

We are going to use HuggingFace pipeline to test our model.

We will ask for sentiment analysis task for our fine-tuned LLM.

In [None]:
from transformers import AutoTokenizer, pipeline

# During fine-tuning BERT tokenizer is not changed.
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Use pipeline with sentiment-analysis task to evaluate our model.
nlp = pipeline("sentiment-analysis", model="./bert-local", tokenizer=tokenizer)

bug = """
**Describe the bug**
glasskube installation doesn't work. 

**To reproduce**
Run `glasskube bootstrap`. 

**Cluster Info (please complete the following information):**
"""

enhancement = """
**Is your feature request related to a problem? Please describe.**
There are not enough cool buttons on the UI, please add more buttons I can click, that gives me more feelings of power!

**Describe the solution you'd like**
Create a button on the package detail page that says "Check for Updates" or something like that. 

"""


print(nlp(bug))
print(nlp(enhancement))

there should be a new button on the package detail page right below the header


## Delete the PyTorchJobs

When done with the training, you can delete the created PyTorchJob.

In [23]:
TrainingClient().delete_job(name=job_name)