<img src="static/event-image-small.jpg" alt="drawing" width="800"/> 

# **LAB 2: FINE TUNING**

Fine-tuning refers to the process of taking a pre-trained language model and 
training it further on a specific task or domain to improve its performance on that task.  
<br />
It is an important technique used to adapt LLMs to specific tasks and domains.  
<br />
In this lab we will explore basic ways to fine tune large language models using
open soure tools. First we look at an example of doing this by hand with the open source 🤗 Transformers
Python library. Familiarity with the 🤗 Transformers package is helpful once we
introduce additional tools with more flexibility, such as H2O LLM Studio  
<br />
In this notebook, we will explore how do fine-tune a foundational large language
model such that it can generate LinkedIn posts in the style of known influencers
on the platform. 

The prepared data set from lab 1 can be found here: `s3://h2o-public-test-data/generative-ai/`


# Using Hugging Face 

Among open source tools, Hugging Face provides some of the best to understnd 
how language modeling works. Before taking a look at expert fine tuning with H2O LLM Studio,
 we will look at a brief example of using the `transformers` and `datasets` python libraries. 

Let's load in the WNLI data set from the General Language Understanding Evaluation (GLUE)
benchmark. (https://gluebenchmark.com/)

From the paper, `The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task
in which a system must read a sentence with a pronoun and select the referent of that pronoun from
a list of choices.`


In [1]:
import warnings
warnings.filterwarnings('ignore')

# set flag for training environment
TRAINING = True

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "wnli")
checkpoint = "bert-base-uncased"


Downloading builder script: 100%|██████████| 28.8k/28.8k [00:00<00:00, 48.7MB/s]
Downloading metadata: 100%|██████████| 28.7k/28.7k [00:00<00:00, 67.7MB/s]
Downloading readme: 100%|██████████| 27.9k/27.9k [00:00<00:00, 51.8MB/s]
Downloading data: 100%|██████████| 29.0k/29.0k [00:00<00:00, 49.3MB/s]
Generating train split: 100%|██████████| 635/635 [00:00<00:00, 24750.56 examples/s]
Generating validation split: 100%|██████████| 71/71 [00:00<00:00, 18337.17 examples/s]
Generating test split: 100%|██████████| 146/146 [00:00<00:00, 22774.78 examples/s]


In [2]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 635
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 71
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 146
    })
})

# Tokenizer

We can automatically load the correct tokenizer used from the pretrained model
via `AutoTokenizer`.

In [3]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = 'this will be fun!'

tokenizer.tokenize(sequence)


Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 222kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 5.00MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 13.7MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 52.5MB/s]


['this', 'will', 'be', 'fun', '!']

# Tokenizer Output

Let's take a look at the integers (input_ids) assigned to each token in the sequence
as well as other information such as optional masks for any tokens that need to be
masked from the attention mechanism - special tokens for truncating sequences for example

In [4]:
tokenizer(sequence)

{'input_ids': [101, 2023, 2097, 2022, 4569, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [5]:
# function to crate
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map: 100%|██████████| 635/635 [00:00<00:00, 28083.20 examples/s]
Map: 100%|██████████| 71/71 [00:00<00:00, 12372.06 examples/s]
Map: 100%|██████████| 146/146 [00:00<00:00, 16078.15 examples/s]


In [6]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

# Load pretrained model weights

In [7]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading model.safetensors: 100%|██████████| 440M/440M [00:01<00:00, 383MB/s] 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Create a Trainer object to begin fine tuning

In [8]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [9]:
trainer.train()

  0%|          | 0/240 [00:00<?, ?it/s]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 240/240 [01:25<00:00,  2.82it/s]

{'train_runtime': 85.0038, 'train_samples_per_second': 22.411, 'train_steps_per_second': 2.823, 'train_loss': 0.7035834630330403, 'epoch': 3.0}





TrainOutput(global_step=240, training_loss=0.7035834630330403, metrics={'train_runtime': 85.0038, 'train_samples_per_second': 22.411, 'train_steps_per_second': 2.823, 'train_loss': 0.7035834630330403, 'epoch': 3.0})

# Generate predictions on the validation data

In [10]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

100%|██████████| 9/9 [00:00<00:00, 13.60it/s]

(71, 2) (71,)





# Model Output

As we can see, the transformer model outputs logits directly

In [11]:
print(predictions.predictions[:10, :10])

[[0.09305581 0.06058019]
 [0.064982   0.07597283]
 [0.06600507 0.07268026]
 [0.11784149 0.07293674]
 [0.07580852 0.07072549]
 [0.07087269 0.07751691]
 [0.08125432 0.07708483]
 [0.10697782 0.06369373]
 [0.10485618 0.06580386]
 [0.10855884 0.073225  ]]


# Turn into label predictions

In [12]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)
preds

array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0])

<img src="static/llm-studio-logo.svg" alt="drawing" width="200"/>  

# Fine Tuning like a Kaggle Grandmaster using LLM Studio

H2O LLM studio was created by some of the most successful kagglers in the world.
Recently, it was used to win 1st place in the Kaggle LLM Science Exam competition, 
the first competitive LLM competion on the platform. 

LLM Studio is open source technology that allows anyone to fine tune their own
large language model using their own data. 

It can be used as a UI, as well as via Python Command Line Interface (CLI). Please first
complete the Aqarium lab found at https://aquarium.h2o.ai before trying to use
the Python CLI. Examples of the CLI are found below as extended tasks.

***

<img src="static/aquarium-llm-studio.png" alt="drawing" width="600"/> 

***



# 🎉 **CONGRATULATIONS!** You have completed this lab!
---


# 📚 **EXTENDED TASKS**

## Programmatic Fine Tuning with H2O LLM Studio

H2O LLM Studio is open source technology located here: https://github.com/h2oai/h2o-llmstudio.

It's recommended to have a modern GPU with sufficient VRAM to support the task
of fine tuning some of these truly enormous neural network architectures. For best
results, recommend at least one NVIDIA A100 with 80GB of GPU Memory

Once up and running, you will be able to use LLM Studio pythonically through 
the use of a `cfg.yaml` file.

You can pass this file to LLM Studio to launch your first experiment. For example,
if you have a configuration file called `my_yaml_cfg.yaml` the Python command to 
launch the experiment would be:

```python train.py -Y my_yaml_cfg.yaml```

An example file can be found in this training repository as well: `example-config.yaml`)

Let's now walk through the configurations within the YAML file

# Overall Configurations

First of all, there are some overall configurations that are worth discussing:

- `experiment_name` - the name you'd like to call your experiment
- `output_directory` - where will you be storing your results
- `problem_type` - the machine learning task you're fine tuning for (e.g. `text_causal_language_modeling`)
- `llm_backbone` - the pretrained checkpoint you'll be using to undertake transfer learning

The backbone is by far the most important configuration for your experiment. 


# Architecture

The architecture settings allows the user to specify various aspects of the 
backbone architecture being used for fine tuning

```
architecture:
    backbone_dtype: int4
    force_embedding_gradients: false
    gradient_checkpointing: true
    intermediate_dropout: 0.0
    pretrained: true
    pretrained_weights:
```




# Tokenizer Settings

```
tokenizer:
    add_prefix_space: false
    add_prompt_answer_tokens: false
    max_length: 4032
    max_length_answer: 2016
    max_length_prompt: 2016
    padding_quantile: 1.0
    use_fast: true
```

# Augmentation Settings

```
augmentation:
    random_parent_probability: 0.0
    skip_parent_probability: 0.0
    token_mask_probability: 0.0
```

# Dataset Configurations

```
dataset:
    add_eos_token_to_answer: true
    add_eos_token_to_prompt: true
    add_eos_token_to_system: true
    answer_column: content
    chatbot_author: H2O.ai
    chatbot_name: h2oGPT
    data_sample: 1.0
    data_sample_choice:
    - Train
    - Validation
    limit_chained_samples: false
    mask_prompt_labels: true
    parent_id_column: None
    personalize: false
    prompt_column:
    - instruction
    system_column: None
    text_answer_separator: <|answer|>
    text_prompt_start: <|prompt|>
    text_system_start: <|system|>
    train_dataframe: h2o_genai_world_training/influencers_data_cleaned_sample.csv
    validation_dataframe: None
    validation_size: 0.01
    validation_strategy: automatic
```

# Training Configurations

```
training:
    batch_size: 2
    differential_learning_rate: 1.0e-05
    differential_learning_rate_layers: []
    drop_last_batch: true
    epochs: 1
    evaluate_before_training: false
    evaluation_epochs: 1.0
    grad_accumulation: 1
    gradient_clip: 0.0
    learning_rate: 0.0001
    lora: true
    lora_alpha: 16
    lora_dropout: 0.05
    lora_r: 4
    lora_target_modules: ''
    loss_function: TokenAveragedCrossEntropy
    optimizer: AdamW
    save_best_checkpoint: false
    schedule: Cosine
    train_validation_data: false
    warmup_epochs: 0.0
    weight_decay: 0.0
```

# 