#**CS 236 Final Project**
### Shreyas Lakhtakia
shreyasl@stanford.edu

`source`: https://ludwig.ai/latest/faq/


## **Basic Setup** 🧰

We're going to install Ludwig, setup our HuggingFace Token and load our dataset that we will be running experiments with.

### **Install Ludwig and Ludwig's LLM related dependencies.**

Install Ludwig from the latest release

In [3]:
!pip uninstall -y tensorflow --quiet
!pip install ludwig --quiet
!pip install ludwig[llm] --quiet

[0m

Enable text wrapping so we don't have to scroll horizontally and create a function to flush CUDA cache.

In [4]:
from IPython.display import HTML, display

# def set_css():
#   display(HTML('''
#   <style>
#     pre {
#         white-space: pre-wrap;
#     }
#   </style>
#   '''))

# get_ipython().events.register('pre_run_cell', set_css)

def clear_cache():
  if torch.cuda.is_available():
    torch.cuda.empty_cache()

### **Setup HuggingFace Token** 🤗

This enables use of [Llama2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)

In [5]:
import getpass
import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import torch
import yaml

from ludwig.api import LudwigModel

os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Token: ········


### **Read in MedQuad training data** 📋



In [6]:
# from google.colab import data_table; data_table.enable_dataframe_formatter()
import numpy as np; np.random.seed(236)
import pandas as pd

In [7]:
train = pd.read_csv('cleaned_med_quad_train.csv')
valid = pd.read_csv('cleaned_med_quad_valid.csv')
test = pd.read_csv('cleaned_med_quad_test.csv', on_bad_lines='warn')

In [8]:
print("train", train.shape)
print("valid", valid.shape)
print("test", test.shape)

train (7226, 4)
valid (2409, 4)
test (2409, 4)


In [None]:
# # shrink the datasets for demo purposes
# train = train.head(200)
# valid = valid.head(20)
# test = test.head(20)

## **Finetuning the Dataset**




Typically, every 3-4 characters maps to a *token* (the basic building blocks that language models use to understand and analyze text data), and large language models have a limit on the number of tokens they can take as input. The maximum context length for the base LLaMA-2 model is 4096 tokens. (Ludwig automatically truncates texts that are too long for the model).





In [9]:
qlora_fine_tuning_config = yaml.safe_load(
"""
model_type: llm
base_model: meta-llama/Llama-2-7b-hf

input_features:
  - name: instruction
    type: text

output_features:
  - name: output
    type: text

prompt:
  template: >-
    You are a health agent trying to help potential patients who have no alternatives.
    Be helpful, respectful and honest assistant. If you don't know an answer, say so.
    Below is an instruction that describes a question. Write a response that appropriately
    answers the question truthfully.

    ### Instruction: {instruction}

    ### Response:

generation:
  temperature: 0.1
  max_new_tokens: 1024

adapter:
  type: lora

quantization:
  bits: 4

preprocessing:
  global_max_sequence_length: 1024
  split:
    type: random
    probabilities:
    - 1
    - 0
    - 0

trainer:
  type: finetune
  epochs: 1
  batch_size: 1
  eval_batch_size: 2
  gradient_accumulation_steps: 16
  learning_rate: 0.0004
  learning_rate_scheduler:
    warmup_fraction: 0.03
"""
)

In [10]:
model = LudwigModel(config=qlora_fine_tuning_config, logging_level=logging.INFO)

Downloading config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

In [11]:
print(train.shape)

(7226, 4)


In [12]:
train.sample(3)

Unnamed: 0,instruction,output,num_characters_instruction,num_characters_output
4886,Is congenital diaphragmatic hernia inherited ?,Isolated congenital diaphragmatic hernia is ra...,46,325
311,How to diagnose COPD ?,"To confirm a COPD diagnosis, a doctor will use...",22,373
167,What are the treatments for Schimke immuno-oss...,These resources address the diagnosis or manag...,62,436


In [13]:
results = model.train(dataset=train)


╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤═════════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                                          │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                                                     │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /workspace/results/api_experiment_run                                                   │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ ludwig_version   │ '0.8.6'                                                                                 │
├──────────────────┼─────────

Downloading tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'None': 117 (without start and stop symbols)
Setting max length using dataset: 119 (including start and stop symbols)
max sequence length is 119 for feature 'None'
Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'output': 2429 (without start and stop symbols)
Setting max length using dataset: 2431 (including start and stop symbols)
max sequence length is 2431 for feature 'output'
Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Building dataset: DONE
Writing preprocessed training set cache to /workspace/fd1f33d0919411ee8fd00242ac170002.training.hdf5
Writing preprocessed validation set cache to /workspace/fd1f33d0919411ee8fd00242ac170002.validation.hdf5
Writing preprocessed test set cache to /workspace/fd1f33d0919411ee8fd00242ac170002.test.hdf5
Writing train set metadata to /workspace/fd1f33d0919411ee8fd00242ac170002.meta.json
Validation set empty. If this is unintentional, please check the preprocessing configuration.
Test set empty. If this is unintentional, please check the preprocessing configuration.

Dataset Statistics
╒═══════════╤═══════════════╤════════════════════╕
│ Dataset   │   Size (Rows) │ Size (In Memory)   │
╞═══════════╪═══════════════╪════════════════════╡
│ Training  │          7226 │ 1.65 Mb            │
╘═══════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Loading large language model...


Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Done.
Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.
Trainable Parameter Summary For Fine-Tuning
Fine-tuning with adapter: lora
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199

╒══════════╕
│ TRAINING │
╘══════════╛

Creating fresh model training run.
Training for 7226 step(s), approximately 1 epoch(s).
Early stopping policy: 5 round(s) of evaluation, or 36130 step(s), approximately 5 epoch(s).

Starting with step 0, epoch: 0
Training:   4%|▎         | 256/7226 [01:54<1:00:07,  1.93it/s, loss=nan]   



Training: 100%|██████████| 7226/7226 [53:15<00:00,  1.70it/s, loss=0.0732]    
Running evaluation for step: 7226, epoch: 0
Evaluation took 0.8051s

╒═══════════════════════╤════════════╤══════════════╤════════╕
│                       │      train │ validation   │ test   │
╞═══════════════════════╪════════════╪══════════════╪════════╡
│ bleu                  │     0.2454 │              │        │
├───────────────────────┼────────────┼──────────────┼────────┤
│ char_error_rate       │     0.6612 │              │        │
├───────────────────────┼────────────┼──────────────┼────────┤
│ loss                  │     0.9985 │              │        │
├───────────────────────┼────────────┼──────────────┼────────┤
│ next_token_perplexity │ 16648.4434 │              │        │
├───────────────────────┼────────────┼──────────────┼────────┤
│ perplexity            │ 31953.3770 │              │        │
├───────────────────────┼────────────┼──────────────┼────────┤
│ rouge1_fmeasure       │     0.5

### Save model (duplicate to confirm saving)

In [14]:
model.save('finetuned_model')

#### Double check saved model

In [None]:
# temp_model = LudwigModel.load('finetuned_model')

In [None]:
# temp_model.config

In [None]:
# evaluation_statistics, predictions, output_directory = temp_model.evaluate(
#   dataset=valid,
#   # data_format=None,
#   split='full',
#   # batch_size=None,
#   skip_save_unprocessed_output=False,
#   skip_save_predictions=False,
#   skip_save_eval_stats=False,
#   collect_predictions=True,
#   collect_overall_stats=True,
#   output_directory='finetuned_results',
#   # return_type=<class 'pandas.core.frame.DataFrame'>
# )

In [None]:
# valid

In [15]:
predictions.output_response[18]

NameError: name 'predictions' is not defined

#### Perform Inference

We can now use the model we fine-tuned above to make predictions on some test examples to see whether fine-tuning the large language model improve its ability to follow instructions/the tasks we're asking it to perform.

In [None]:
evaluation_statistics, predictions, output_directory = model.evaluate(
  dataset=valid,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=True,
  collect_overall_stats=True,
  output_directory='finetuned_results_valid',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Evaluation:   0%|          | 0/1205 [00:00<?, ?it/s]Decoded text inputs for the first example in batch: you are a health agent trying to help potential patients who have no alternatives. be helpful, respectful and honest assistant. if you don't know an answer, say so. below is an instruction that describes a question. write a response that appropriately answers the question truthfully.
### instruction: what are the treatments for inflammatory myopathies ?
### response:
Decoded generated output for the first example in batch: you are a health agent trying to help potential patients who have no alternatives. be helpful, respectful and honest assistant. if you don't know an answer, say so. below is an instruction that describes a question. write a response that appropriately answers the question truthfully.
### instruction: what are the treatments for inflammatory myopathies ?
### response: these resources address the diagnosis or management of inflammatory myopathies:  - gene review: gen

In [None]:
evaluation_statistics, predictions, output_directory = model.evaluate(
  dataset=test,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=True,
  collect_overall_stats=True,
  output_directory='finetuned_results_test',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

In [None]:
evaluation_statistics

In [None]:
evaluation_statistics, predictions, output_directory = model.evaluate(
  dataset=test,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=True,
  collect_overall_stats=True,
  output_directory='finetuned_results_test',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

In [None]:
# evaluation_statistics, training_statistics, preprocessed_data, output_directory = model.experiment(
#   # dataset=None,
#   training_set=train,
#   validation_set=valid,
#   test_set=test,
#   training_set_metadata=None,
#   # data_format=None,
#   experiment_name='basic_finetune',
#   model_name='run',
#   model_resume_path=None,
#   eval_split='test',
#   skip_save_training_description=False,
#   skip_save_training_statistics=False,
#   skip_save_model=False,
#   skip_save_progress=False,
#   skip_save_log=False,
#   skip_save_processed_input=False,
#   skip_save_unprocessed_output=False,
#   skip_save_predictions=False,
#   skip_save_eval_stats=False,
#   skip_collect_predictions=False,
#   skip_collect_overall_stats=False,
#   output_directory='results',
#   random_seed=236
# )

In [None]:
# test_examples = pd.DataFrame([
#       {
#             "instruction": "Create an array of length 5 which contains all even numbers between 1 and 10.",
#             "input": ''
#       },
#       {
#             "instruction": "Create an array of length 15 containing numbers divisible by 3 up to 45.",
#             "input": "",
#       },
#       {
#             "instruction": "Create a nested loop to print every combination of numbers between 0-9",
#             "input": ""
#       },
#       {
#             "instruction": "Generate a function that computes the sum of the numbers in a given list",
#             "input": "",
#       },
#       {
#             "instruction": "Create a class to store student names, ages and grades.",
#             "input": "",
#       },
#       {
#             "instruction": "Print out the values in the following dictionary.",
#             "input": "my_dict = {\n  'name': 'John Doe',\n  'age': 32,\n  'city': 'New York'\n}",
#       },
# ])

# predictions = model.predict(test_examples)[0]
# for input_with_prediction in zip(test_examples['instruction'], test_examples['input'], predictions['output_response']):
#   print(f"Instruction: {input_with_prediction[0]}")
#   print(f"Input: {input_with_prediction[1]}")
#   print(f"Generated Output: {input_with_prediction[2][0]}")
#   print("\n\n")

#### **Observations From QLoRA Fine-Tuning** 🔎
- Even when we just fine-tune the model on 100 examples from our dataset (which only takes about 4 minutes), it significantly improves the model on our task 🔥
- The answers are not perfect when we just use 100 examples, but if we inspect the *logic* in the response, we can see that it is 95% of the way there. This is SIGNIFICANTLY better than before - there is no repetition and the actual code aspects of the answers are all correct.
- The partial errors such as `sierp` instead of `arrray` etc indicate that we need to train on a larger amount of data for the model to better learn how to follow instructions and not make these kinds of mistakes.

If you're looking for a managed solution to handle all of the hassle of figuring out the right compute for your fine-tuning task, ensuring that they always succeed without CPU or GPU out-of-memory errors, and be able to rapidly deploy them for fast real-time inference, check out [Predibase](https://www.predibase.com/).

In [None]:
# !ludwig upload hf_hub --repo_id arnavgrg/ludwig-webinar --model_path /content/results/api_experiment_run_3