#**CS 236 Final Project**
### Shreyas Lakhtakia
shreyasl@stanford.edu

`source`: https://ludwig.ai/latest/faq/


## **Basic Setup** 🧰

We're going to install Ludwig, setup our HuggingFace Token and load our dataset that we will be running experiments with.

### **Install Ludwig and Ludwig's LLM related dependencies.**

Install Ludwig from the latest release

In [2]:
!pip uninstall -y tensorflow --quiet
!pip install ludwig --quiet
!pip install ludwig[llm] --quiet

[0m

Enable text wrapping so we don't have to scroll horizontally and create a function to flush CUDA cache.

In [3]:
from IPython.display import HTML, display

# def set_css():
#   display(HTML('''
#   <style>
#     pre {
#         white-space: pre-wrap;
#     }
#   </style>
#   '''))

# get_ipython().events.register('pre_run_cell', set_css)

def clear_cache():
  if torch.cuda.is_available():
    torch.cuda.empty_cache()

### **Setup HuggingFace Token** 🤗

This enables use of [Llama2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)

In [4]:
import getpass
import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import torch
import yaml

from ludwig.api import LudwigModel

os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Token: ········


### **Read in MedQuad inference data** 📋



In [5]:
# from google.colab import data_table; data_table.enable_dataframe_formatter()
import numpy as np; np.random.seed(236)
import pandas as pd

In [6]:
train = pd.read_csv('cleaned_med_quad_train.csv')
valid = pd.read_csv('cleaned_med_quad_valid.csv')
test = pd.read_csv('cleaned_med_quad_test.csv', on_bad_lines='warn')

In [7]:
# shrink the datasets for demo purposes
train = train.head(200)
valid = valid.head(150)
test = test.head(150)

In [8]:
print("train", train.shape)
print("valid", valid.shape)
print("test", test.shape)

train (200, 4)
valid (150, 4)
test (150, 4)


## **Retrieve finetuned model**

In [9]:
# confirm test set size
test.shape

(150, 4)

In [10]:
ft_model = LudwigModel.load('results/api_experiment_run/model')

Downloading config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [11]:
ft_model.config

{'input_features': [{'active': True,
   'name': 'instruction',
   'type': 'text',
   'column': 'instruction',
   'proc_column': 'instruction_TityHg',
   'tied': None,
   'preprocessing': {'pretrained_model_name_or_path': 'meta-llama/Llama-2-7b-hf',
    'tokenizer': 'hf_tokenizer',
    'vocab_file': None,
    'sequence_length': None,
    'max_sequence_length': None,
    'most_common': 20000,
    'padding_symbol': '<PAD>',
    'unknown_symbol': '<UNK>',
    'padding': 'left',
    'lowercase': True,
    'missing_value_strategy': 'fill_with_const',
    'fill_value': '<UNK>',
    'computed_fill_value': '<UNK>',
    'ngram_size': 2,
    'cache_encoder_embeddings': False,
    'compute_idf': False},
   'encoder': {'type': 'passthrough', 'skip': False}}],
 'output_features': [{'active': True,
   'name': 'output',
   'type': 'text',
   'column': 'output',
   'proc_column': 'output_9bi87u',
   'reduce_input': 'sum',
   'default_validation_metric': 'loss',
   'dependencies': [],
   'reduce_depende

## **Score with finetuned model**

#### Check to confirm it's working


In [12]:
microtest = test[:2]
microtest

Unnamed: 0,instruction,output,num_characters_instruction,num_characters_output
0,What is the outlook for Gaucher Disease ?,Enzyme replacement therapy is very beneficial ...,41,300
1,What is (are) Amish lethal microcephaly ?,Amish lethal microcephaly is a disorder in whi...,41,916


In [13]:
x, y, z = ft_model.evaluate(microtest)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [14]:
pd.DataFrame(x)

Unnamed: 0,output,combined
loss,10.373506,10.373506
token_accuracy,0.00678,
sequence_accuracy,0.0,
perplexity,32000.003906,
next_token_perplexity,32000.460938,
bleu,0.0,
rouge1_fmeasure,0.368979,
rouge1_precision,0.420455,
rouge1_recall,0.339009,
rouge2_fmeasure,0.137222,


#### Perform Inference

We can now use the model we fine-tuned above to make predictions on some test examples to see whether fine-tuning the large language model improve its ability to follow instructions/the tasks we're asking it to perform.

In [15]:
evaluation_statistics, predictions, output_directory = ft_model.evaluate(
  dataset=test,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=True,
  collect_overall_stats=True,
  output_directory='finetuned_results_test',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  return np.sum(np.log(sequence_probabilities))


In [16]:
evaluation_statistics_df = pd.DataFrame(evaluation_statistics)

In [17]:
evaluation_statistics_df.to_csv("testset_evaluation_results.csv")

#### **Observations From QLoRA Fine-Tuning** 🔎
- Even when we just fine-tune the model on 100 examples from our dataset (which only takes about 4 minutes), it significantly improves the model on our task 🔥
- The answers are not perfect when we just use 100 examples, but if we inspect the *logic* in the response, we can see that it is 95% of the way there. This is SIGNIFICANTLY better than before - there is no repetition and the actual code aspects of the answers are all correct.
- The partial errors such as `sierp` instead of `arrray` etc indicate that we need to train on a larger amount of data for the model to better learn how to follow instructions and not make these kinds of mistakes.

If you're looking for a managed solution to handle all of the hassle of figuring out the right compute for your fine-tuning task, ensuring that they always succeed without CPU or GPU out-of-memory errors, and be able to rapidly deploy them for fast real-time inference, check out [Predibase](https://www.predibase.com/).

In [18]:
# !ludwig upload hf_hub --repo_id arnavgrg/ludwig-webinar --model_path /content/results/api_experiment_run_3