# **CS 236 Final Project**
### Shreyas Lakhtakia
shreyasl@stanford.edu

`source`: https://ludwig.ai/latest/faq/

## **Basic Setup** 🧰

We're going to install Ludwig, setup our HuggingFace Token and load our dataset that we will be running experiments with.

### **Install Ludwig and Ludwig's LLM related dependencies.**

Install Ludwig from the latest release

In [1]:
!pip uninstall -y tensorflow --quiet
!pip install ludwig --quiet
!pip install ludwig[llm] --quiet

[0m

Enable text wrapping so we don't have to scroll horizontally and create a function to flush CUDA cache.

In [2]:
from IPython.display import HTML, display

# def set_css():
#   display(HTML('''
#   <style>
#     pre {
#         white-space: pre-wrap;
#     }
#   </style>/
#   '''))

# get_ipython().events.register('pre_run_cell', set_css)

def clear_cache():
  if torch.cuda.is_available():
    torch.cuda.empty_cache()

### **Setup HuggingFace Token** 🤗

This enables use of [Llama2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)

In [4]:
import getpass
import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import torch
import yaml

from ludwig.api import LudwigModel

os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Token: ········


### **Read in MedQuad inference data** 📋



In [5]:
# from google.colab import data_table; data_table.enable_dataframe_formatter()
import numpy as np; np.random.seed(236)
import pandas as pd

In [6]:
test = pd.read_csv('cleaned_pubmed_qa_all.csv')
train = pd.read_csv('cleaned_pubmed_qa_all.csv')

In [10]:
# shrink the datasets for demo purposes
train = train.head(100)
test = test.head(100)

In [11]:
print("train", train.shape)
print("test", test.shape)

train (100, 7)
test (100, 7)


## **Retrieve finetuned model**

In [9]:
ft_model = LudwigModel.load('results/api_experiment_run/model')

We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## **Score with finetuned model**

#### Check to confirm it's working


In [10]:
microtest = test[:2]

microtest['instruction_1'] = microtest['instruction']
microtest['instruction_2'] = ' Reply in a single word saying yes or no.'

microtest['instruction'] = microtest['instruction_1'] + microtest['instruction_2'] 
microtest

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  microtest['instruction_1'] = microtest['instruction']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  microtest['instruction_2'] = ' Reply in a single word saying yes or no.'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  microtest['instruction'] = microtest['instruction_1'] + microtest['instruction

Unnamed: 0,pubid,instruction,context,long_answer,output,num_characters_instruction,num_characters_output,instruction_1,instruction_2
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes,90,3,Do mitochondria play a role in remodelling lac...,Reply in a single word saying yes or no.
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no,68,2,Landolt C and snellen e acuity: differences in...,Reply in a single word saying yes or no.


In [12]:
microtest.instruction[1]

'Landolt C and snellen e acuity: differences in strabismus amblyopia? Reply in a single word saying yes or no.'

In [13]:
microtest

Unnamed: 0,pubid,instruction,context,long_answer,output,num_characters_instruction,num_characters_output,instruction_1,instruction_2
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes,90,3,Do mitochondria play a role in remodelling lac...,Reply in a single word saying yes or no.
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no,68,2,Landolt C and snellen e acuity: differences in...,Reply in a single word saying yes or no.


In [14]:
temp = ft_model.predict(dataset=microtest,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=True,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=True,
  collect_overall_stats=True,
  output_directory='finetuned_results_microtest_4',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  return np.sum(np.log(sequence_probabilities))


In [12]:
# x, y, z = ft_model.evaluate(dataset=microtest,
#   # data_format=None,
#   split='full',
#   # batch_size=None,
#   skip_save_unprocessed_output=True,
#   skip_save_predictions=False,
#   skip_save_eval_stats=False,
#   collect_predictions=True,
#   collect_overall_stats=True,
#   output_directory='finetuned_results_microtest_4',
#   # return_type=<class 'pandas.core.frame.DataFrame'>
# )

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  return np.sum(np.log(sequence_probabilities))


### Full testset

In [12]:
testset = test

testset['instruction_1'] = testset['instruction']
testset['instruction_2'] = ' Reply in a single word saying yes or no.'

testset['instruction'] = testset['instruction_1'] + testset['instruction_2'] 
print(testset.shape)
testset.head()

(100, 9)


Unnamed: 0,pubid,instruction,context,long_answer,output,num_characters_instruction,num_characters_output,instruction_1,instruction_2
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes,90,3,Do mitochondria play a role in remodelling lac...,Reply in a single word saying yes or no.
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no,68,2,Landolt C and snellen e acuity: differences in...,Reply in a single word saying yes or no.
2,9488747,"Syncope during bathing in infants, a pediatric...",{'contexts': ['Apparent life-threatening event...,"""Aquagenic maladies"" could be a pediatric form...",yes,79,3,"Syncope during bathing in infants, a pediatric...",Reply in a single word saying yes or no.
3,17208539,Are the long-term results of the transanal pul...,{'contexts': ['The transanal endorectal pull-t...,Our long-term study showed significantly bette...,no,106,2,Are the long-term results of the transanal pul...,Reply in a single word saying yes or no.
4,10808977,Can tailored interventions increase mammograph...,{'contexts': ['Telephone counseling and tailor...,The effects of the intervention were most pron...,yes,68,3,Can tailored interventions increase mammograph...,Reply in a single word saying yes or no.


In [13]:
testset.instruction[3]

'Are the long-term results of the transanal pull-through equal to those of the transabdominal pull-through? Reply in a single word saying yes or no.'

#### Perform Inference

We can now use the model we fine-tuned above to make predictions on some test examples to see whether fine-tuning the large language model improve its ability to follow instructions/the tasks we're asking it to perform.

In [14]:
classif_results = ft_model.predict(
  dataset=testset,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=True,
  collect_overall_stats=True,
  output_directory='finetuned_results_test',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  return np.sum(np.log(sequence_probabilities))


In [None]:
evaluation_statistics, predictions, output_directory = ft_model.evaluate(
  dataset=testset,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=True,
  collect_overall_stats=True,
  output_directory='finetuned_results_test',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [None]:
evaluation_statistics_df = pd.DataFrame(evaluation_statistics)

In [None]:
evaluation_statistics_df.to_csv("testset_evaluation_results.csv")

#### **Observations From QLoRA Fine-Tuning** 🔎
- Even when we just fine-tune the model on 100 examples from our dataset (which only takes about 4 minutes), it significantly improves the model on our task 🔥
- The answers are not perfect when we just use 100 examples, but if we inspect the *logic* in the response, we can see that it is 95% of the way there. This is SIGNIFICANTLY better than before - there is no repetition and the actual code aspects of the answers are all correct.
- The partial errors such as `sierp` instead of `arrray` etc indicate that we need to train on a larger amount of data for the model to better learn how to follow instructions and not make these kinds of mistakes.

If you're looking for a managed solution to handle all of the hassle of figuring out the right compute for your fine-tuning task, ensuring that they always succeed without CPU or GPU out-of-memory errors, and be able to rapidly deploy them for fast real-time inference, check out [Predibase](https://www.predibase.com/).

In [None]:
# !ludwig upload hf_hub --repo_id arnavgrg/ludwig-webinar --model_path /content/results/api_experiment_run_3