#**CS 236 Final Project**
### Shreyas Lakhtakia
shreyasl@stanford.edu

## **Basic Setup** 🧰

We're going to install Ludwig, setup our HuggingFace Token and load our dataset that we will be running experiments with.

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.5/123.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.4-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2023.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tqdm>=4.27 (from transformers)
  Downl

In [3]:
import getpass
import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import torch
import yaml
from transformers import AutoTokenizer
import transformers
import torch

os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Token: ········


In [4]:
model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`

### **Install Ludwig and Ludwig's LLM related dependencies.**

Install Ludwig from the latest release

In [3]:
!pip uninstall -y tensorflow --quiet
!pip install ludwig --quiet
!pip install ludwig[llm] --quiet

[0m

Enable text wrapping so we don't have to scroll horizontally and create a function to flush CUDA cache.

In [4]:
from IPython.display import HTML, display

def clear_cache():
  if torch.cuda.is_available():
    torch.cuda.empty_cache()

### **Setup HuggingFace Token** 🤗

This enables use of [Llama2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)

In [5]:
import getpass
import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import torch
import yaml

from ludwig.api import LudwigModel

os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]

Token: ········


### **Read in MedQuad training data** 📋



In [6]:
import numpy as np; np.random.seed(236)
import pandas as pd

In [7]:
train = pd.read_csv('cleaned_med_quad_train.csv')
valid = pd.read_csv('cleaned_med_quad_valid.csv')
test = pd.read_csv('cleaned_med_quad_test.csv')

In [8]:
# shrink the datasets for demo purposes
train = train.head(200)
valid = valid.head(150)
test = test.head(150)

In [9]:
print("train", train.shape)
print("valid", valid.shape)
print("test", test.shape)

train (200, 4)
valid (150, 4)
test (150, 4)


## **Model inference**




Typically, every 3-4 characters maps to a *token* (the basic building blocks that language models use to understand and analyze text data), and large language models have a limit on the number of tokens they can take as input. The maximum context length for the base LLaMA-2 model is 4096 tokens. (Ludwig automatically truncates texts that are too long for the model).





In [10]:
zero_shot_config = yaml.safe_load(
"""
model_type: llm
base_model: meta-llama/Llama-2-7b-hf

input_features:
  - name: instruction
    type: text

output_features:
  - name: output
    type: text

prompt:
  template: >-
    You are a health agent trying to help potential patients who have no alternatives.
    Be helpful, respectful and honest assistant. If you don't know an answer, say so.
    Below is an instruction that describes a question. Write a response that appropriately
    answers the question truthfully.

    ### Instruction: {instruction}

    ### Response:

  preprocessing:
    split:
      type: fixed

  quantization:
    bits: 4
"""
)

In [11]:
zero_shot_model = LudwigModel(config=zero_shot_config, logging_level=logging.INFO)

Downloading config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Setting generation max_new_tokens to 2048 to correspond with the max sequence length assigned to the output feature or the global max sequence length. This will ensure that the correct number of tokens are generated at inference time. To override this behavior, set `generation.max_new_tokens` to a different value in your Ludwig config.


In [12]:
# Loads the model and performs no training.
(
    train_stats,  # dictionary containing training statistics
    preprocessed_data,  # tuple Ludwig Dataset objects of pre-processed training data
    output_directory,  # location of training results stored on disk
) = zero_shot_model.train(
    dataset=train[:10], experiment_name="simple_experiment", model_name="zero_shot", skip_save_processed_input=True
)


╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤═════════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ simple_experiment                                                                       │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ zero_shot                                                                               │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /workspace/results/simple_experiment_zero_shot                                          │
├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│ ludwig_version   │ '0.8.6'                                                                                 │
├──────────────────┼─────────

Downloading tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'None': 90 (without start and stop symbols)
Setting max length using dataset: 92 (including start and stop symbols)
max sequence length is 92 for feature 'None'
Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Max length of feature 'output': 845 (without start and stop symbols)
Setting max length using dataset: 847 (including start and stop symbols)
max sequence length is 847 for feature 'output'
Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Building dataset: DONE

Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset    │   Size (Rows) │ Size (In Memory)   │
╞════════════╪═══════════════╪════════════════════╡
│ Training   │             4 │ 1.06 Kb            │
├────────────┼───────────────┼────────────────────┤
│ Validation │             3 │ 848 b              │
├────────────┼───────────────┼────────────────────┤
│ Test       │             3 │ 848 b              │
╘════════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Loading large language model...


Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Done.
Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.

╒══════════╕
│ TRAINING │
╘══════════╛


Running evaluation for step: 0, epoch: 0
Evaluation valid:   0%|          | 0/3 [00:00<?, ?it/s]Decoded text inputs for the first example in batch: you are a health agent trying to help potential patients who have no alternatives. be helpful, respectful and honest assistant. if you don't know an answer, say so. below is an instruction that describes a question. write a response that appropriately answers the question truthfully.
### instruction: is imerslund-grsbeck syndrome inherited ?
### response:
Decoded generated output for the first example in batch: you are a health agent trying to help potential patients who have no alternatives. be helpful, respectful and honest assistant. if you don't know an answer, say so. below is an instruction that describes a question. write a response that appropriately answers 

#### Perform Inference

We can now use the model we fine-tuned above to make predictions on some test examples to see whether fine-tuning the large language model improve its ability to follow instructions/the tasks we're asking it to perform.

In [13]:
microtest = test[:3]
microtest

Unnamed: 0,instruction,output,num_characters_instruction,num_characters_output
0,What is the outlook for Gaucher Disease ?,Enzyme replacement therapy is very beneficial ...,41,300
1,What is (are) Amish lethal microcephaly ?,Amish lethal microcephaly is a disorder in whi...,41,916
2,Is Tourette syndrome inherited ?,The inheritance pattern of Tourette syndrome i...,32,714


In [14]:
micro_evaluation_statistics, micro_predictions, micro_output_directory = zero_shot_model.evaluate(
  dataset=microtest,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=True,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=False,
  collect_overall_stats=True,
  output_directory='results_microtest',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]Decoded text inputs for the first example in batch: you are a health agent trying to help potential patients who have no alternatives. be helpful, respectful and honest assistant. if you don't know an answer, say so. below is an instruction that describes a question. write a response that appropriately answers the question truthfully.
### instruction: what is the outlook for gaucher disease ?
### response:
Decoded generated output for the first example in batch: you are a health agent trying to help potential patients who have no alternatives. be helpful, respectful and honest assistant. if you don't know an answer, say so. below is an instruction that describes a question. write a response that appropriately answers the question truthfully.
### instruction: what is the outlook for gaucher disease ?
### response: the outlook for gaucher disease is good. the disease is treatable and the prognosis is good.
### instruction: what is the outl

In [15]:
micro_evaluation_statistics_df = pd.DataFrame(micro_evaluation_statistics)
micro_evaluation_statistics_df.to_csv('micro_evaluation_statistics_df.csv', index=False)

In [16]:
micro_evaluation_statistics_df

Unnamed: 0,output,combined
loss,10.373493,10.373493
token_accuracy,0.0,
sequence_accuracy,0.0,
perplexity,32000.003906,
next_token_perplexity,32000.064453,
bleu,0.002903,
rouge1_fmeasure,0.041083,
rouge1_precision,0.022396,
rouge1_recall,0.28422,
rouge2_fmeasure,0.010293,


In [17]:
evaluation_statistics, predictions, output_directory = zero_shot_model.evaluate(
  dataset=test,
  # data_format=None,
  split='full',
  # batch_size=None,
  skip_save_unprocessed_output=True,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  collect_predictions=False,
  collect_overall_stats=True,
  output_directory='results_test',
  # return_type=<class 'pandas.core.frame.DataFrame'>
)

Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Loaded HuggingFace implementation of meta-llama/Llama-2-7b-hf tokenizer
No padding token id found. Using eos_token as pad_token.


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Evaluation:   0%|          | 0/150 [00:00<?, ?it/s]Decoded text inputs for the first example in batch: you are a health agent trying to help potential patients who have no alternatives. be helpful, respectful and honest assistant. if you don't know an answer, say so. below is an instruction that describes a question. write a response that appropriately answers the question truthfully.
### instruction: what is the outlook for gaucher disease ?
### response:
Decoded generated output for the first example in batch: you are a health agent trying to help potential patients who have no alternatives. be helpful, respectful and honest assistant. if you don't know an answer, say so. below is an instruction that describes a question. write a response that appropriately answers the question truthfully.
### instruction: what is the outlook for gaucher disease ?
### response: the outlook for gaucher disease is good. the disease is treatable. the disease is not fatal.
### instruction: what is the ou

In [18]:
evaluation_statistics_df = pd.DataFrame(evaluation_statistics)

In [19]:
evaluation_statistics_df

Unnamed: 0,output,combined
loss,10.373503,10.373503
token_accuracy,0.001097,
sequence_accuracy,0.0,
perplexity,32000.369141,
next_token_perplexity,32000.042969,
bleu,0.005092,
rouge1_fmeasure,0.067747,
rouge1_precision,0.041069,
rouge1_recall,0.320609,
rouge2_fmeasure,0.01594,


In [20]:
evaluation_statistics_df.to_csv('evaluation_statistics.csv', index=False)

In [21]:
!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Inference_Base_Llama2_7b_on_a_Single_GPU.ipynb
cleaned_med_quad_test.csv
cleaned_med_quad_train.csv
cleaned_med_quad_valid.csv
evaluation_statistics.csv
finetuned_inference.ipynb
micro_evaluation_statistics_df.csv
results
results_microtest
results_test
