# Medical Text Generation with Llama


Now we will explore fine-tuning Llama! The goal is to compare this model to GPT-2 and the paid fine-tuned GPT-3.5-Turbo. 

Aid from:\
https://huggingface.co/meta-llama/Llama-3.2-3B\
https://github.com/meta-llama/llama\
https://pytorch.org/torchtune/0.3/tutorials/lora_finetune.html\
https://github.com/microsoft/LoRA\
https://huggingface.co/docs/transformers/en/model_doc/llama3

In [1]:
! nvidia-smi

Sun Nov 24 17:32:48 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.95                 Driver Version: 551.95         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4070 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   56C    P8              2W /   50W |       8MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
# gather all imports

import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import os
import platform
import subprocess
from tqdm import tqdm
import os
import gc
from safetensors.torch import load_file

  from .autonotebook import tqdm as notebook_tqdm


As before, load in the data.

In [3]:
def load_and_prepare(file_path):
    """
    Helper function to load in the data into a specific form 

    @PARAMS:
        - file_path -> the file to process
    """
    try:
        with open(file_path, "r", encoding="utf-8") as f:
            data = json.load(f)
        
        # format data to be just question answer pairs
        formatted_data = []
        for entry in data:
            formatted_data.append({
                "question": entry["Question"],
                "response": entry["Answer"]
            })
        
        print(f"Loaded {len(formatted_data)} Q&A pairs from {file_path}!")
        return formatted_data
        
    except Exception as e:
        print(f"Error loading in file...\n{e}")
        return []

In [4]:
# load in formatted data
## TRAIN ##
train_data = load_and_prepare("processed_data/train.json")

## VAL ##
val_data = load_and_prepare("processed_data/validation.json")

## TEST ##
test_data = load_and_prepare("processed_data/test.json")

# print out one value of each to make sure it is loaded correctly
print(train_data[0])
print(val_data[0])
print(test_data[0])

Loaded 18749 Q&A pairs from processed_data/train.json!
Loaded 2344 Q&A pairs from processed_data/validation.json!
Loaded 2344 Q&A pairs from processed_data/test.json!
{'question': 'will eating late evening meals increase my cholesterol?', 'response': 'no. it is what you are eating (as well as your genetics) not when you eat it. it depends on the kinds of foods that you eat. make sure that you are eating healthy foods in order to not gain great amount of cholesterol. you have to always watch what you eat in order to have a healthy skin and body. you may check out www. clearclinic. com for great ideas to achieve an acne free skin.'}
{'question': 'who is affected by arthritis?', 'response': 'arthritis sufferers include men and women children and adults. approximately 350 million people worldwide have arthritis. nearly 40 million people in the united states are affected by arthritis including over a quarter million children! more than 27 million americans have osteoarthritis. approximately

In [5]:
class LLAMA:
    """Baseline LLAMA class to handle the 3.2-3B model."""
    def __init__(self, model_id="meta-llama/Llama-3.2-3B"):
        """
        Initializer function to setup LLAMA!

        @PARAMS:
            - model_id -> the specific llama model to use
        """
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        # load tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="cuda:0"
        )
        self.model.eval()
        
    def generate_response(self, input_text, max_length=200, num_sequences=1, top_k=50, top_p=0.95, temperature=0.8):
        """
        Generate text based on the user prompt.
        
        @PARAMS:
            - input_text -> user query
            - [*]        -> parameters for specifying model output
        """
        # format the prompt
        formatted_prompt = (
            "Instructions: Provide a clear and accurate answer to the following question.\n"
            f"Question: {input_text}\n"
            "Answer:"
        )
        
        # tokenize input
        input_ids = self.tokenizer.encode(
            formatted_prompt,
            return_tensors="pt"
        ).cuda()
        
        with torch.no_grad(): 
            outputs = self.model.generate(
                input_ids,
                max_length=max_length,
                num_return_sequences=num_sequences,
                top_k=top_k,
                top_p=top_p,
                do_sample=True,
                temperature=temperature,
                pad_token_id=self.tokenizer.eos_token_id,
                repetition_penalty=1.2,           
                no_repeat_ngram_size=3,
                early_stopping=True,
                min_length=30,
            )
        
        # decode and return results
        generated_texts = [
            self.tokenizer.decode(output, skip_special_tokens=True)
                .replace(formatted_prompt, "")    # Remove the prompt
                .strip()                         # Remove leading/trailing whitespace
            for output in outputs
        ]
        
        return generated_texts[0] if num_sequences == 1 else generated_texts    

In [12]:
# Initialize the generator
llama = LLAMA()

# generate testing examples:
questions = [point['question'] for point in test_data[:5]]
responses = [point['response'] for point in test_data[:5]]


print("\n\nLLAMA Baseline Results:")
for question, response in zip(questions, responses):
    print(f"Question: {question}")
    print(f"Response: {llama.generate_response(question)}")
    print(f"Expected Response: {response}\n\n")

Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00,  6.48s/it]




LLAMA Baseline Results:
Question: can i be pregnant if i had unprotected sex the 4th day of being on the depo?
Response: It is possible, but not likely. The Depo-Provera shot (medroxyprogesterone) prevents pregnancy for three months; therefore it would need to have been given before you were exposed to sperm in order for there to be no risk of becoming pregnant within that time frame. If your first period after receiving this injection was not expected, then the most recent one may still be considered fertile due to its short half-life or length between doses so we cannot say with certainty whether or not an unwanted result has occurred just yet!
Expected Response: yes you can. the depo will take about a month or two to take full effect. even then it is not 100% effective.


Question: what is delta hepatitis?
Response: The infection caused by HBV (hepatitis B virus) can be asymptomatic or cause acute liver failure. If not treated properly, it may lead to chronic hepatitis which will 

Looking at the outputs, we definetely have room for improvement! Unlike GPT-2, I will fine-tune using a process called LoRA, which is a much faster and more efficient fine-tuning procedure. I will comapre the results of GPT-2 to see if it improved a similar amount!

In [10]:
class LLAMA_FineTuned:
    """Class for the fine-tuned version of LLAMA 3.2-3B"""
    def __init__(self, model_id="meta-llama/Llama-3.2-3B"):
        """
        Initializer function for the fine-tuned version of LLAMA.
        """
        # make sure to clear all GPU memory
        torch.cuda.empty_cache()
        gc.collect()

        # save the model id        
        self.model_id = model_id
        
        # load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # setyp the quantization
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
        )
        
        # load the model with memory optimization
        print("Loading base model...")
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            quantization_config=quantization_config,
            torch_dtype=torch.float16,
            device_map='auto',
            max_memory={0: "10GB"}
        )
        
        # create the configuration for the LORA fine-tuning
        config = LoraConfig(
            r=8,
            lora_alpha=16,
            target_modules=["q_proj", "v_proj"],
            lora_dropout=0.05,
            bias="none",
            task_type="CAUSAL_LM"
        )
        
        # prepare everything 
        print("Applying LoRA...")
        self.model = prepare_model_for_kbit_training(self.model)
        self.model = get_peft_model(self.model, config)
        
        # get some insights into the trainable parameters for this model
        self.model.print_trainable_parameters()
        
    def train(self, train_data, output_dir="llama-lora-medical", num_epochs=1, batch_size=1, max_length=200):
        """
        Main training function for the model with memory optimization.

        @PARAMS:
            - train_data -> all training conversations
            - output_dir -> where the model will be saved
            - [2:*]       -> hyperparameters for tuning
        """
        print("Starting training preparation...")
        
        # process in chunks to save time
        def prepare_batch(batch_items):
            texts = [
                # using a basic instruction to see how well it does
                f"Instructions: Provide a clear and accurate answer to the following question.\nQuestion: {item['question']}\nAnswer: {item['response']}"
                for item in batch_items
            ]
            
            # create the encodings
            encodings = self.tokenizer(
                texts,
                truncation=True,
                max_length=max_length,
                padding=True,
                return_tensors="pt"
            )
            
            return {
                'input_ids': encodings['input_ids'].cuda(),
                'attention_mask': encodings['attention_mask'].cuda(),
                'labels': encodings['input_ids'].cuda()
            }
        
        # setup the optimizer
        optimizer = torch.optim.AdamW(
            self.model.parameters(),
            lr=1e-4,
            weight_decay=0.01,
            eps=1e-7
        )
        
        print("Starting training loop...")
        self.model.train()
        
        # now loop through each epoch - one full pass through the data
        for epoch in range(num_epochs):
            total_loss = 0
            progress_bar = tqdm(range(0, len(train_data), batch_size))
            
            for i in progress_bar:
                # clear cache so GPU isn't overloaded
                torch.cuda.empty_cache()
                
                # prepare the batch processing
                batch_end = min(i + batch_size, len(train_data))
                batch_data = train_data[i:batch_end]
                batch = prepare_batch(batch_data)
                
                # forward pass
                try:
                    outputs = self.model(
                        input_ids=batch['input_ids'],
                        attention_mask=batch['attention_mask'],
                        labels=batch['labels']
                    )
                    
                    # update values
                    loss = outputs.loss
                    total_loss += loss.item()
                    
                    # backward pass
                    loss.backward()
                    optimizer.step()
                    optimizer.zero_grad()
                    
                    # make sure to update the progress bar
                    progress_bar.set_postfix({
                        'loss': loss.item(),
                        'gpu_mem': f"{torch.cuda.memory_allocated()/1024**2:.0f}MB"
                    })
                    
                except RuntimeError as e:
                    # if we are out of memory, try and clear cache and continue, o/w throw error
                    if "out of memory" in str(e):
                        print(f"\nOOM error at batch {i}. Current memory: {torch.cuda.memory_allocated()/1024**2:.0f}MB")
                        if torch.cuda.memory_allocated() > 0:
                            torch.cuda.empty_cache()
                            print(f"Memory after cache clear: {torch.cuda.memory_allocated()/1024**2:.0f}MB")
                        continue
                    else:
                        raise e
                
                # cleanup!!!
                del outputs
                del loss
                torch.cuda.empty_cache()
                
            # calculate the loss
            avg_loss = total_loss / (len(train_data) / batch_size)
            print(f'\nEpoch {epoch+1} average loss: {avg_loss:.4f}')
            
            # save a checkpoint in case of crash!
            if not os.path.exists(output_dir):
                os.makedirs(output_dir)
            self.model.save_pretrained(f"{output_dir}/checkpoint-{epoch+1}")
        
        print("Training complete!")
        self.model.save_pretrained(output_dir)

    def load_finetuned(self, model_path):
        """
        Function to load in a fine-tuned model instead of re-training.

        @PARAMS:
            - model_path -> where the model config values are
        """
        print(f"Loading fine-tuned model from {model_path}...")
        
        try:
            # first load the base model
            print("Loading base model...")
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_use_double_quant=True,
            )
            
            # now load the pre-trained one
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_id,
                quantization_config=quantization_config,
                torch_dtype=torch.float16,
                device_map='auto',
                max_memory={0: "10GB"}
            )
            
            # prepare for LORA
            self.model = prepare_model_for_kbit_training(self.model)
            print("Loading LoRA config...")
            lora_config = LoraConfig.from_pretrained(model_path)
            
            # apply LoRA
            self.model = get_peft_model(self.model, lora_config)
            
            # load in config values
            print("Loading weights and fixing state dict...")
            state_dict = load_file(f"{model_path}/adapter_model.safetensors")
            
            # need to convert keys into the proper format
            new_state_dict = {}
            for key, value in state_dict.items():
                if "lora_A.weight" in key or "lora_B.weight" in key:
                    parts = key.split(".")
                    new_key = f"{'.'.join(parts[:-1])}.default.{parts[-1]}"
                    new_state_dict[new_key] = value
                else:
                    new_state_dict[key] = value
            
            # load the dict
            print("Loading adapted weights...")
            self.model.load_state_dict(new_state_dict, strict=False)
            
            # set to eval 
            self.model.eval()
            print("Model loaded successfully!")
            return self
            
        except Exception as e:
            print(f"Error loading model: {str(e)}")
            raise

    def generate_response(self, question, max_length=200, temperature=0.7):
        """
        Function to get the response of the model
        """
        self.model.eval()
        prompt = f"Question: {question}\nAnswer:"
        
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            max_length=200,
            truncation=True
        ).to('cuda')

        with torch.no_grad(), torch.amp.autocast(device_type='cuda', dtype=torch.float16):
            outputs = self.model.generate(
                **inputs,
                max_length=max_length,
                min_length=20,
                num_return_sequences=1,
                temperature=temperature,
                top_p=0.85,
                top_k=40,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id,
                repetition_penalty=1.1,
                length_penalty=1.0,
                early_stopping=True,
                use_cache=True,
                num_beams=1
            )

        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response.replace(prompt, "").strip()

In [8]:
# prevent the computer from sleeping when running!
try:
    if platform.system() == 'Windows':
        # Windows command to prevent sleep
        subprocess.Popen(['powercfg', '-change', '-standby-timeout-ac', '0'])
        subprocess.Popen(['powercfg', '-change', '-monitor-timeout-ac', '0'])
        print("Sleep prevention activated for Windows")
    elif platform.system() == 'Darwin':  # macOS
        subprocess.Popen(['caffeinate', '-i'])
        print("Sleep prevention activated for macOS")
    elif platform.system() == 'Linux':
        subprocess.Popen(['systemctl', 'mask', 'sleep.target', 'suspend.target', 
                        'hibernate.target', 'hybrid-sleep.target'])
        print("Sleep prevention activated for Linux")
except Exception as e:
    print(f"Warning: Could not prevent sleep mode: {e}")

# initialize the model
fine_tuner = LLAMA_FineTuned()

# train
fine_tuner.train(
    train_data=train_data,
    num_epochs=1,
    batch_size=8,
    max_length=200
)

Sleep prevention activated for Windows
Loading base model...


Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.21s/it]


Applying LoRA...
trainable params: 2,293,760 || all params: 3,215,043,584 || trainable%: 0.0713
Starting training preparation...
Starting training loop...


  0%|          | 0/2344 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  0%|          | 1/2344 [00:25<16:38:18, 25.56s/it, loss=5.05, gpu_mem=3768MB]


KeyboardInterrupt: 

I re-ran the cell above after the final model was produced. The output you see if the start with the keyboard interupt.

### Collect Data for Evaluation

Now that we have both the baseline and fine-tuned versions of LLAMA, lets gather results in a file that we can evaluate. Due to the long processing times, I am only collecting 20 values each, as that on its own is taking roughly 200 minutes to run.

In [11]:
# Initialize baseline and fine-tuned
llama = LLAMA()
fine_tuner = LLAMA_FineTuned()

# load in the trained fine-tuned model
fine_tuner.load_finetuned("llama-lora-medical")

# get the testing testing examples:
questions = [point['question'] for point in test_data[:20]]
responses = [point['response'] for point in test_data[:20]]


# write responses to a result file
with open('llama_results.txt', "w", encoding='utf-8') as f:
    for question,response in zip(questions, responses):
        baseline_response = llama.generate_response(question)
        finetuned_response = fine_tuner.generate_response(question)
        f.write(f"Question: {question}\n")
        f.write(f"Baseline Response: {baseline_response}\n")
        f.write(f"Fine-tuned Response: {finetuned_response}\n")
        f.write(f"\nExpected Response: {response}\n")
        # response seperator
        f.write("=" * 80 + "\n")

Loading checkpoint shards: 100%|██████████| 2/2 [00:16<00:00,  8.25s/it]


Loading base model...


Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.07s/it]


Applying LoRA...
trainable params: 2,293,760 || all params: 3,215,043,584 || trainable%: 0.0713
Loading fine-tuned model from llama-lora-medical...
Loading base model...


Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.22s/it]


Loading LoRA config...
Loading weights and fixing state dict...
Loading adapted weights...
Model loaded successfully!


In [6]:
# Force GPU memory release
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    for i in range(torch.cuda.device_count()):
        torch.cuda.reset_peak_memory_stats(i)
        torch.cuda.reset_accumulated_memory_stats(i)