# Caltech: Guide to fine-tune LLama Models

This notebook is based off work found here:

* https://github.com/tloen/alpaca-lora/
* https://github.com/samlhuillier/code-llama-fine-tune-notebook/tree/main Thank you Sam!

It has been modified for the Caltech class tought by Brian Ray brian@methodical.company




### 1. Getting Setup with modules and data

Following sections:

1. Pip installs for packages
2. Loading Libraries needed
3. Loading Data
4. Loading a Model


In [None]:
!pip install -q peft==0.13.2 bitsandbytes==0.44.1 datasets==3.0.1 wandb==0.18.5 scipy==1.13.1 google-cloud-secret-manager

### Loading libraries


In [None]:
from datetime import datetime
import sys

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_kbit_training,
)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq, BitsAndBytesConfig


(If you have import errors, try restarting your Jupyter kernel)


Checking of Cuda is available, which it should be

In [None]:
torch.cuda.is_available()

### Helper Functions

In [None]:
import os
import glob
from google.cloud import storage


def get_env_secret(key):
  try:
    from google.colab import userdata
  except:
    class userdata:
      get = lambda x: None
      SecretNotFoundError = Exception
  try:
    userdata.get(key)
  except userdata.SecretNotFoundError:
    print(f"{key} Token not found, looking in caltech class project")
    from google.cloud import secretmanager
    import os
    client = secretmanager.SecretManagerServiceClient()
    response = client.access_secret_version(request={"name": f"projects/240830225929/secrets/{key}/versions/1"})
    os.environ[key] = response.payload.data.decode("UTF-8")
    return True
  except Exception as e:
    print(f"Error: {e}")
    return False

def save_model(model, localdir, destination_sub_path="/"):
  model.save_pretrained(localdir)
  save_checkpoints(local_directory=localdir,
                   destination_sub_path=destination_sub_path)
  print(f"Model saved to default cloud bucket")


# Upload files to the subdirectory in the GCS bucket
def upload_files_to_gcs(bucket, subdirectory_path, local_file_path, destination_file_name):
    """Uploads a file to the GCS bucket."""
    blob = bucket.blob(subdirectory_path + destination_file_name)
    blob.upload_from_filename(local_file_path)
    print(f"File {local_file_path} uploaded to {subdirectory_path + destination_file_name}.")


def save_checkpoints(bucket_name='caltech-class',
                     local_directory = "./sql-code-llama/",
                     destination_sub_path="Lab01"):
    # Create a Cloud Storage client
    client = storage.Client()

    # Define the bucket name and the base path
    bucket = client.get_bucket(bucket_name)

    # Get the current username
    username = !gcloud config get-value account
    username = username[0]

    # Define the directory path within the bucket
    subdirectory_path = f"{username}/{destination_sub_path}/"

    # Use glob to recursively find all files in the directory and its subdirectories
    files_to_upload = glob.glob(os.path.join(local_directory, '**', '*'), recursive=True)

    # Upload each file found by glob
    for file_path in files_to_upload:
        if os.path.isfile(file_path):  # Only upload if it's a file, not a directory
            # Maintain relative paths for files in subdirectories
            relative_path = os.path.relpath(file_path, local_directory)
            destination_filename = os.path.join(relative_path)

            # Upload the file to GCS
            upload_files_to_gcs(bucket,
                                subdirectory_path,
                                file_path,
                                destination_filename)



Get Envirment variables

In [None]:
get_env_secret("HF_TOKEN")
is_set = get_env_secret("WANDB_API_KEY")

if is_set:
  report_to="use_wandb"
  os.environ["WANDB_PROJECT"] = "lab01_caltech"
else:
  report_to="none"


### Load dataset

You may get a message that Colab does now have access to secrets.

You should grand it and make sure you have your HF_TOKEN set to what you obitained from setting up your Hugging Face account: https://huggingface.co/settings/tokens


In [None]:
from datasets import load_dataset
dataset = load_dataset("b-mc2/sql-create-context", split="train")
train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]

The above pulls the dataset from the Huggingface Hub and splits 10% of it into an evaluation set to check how well the model is doing through training. If you want to load your own dataset do this:

```
train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')
```

And if you want to view any samples in the dataset just do something like:``` ```


In [None]:
print(train_dataset[3])

Each entry is made up of a text 'question', the sql table 'context' and the 'answer'.

### 2. Load model
I load code llama from huggingface in int8. Standard for Lora:

In [None]:
base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

In [None]:
save_model(model, "Lab01A", destination_sub_path="Lab01/pretrained_model_A")

torch_dtype=torch.float16 means computations are performed using a float16 representation, even though the values themselves are 8 bit ints.

If you get error "ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported." Make sure you have transformers version is 4.33.0.dev0 and accelerate is >=0.20.3.


### 3. Runing the Base Foundational Model model
A very good common practice is to check whether a model can already do the task at hand. Fine-tuning is something you want to try to avoid at all cost:


In [None]:
eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""
# {'question': 'Name the comptroller for office of prohibition', 'context': 'CREATE TABLE table_22607062_1 (comptroller VARCHAR, ticket___office VARCHAR)', 'answer': 'SELECT comptroller FROM table_22607062_1 WHERE ticket___office = "Prohibition"'}
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

I get the output:
```
SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'
```
which is clearly wrong if the input is asking for just class!

### 4. Tokenization
Setup some tokenization settings like left padding because it makes [training use less memory](https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa):

In [None]:
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

Setup the tokenize function to make labels and input_ids the same. This is basically what [self-supervised fine-tuning](https://neptune.ai/blog/self-supervised-learning) is:

In [None]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

And run convert each data_point into a prompt that I found online that works quite well:

In [None]:
def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.

### Input:
{data_point["question"]}

### Context:
{data_point["context"]}

### Response:
{data_point["answer"]}
"""
    return tokenize(full_prompt)

Reformat to prompt and tokenize each sample:

In [None]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

### 5. Setup Lora

In [None]:
model.train() # put model back into training mode
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

In [None]:
if torch.cuda.device_count() > 1:
    print(f"multi cuda devices #{torch.cuda.device_count()}")
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

### 6. Training arguments
If you run out of GPU memory, change per_device_train_batch_size. The gradient_accumulation_steps variable should ensure this doesn't affect batch dynamics during the training run. All the other variables are standard stuff that I wouldn't recommend messing with:

In [None]:
# Adjust batch size to maximize GPU utilization
batch_size = 128  # Increased batch size if memory allows
per_device_train_batch_size = 32  # Adjust based on GPU memory
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "sql-code-llama"
max_steps = 80  # Adjust this if needed

training_args = TrainingArguments(
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    warmup_steps=0,
    max_steps=max_steps,
    learning_rate=3e-4,
    fp16=True,  # Keep mixed precision for faster training
    bf16=False,  # Set to True only if hardware supports bf16 (e.g., A100 GPUs)
    logging_steps=20,  # Increase logging steps to reduce logging overhead
    optim="adamw_torch",  # Optionally, use 8-bit Adam optimizer for speed/memory efficiency
    evaluation_strategy="steps",  # Keep evaluation every few steps
    eval_steps=40,  # Reduce evaluation frequency
    save_strategy="steps",
    save_steps=40,  # Reduce saving frequency
    output_dir=output_dir,
    group_by_length=True,  # Keep sequences of similar length for efficiency
    report_to="none",  # Remove reporting to external loggers for faster training
    run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}",
    gradient_checkpointing=False,  # Disable gradient checkpointing for faster training (if memory allows)
    dataloader_num_workers=4,  # Increase workers to improve data loading speed
    ddp_find_unused_parameters=False,  # Enable for distributed multi-GPU training, if using multiple GPUs
)

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)


Then we do some pytorch-related optimisation (which just make training faster but don't affect accuracy):

In [None]:
model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

### 7. Execute the Training Job


The below will take 1hr or more. While this is running you can move on to the next Lab.

In [None]:
trainer.train()

## 8. Save Place and Go

This saves the checkpoints to GCS. then go to the test_notebook to test


In [None]:
save_checkpoints()

For questions please contact brian@methodical.company or the original author https://twitter.com/samlhuillier_ for the original code found here https://github.com/samlhuillier/code-llama-fine-tune-notebook
