## A Guide for Fine-tuning Llama 3.1 (8B parameter) using Ray Framework on Hopsworks
This tutorial demonstrates how to perform fine-tuning (with LoRA and deepspeed) of a Llama 3.1 (8B) using the Ray framework on Hopsworks. Ray is an industry-leading distributed computing framework. This tutorial was run on OVH cluster but you can use any cloud provider of your choice.

### Pre-requisites
To perform the steps in this tutorial, you need to create a Hopsworks Kubernetes cluster with Ray enabled. For the fine-tuning task demonstrated in this example, these are the minimum resources required:
* 1 x Node (16 CPU 64 GB RAM) for the Ray head
* 4 x Nodes (15 CPU 45 GB RAM 300 GB disk 1 Tesla V100S) for the workers
Let's get started!

## 1️⃣ Dataset preparation
We are going to fine-tune the model for question answering. We need to prepare the dataset that will be used for supervised fine-tuning in a certain format. There is no specific prompt format required for the pre-trained Llama 3.1 so the dataset preprocessing can follow any prompt-completion style. The instruction-tuned models (Meta-Llama-3.1-{8,70,405}B-Instruct) use a multi-turn conversation prompt format that structures the conversation between the users and the models.

The dataset for QA typically includes the following fields:

* Question: The input question to the model.
* Context (optional): A passage or text providing information the model should use to answer.
* Answer: The correct response.

This example is configured to fine-tune the Llama 3.1 8B pre-trained model on the GSM8K dataset.

In [1]:
import os
import json
from datasets import load_dataset

In [2]:
import hopsworks

project = hopsworks.login()

ds = project.get_dataset_api()
mr = project.get_model_registry()
jb = project.get_jobs_api()

2025-02-21 00:27:34,288 INFO: Python Engine initialized.

Logged in to project, explore it here https://hopsworks.ai.local/p/1146


In [3]:
# Create resources directory in HopsFS

llama_ft_resources_dir = "Resources/llama_finetuning"
HOPSFS_STORAGE_PATH = os.path.join(os.environ.get("PROJECT_PATH"), llama_ft_resources_dir)
if not os.path.exists(HOPSFS_STORAGE_PATH):
    os.mkdir(HOPSFS_STORAGE_PATH)

In [4]:
# Copy fine-tunning configuration files to llama resources directory
for root, dirs, files in os.walk("configs"):
    for filename in files:
        ds.upload(os.path.join(root, filename), os.path.join(llama_ft_resources_dir, root), overwrite=True)

Uploading: 0.000%|          | 0/977 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/763 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/980 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/866 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/263 elapsed<00:00 remaining<?

In [5]:
# Download training data files

dataset = load_dataset("openai/gsm8k", "main")
dataset_splits = {"train": dataset["train"], "test": dataset["test"]}
dataset_dir = os.path.join(HOPSFS_STORAGE_PATH, "datasets")
if not os.path.exists(dataset_dir):
    os.mkdir(dataset_dir)

In [6]:
# Add special tokens to the dataset to optimize the fine-tuning of the model

with open(os.path.join(dataset_dir, "tokens.json"), "w") as f:
    tokens = {}
    print(f)
    tokens["tokens"] = ["<START_Q>", "<END_Q>", "<START_A>", "<END_A>"]
    f.write(json.dumps(tokens))
    for key, split in dataset_splits.items():
        with open(os.path.join(dataset_dir, f"{key}.jsonl"), "w") as f:
            max_num_qas = 100 # 2 # Number of QAs
            for item in split:
                newitem = {}
                newitem["input"] = (
                    f"<START_Q>{item['question']}<END_Q>"
                    f"<START_A>{item['answer']}<END_A>"
                )
                f.write(json.dumps(newitem) + "\n")  # write file into dataset resources dir
                if max_num_qas is not None:
                    max_num_qas -= 1
                    if max_num_qas <= 0:
                        break

<_io.TextIOWrapper name='/hopsfs/Resources/llama_finetuning/datasets/tokens.json' mode='w' encoding='UTF-8'>


## 2️⃣ Download and Register the Base Llama3.1 Model
The next step is to download the pre-trained Llama model from hugging face. For this you will need the hugging face token.

In [7]:
!pip install huggingface_hub --quiet

In [None]:
os.environ["HF_TOKEN"] = "<INSERT_YOUR_HF_TOKEN>"

In [9]:
# download the pre-trained model from Hugging face
from huggingface_hub import snapshot_download

model_id = "meta-llama/Llama-3.1-8B-Instruct"
llama31_local_dir = snapshot_download(model_id, ignore_patterns="original/*")

Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

In [10]:
# Export Llama3.1 model to the Hopsworks Model Registry
base_model_name = "llama318binstruct"
llama31 = mr.llm.create_model(base_model_name, description="Llama3.1-8B-Instruct model from HuggingFace")
llama31.save(llama31_local_dir)

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://hopsworks.ai.local/p/1146/models/llama318binstruct/1


Model(name: 'llama318binstruct', version: 1)

## 3️⃣ Create the Ray job for the fine-tuning task
We are going to use the hopsworks jobs api to create and run the job for the fine-tuning task

In [11]:
lora_adapter_name = f"lora{base_model_name}"

app_file_path = ds.upload("ray_llm_finetuning.py", llama_ft_resources_dir, overwrite=True)
environment_config_yaml_path = ds.upload("llama_fine_tune_runtime_env.yaml", llama_ft_resources_dir, overwrite=True)

Uploading: 0.000%|          | 0/29148 elapsed<00:00 remaining<?

Uploading: 0.000%|          | 0/548 elapsed<00:00 remaining<?

#### About the runtime environment file
The runtime environment file contains the dependencies required for the Ray job including files, packages, environment variables, and more. This is useful when you need to install specific packages and set environment variables for this particular Ray job. It should be provided as a YAML file. In this example, the runtime environment file has the following configuration.
```
pip:
  - transformers==4.44.0
  - accelerate==0.31.0
  - peft==0.11.1
  - deepspeed==0.16.2
env_vars:
  LIBRARY_PATH: "$CUDA_HOME/lib64:$LIBRARY_PATH"
  PROJECT_DIR: "/home/yarnapp/hopsfs"
  TRAINED_MODEL_STORAGE_PATH: "${PROJECT_DIR}/Resources/llama_finetuning/fine-tuned-model" # Where the fine-tuned model will be saved
  TRAINING_DATA_DIR: "${PROJECT_DIR}/Resources/llama_finetuning/datasets" # dataset location
  TRAINING_CONFIGURATION_DIR: "${PROJECT_DIR}/Resources/llama_finetuning/configs" # location for deepspeed and lora configuration files
```

In [12]:
# Model config
model_args = f"--base-model-name {base_model_name} --lora-model-name {lora_adapter_name}"

# Torch Trainer scaling config
torch_trainer_num_workers = 4
torch_trainer_worker_cpus = 11
torch_trainer_worker_gpus = 1
torch_trainer_scaling_args = f"-ttnm {torch_trainer_num_workers} -ttwc {torch_trainer_worker_cpus} -ttwg {torch_trainer_worker_gpus}"

# Training config
num_epochs = 2
learning_rate = "5e-4"
batch_size_per_device=4
eval_batch_size_per_device=4
training_config_args = f"--lora --mx fp16 --num-epochs={num_epochs} --lr={learning_rate} --batch-size-per-device={batch_size_per_device} --eval-batch-size-per-device={eval_batch_size_per_device}"

# Ray cluster config
ray_config = jb.get_configuration("RAY")
ray_config['appPath'] = os.path.join('/Projects/' + project.name, app_file_path)
ray_config['environmentName'] = "ray-torch-training-pipeline"
ray_config['driverCores'] = 1
ray_config['driverMemory'] = 4096
ray_config['workerCores'] = 12
ray_config['workerMemory'] = 30816
ray_config['workerMinInstances'] = 4
ray_config['workerMaxInstances'] = 4
ray_config['workerGpus'] = 1
ray_config['runtimeEnvironment'] = os.path.join('/Projects/' + project.name, environment_config_yaml_path)

ray_config['defaultArgs'] = f"{model_args} {torch_trainer_scaling_args} {training_config_args}"

In [13]:
job_name = "fine-tune-llama31"
job = jb.create_job(job_name, ray_config)

Job created successfully, explore it at https://hopsworks.ai.local/p/1146/jobs/named/fine-tune-llama31


## 4️⃣ Run the fine-tuning Ray job

In [14]:
finetuning_job = jb.get_job(job_name)

In [15]:
finetuning_job.run()

Launching job: fine-tune-llama31
Job started successfully, you can follow the progress at 
https://hopsworks.ai.local/p/1146/jobs/named/fine-tune-llama31/executions
2025-02-21 00:33:33,865 INFO: Waiting for execution to finish. Current state: INITIALIZING
2025-02-21 00:34:01,210 INFO: Waiting for execution to finish. Current state: PENDING
2025-02-21 00:34:25,490 INFO: Waiting for execution to finish. Current state: RUNNING
2025-02-21 00:56:06,842 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS
2025-02-21 00:56:06,870 INFO: Waiting for log aggregation to finish.
2025-02-21 00:59:16,513 INFO: Execution finished successfully.


Execution('SUCCEEDED', 'FINISHED', '2025-02-21T08:33:30.000Z', '--base-model-name llama318binstruct --lora-model-name lorallama318binstruct -ttnm 4 -ttwc 11 -ttwg 1 --lora --mx fp16 --num-epochs=2 --lr=5e-4 --batch-size-per-device=4 --eval-batch-size-per-device=4')

After the job is run you can go to the hopsworks UI to monitor the job execution. From executions page, you can open the Ray dashboard. In the Ray Dashboard, you can monitor the resources used by the job, the number of workers, logs, and the tasks that are running. 

After the job finishes running successfully, the fine-tuned model will be saved in the directory specified in the TRAINED_MODEL_STORAGE_PATH variable defined in the 

## 5️⃣ Export fine-tuned Llama3.1 model

In [16]:
# Replicate the base model first

fine_tuned_model_name = f"ft{llama31.name}"
ft_llama31 = mr.llm.create_model(fine_tuned_model_name, description="(LoRA fine-tuned) " + llama31.description)
ft_llama31.save(llama31.model_files_path, keep_original_files=True)

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://hopsworks.ai.local/p/1146/models/ftllama318binstruct/1


Model(name: 'ftllama318binstruct', version: 1)

In [17]:
# copy fine-tuned lora adapter into model files directory

ftllama31_lora_adapter_path = f"{ft_llama31.model_files_path}/lora_adapter"
if not ds.exists(ftllama31_lora_adapter_path):
    ds.mkdir(ftllama31_lora_adapter_path)

lora_adapter = mr.get_model(lora_adapter_name)    
count, files = ds.list_files(lora_adapter.model_files_path, 0, 100)
for f in files:
    ds.copy(f.path, f"{ftllama31_lora_adapter_path}/{os.path.basename(f.path)}")




## 6️⃣ Deploy the fine-tuned Llama3.1 model

In [19]:
path_to_config_file = f"/Projects/{project.name}/" + ds.upload("llama_vllmconfig.yaml", "Resources", overwrite=True)

Uploading: 0.000%|          | 0/134 elapsed<00:00 remaining<?

In [20]:
ft_llama31_depl = ft_llama31.deploy(
    name="ftllama31",
    description="(LoRA fine-tuned) Llama3.1 8B-Instruct from HuggingFace",
    config_file=path_to_config_file,
    resources={"num_instances": 1, "requests": {"cores": 2, "memory": 1024*12, "gpus": 1}},
)

Deployment created, explore it at https://hopsworks.ai.local/p/1146/deployments/1035
Before making predictions, start the deployment by using `.start()`


In [21]:
ft_llama31_depl.start(await_running=60*15)

  0%|          | 0/5 [00:00<?, ?it/s]

Start making predictions by using `.predict()`


## 7️⃣ Prompting the fine-tuned Llama3.1 model

In [22]:
import httpx

# Get the istio endpoint from the Llama deployment page in the Hopsworks UI.
istio_endpoint = "<ISTIO_ENDPOINT>" # with format "http://<ip-address>"

chat_completions_url = istio_endpoint + "/v1/chat/completions"

# Resolve API key for request authentication
if "SERVING_API_KEY" in os.environ:
    # if running inside Hopsworks
    api_key_value = os.environ["SERVING_API_KEY"]
else:
    # Create an API KEY using the Hopsworks UI and place the value below
    api_key_value = "<API_KEY>"
    
# Prepare request headers
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'ApiKey ' + api_key_value,
    'Host': f"{ft_llama31_depl.name}.{project.name.lower().replace('_', '-')}.hopsworks.ai", # also provided in the Hopsworks UI
}

#### 🟨 Generate answer with the base model

In [24]:
#
# Chat Completion for a user message
#

user_message = "Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. " \
               "If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used."

# user_message = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. " +
#                "How many clips did Natalia sell altogether in April and May?"

# Improvement proposed by: https://arxiv.org/abs/2205.11916
final_instruction = " Let's think step by step. At the end, you MUST write the answer as an integer after '####'."
    
completion_request = {
    "model": ft_llama31_depl.name,
    "messages": [
        {
            "role": "user",
            "content": user_message + final_instruction
        }
    ]
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)
print(response)
print(response.json()["choices"][0]["message"]["content"])

Completion request:  {'model': 'ftllama31', 'messages': [{'role': 'user', 'content': "Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used. Let's think step by step. At the end, you MUST write the answer as an integer after '####'."}]}
2025-02-21 01:17:20,225 INFO: HTTP Request: POST http://54.37.77.225/v1/chat/completions "HTTP/1.1 200 OK"
<Response [200 OK]>
To find the number of teaspoonfuls of sugar Katy used, we first need to determine the total parts in the ratio. 

The ratio of sugar to water is 7:13. 
To find the total parts, we add the two parts together: 
7 + 13 = 20 parts.

Since the total amount used was 120 (teaspoons of sugar and cups of water), we can find the value of one part by dividing 120 by 20: 
120 / 20 = 6.

Now that we know one part is equal to 6, we can find the number of teaspoonfuls of sugar Katy used. 
In the r

#### 🟨 Generate answer via LoRA adapter

In [25]:
#
# Chat Completion for a user message (fine-tuned)
#

user_message = "Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. " \
               "If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used."

# user_message = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. " +
#                "How many clips did Natalia sell altogether in April and May?"

final_instruction = " Let's think step by step. At the end, you MUST write the answer as an integer after '####'."
        
completion_request = {
    "model": "lora_adapter",
    "messages": [
        {
            "role": "user",
            "content": user_message + final_instruction
        }
    ]
}

print("Completion request: ", completion_request, end="\n")

response = httpx.post(chat_completions_url, headers=headers, json=completion_request, timeout=45.0)
print(response)
print(response.json()["choices"][0]["message"]["content"])

Completion request:  {'model': 'lora_adapter', 'messages': [{'role': 'user', 'content': "Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used. Let's think step by step. At the end, you MUST write the answer as an integer after '####'."}]}
2025-02-21 01:20:00,681 INFO: HTTP Request: POST http://54.37.77.225/v1/chat/completions "HTTP/1.1 200 OK"
<Response [200 OK]>
To find the number of teaspoonfuls of sugar Katy used, we need to follow these steps:

1. The ratio of sugar to water is 7:13. This means that for every 7 teaspoons of sugar, there are 13 cups of water. To simplify this ratio, we can find the least common multiple (LCM) of 7 and 13. The LCM of 7 and 13 is 91 (7 * 13 = 91).

2. Now we know that for every 91 units of the ratio, there are 7 teaspoons of sugar. We also know that the total number of units (sugar and water) used is 120