# Inference for Massage Robot Assistant in 4-bit Mode

This notebook loads the merged model from the Hugging Face Hub using UnsloTH in 4-bit mode for fast inference and runs a test prompt.

In [1]:
# Define the training/inference prompt style and the test question
train_prompt_style = """Below is an instruction that describes a task for a massage robot, paired with an input that provides further context.
Write a response that generates an executable pipeline in Python using only the provided functions.
Before answering, analyze the task carefully and generate a clear sequence of commands.

### Instruction:
You are provided with high-level instructions for operating a massage robot. Create an executable pipeline in Python that structures task execution through a sub-task pipeline. This pipeline should be composed exclusively of the functions listed below in the Capabilities section, arranged in a logical and correct order so that it can be directly executed. Your response should only consist of the pipeline without additional information.

Capabilities:
    start() → Initializes the robot.
    stop() → Stops the robot.
    home() → Moves the robot to the home position.
    [x, y, z] = detect_body_part(part_name) → Detects the specified body part and returns the coordinates.
    move_to([x, y, z]) → Moves the robot to the specified coordinates.
    change_force(mode, value) → Adjusts the massage force based on the specified mode:
        - If mode is 'absolute', then value is any real number, setting the force directly.
        - If mode is 'relative', then value is between -1 and 1, modifying the current force F as:
          F_new = (1 + value) * F
    automatic_massage(part_name) → Automatically massages the specified body part.

### Question:
{}

### Response:
{}"""

# Test question
question = "up the pressure"

In [2]:
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"
os.environ["HF_HUB_DISABLE_PROGRESS_BAR"] = "0"

In [4]:
# Load the model and tokenizer from Hugging Face Hub in 4-bit mode using UnsloTH
from unsloth import FastLanguageModel

# Replace with your model repository on the Hugging Face Hub
model_repo = "novak247/massage_assistant_v1_full"

print("Loading model and tokenizer from", model_repo, "in 4-bit mode...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_repo,
    load_in_4bit=True,
    max_seq_length=2048
)
print("Model and tokenizer loaded.")

Loading model and tokenizer from novak247/massage_assistant_v1_full in 4-bit mode...
==((====))==  Unsloth 2025.2.12: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: NVIDIA GeForce RTX 4060 Ti. Max memory: 15.699 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

pytorch_model-00002-of-00004.bin:  18%|#7        | 891M/5.00G [00:00<?, ?B/s]

pytorch_model-00003-of-00004.bin:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

pytorch_model-00004-of-00004.bin:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model and tokenizer loaded.


In [5]:
# Enable optimized inference mode
FastLanguageModel.for_inference(model)
print("Optimized inference mode enabled.")

Optimized inference mode enabled.


In [6]:
# Prepare the input using the prompt style and tokenize the input, then move to GPU
inputs = tokenizer([train_prompt_style.format(question, "")], return_tensors="pt").to("cuda")
print("Input prepared and moved to GPU.")

Input prepared and moved to GPU.


In [8]:
# Generate a response using the model
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True
)

# Decode the generated output
response = tokenizer.batch_decode(outputs)

# Extract and print only the response part after "### Response:"
result = response[0].split("### Response:")[1]
print("Generated Response:")
print(result)

Generated Response:

Here's the pipeline for the task:

```python
pipeline = [
    start(),
    detect_body_part('shoulder'),
    move_to([0, 0, 0.5]),
    change_force('relative', 0.2),
    automatic_massage('shoulder'),
    home()
]
```

### Explanation:
The pipeline begins by initializing the robot with `start()`. Next, it detects the shoulder part with `detect_body_part('shoulder')`, which gives the coordinates. The robot then moves to the detected coordinates with `move_to()`. To adjust the pressure, it changes the force mode to 'relative' and increases it by 20% using `change_force('relative', 0.2)`. Finally, it performs the automatic massage on the shoulder and returns to the home position with `home()`.

The response provided correctly structures the task execution through a logical sequence of functions, ensuring the robot performs the desired actions without errors.
</think>

To increase the pressure during the massage, the pipeline should adjust the force level appropriately