## **Out-of-the-Box Assembly Plan Generation: A Peer Evaluation of Open-Source LLMs**

### Code related to the paper 'Out-of-the-Box Assembly Plan Generation: A Peer Evaluation of Open-Source LLMs' by Harkrian Sahota and Adam Klodowski.

The paper **'Out-of-the-box Assembly Plan Generation: A Peer Evaluation of Open-Source LLMs'** investigates the extent to which pre-trained, general-purpose LLMs can be applied directly—without additional training — to generate structured assembly instructions using minimal input, such as part lists and basic constraints.

This Notebook contains the corresponding code for the conducted tests and evaluations used to explore the feasibility of employing large language models (LLMs) for automated assembly instruction generation. For this purpose a selection of LLMs is first tested on their Domain Relevance by asking 10 general questions on manufacturing and scoring their responses according to reference answers. The top performers are then evaluated on their ability to generate high quality assembly instructions for different assembly tasks with different prompting instructions.  

### **1. Testing Domain Relevance**

Testing the domain relevance of four selected non-proprietary LLM models by asking 10 domain relevant question to each model and conducting a peer evaluation using reference answers and the same four models to assess the Correctness, Completeness and Domain relevance of the given answers.

To evaluate a model’s domain-specific competence, a set of ten open-ended questions was developed, covering core aspects of manufacturing and production. These questions were designed to assess knowledge across areas such as production process flow, planning, assembly operations, and manufacturing efficiency.

A selection of four LLM models was tested (DeepSeek-V3-0324, Gemma-3-27B-it, Qwen3-235B-A22B, and
NVIDIA’s Llama-3.1-Nemotron-Ultra-253B-v1). The detailed explanation on how this selection was made can be found in the corresponding paper.

#### 1.1 Domain Relevance Questions

In [1]:
from huggingface_hub import InferenceClient

from dotenv import load_dotenv
import os

from openai import OpenAI

import random

load_dotenv()

True

10 evaluation questions that have been selected to evaluate each LLMs domain relevance in manufacturing. The list of questions is looped through for each model.

In [16]:
questions = [
    "What are the key steps in a standard manufacturing process flow?",
    "What is the purpose of a Bill of Materials (BOM) in production?",
    "Explain the difference between subtractive and additive manufacturing.",
    "What is lean manufacturing and its core principles?",
    "What is production planning in manufacturing, and why is it important?",
    "What does an efficient production plan in manufacturing look like?",
    "What are common assembly actions used to assemble products?",
    "How would you describe to a layman the steps to assemble two parts with a special screw using a specific tool?",
    "What does \"takt time\" mean, and how is it calculated?",
    "What are common causes of bottlenecks in a manufacturing system?"
]


Selection of models that have been tested. Select one for testing.

In [17]:
#MODEL_NAME = "google/gemma-3-27b-it"
MODEL_NAME = "deepseek-ai/DeepSeek-V3-0324"
#MODEL_NAME = "Qwen/Qwen3-235B-A22B"
#MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"

Establising the connection to the specified model using nebius provider. Questions and resulting answers are saved in a file named *[MODEL_NAME]_prompting_test_response.txt* in the folder *DomainRelevanceTestResponses*. 

<span style="color:red">*Remark: Please make sure to have a valid nebius access token to run the following code.*</span> 

In [None]:
client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.environ.get("NEBIUS_TOKEN")
)

# Store Q&A results
qa_pairs = []

for i, question in enumerate(questions, 1):
    completion = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[
            {
                "role": "system",
                "content": """SYSTEM_PROMPT"""
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ]
    )

    answer = completion.choices[0].message.content

    # Store result
    qa_pairs.append((question, answer))
    
    # Print to notebook
    print(f"Q{i}: {question}\nA{i}: {answer}\n{'-'*60}")


# Save to file
filename = "DomainRelevanceTestResponses\\"+MODEL_NAME.split("/")[-1] + ".txt"
with open(filename, "w", encoding="utf-8") as f:
    for q, a in qa_pairs:
        f.write(f"Q: {q}\nA: {a}\n{'-'*60}\n")

print(f"\n✅ All results saved to: {filename}")


#### 1.2 Evaluation of Domain Relevance

For a systematic evaluation of the results, rather than relying on a single method, a multi-LLM peer review system is implemented. Each tested model also serves as an evaluation model, assessing the responses of all other models. This approach leverages the linguistic and semantic reasoning capabilities of LLMs to provide structured and consistent assessments without requiring large-scale human evaluation.

Therefore, the evaluation models are presented with the initial question, the reference answer and the four responses given by the models, anonymised and shuffeled. The evaluation models are asked to score the answers on a scale from 1-5.

-------------------------------------------------------------------------------------------------------------

All four initially selected models, now serving as evalutaion models. Select evaluation model.

In [51]:
#MODEL_NAME = "google/gemma-3-27b-it"
#MODEL_NAME = "deepseek-ai/DeepSeek-V3-0324"
#MODEL_NAME = "Qwen/Qwen3-235B-A22B"
MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"

List of reference answers given to the evalutaion models to access the quality of the tested models answers.

In [12]:
reference_answers=[
    "Typical manufacturing processes include: 1) Product design & prototyping, 2) Material procurement, 3) Production planning, 4) Manufacturing (e.g., machining, forming), 5) Assembly, 6) Quality control, and 7) Packaging & distribution. ",
    "A BOM is a structured list detailing all materials, parts, and components required to manufacture a product, including quantities and specifications. It is used for production planning, procurement, and inventory control.",
    "Subtractive manufacturing removes material from a solid block (e.g., CNC milling), while additive manufacturing builds material layer by layer (e.g., 3D printing).",
    "Lean manufacturing focuses on reducing waste and increasing value. Core principles include: 1) Define value, 2) Map value stream, 3) Create flow, 4) Establish pull, 5) Pursue perfection.",
    "Production planning involves scheduling, resource allocation, and workflow optimization to meet demand while minimizing waste and delays. It includes forecasting, inventory control, and capacity planning.",
    "It balances demand forecasts with capacity, reduces idle time, ensures material availability, includes buffers for uncertainty, and adapts to changes. Efficiency depends on synchronization across labour, materials, and equipment.",
    "Common actions include: Fastening (screws, bolts), Joining (welding, gluing), Aligning and fitting parts, Wiring, Testing, and Packaging. These vary depending on the product and process type.",
    "1) Align the two parts so their screw holes match, 2) Insert the special screw, 3) Use the specified tool (e.g., Torx driver) to tighten the screw clockwise, 4) Ensure a firm connection without over-tightening.",
    "Takt time is the rate at which products must be produced to meet customer demand. It’s calculated by dividing available production time by required output and is used to synchronize workflow and reduce inefficiencies, as emphasized in the Toyota Production System.",
    "Common bottlenecks in manufacturing include machine breakdowns, poor layout, insufficient workforce, planning issues, and process misalignment. These cause delays and lower efficiency, so addressing them is key to optimizing production."
]

Reading in all answer files and creating a dictionary of answers by each model.

In [13]:
model_answer_files = {
    "DeepSeek": "DomainRelevanceTestResponse/DeepSeek-V3-0324.txt",
    "Gemma": "DomainRelevanceTestResponse/gemma-3-27b-it.txt",
    "Qwen": "DomainRelevanceTestResponse/Qwen3-235B-A22B.txt",
    "Llama": "DomainRelevanceTestResponse/Llama-3_1-Nemotron-Ultra-253B-v1.txt"
}

In [None]:
import re

def extract_answers(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        content = file.read()

    # Use regex to extract answers
    q_a_pairs = re.findall(r"Q:\s*(.*?)\nA:\s*(.*?)(?=\nQ:|\Z)", content, re.DOTALL)

    # Extract only answers in order
    answers = [answer.strip() for _, answer in q_a_pairs]
    return answers

model_answers = {model: extract_answers(path) for model, path in model_answer_files.items()}


Removing thinking tags from Qwens answers as this could bias the evaluation.

In [31]:
def remove_think_tags(text):
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()

model_answers["Qwen"] = [remove_think_tags(ans) for ans in model_answers["Qwen"]]


Generating the evaluation prompt by adding the question, reference answer and the four answers given by the different models, anonymised and shuffeled.

In [32]:
def get_prompt(question,reference_answer,answer1,answer2,answer3,answer4):
    return"""You are an expert evaluator tasked with judging the quality of AI-generated answers in a specific technical domain. Below is a question, a reference answer, and four candidate answers produced by different models. Your goal is to assess how well each candidate answer matches the reference in terms of correctness, completeness, and domain relevance.

---

Question: {question}

Reference Answer: {reference_answer}

Candidate Answers:
A: {answer1}
B: {answer2}
C: {answer3}
D: {answer4}

Please evaluate each candidate answer using the following criteria:
1. **Correctness** Is the information factually accurate?
2. **Completeness** Does it fully answer the question?
3. **Domain relevance** Is it appropriate and useful in the given context?

Give each answer a score between 1 (poor) and 5 (excellent). Provide a brief explanation for each score.

Respond in the following format:

A: [score]  [short explanation]  
B: [score]  [short explanation]  
C: [score]  [short explanation]  
D: [score]  [short explanation]""".format(question=question,reference_answer=reference_answer,answer1=answer1,answer2=answer2,answer3=answer3,answer4=answer4)

List of all four models used to shuffle the answers in the evaluation process.

In [33]:
model_list=["Gemma","Qwen","Llama","DeepSeek"]


Running the evaluation using the selected evaluation model. For every question the specific prompt is generated (question - reference answer - answer of four models shuffeled) and send to the evaluation model. The reponses of the evalutaion model to the ten prompts are then saved to a file called *[EVALUTAION_MODEL_NAME]_domain_relevance_evaluation.txt* in the folder *DomainRelevanceTestEvaluations*.

<span style="color:red">*Remark: Please make sure to have a valid nebius access token to run the following code.*</span> 

In [None]:
client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.environ.get("NEBIUS_TOKEN")
)

# Store Q&A results
evaluations = []

for i in range(len(questions)):
    question = questions[i]
    reference_answer = reference_answers[i]

    random.shuffle(model_list)
    
    deepseek_ans = model_answers["DeepSeek"][i]
    gemma_ans = model_answers["Gemma"][i]
    llama_ans = model_answers["Llama"][i]
    qwen_ans = model_answers["Qwen"][i]

    answer1=model_answers[model_list[0]][i]
    answer2=model_answers[model_list[1]][i]
    answer3=model_answers[model_list[2]][i]
    answer4=model_answers[model_list[3]][i]

    evaluation_prompt=get_prompt(question,reference_answer,answer1,answer2,answer3,answer4)
    
    completion = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[
            {
                "role": "system",
                "content": """SYSTEM_PROMPT"""
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": evaluation_prompt
                    }
                ]
            }
        ]
    )

    answer = completion.choices[0].message.content

    # Store result
    evaluations.append((question, answer, model_list.copy()))
    
    # Print to notebook
    print(f"Q{i}: {question}\nA{i}: {answer}\nM:{model_list}\n{'-'*60}\n")


# Save to file
filename = "DomainRelevanceTestEvaluations/"+MODEL_NAME.split("/")[-1] + "_domain_relevance_evaluation.txt"
with open(filename, "w", encoding="utf-8") as f:
    for q, a, list in evaluations:
        f.write(f"Q: {q}\n\nA: {a}\n\nM: {list}\n{'-'*60}\n\n")

print(f"\n All results saved to: {filename}") 


    


### **2. Testing Ability to Generate Assembly Instructions**

To evaluate the ability of large language models to generate effective assembly instructions, a structured prompting evaluation is conducted. In addition, it should be tested how different prompting strategies might affect the quality, completeness, and clarity of the gen-
erated instructions in realistic manufacturing scenarios.

For this purpose, the two top performing models from the previous chaper - Gemma-3-27B-it and Qwen3-235B-A22B - are tested. Both models are given 5 assembly tasks with different complexities (side table, office chair, Raspberry Pi case, book shelf, kids bicycle). Each task is presented 4 times, each time using a different prompt strategie (zero-shot, few-shot, instruction tuned, evaluation aware).


#### 2.1 Prompting Test

Selecting one of the test models

In [3]:
MODEL_NAME = "google/gemma-3-27b-it"
#MODEL_NAME = "Qwen/Qwen3-235B-A22B"

Dictonary with the prompts for each of the 5 assembly tasks and each of the four distinct prompting strategies (5 tasks with 4 prompts each = 20 prompts).

In [2]:
assembly_prompts = {
    "Task 1 - Side Table": {
        "zero_shot": (
            "Create a step-by-step assembly plan for a small side table consisting of the following parts:\n"
            "1. One tabletop with four pre-drilled holes on the bottom side of each corner (on the bottom face only, not going all the way through).\n"
            "2. Four legs each with one pre-drilled hole on the top.\n"
            "3. Four double ended screws."
        ),
        "few_shot": (
            "Here is an example of an assembly plan for a night-stand with the following parts: centre stack with two pre-drilled holes on each corner,"
            "a table top with two pre-drilled holes on each corner, four legs each with two pre-drilled holes in the centre and on the top, 16 socket screws:\n"
            "Step 1: Position the centre stack.\n"
            "Step 2: Position each leg and attach loosely to the centre stack using two screws per leg.\n"
            "Step 3: Position the table top aligning the screw holes.\n"
            "Step 4: Attach the table top to the assembly using two screws per corner.\n"
            "Step 5: Carefully set the table upright.\n"
            "Step 6: Ensure everything is aligned and tighten all screws securely.\n\n"
            "Now create a step-by-step assembly plan for a small side table consisting of the following parts:\n"
            "1. One tabletop with four pre-drilled holes on the bottom side of each corner (on the bottom face only, not going all the way through).\n"
            "2. Four legs each with one pre-drilled hole on the top.\n"
            "3. Four double ended screws."
        ),
        "instruction_tuned": (
            "You are a planner specialised in creating user-friendly assembly manuals. Your task is to write a clear, step-by-step plan for assembling "
            "a small side table consisting of the following parts:\n"
            "1. One tabletop with four pre-drilled holes on the bottom side of each corner (on the bottom face only, not going all the way through).\n"
            "2. Four legs each with one pre-drilled hole on the top.\n"
            "3. Four double ended screws.\n"
            "Make the instructions easy to follow for beginners."
        ),
        "evaluation_aware": (
            "Create a high-quality, step-by-step assembly plan for a small side table with the following parts:\n"
            "1. One tabletop with four pre-drilled holes on the bottom side of each corner (on the bottom face only, not going all the way through).\n"
            "2. Four legs each with one pre-drilled hole on the top.\n"
            "3. Four double ended screws.\n"
            "Your response will be evaluated and scored based on clarity, completeness, and usefulness, so ensure the instructions are well-structured and easy to follow."
        ),
    },
    "Task 2 - Office Chair": {
        "zero_shot": (
            "Create a step-by-step assembly plan for an office chair consisting of the following parts:\n"
            "1. A seat with four pre-drilled holes on the bottom center and two on each side of the bottom (left and right).\n"
            "2. A backrest with two pre-drilled holes on each side on the back (left and right).\n"
            "3. Two armrests, each with two screw holes on the back (for the backrest) and two on the bottom (for the seat).\n"
            "4. A leg base with a plug-in hole at the center and five radial arms, each with a threaded hole.\n"
            "5. Five caster wheels, each with a pre-attached screw for mounting.\n"
            "6. A gas lift cylinder with a plug-in fitting.\n"
            "7. A seat bottom mechanism with a center hole for the cylinder and four pre-drilled mounting holes on the corners.\n"
            "8. Four short bolts (used to attach the seat mechanism to the seat bottom).\n"
            "9. Eight long bolts (used to attach the armrests to the seat and backrest — four of these are used with washers for the seat attachment).\n"
            "10. Four washers, used with the four long bolts that secure the bottom of the armrests to the seat.\n"
        ),
        "few_shot": (
            "Here is an example of an assembly plan for a night-stand with the following parts: centre stack with two pre-drilled holes on each corner,"
            "a table top with two pre-drilled holes on each corner, four legs each with two pre-drilled holes in the centre and on the top, 16 socket screws:\n"
            "Step 1: Position the centre stack.\n"
            "Step 2: Position each leg and attach loosely to the centre stack using two screws per leg.\n"
            "Step 3: Position the table top aligning the screw holes.\n"
            "Step 4: Attach the table top to the assembly using two screws per corner.\n"
            "Step 5: Carefully set the table upright.\n"
            "Step 6: Ensure everything is aligned and tighten all screws securely.\n\n"
            "Now create a step-by-step assembly plan for an office chair consisting of the following parts:\n"
            "1. A seat with four pre-drilled holes on the bottom center and two on each side of the bottom (left and right).\n"
            "2. A backrest with two pre-drilled holes on each side on the back (left and right).\n"
            "3. Two armrests, each with two screw holes on the back (for the backrest) and two on the bottom (for the seat).\n"
            "4. A leg base with a plug-in hole at the center and five radial arms, each with a threaded hole.\n"
            "5. Five caster wheels, each with a pre-attached screw for mounting.\n"
            "6. A gas lift cylinder with a plug-in fitting.\n"
            "7. A seat bottom mechanism with a center hole for the cylinder and four pre-drilled mounting holes on the corners.\n"
            "8. Four short bolts (used to attach the seat mechanism to the seat bottom).\n"
            "9. Eight long bolts (used to attach the armrests to the seat and backrest — four of these are used with washers for the seat attachment).\n"
            "10. Four washers, used with the four long bolts that secure the bottom of the armrests to the seat."    
        ),
        "instruction_tuned": (
            "You are a planner specialised in creating user-friendly assembly manuals. Your task is to write a clear, step-by-step plan for assembling "
            "an office chair consisting of the following parts:\n"
            "1. A seat with four pre-drilled holes on the bottom center and two on each side of the bottom (left and right).\n"
            "2. A backrest with two pre-drilled holes on each side on the back (left and right).\n"
            "3. Two armrests, each with two screw holes on the back (for the backrest) and two on the bottom (for the seat).\n"
            "4. A leg base with a plug-in hole at the center and five radial arms, each with a threaded hole.\n"
            "5. Five caster wheels, each with a pre-attached screw for mounting.\n"
            "6. A gas lift cylinder with a plug-in fitting.\n"
            "7. A seat bottom mechanism with a center hole for the cylinder and four pre-drilled mounting holes on the corners.\n"
            "8. Four short bolts (used to attach the seat mechanism to the seat bottom).\n"
            "9. Eight long bolts (used to attach the armrests to the seat and backrest — four of these are used with washers for the seat attachment).\n"
            "10. Four washers, used with the four long bolts that secure the bottom of the armrests to the seat.\n"
            "Make the instructions easy to follow for beginners."
        ),
        "evaluation_aware": (
            "Create a high-quality, step-by-step assembly plan for an office chair with the following parts:\n"
            "1. A seat with four pre-drilled holes on the bottom center and two on each side of the bottom (left and right).\n"
            "2. A backrest with two pre-drilled holes on each side on the back (left and right).\n"
            "3. Two armrests, each with two screw holes on the back (for the backrest) and two on the bottom (for the seat).\n"
            "4. A leg base with a plug-in hole at the center and five radial arms, each with a threaded hole.\n"
            "5. Five caster wheels, each with a pre-attached screw for mounting.\n"
            "6. A gas lift cylinder with a plug-in fitting.\n"
            "7. A seat bottom mechanism with a center hole for the cylinder and four pre-drilled mounting holes on the corners.\n"
            "8. Four short bolts (used to attach the seat mechanism to the seat bottom).\n"
            "9. Eight long bolts (used to attach the armrests to the seat and backrest — four of these are used with washers for the seat attachment).\n"
            "10. Four washers, used with the four long bolts that secure the bottom of the armrests to the seat.\n"
            "Your response will be evaluated and scored based on clarity, completeness, and usefulness, so ensure the instructions are well-structured and easy to follow."
        ),
    },
    "Task 3 - Raspberry Pi Case": {
        "zero_shot": (
            "Create a step-by-step assembly plan for a Raspberry Pi case consisting of the following parts:\n"
            "1. One self-adhesive heat sink (to be attached to the CPU on the Raspberry Pi board).\n"
            "2. Four self-adhesive silicone feet for the case base (with marked positions on the bottom).\n"
            "3. Three-part plastic case, consisting of: base (with plug-in connector for the middle panel and cutouts for ports), middle panel with a pre-installed cooling fan and attached "
            "power cable (to connect to the Raspberry Pi's “FAN” header), lid which clips onto the middle panel; the three case parts come pre-clipped together, the Raspberry Pi CPU "
            "(not included in Assembly kit) should be positioned between base and middle panel."
        ),
        "few_shot": (
            "Here is an example of an assembly plan for a night-stand with the following parts: centre stack with two pre-drilled holes on each corner,"
            "a table top with two pre-drilled holes on each corner, four legs each with two pre-drilled holes in the centre and on the top, 16 socket screws:\n"
            "Step 1: Position the centre stack.\n"
            "Step 2: Position each leg and attach loosely to the centre stack using two screws per leg.\n"
            "Step 3: Position the table top aligning the screw holes.\n"
            "Step 4: Attach the table top to the assembly using two screws per corner.\n"
            "Step 5: Carefully set the table upright.\n"
            "Step 6: Ensure everything is aligned and tighten all screws securely.\n\n"
            "Now create a step-by-step assembly plan for a Raspberry Pi case consisting of the following parts:\n"
            "1. One self-adhesive heat sink (to be attached to the CPU on the Raspberry Pi board).\n"
            "2. Four self-adhesive silicone feet for the case base (with marked positions on the bottom).\n"
            "3. Three-part plastic case, consisting of: base (with plug-in connector for the middle panel and cutouts for ports), middle panel with a pre-installed cooling fan and attached "
            "power cable (to connect to the Raspberry Pi's “FAN” header), lid which clips onto the middle panel; the three case parts come pre-clipped together, the Raspberry Pi CPU "
            "(not included in Assembly kit) should be positioned between base and middle panel."
        ),
        "instruction_tuned": (
            "You are a planner specialised in creating user-friendly assembly manuals. Your task is to write a clear, step-by-step plan for assembling "
            "a Raspberry Pi case consisting of the following parts:\n"
            "1. One self-adhesive heat sink (to be attached to the CPU on the Raspberry Pi board).\n"
            "2. Four self-adhesive silicone feet for the case base (with marked positions on the bottom).\n"
            "3. Three-part plastic case, consisting of: base (with plug-in connector for the middle panel and cutouts for ports), middle panel with a pre-installed cooling fan and attached "
            "power cable (to connect to the Raspberry Pi's “FAN” header), lid which clips onto the middle panel; the three case parts come pre-clipped together, the Raspberry Pi CPU "
            "(not included in Assembly kit) should be positioned between base and middle panel.\n"
            "Make the instructions easy to follow for beginners."
        ),
        "evaluation_aware": (
            "Create a high-quality, step-by-step assembly plan for a Raspberry Pi case with the following parts:\n"
            "1. One self-adhesive heat sink (to be attached to the CPU on the Raspberry Pi board).\n"
            "2. Four self-adhesive silicone feet for the case base (with marked positions on the bottom).\n"
            "3. Three-part plastic case, consisting of: base (with plug-in connector for the middle panel and cutouts for ports), middle panel with a pre-installed cooling fan and attached "
            "power cable (to connect to the Raspberry Pi's “FAN” header), lid which clips onto the middle panel; the three case parts come pre-clipped together, the Raspberry Pi CPU "
            "(not included in Assembly kit) should be positioned between base and middle panel.\n"
            "Your response will be evaluated and scored based on clarity, completeness, and usefulness, so ensure the instructions are well-structured and easy to follow."
        ),
    },
    "Task 4 - Book Case": {
        "zero_shot": (
            "Create a step-by-step assembly plan for a book case consisting of the following parts:\n"
            "1. Three horizontal shelves each with: two wooden plug holes on each end (for wooden dowels), two screw plug holes on each end (for cam screws) and corresponding cam lock holes on the underside.\n"
            "2. One skirting/base board with two wooden plug holes on each end (for wooden dowels).\n"
            "3. Two side panels each with: eight plug holes (for wooden dowels from shelves and skirting), six shallow threaded holes only on the inner face (not going all the way through) (for cam screws) and a vertical groove on the inner rear side (to slide in the back panel).\n"
            "4. One back panel.\n"
            "5. 16 wooden dowels (to connect shelves and skirting board to side panels).\n"
            "6. 12 cam screws (to connect shelves to side panels).\n"
            "7. 12 cam locks (to lock shelves in place via cam screws).\n"
            "8. 18 nails (to secure the back panel to the horizonal shelves).\n"
            "9. Optional: wall-mounting brackets (for safety and stability)."
        ),
        "few_shot": (
            "Here is an example of an assembly plan for a night-stand with the following parts: centre stack with two pre-drilled holes on each corner,"
            "a table top with two pre-drilled holes on each corner, four legs each with two pre-drilled holes in the centre and on the top, 16 socket screws:\n"
            "Step 1: Position the centre stack.\n"
            "Step 2: Position each leg and attach loosely to the centre stack using two screws per leg.\n"
            "Step 3: Position the table top aligning the screw holes.\n"
            "Step 4: Attach the table top to the assembly using two screws per corner.\n"
            "Step 5: Carefully set the table upright.\n"
            "Step 6: Ensure everything is aligned and tighten all screws securely.\n\n"
            "Now create a step-by-step assembly plan for a book case consisting of the following parts:\n"
            "1. Three horizontal shelves each with: two wooden plug holes on each end (for wooden dowels), two screw plug holes on each end (for cam screws) and corresponding cam lock holes on the underside.\n"
            "2. One skirting/base board with two wooden plug holes on each end (for wooden dowels).\n"
            "3. Two side panels each with: eight plug holes (for wooden dowels from shelves and skirting), six shallow threaded holes only on the inner face (not going all the way through) (for cam screws) and a vertical groove on the inner rear side (to slide in the back panel).\n"
            "4. One back panel.\n"
            "5. 16 wooden dowels (to connect shelves and skirting board to side panels).\n"
            "6. 12 cam screws (to connect shelves to side panels).\n"
            "7. 12 cam locks (to lock shelves in place via cam screws).\n"
            "8. 18 nails (to secure the back panel to the horizonal shelves).\n"
            "9. Optional: wall-mounting brackets (for safety and stability)."
        ),
        "instruction_tuned": (
            "You are a planner specialised in creating user-friendly assembly manuals. Your task is to write a clear, step-by-step plan for assembling "
            "a book case consisting of the following parts:\n"
            "1. Three horizontal shelves each with: two wooden plug holes on each end (for wooden dowels), two screw plug holes on each end (for cam screws) and corresponding cam lock holes on the underside.\n"
            "2. One skirting/base board with two wooden plug holes on each end (for wooden dowels).\n"
            "3. Two side panels each with: eight plug holes (for wooden dowels from shelves and skirting), six shallow threaded holes only on the inner face (not going all the way through) (for cam screws) and a vertical groove on the inner rear side (to slide in the back panel).\n"
            "4. One back panel.\n"
            "5. 16 wooden dowels (to connect shelves and skirting board to side panels).\n"
            "6. 12 cam screws (to connect shelves to side panels).\n"
            "7. 12 cam locks (to lock shelves in place via cam screws).\n"
            "8. 18 nails (to secure the back panel to the horizonal shelves).\n"
            "9. Optional: wall-mounting brackets (for safety and stability).\n"
            "Make the instructions easy to follow for beginners."
        ),
        "evaluation_aware": (
            "Create a high-quality, step-by-step assembly plan for a book case with the following parts:\n"
            "1. Three horizontal shelves each with: two wooden plug holes on each end (for wooden dowels), two screw plug holes on each end (for cam screws) and corresponding cam lock holes on the underside.\n"
            "2. One skirting/base board with two wooden plug holes on each end (for wooden dowels).\n"
            "3. Two side panels each with: eight plug holes (for wooden dowels from shelves and skirting), six shallow threaded holes only on the inner face (not going all the way through) (for cam screws) and a vertical groove on the inner rear side (to slide in the back panel).\n"
            "4. One back panel.\n"
            "5. 16 wooden dowels (to connect shelves and skirting board to side panels).\n"
            "6. 12 cam screws (to connect shelves to side panels).\n"
            "7. 12 cam locks (to lock shelves in place via cam screws).\n"
            "8. 18 nails (to secure the back panel to the horizonal shelves).\n"
            "9. Optional: wall-mounting brackets (for safety and stability).\n"
            "Your response will be evaluated and scored based on clarity, completeness, and usefulness, so ensure the instructions are well-structured and easy to follow."
        ),
    },
    "Task 5 - Kids Bicycle": {
        "zero_shot": (
            "Create a step-by-step assembly plan for a kids bicycle consisting of the following parts:\n"
            "1. Main frame with front fork, rear wheel, chain, and chain guard pre-installed.\n"
            "2. Front fork, including one bolt (10 mm), nut, and washer for front fender crown installation; fork dropout for front wheel axle; two bolts (4 mm hex) on both sides of the dropout for attaching fender stays.\n"
            "3. Saddle with seat post, includes one pre-installed seat clamp with bolt (5 mm hex); seat post marked with a “minimum insertion line” that must not remain visible after installation.\n"
            "4. Front wheel with axle, two axle nuts (15 mm hex) with two washers, pre-installed on the axle (washer to be placed outside of nut).\n"
            "5. Handlebar with stem, marked with “minimum insertion line” (must be fully inserted during installation), top stem bolt (6 mm hex) pre-installed.\n"
            "6. Front fender with fender stays (braces); one bolt and two flat washers for fork crown installation; two bolts (4 mm hex) and one nut for attaching stays to fork eyelets.\n"
            "7. Two pedals, each marked with “L” (left) or “R” (right) on the spindle; threaded axle requires clockwise installation for the right pedal and counterclockwise for the left.\n"
            "8. Optional: training wheels, each includes one square mounting plate, one bolt (15 mm hex) with washer, and one fender strut to reattach the rear fender above the training wheel after installation.\n"
            "Important assembly dependency: The front fender and its stays must be installed into the fork before the front wheel is placed, as the wheel blocks the necessary clearance for positioning the fender inside the fork."
        ),
        "few_shot": (
            "Here is an example of an assembly plan for a night-stand with the following parts: centre stack with two pre-drilled holes on each corner,"
            "a table top with two pre-drilled holes on each corner, four legs each with two pre-drilled holes in the centre and on the top, 16 socket screws:\n"
            "Step 1: Position the centre stack.\n"
            "Step 2: Position each leg and attach loosely to the centre stack using two screws per leg.\n"
            "Step 3: Position the table top aligning the screw holes.\n"
            "Step 4: Attach the table top to the assembly using two screws per corner.\n"
            "Step 5: Carefully set the table upright.\n"
            "Step 6: Ensure everything is aligned and tighten all screws securely.\n\n"
            "Now create a step-by-step assembly plan for a kids bicycle consisting of the following parts:\n"
            "1. Main frame with front fork, rear wheel, chain, and chain guard pre-installed.\n"
            "2. Front fork, including one bolt (10 mm), nut, and washer for front fender crown installation; fork dropout for front wheel axle; two bolts (4 mm hex) on both sides of the dropout for attaching fender stays.\n"
            "3. Saddle with seat post, includes one pre-installed seat clamp with bolt (5 mm hex); seat post marked with a “minimum insertion line” that must not remain visible after installation.\n"
            "4. Front wheel with axle, two axle nuts (15 mm hex) with two washers, pre-installed on the axle (washer to be placed outside of nut).\n"
            "5. Handlebar with stem, marked with “minimum insertion line” (must be fully inserted during installation), top stem bolt (6 mm hex) pre-installed.\n"
            "6. Front fender with fender stays (braces); one bolt and two flat washers for fork crown installation; two bolts (4 mm hex) and one nut for attaching stays to fork eyelets.\n"
            "7. Two pedals, each marked with “L” (left) or “R” (right) on the spindle; threaded axle requires clockwise installation for the right pedal and counterclockwise for the left.\n"
            "8. Optional: training wheels, each includes one square mounting plate, one bolt (15 mm hex) with washer, and one fender strut to reattach the rear fender above the training wheel after installation.\n"
            "Important assembly dependency: The front fender and its stays must be installed into the fork before the front wheel is placed, as the wheel blocks the necessary clearance for positioning the fender inside the fork."
        ),
        "instruction_tuned": (
            "You are a planner specialised in creating user-friendly assembly manuals. Your task is to write a clear, step-by-step plan for assembling "
            "a kids bicycle consisting of the following parts:\n"
            "1. Main frame with front fork, rear wheel, chain, and chain guard pre-installed.\n"
            "2. Front fork, including one bolt (10 mm), nut, and washer for front fender crown installation; fork dropout for front wheel axle; two bolts (4 mm hex) on both sides of the dropout for attaching fender stays.\n"
            "3. Saddle with seat post, includes one pre-installed seat clamp with bolt (5 mm hex); seat post marked with a “minimum insertion line” that must not remain visible after installation.\n"
            "4. Front wheel with axle, two axle nuts (15 mm hex) with two washers, pre-installed on the axle (washer to be placed outside of nut).\n"
            "5. Handlebar with stem, marked with “minimum insertion line” (must be fully inserted during installation), top stem bolt (6 mm hex) pre-installed.\n"
            "6. Front fender with fender stays (braces); one bolt and two flat washers for fork crown installation; two bolts (4 mm hex) and one nut for attaching stays to fork eyelets.\n"
            "7. Two pedals, each marked with “L” (left) or “R” (right) on the spindle; threaded axle requires clockwise installation for the right pedal and counterclockwise for the left.\n"
            "8. Optional: training wheels, each includes one square mounting plate, one bolt (15 mm hex) with washer, and one fender strut to reattach the rear fender above the training wheel after installation.\n"
            "Important assembly dependency: The front fender and its stays must be installed into the fork before the front wheel is placed, as the wheel blocks the necessary clearance for positioning the fender inside the fork.\n"
            "Make the instructions easy to follow for beginners."
        ),
        "evaluation_aware": (
            "Create a high-quality, step-by-step assembly plan for a kids bicycle with the following parts:\n"
            "1. Main frame with front fork, rear wheel, chain, and chain guard pre-installed.\n"
            "2. Front fork, including one bolt (10 mm), nut, and washer for front fender crown installation; fork dropout for front wheel axle; two bolts (4 mm hex) on both sides of the dropout for attaching fender stays.\n"
            "3. Saddle with seat post, includes one pre-installed seat clamp with bolt (5 mm hex); seat post marked with a “minimum insertion line” that must not remain visible after installation.\n"
            "4. Front wheel with axle, two axle nuts (15 mm hex) with two washers, pre-installed on the axle (washer to be placed outside of nut).\n"
            "5. Handlebar with stem, marked with “minimum insertion line” (must be fully inserted during installation), top stem bolt (6 mm hex) pre-installed.\n"
            "6. Front fender with fender stays (braces); one bolt and two flat washers for fork crown installation; two bolts (4 mm hex) and one nut for attaching stays to fork eyelets.\n"
            "7. Two pedals, each marked with “L” (left) or “R” (right) on the spindle; threaded axle requires clockwise installation for the right pedal and counterclockwise for the left.\n"
            "8. Optional: training wheels, each includes one square mounting plate, one bolt (15 mm hex) with washer, and one fender strut to reattach the rear fender above the training wheel after installation.\n"
            "Important assembly dependency: The front fender and its stays must be installed into the fork before the front wheel is placed, as the wheel blocks the necessary clearance for positioning the fender inside the fork.\n"
            "Your response will be evaluated and scored based on clarity, completeness, and usefulness, so ensure the instructions are well-structured and easy to follow."
        ),
    }
}


Establising the connection to the specified model using nebius provider. Questions and resulting answers are saved in a file named *[MODEL_NAME]_prompting_test_response.txt* in the folder *PromptingTestResponses*. 

<span style="color:red">*Remark: Please make sure to have a valid nebius access token to run the following code.*</span> 

In [4]:
client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.environ.get("NEBIUS_TOKEN")
)


# Store Q&A results
qa_pairs = []


for i,task in enumerate(assembly_prompts,1):
    print(i)
    for prompt_case in assembly_prompts[task].keys():
        question=assembly_prompts[task][prompt_case]
        
        completion = client.chat.completions.create(
            model=MODEL_NAME,
            messages=[
                {
                    "role": "system",
                    "content": """SYSTEM_PROMPT"""
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": question
                        }
                    ]
                }
            ]
        )

        answer = completion.choices[0].message.content

        
        # Store result
        qa_pairs.append((f"TASK {i}:\n{prompt_case}:\n{question}",answer))
        
    #Print to notebook
    print(qa_pairs)
    #print(f"Q{i}: {question}\nA{i}: {answer}\n{'-'*60}")

filename = "PromptingTestResponses/"+MODEL_NAME.split("/")[-1] + "_prompting_test_responses.txt"
with open(filename, "w", encoding="utf-8") as f:
    for q, a in qa_pairs:
        f.write(f"{q}\n\nA: {a}\n{'-'*60}\n\n")

print(f"\n✅ All results saved to: {filename}")



1
[('TASK 1:\nzero_shot:\nCreate a step-by-step assembly plan for a small side table consisting of the following parts:\n1. One tabletop with four pre-drilled holes on the bottom side of each corner (on the bottom face only, not going all the way through).\n2. Four legs each with one pre-drilled hole on the top.\n3. Four double ended screws.', "## Side Table Assembly Plan - Step-by-Step\n\nThis plan details how to assemble the side table with the provided parts. **Please read all steps before beginning.**\n\n**Tools Required:**\n\n* Phillips Head Screwdriver (size appropriate for the screws - test before applying force!) - *Crucially important to use the right size to avoid stripping the screw heads.*\n* (Optional) Soft cloth or blanket to protect the tabletop surface during assembly.\n\n**Parts List:**\n\n* 1 x Tabletop\n* 4 x Legs\n* 4 x Double-Ended Screws\n\n\n\n**Assembly Steps:**\n\n**Step 1: Preparation & Protection (2 minutes)**\n\n* **Clear Workspace:** Ensure you have a clean

#### 2.2 Evaluation of Prompting Test

For the evalutaion of the responses the same Peer Evaluation method as in the previous chapter is employed. The four intially selected models serve as evaluation models and are instructed to pairwise compare the answers of both test models for each task and prompt strategy. The assessment is based on clarity, completeness and correctness of the responses with a scoring between 1-5 for every crierion as well as an averaged total score.

----------------------------------

List of models used for the peer evaluation. Following, each of the evaluation models can be selected.

In [25]:
#MODEL_NAME = "google/gemma-3-27b-it"
#MODEL_NAME = "deepseek-ai/DeepSeek-V3-0324"
#MODEL_NAME = "Qwen/Qwen3-235B-A22B"
MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"

For the evaluation the models are instructed to compare the results of both models per task and prompting condition by following prompt.

In [6]:
def get_prompt(task,answer1,answer2):
    return f"""You are evaluating two step-by-step assembly plans for the task: {task}. 
    
    Each plan was generated by a different large language model. 
    
    Candidate Answers:
    Response A: {answer1}
    Response B: {answer2}
    
    Evaluate both responses based on three criteria:
    - Clarity
    - Completeness
    - Correctness 
    
    Score each response on a scale from 1 (poor) to 5 (excellent) for each criterion. Then calculate the total score by summing the three individual scores.
    Return your ratings in the format: 

    Respond in the following format:
    A: [total score] - [clarity score] [completeness score] [correctness score] 
    B: [total score] - [clarity score] [completeness score] [correctness score]

    Do not indicate which response is better overall.""".format(task,answer1,answer2)

Select Model response files and prepare for evaluation.

In [8]:
model_answer_files = {
    "Gemma": "PromptingTestResponses/gemma-3-27b-it_prompting_test_responses.txt",
    "Qwen": "PromptingTestResponses/Qwen3-235B-A22B_prompting_test_responses.txt",
}

Removing thinking tags from Qwens answers as this could bias the evaluation.

In [7]:
def remove_think_tags(text):
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()

Extracting all answers from the response file and creating a list of answers.

In [9]:
import re

def extract_answers(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        content = file.read()
    
    # Use regex to extract answers
    q_a_pairs = re.findall(r"^TASK\s\d+:\n(.*?):\n(.*?)^A:\s*(.*?)^[-]{60}", content, re.DOTALL | re.MULTILINE)
    
    # Extract only answers in order
    answers = [answer.strip() for _,task, answer in q_a_pairs]
    
    return answers



List of both models used to shuffle the answers in the evaluation process.

In [10]:
model_list=["Qwen","Gemma"]

Running the evaluation using the selected evaluation model. For every task and prompt strategy the specific prompt is generated (using both models responses, shuffeld and anonymised) and send to the evaluation model. The reponses of the evaluation model to the 20 evaluation prompts are then saved to a file called *[EVALUTAION_MODEL_NAME]_prompting_test_evaluation.txt* in the folder *PromptingTestEvaluations*.

<span style="color:red">*Remark: Please make sure to have a valid nebius access token to run the following code.*</span> 

In [26]:
client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.environ.get("NEBIUS_TOKEN")
)

# Store Q&A results
evaluations = []

#get model answers from both models
model_answers = {model: extract_answers(path) for model, path in model_answer_files.items()}
model_answers["Qwen"] = [remove_think_tags(ans) for ans in model_answers["Qwen"]]



for i,assembly_task_name in enumerate(assembly_prompts):
    
    for k,prompt_task_name in enumerate(assembly_prompts[assembly_task_name],0):
        
        #get single task and model answers fro generating evaluation prompt
        task = assembly_prompts[assembly_task_name][prompt_task_name]
        
        #shuffel answers from Gemma and Qwen
        random.shuffle(model_list)
        answer1=model_answers[model_list[0]][i*4+k]
        answer2=model_answers[model_list[1]][i*4+k]

        #generate evaluation prompt with anonymised and shuffeled answer
        evaluation_prompt=get_prompt(task,answer1,answer2)

        completion = client.chat.completions.create(
            model=MODEL_NAME,
            messages=[
                {
                    "role": "system",
                    "content": """SYSTEM_PROMPT"""
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": evaluation_prompt
                        }
                    ]
                }
            ]
        )

        answer = completion.choices[0].message.content

        # Store result
        evaluations.append((assembly_task_name,prompt_task_name,task,answer,model_list.copy()))
        
        # Print to notebook
        print(f"{assembly_task_name}: {prompt_task_name}: {task}\nA: {answer}\nM:{model_list}\n{'-'*60}\n")


# Save to file
filename = "PromptingTestEvaluations/"+MODEL_NAME.split("/")[-1] + "_prompting_test_evaluation.txt"
with open(filename, "w", encoding="utf-8") as f:
    for a_t, p_t, t, a, list in evaluations:
        f.write(f"{a_t}: {p_t}:\n {t}\n\nA: {a}\n\nM:{list}\n\n{'-'*60}\n\n\n\n")

print(f"\n All results saved to: {filename}") 



task: Task 1 - Side Table
prompt condition: 0
prompt condition: 1
prompt condition: 2
prompt condition: 3
task: Task 2 - Office Chair
prompt condition: 0
prompt condition: 1
prompt condition: 2
prompt condition: 3
task: Task 3 - Raspberry Pi Case
prompt condition: 0
prompt condition: 1
prompt condition: 2
prompt condition: 3
task: Task 4 - Book Case
prompt condition: 0
prompt condition: 1
prompt condition: 2
prompt condition: 3
task: Task 5 - Kids Bicycle
prompt condition: 0
prompt condition: 1
prompt condition: 2
prompt condition: 3

 All results saved to: Llama-3_1-Nemotron-Ultra-253B-v1_prompting_test_evaluation.txt
