## **Out-of-the-Box Assembly Plan Generation: A Peer Evaluation of Open-Source LLMs**

### Code related to the paper 'Out-of-the-Box Assembly Plan Generation: A Peer Evaluation of Open-Source LLMs' by Harkrian Sahota and Adam Klodowski.

Testing to what extent pre-trained, general-purpose LLMs can be used directly—without additional training—for generating structured assembly instructions, relying only on minimal input such as part lists and basic constraints.

### **1. Testing Domain Relevance**

Testing the domain relevance of four selected non-proprietary LLM models by asking 10 domain relevant question to each model and conducting a peer evaluation using reference answers and the same four models to assess the Correctness, Completeness and Domain relevance of the given answers.

**1.1 Domain Relevance Questions**

In [15]:
from huggingface_hub import InferenceClient

from dotenv import load_dotenv
import os

from openai import OpenAI

import random

load_dotenv()

True

10 evaluation questions that have been selected to evaluate each LLMs domain relevance in manufacturing. The list of questions is looped through for each model.

In [16]:
questions = [
    "What are the key steps in a standard manufacturing process flow?",
    "What is the purpose of a Bill of Materials (BOM) in production?",
    "Explain the difference between subtractive and additive manufacturing.",
    "What is lean manufacturing and its core principles?",
    "What is production planning in manufacturing, and why is it important?",
    "What does an efficient production plan in manufacturing look like?",
    "What are common assembly actions used to assemble products?",
    "How would you describe to a layman the steps to assemble two parts with a special screw using a specific tool?",
    "What does \"takt time\" mean, and how is it calculated?",
    "What are common causes of bottlenecks in a manufacturing system?"
]


Selecting the model that should be used for the evaluation.

In [41]:
#MODEL_NAME = "google/gemma-3-27b-it"
MODEL_NAME = "deepseek-ai/DeepSeek-V3-0324"
#MODEL_NAME = "Qwen/Qwen3-235B-A22B"
#MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"

Establising the connection to the specified model using nebius provider. Questions and resulting answers are saved in a file named *[MODEL_NAME].txt*. 

<span style="color:red">*Remark: Please make sure to have a valid nebius access token to run the following code.*</span> 

In [8]:
#MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"

client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.environ.get("NEBIUS_TOKEN")
)

# Store Q&A results
qa_pairs = []

for i, question in enumerate(questions, 1):
    completion = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[
            {
                "role": "system",
                "content": """SYSTEM_PROMPT"""
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ]
    )

    answer = completion.choices[0].message.content

    # Store result
    qa_pairs.append((question, answer))
    
    # Print to notebook
    print(f"Q{i}: {question}\nA{i}: {answer}\n{'-'*60}")


# Save to file
filename = MODEL_NAME.split("/")[-1] + ".txt"
with open(filename, "w", encoding="utf-8") as f:
    for q, a in qa_pairs:
        f.write(f"Q: {q}\nA: {a}\n{'-'*60}\n")

print(f"\n✅ All results saved to: {filename}")


Q1: What are the key steps in a standard manufacturing process flow?
A1: <think>
Okay, the user is asking about the key steps in a standard manufacturing process flow. Let me start by recalling what I know about manufacturing processes. I think the steps generally start with product design, then planning, sourcing materials, production, quality control, assembly, testing, packaging, and distribution. But wait, maybe I should check if there's a more standardized framework.

I remember that manufacturing processes can vary by industry, so the answer should be generic enough but still specific to cover the essentials. Let me break it down step by step. Product Design and Development definitely comes first. Then, process design, which involves planning how to manufacture the product. Then sourcing raw materials, which is part procurement, part supplier relationships.

Production could be broken down into setup, the actual manufacturing (like machining, molding, etc.), and then maybe in-pro

**1.2 Evaluation of Domain Relevance**

List of reference answers for the evaluation of the model answers

In [18]:
reference_answers=[
    "Typical manufacturing processes include: 1) Product design & prototyping, 2) Material procurement, 3) Production planning, 4) Manufacturing (e.g., machining, forming), 5) Assembly, 6) Quality control, and 7) Packaging & distribution. ",
    "A BOM is a structured list detailing all materials, parts, and components required to manufacture a product, including quantities and specifications. It is used for production planning, procurement, and inventory control.",
    "Subtractive manufacturing removes material from a solid block (e.g., CNC milling), while additive manufacturing builds material layer by layer (e.g., 3D printing).",
    "Lean manufacturing focuses on reducing waste and increasing value. Core principles include: 1) Define value, 2) Map value stream, 3) Create flow, 4) Establish pull, 5) Pursue perfection.",
    "Production planning involves scheduling, resource allocation, and workflow optimization to meet demand while minimizing waste and delays. It includes forecasting, inventory control, and capacity planning.",
    "It balances demand forecasts with capacity, reduces idle time, ensures material availability, includes buffers for uncertainty, and adapts to changes. Efficiency depends on synchronization across labour, materials, and equipment.",
    "Common actions include: Fastening (screws, bolts), Joining (welding, gluing), Aligning and fitting parts, Wiring, Testing, and Packaging. These vary depending on the product and process type.",
    "1) Align the two parts so their screw holes match, 2) Insert the special screw, 3) Use the specified tool (e.g., Torx driver) to tighten the screw clockwise, 4) Ensure a firm connection without over-tightening.",
    "Takt time is the rate at which products must be produced to meet customer demand. It’s calculated by dividing available production time by required output and is used to synchronize workflow and reduce inefficiencies, as emphasized in the Toyota Production System.",
    "Common bottlenecks in manufacturing include machine breakdowns, poor layout, insufficient workforce, planning issues, and process misalignment. These cause delays and lower efficiency, so addressing them is key to optimizing production."
]

Reading in all answer files and creating a dictionary of answers by each model.

In [19]:
model_answer_files = {
    "DeepSeek": "DeepSeek-V3-0324.txt",
    "Gemma": "gemma-3-27b-it.txt",
    "Qwen": "Qwen3-235B-A22B.txt",
    "Llama": "Llama-3_1-Nemotron-Ultra-253B-v1.txt"
}

In [20]:
import re

def extract_answers(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        content = file.read()

    # Use regex to extract answers
    q_a_pairs = re.findall(r"Q:\s*(.*?)\nA:\s*(.*?)(?=\nQ:|\Z)", content, re.DOTALL)

    # Extract only answers in order
    answers = [answer.strip() for _, answer in q_a_pairs]
    return answers

model_answers = {model: extract_answers(path) for model, path in model_answer_files.items()}


Removing thinking tags from Qwens answers as this could bias the evaluation.

In [31]:
def remove_think_tags(text):
    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()

model_answers["Qwen"] = [remove_think_tags(ans) for ans in model_answers["Qwen"]]


Prompt used for the evaluation of the answers given to each question. The evaluation model is asked to judge each models answer by scoring it between 1-5 and giving a brief explanation. It is complemented by adding the question, reference answer and the four answers given by the different models.

In [32]:
def get_prompt(question,reference_answer,answer1,answer2,answer3,answer4):
    return"""You are an expert evaluator tasked with judging the quality of AI-generated answers in a specific technical domain. Below is a question, a reference answer, and four candidate answers produced by different models. Your goal is to assess how well each candidate answer matches the reference in terms of correctness, completeness, and domain relevance.

---

Question: {question}

Reference Answer: {reference_answer}

Candidate Answers:
A: {answer1}
B: {answer2}
C: {answer3}
D: {answer4}

Please evaluate each candidate answer using the following criteria:
1. **Correctness** Is the information factually accurate?
2. **Completeness** Does it fully answer the question?
3. **Domain relevance** Is it appropriate and useful in the given context?

Give each answer a score between 1 (poor) and 5 (excellent). Provide a brief explanation for each score.

Respond in the following format:

A: [score]  [short explanation]  
B: [score]  [short explanation]  
C: [score]  [short explanation]  
D: [score]  [short explanation]""".format(question=question,reference_answer=reference_answer,answer1=answer1,answer2=answer2,answer3=answer3,answer4=answer4)

List of all four models to shuffle the answers in the evaluation process.

In [33]:
model_list=["Gemma","Qwen","Llama","DeepSeek"]


Selecting the evaluation model

In [51]:
#MODEL_NAME = "google/gemma-3-27b-it"
#MODEL_NAME = "deepseek-ai/DeepSeek-V3-0324"
#MODEL_NAME = "Qwen/Qwen3-235B-A22B"
MODEL_NAME = "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1"

Running the evaluation using the selected evaluation model. For every question the specific prompt is generated (question - reference answer - answer of four models shuffeled) and send to the evaluation model. The reponse of the model to the ten prompts is then saved to a file called *[MODEL_NAME]_domain_relevance_evaluation.txt*

<span style="color:red">*Remark: Please make sure to have a valid nebius access token to run the following code.*</span> 

In [52]:
client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.environ.get("NEBIUS_TOKEN")
)

# Store Q&A results
evaluations = []

for i in range(len(questions)):
    question = questions[i]
    reference_answer = reference_answers[i]

    random.shuffle(model_list)
    
    deepseek_ans = model_answers["DeepSeek"][i]
    gemma_ans = model_answers["Gemma"][i]
    llama_ans = model_answers["Llama"][i]
    qwen_ans = model_answers["Qwen"][i]

    answer1=model_answers[model_list[0]][i]
    answer2=model_answers[model_list[1]][i]
    answer3=model_answers[model_list[2]][i]
    answer4=model_answers[model_list[3]][i]

    evaluation_prompt=get_prompt(question,reference_answer,answer1,answer2,answer3,answer4)
    
    completion = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[
            {
                "role": "system",
                "content": """SYSTEM_PROMPT"""
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": evaluation_prompt
                    }
                ]
            }
        ]
    )

    answer = completion.choices[0].message.content

    # Store result
    evaluations.append((question, answer, model_list.copy()))
    
    # Print to notebook
    print(f"Q{i}: {question}\nA{i}: {answer}\nM:{model_list}\n{'-'*60}\n")


# Save to file
filename = MODEL_NAME.split("/")[-1] + "_domain_relevance_evaluation.txt"
with open(filename, "w", encoding="utf-8") as f:
    for q, a, list in evaluations:
        f.write(f"Q: {q}\n\nA: {a}\n\nM: {list}\n{'-'*60}\n\n")

print(f"\n All results saved to: {filename}") 


    


Q0: What are the key steps in a standard manufacturing process flow?
A0: <think>
Okay, let's tackle this evaluation. I need to assess each candidate answer (A, B, C, D) against the reference answer based on correctness, completeness, and domain relevance. Let me start by understanding the reference answer thoroughly.

The reference lists seven key steps: Product design & prototyping, Material procurement, Production planning, Manufacturing (e.g., machining, forming), Assembly, Quality control, and Packaging & distribution. It's a concise, standard flow.

Now, I'll go through each candidate one by one.

**Candidate A:**
This answer expands on each step with subpoints and adds optional steps. It covers all the reference's points but includes more details like Warehousing, Order Fulfillment, and After-Sales Support. The optional steps might be industry-specific but don't contradict the reference. Correctness seems good as it aligns with the core steps. Completeness is high because it goes

**3. Generating Assembly Plans**