<a href="https://colab.research.google.com/github/owensun2004/Furniture_Assembly/blob/main/Manual2Skill_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright 2023 Google LLC. SPDX-License-Identifier: Apache-2.0

<details>
<summary>Click to expand</summary>

Copyright 2023 Google LLC. SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
</details>

# **Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models** Demo

Manual2Skill is a novel framework that enables robots to autonomously perform complex assembly tasks in real-life guided by high-level manual instructions.

This colab shows the basics of how the VLM-Guided Hierachical Assembly Graph Generation step is performed, which contains the prompts and API calls for the two stages as input, and the generated assembly graph as output.

<img>

Disclaimer: The VLM we chose to use is GPT-4o-12-24, this is the tested version, feel free to use other models but they are not tested.

In [None]:
import os
os.environ["OPENAI_API_KEY"] = ""

## **Setup**

In [12]:
#@markdown A few imports
!pip install openai==1.52.0 matplotlib==3.9.2 plotly==5.23.0 pyvista plotly ipyvolume matplotlib graphviz

import json
import os
import re
import base64
import copy
from openai import OpenAI

Collecting pyvista
  Downloading pyvista-0.44.2-py3-none-any.whl.metadata (15 kB)
Collecting ipyvolume
  Downloading ipyvolume-0.6.3-py3-none-any.whl.metadata (2.3 kB)
Collecting vtk<9.4.0 (from pyvista)
  Downloading vtk-9.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting ipywebrtc (from ipyvolume)
  Downloading ipywebrtc-0.6.0-py2.py3-none-any.whl.metadata (825 bytes)
Collecting ipyvuetify (from ipyvolume)
  Downloading ipyvuetify-1.10.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting ipyvue>=1.7.0 (from ipyvolume)
  Downloading ipyvue-1.11.2-py2.py3-none-any.whl.metadata (1.1 kB)
Collecting pythreejs>=2.4.0 (from ipyvolume)
  Downloading pythreejs-2.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting ipydatawidgets>=1.1.1 (from pythreejs>=2.4.0->ipyvolume)
  Downloading ipydatawidgets-4.3.5-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pyvista-0.44.2-py3-none-any.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32

In [4]:
#@markdown Clone the repository for all the input furniture data
!git clone https://github.com/owensun2004/Furniture_Assembly.git

Cloning into 'Furniture-Assembly-Web'...
remote: Enumerating objects: 549, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 549 (delta 14), reused 24 (delta 10), pack-reused 519 (from 2)[K
Receiving objects: 100% (549/549), 273.03 MiB | 30.92 MiB/s, done.
Resolving deltas: 100% (115/115), done.
Updating files: 100% (34/34), done.


## **Specify the furniture**
Please select the furniture you want to plan for assembly

In [None]:
furniture_name = ""
furniture_type = ""

## **Stage 1 Associating Real Parts with Manuals:**

In [10]:
def alphanumeric_sort_key(filename):
    # Split the filename into a list of alphabetic and numeric parts
    return [int(text) if text.isdigit() else text for text in re.split(r'(\d+)', filename)]

def select_materials_for_planning(furniture_name, furniture_type, temp):
    scene_path = f"Furniture_Assembly/IKEA-Manuals-at-Work/output/{furniture_type}/{furniture_name}/scene_annotated.png"
    manual_path = f"Furniture_Assembly/IKEA-Manuals-at-Work/data/pdfs/{furniture_type}/{furniture_name}/page_1.png"
    # output_folder = f"Furniture-Assembly/IKEA-Manuals-at-Work/{pdf_path}/{furniture_type}/{furniture_name}/"
    # os.makedirs(output_folder, exist_ok=True)

    # output_path = output_folder + "/label.json"
    json_text = generate_json(scene_path, manual_path, 0)

    prompt_text = "select_material"
    # pdf_path = f"../dataset/pdfs/{furniture_type}/{furniture_name}/0.pdf"
    # print(pdf_path)

    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')

    # with open(output_path, 'r') as json_file:
    #     json_content = json.load(json_file)

    # json_text = json.dumps(json_content)

    b64_pages = []
    for page in os.listdir(f"/Furniture_Assembly/IKEA-Manuals-at-Work/data/pdfs/{furniture_type}/{furniture_name}/"):
        if page.endswith(".png"):
            page_path = os.path.join(f"Furniture_Assembly/IKEA-Manuals-at-Work/data/pdfs/{furniture_type}/{furniture_name}/", page)
            b64_pages.append(encode_image(page_path))

    print(len(b64_pages))
    b64_pages_sorted = sorted(b64_pages, key=alphanumeric_sort_key)
    base64_image = encode_image(scene_path)

    client = OpenAI()

    with open(f"Furniture_Assembly/furniture_assembly/prompts/{prompt_text}.txt", 'r') as f:
        prompt = f.read()

    # Build the dynamic list of image messages based on the number of pages
    image_messages = [
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}",
                "detail": "high"
            },
        }
    ]

    # Add each page image to the message
    for i, b64_image in enumerate(b64_pages_sorted):
        image_messages.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{b64_image}",
                "detail": "high"
            },
        })

    print(len(image_messages))

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt + "\n\nAnd here is the json file: \n" + json_text},
                    *image_messages  # Unpack the dynamic list of image messages
                ],
            }
        ],
        max_tokens=1000,
        temperature = temp
    )

    input_prompt = prompt + "\n\nAnd here is the json file: \n" + json_text

    return input_prompt, response.choices[0].message.content

def generate_json(img_path, manual_path, temp):
    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')

    base64_image = encode_image(img_path)
    base64_image2 = encode_image(manual_path)
    with open("Furniture_Assembly/furniture_assembly/prompts/generate_json.txt", 'r') as f:
        prompt = f.read()
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                            "detail": "high"
                        }
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image2}",
                            "detail": "high"
                        }
                    },
                ],
            }
        ],
        max_tokens=1000,
        temperature=temp,
    )
    final_output = response.choices[0].message.content.replace("```json", "").replace("```", "")
    return final_output


input_prompt, output_table = select_materials_for_planning(furniture_name, furniture_type, 0)
print(output_table)

## **Stage II: Identify Involved Parts in Each Step**

In [None]:
def create_plan(furniture_name, furniture_type, temp):
    prompt_text = "planning_no_seg_no_num"

    def encode_image(image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')

    # Encode the main labeled image
    image_path = f"Furniture_Assembly/IKEA-Manuals-at-Work/output/{furniture_type}/{furniture_name}/scene_annotated.png"
    base64_image = encode_image(image_path)

    # Initialize OpenAI client
    client = OpenAI()

    # Load prompt text
    with open(f"Furniture_Assembly/furniture_assembly/prompts/{prompt_text}.txt", 'r') as f:
        prompt = f.read()

    prompt = prompt + output_table
    print(prompt)

    # Encode 3D OBJ images
    obj_img = []
    file_list = os.listdir(f"Furniture_Assembly/IKEA-Manuals-at-Work/data/pdfs/{furniture_type}/{furniture_name}/")
    file_list = os.listdir(f"Furniture_Assembly/IKEA-Manuals-at-Work/data/mask/{furniture_type}/{furniture_name}/")
    sorted_file_list = sorted(file_list, key=alphanumeric_sort_key)


    for filename in sorted_file_list:
        if filename.endswith("no_seg.png"):
            img_path = os.path.join(f"Furniture_Assembly/IKEA-Manuals-at-Work/data/mask/{furniture_type}/{furniture_name}/", filename)
            print(img_path)
            obj_img.append(
                encode_image(img_path)
            )
    image_messages = [
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}",
                "detail": "high"
            },
        }
    ]
    for b64_obj_image in obj_img:
        image_messages.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{b64_obj_image}",
                "detail": "high"
            },
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    *image_messages
                ],
            }
        ],
        max_tokens=1000,
        temperature=temp,
    )
    return response.choices[0].message.content, prompt

output_plan, planning_prompt = create_plan(furniture_name, furniture_type, 0)
print(output_plan)

## **Convert text-based assembly plan to Hierachical Assembly Graph**

In [None]:
def convert_to_tree(furniture_name, furniture_type, temp):

  prompt_text = "tree_ikea_manual"

  client = OpenAI()

  with open(f"Furniture_Assembly/furniture_assembly/prompts/{prompt_text}.txt", 'r') as f:
      prompt = f.read()

  prompt = prompt + "\n" + output_plan + "\n\nYOUR REAL OUTPUT:\n"


  response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
      {
        "role": "user",
        "content": [
          {"type": "text", "text": prompt}
        ],
      }
    ],
    max_tokens=2000,
    temperature = temp,
  )

  final_output = response.choices[0].message.content.replace("```python", "").replace("```", "")
  return final_output

def compare_to_gt_tree(furniture_name, furniture_type):
  def load_paths(json_path):
    """Load directory paths from the JSON file."""
    with open(json_path, 'r') as f:
        data = json.load(f)
    return data

  json_file_path = "Furniture_Assembly/IKEA-Manuals-at-Work/data/main_data.json"

  data = load_paths(json_file_path)

  for i in range(len(data)):
    cur_furniture_name = data[i]["name"]
    cur_furniture_category = data[i]["category"]
    if cur_furniture_name == furniture_name and cur_furniture_category == furniture_type:
      return data[i]["assembly_tree"]

  return "Error: No ground truth assembly tree found, check if furniture name and furniture type are inputted correctly"

def are_nested_lists_equal(list1, list2):
    # Create deep copies of the lists to avoid modifying the originals
    list1_copy = copy.deepcopy(list1)
    list2_copy = copy.deepcopy(list2)

    # If both are not lists, compare directly
    if not isinstance(list1_copy, list) or not isinstance(list2_copy, list):
        return list1_copy == list2_copy

    # If lengths are different, they can't be equal
    if len(list1_copy) != len(list2_copy):
        return False

    # Recursively check each element in the lists
    for item1 in list1_copy:
        found_match = False
        for item2 in list2_copy:
            if are_nested_lists_equal(item1, item2):
                found_match = True
                list2_copy.remove(item2)  # Remove the matched item to avoid re-matching
                break
        if not found_match:
            return False

    return True

final_tree = convert_to_tree(furniture_name, furniture_type, 0)
print(f"VLM predicted assembly graph: {final_tree}")
gt_tree = compare_to_gt_tree(furniture_name, furniture_type)
print(f"Ground truth assembly graph: {gt_tree}")
print(f"Does predicted graph equal grount truth graph? {are_nested_lists_equal(final_tree, gt_tree)}")

## **Visualize Assembly Graph**

In [11]:
!pip install vpython

from vpython import *

def visualize_tree_3d(tree):
    nodes = []  # Stores node information: id, x, y, color
    edges = []  # Stores parent-child relationships
    id_counter = 0
    x_accumulator = [0]  # Track x positions for leaves

    def next_id():
        nonlocal id_counter
        id_counter += 1
        return id_counter

    node_info = {}  # id: {'x', 'y', 'color'}

    def process_node(node, depth, parent_id):
        nonlocal nodes, edges, x_accumulator
        current_id = next_id()

        # Determine node color and position
        if isinstance(node, list):
            all_ints = all(isinstance(e, int) for e in node)
            if all_ints and sum(node) == 3:
                color = color.blue
            else:
                color = color.red

            children_x = []
            for child in node:
                child_id = process_node(child, depth + 1, current_id)
                children_x.append(node_info[child_id]['x'])
            current_x = sum(children_x) / len(children_x) if children_x else 0
            current_y = depth
        else:
            # Leaf node
            color = color.green
            current_x = x_accumulator[0]
            x_accumulator[0] += 1
            current_y = depth

        node_info[current_id] = {'x': current_x, 'y': current_y, 'color': color}
        nodes.append({'id': current_id, 'x': current_x, 'y': current_y, 'color': color})
        if parent_id is not None:
            edges.append((parent_id, current_id))
        return current_id

    # Build node and edge data
    root_id = process_node(tree, depth=0, parent_id=None)

    # Create 3D visualization
    scene = canvas(width=800, height=600, background=color.white)
    node_objects = {}

    # Create spheres for nodes
    for node in nodes:
        n_id = node['id']
        x = node['x']
        y = -node['y']  # Invert y to display root at top
        node_color = node['color']
        node_objects[n_id] = sphere(pos=vector(x, y, 0), radius=0.3, color=node_color)

    # Create cylinders for edges
    for parent_id, child_id in edges:
        parent_pos = node_objects[parent_id].pos
        child_pos = node_objects[child_id].pos
        cylinder(pos=parent_pos, axis=child_pos - parent_pos, radius=0.1, color=color.gray(0.5))

    return scene

# Example usage:
tree = [[1, 2, 0], 3]
visualize_tree_3d(tree)

Collecting vpython
  Downloading vpython-7.6.5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting jupyter (from vpython)
  Downloading jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting jupyter-server-proxy (from vpython)
  Downloading jupyter_server_proxy-4.4.0-py3-none-any.whl.metadata (8.7 kB)
Collecting jupyterlab-vpython>=3.1.8 (from vpython)
  Downloading jupyterlab_vpython-3.1.8-py3-none-any.whl.metadata (5.3 kB)
Collecting autobahn<27,>=22.6.1 (from vpython)
  Downloading autobahn-24.4.2-py2.py3-none-any.whl.metadata (18 kB)
Collecting txaio>=21.2.1 (from autobahn<27,>=22.6.1->vpython)
  Downloading txaio-23.1.1-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting hyperlink>=21.0.0 (from autobahn<27,>=22.6.1->vpython)
  Downloading hyperlink-21.0.0-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting jupyterlab (from jupyter->vpython)
  Downloading jupyterlab-4.3.5-py3-none-any.whl.metadata (16 k

UnboundLocalError: cannot access local variable 'color' where it is not associated with a value

In [None]:
import pyvista as pv
from pyvista import examples

# Load a 3D object (e.g., a sphere)
mesh = pv.Sphere()

# Plot the 3D object
plotter = pv.Plotter()
plotter.add_mesh(mesh)
plotter.show()