Skip to content

Matplotlib visualisation callback crashed during training: array reshaping dimension mismatch (9000000 vs 1500×1500×3) #166

@holomo

Description

@holomo

Problem Description

During the Octo model fine-tuning training process, at approximately step 5000, the visualization callback function crashed due to a Matplotlib compatibility issue, preventing training from completing.

Reproduction steps

  1. Set up the environment: Python 3.10, matplotlib 3.10.0 (also tested with 3.5.2)
  2. Run the command: python scripts/finetune.py --config.pretrained_path=hf://rail-berkeley/octo-small-1.5 --debug --batch_size = 8
  3. Training crashed at about 5000 steps

Error message

File "/octo/utils/visualization_lib.py", line 550, in exit
self.image = out_image.reshape((self.fig.canvas.get_width_height()[1], self.fig.canvas.get_width_height()[0], 3))
ValueError: cannot reshape array of size 9000000 into shape (1500,1500,3)

Environment information

  • Operating system: Linux
  • Python Version: 3.10
  • matplotlib Version: 3.10.0 (Also tested with 3.5.2, the issue persists)
  • Octo Version: Latest main branch
  • Hardware: NVIDIA GPU with CUDA 11.8

Attempted Solutions

  1. ❌ Downgraded matplotlib to 3.5.2 - The issue persisted.
  2. ✅ Disabled visualization callbacks - Training continued, but visualization functionality was lost.
  3. 🔧 Manually fixed the array reshaping logic - A temporary fix, but not a permanent solution.

Root Cause Analysis

The issue occurred in line 550 of visualization_lib.py:

  • Hard-coded assumption of image size of 1500×1500×3 = 6,750,000 pixels
  • Actual render buffer size is 9,000,000 pixels
  • Missing matplotlib version compatibility (tostring_rgb is deprecated)

Suggested Fix

python
Current problematic code:
out_image = np.frombuffer(self.canvas.tostring_rgb(), dtype="uint8")
self.image = out_image.reshape((h, w, 3))
Suggested Fix:
try:
buffer = self.canvas.tostring_rgb()
except AttributeError:
buffer = self.canvas.tobytes() # New version API
Dynamic size calculation
total_pixels = len(buffer) // 3
h = int(total_pixels ​**​ 0.5)
w = total_pixels // h
self.image = out_image.reshape((h, w, 3))

Impact Assessment

  • Severity: Medium-to-high (prevents training from completing)
  • Affects: All training using visualization callbacks
  • Temporary Solution: Disable visualization, but lose debugging capabilities.

Additional Information

This issue took me a weekend to debug. I hope the team can fix it quickly to avoid affecting other users.

The following is the Chinese version

问题描述

在Octo模型微调训练过程中,当进行到约5000步时,可视化回调函数因matplotlib兼容性问题崩溃,阻碍训练完成。

复现步骤

  1. 设置环境:Python 3.10, matplotlib 3.10.0 (也测试过3.5.2)
  2. 运行命令:python scripts/finetune.py --config.pretrained_path=hf://rail-berkeley/octo-small-1.5 --debug--batch_size = 8
  3. 训练进行到约5000步时崩溃

错误信息

File "/octo/utils/visualization_lib.py", line 550, in exit
self.image = out_image.reshape((self.fig.canvas.get_width_height()[1], self.fig.canvas.get_width_height()[0], 3))
ValueError: cannot reshape array of size 9000000 into shape (1500,1500,3)

环境信息

  • 操作系统: Linux
  • Python版本: 3.10
  • matplotlib版本: 3.10.0 (也测试过3.5.2,问题依旧)
  • Octo版本: 最新main分支
  • 硬件: NVIDIA GPU with CUDA 11.8

已尝试的解决方案

  1. ❌ 降级matplotlib到3.5.2 - 问题依旧
  2. ✅ 禁用可视化回调 - 训练可继续,但失去可视化功能
  3. 🔧 手动修复数组重塑逻辑 - 临时有效但非根本解决

根本原因分析

问题出现在visualization_lib.py第550行:

  • 硬编码假设图像尺寸为1500×1500×3=6,750,000像素
  • 实际渲染缓冲区大小为9,000,000像素
  • 缺少matplotlib版本兼容性处理(tostring_rgb已弃用)

建议修复

python
当前问题代码:
out_image = np.frombuffer(self.canvas.tostring_rgb(), dtype="uint8")
self.image = out_image.reshape((h, w, 3))
建议修复:
try:
buffer = self.canvas.tostring_rgb()
except AttributeError:
buffer = self.canvas.tobytes() # 新版本API
动态计算尺寸
total_pixels = len(buffer) // 3
h = int(total_pixels ​**​ 0.5)
w = total_pixels // h
self.image = out_image.reshape((h, w, 3))

影响评估

  • 严重程度: 中高(阻碍训练完成)
  • 影响范围: 所有使用可视化回调的训练
  • 临时解决方案: 禁用可视化,但失去调试能力

附加信息

这个问题花费了我一个周末的时间进行调试,希望团队能尽快修复,避免影响其他用户。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions