-
Notifications
You must be signed in to change notification settings - Fork 240
Description
Problem Description
During the Octo model fine-tuning training process, at approximately step 5000, the visualization callback function crashed due to a Matplotlib compatibility issue, preventing training from completing.
Reproduction steps
- Set up the environment: Python 3.10, matplotlib 3.10.0 (also tested with 3.5.2)
- Run the command:
python scripts/finetune.py --config.pretrained_path=hf://rail-berkeley/octo-small-1.5 --debug--batch_size = 8 - Training crashed at about 5000 steps
Error message
File "/octo/utils/visualization_lib.py", line 550, in exit
self.image = out_image.reshape((self.fig.canvas.get_width_height()[1], self.fig.canvas.get_width_height()[0], 3))
ValueError: cannot reshape array of size 9000000 into shape (1500,1500,3)
Environment information
- Operating system: Linux
- Python Version: 3.10
- matplotlib Version: 3.10.0 (Also tested with 3.5.2, the issue persists)
- Octo Version: Latest main branch
- Hardware: NVIDIA GPU with CUDA 11.8
Attempted Solutions
- ❌ Downgraded matplotlib to 3.5.2 - The issue persisted.
- ✅ Disabled visualization callbacks - Training continued, but visualization functionality was lost.
- 🔧 Manually fixed the array reshaping logic - A temporary fix, but not a permanent solution.
Root Cause Analysis
The issue occurred in line 550 of visualization_lib.py:
- Hard-coded assumption of image size of 1500×1500×3 = 6,750,000 pixels
- Actual render buffer size is 9,000,000 pixels
- Missing matplotlib version compatibility (
tostring_rgbis deprecated)
Suggested Fix
python
Current problematic code:
out_image = np.frombuffer(self.canvas.tostring_rgb(), dtype="uint8")
self.image = out_image.reshape((h, w, 3))
Suggested Fix:
try:
buffer = self.canvas.tostring_rgb()
except AttributeError:
buffer = self.canvas.tobytes() # New version API
Dynamic size calculation
total_pixels = len(buffer) // 3
h = int(total_pixels ** 0.5)
w = total_pixels // h
self.image = out_image.reshape((h, w, 3))
Impact Assessment
- Severity: Medium-to-high (prevents training from completing)
- Affects: All training using visualization callbacks
- Temporary Solution: Disable visualization, but lose debugging capabilities.
Additional Information
This issue took me a weekend to debug. I hope the team can fix it quickly to avoid affecting other users.
The following is the Chinese version
问题描述
在Octo模型微调训练过程中,当进行到约5000步时,可视化回调函数因matplotlib兼容性问题崩溃,阻碍训练完成。
复现步骤
- 设置环境:Python 3.10, matplotlib 3.10.0 (也测试过3.5.2)
- 运行命令:
python scripts/finetune.py --config.pretrained_path=hf://rail-berkeley/octo-small-1.5 --debug--batch_size = 8 - 训练进行到约5000步时崩溃
错误信息
File "/octo/utils/visualization_lib.py", line 550, in exit
self.image = out_image.reshape((self.fig.canvas.get_width_height()[1], self.fig.canvas.get_width_height()[0], 3))
ValueError: cannot reshape array of size 9000000 into shape (1500,1500,3)
环境信息
- 操作系统: Linux
- Python版本: 3.10
- matplotlib版本: 3.10.0 (也测试过3.5.2,问题依旧)
- Octo版本: 最新main分支
- 硬件: NVIDIA GPU with CUDA 11.8
已尝试的解决方案
- ❌ 降级matplotlib到3.5.2 - 问题依旧
- ✅ 禁用可视化回调 - 训练可继续,但失去可视化功能
- 🔧 手动修复数组重塑逻辑 - 临时有效但非根本解决
根本原因分析
问题出现在visualization_lib.py第550行:
- 硬编码假设图像尺寸为1500×1500×3=6,750,000像素
- 实际渲染缓冲区大小为9,000,000像素
- 缺少matplotlib版本兼容性处理(
tostring_rgb已弃用)
建议修复
python
当前问题代码:
out_image = np.frombuffer(self.canvas.tostring_rgb(), dtype="uint8")
self.image = out_image.reshape((h, w, 3))
建议修复:
try:
buffer = self.canvas.tostring_rgb()
except AttributeError:
buffer = self.canvas.tobytes() # 新版本API
动态计算尺寸
total_pixels = len(buffer) // 3
h = int(total_pixels ** 0.5)
w = total_pixels // h
self.image = out_image.reshape((h, w, 3))
影响评估
- 严重程度: 中高(阻碍训练完成)
- 影响范围: 所有使用可视化回调的训练
- 临时解决方案: 禁用可视化,但失去调试能力
附加信息
这个问题花费了我一个周末的时间进行调试,希望团队能尽快修复,避免影响其他用户。