support MPS, reorganize #1

pszemraj · 2025-10-26T23:54:04Z

this adds a magic "auto optimal device" utility to decide whether to use cuda or mps accelerators or fall back to CPU, and each of these is set to use bf16 autocast by default, but can disable/fall back to fp32 if needed (cough mps cough)

This commit adds comprehensive device detection and selection utilities that support CUDA, MPS (Apple Silicon), and CPU backends with automatic fallback logic. Changes: - Add decoder_pytorch/device.py with DeviceSelection dataclass and get_optimal_device() function - Update decoder_pytorch/__init__.py to export new device utilities - Refactor train.py to use get_optimal_device() instead of hardcoded device selection - Use device-specific autocast dtype (bfloat16 for CUDA/MPS, float32 for CPU) - Integrate TF32 configuration into get_optimal_device for CUDA - Update fused optimizer check to only enable on CUDA (not MPS/CPU) The get_optimal_device() function provides: - Automatic device detection with configurable priority order - Force device selection via parameter or FORCE_DEVICE env var - Integrated TF32 configuration for CUDA devices - Appropriate autocast dtype selection per device type - Detailed device info logging This ensures the codebase works seamlessly across CUDA, MPS, and CPU devices with optimal settings for each platform. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changed CPU device selection to use torch.bfloat16 for autocast, consistent with the repo's assumption of bfloat16-compatible hardware (2025 AD standard). This eliminates the warning: "CPU Autocast only supports dtype of torch.bfloat16, torch.float16" All devices (CUDA, MPS, CPU) now uniformly use bfloat16 for mixed precision training. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Added configurable autocast support to allow users to enable/disable mixed precision training via config files without modifying code. Changes: - Add use_autocast config option to simple.yaml and test.yaml (default: true) - Update train.py to conditionally use autocast based on config - Use contextlib.nullcontext() when autocast is disabled - Print mixed precision status on startup Usage: use_autocast: true # Enable bfloat16 mixed precision (default) use_autocast: false # Disable, use full fp32 precision Both configurations tested successfully with no warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

* Add `DeviceSelection.autocast_context()`; parse `cuda:N`, dedupe prefs, and warn on bad input (decoder_pytorch/device.py:26,53). * Honor forced indices; guard out-of-range CUDA; loud CPU fallback for debug (decoder_pytorch/device.py:107). * Use context helper for train/val; fix E731; AMP toggle driven by config (train.py:74). * Document detection flow, `FORCE_DEVICE`, and autocast usage (README.md:27).

pszemraj · 2025-10-26T23:55:13Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

decoder_pytorch/device.py

pszemraj · 2025-10-27T00:43:39Z

CPU (colab) 10-steps test config:

/content/decoder-pytorch-template
Device: CPU
Mixed precision: enabled (torch.bfloat16)
======================================================
Layer (type)          Param Shape  Param #  Grad State
======================================================
  Embedding               256x128   32,768   trainable
    TransformerBlock          N/A  262,416       mixed
    TransformerBlock          N/A  262,416       mixed
  ModuleList                  N/A  524,832       mixed
  RMSNorm                     128      128   trainable
Llama                         N/A  557,728       mixed
======================================================
Total params: 557,728
Trainable params: 557,696
Non-trainable params: 32
======================================================
Step 0 | Val loss: 5.2715
Step 5 | Val loss: 4.5177
training: 100% 10/10 [00:08<00:00,  1.23it/s, loss=4.2018]

Training complete! Final checkpoint saved to runs/test/final.pt

pszemraj · 2025-10-27T03:10:28Z

Training converges fine on m3 max MBP (macOS 15.7.1). Generations are fine as well

both bf16 autocast and compile work (torch 2.9)
this uses a slightly smaller config (half the layers), batch size 1 instead of 4 (mps slow compared to GPU), grad acc every 16 to compensate
Add such ^ a smaller config to the repo for faster testing for non GPU people

pszemraj · 2025-10-27T03:26:30Z

Need to decide

Whether to remove tf32
Whether to consolidate device.py to utils for simplicity

- Replace device.py with lightweight tuple-based API and auto-fallbacks - Centralize device checks in training; guard autocast and document grad quirks - Graceful Flash Attention degradation when kernels unavailable - Add nano.yaml config for quick CPU/MPS testing - Update docs to reflect new device API and config

- Stop silently disabling autocast; always respect use_autocast flag - Wrap autocast context manager on all devices (no silent fp32 fallback) - Align nano.yaml to ~20M Llama with bf16 autocast enabled - Clarify autocast behavior in README

- Update nano.yaml: depth 6, dim 384, torch.compile on - Clarify model scale in README

pszemraj · 2025-10-31T02:58:32Z

@codex review.

we already coverd bf16 autocast so dw about taht

chatgpt-codex-connector · 2025-10-31T03:02:08Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pszemraj · 2025-10-31T06:47:53Z

validated that training works on mac torch 2.9 - autocast/bf16 defaults on mps

obviously works on linux/cuda too

Implements full Llama-style transformer in Rust with Burn tensor library: ## Core Components - RMSNorm: Root Mean Square Layer Normalization - SwiGLU: Gated feedforward with SiLU activation - RoPE: Rotary Position Embeddings - Attention: Multi-head attention with causal masking - Transformer: Pre-normalization blocks - Llama: Complete character-level language model ## Features - Type-safe tensor operations with compile-time checking - Backend-agnostic (NdArray CPU, WGPU GPU, etc.) - Memory-safe implementation (no runtime errors) - Configuration via YAML (compatible with PyTorch configs) - enwik8 dataset loading and training loop - Gradient accumulation framework - Progress tracking and metrics logging ## Verification - Tested PyTorch implementation on enwik8 - Training dynamics verified: 5.5 → 3.0 loss (100 steps) - Generated text shows learning of character patterns ## Structure - decoder-rust/src/model/: All model components - decoder-rust/src/data/: Dataset loading - decoder-rust/src/bin/train.rs: Training script - Complete documentation in decoder-rust/README.md Note: Rust compilation requires network access to download dependencies from crates.io. All code is complete and ready for compilation when network access is available. Refs: #1 (decoder template port request)

claude and others added 5 commits October 25, 2025 01:19

clarify, format

27594bd

pszemraj self-assigned this Oct 26, 2025

pszemraj added the enhancement New feature or request label Oct 26, 2025

chatgpt-codex-connector bot reviewed Oct 26, 2025

View reviewed changes

decoder_pytorch/device.py Outdated Show resolved Hide resolved

pszemraj added 3 commits October 30, 2025 22:22

Enforce autocast and fix nano config alignment

f05282a

- Stop silently disabling autocast; always respect use_autocast flag - Wrap autocast context manager on all devices (no silent fp32 fallback) - Align nano.yaml to ~20M Llama with bf16 autocast enabled - Clarify autocast behavior in README

Set nano preset to L6·H384 (~9M) and update docs

13ab4ff

- Update nano.yaml: depth 6, dim 384, torch.compile on - Clarify model scale in README

pszemraj merged commit 9c90a55 into main Oct 31, 2025

pszemraj deleted the claude/add-device-utils-011CUT2XqJ8H2grwuEdJrKEB branch October 31, 2025 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support MPS, reorganize #1

support MPS, reorganize #1

Uh oh!

pszemraj commented Oct 26, 2025

Uh oh!

pszemraj commented Oct 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

pszemraj commented Oct 27, 2025

Uh oh!

pszemraj commented Oct 27, 2025 •

edited

Loading

Uh oh!

pszemraj commented Oct 27, 2025

Uh oh!

pszemraj commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 31, 2025

Uh oh!

pszemraj commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

support MPS, reorganize #1

support MPS, reorganize #1

Uh oh!

Conversation

pszemraj commented Oct 26, 2025

Uh oh!

pszemraj commented Oct 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

pszemraj commented Oct 27, 2025

Uh oh!

pszemraj commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pszemraj commented Oct 27, 2025

Uh oh!

pszemraj commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 31, 2025

Uh oh!

pszemraj commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pszemraj commented Oct 27, 2025 •

edited

Loading