Skip to content

Conversation

@pszemraj
Copy link
Owner

this adds a magic "auto optimal device" utility to decide whether to use cuda or mps accelerators or fall back to CPU, and each of these is set to use bf16 autocast by default, but can disable/fall back to fp32 if needed (cough mps cough)

claude and others added 5 commits October 25, 2025 01:19
This commit adds comprehensive device detection and selection utilities
that support CUDA, MPS (Apple Silicon), and CPU backends with automatic
fallback logic.

Changes:
- Add decoder_pytorch/device.py with DeviceSelection dataclass and
  get_optimal_device() function
- Update decoder_pytorch/__init__.py to export new device utilities
- Refactor train.py to use get_optimal_device() instead of hardcoded
  device selection
- Use device-specific autocast dtype (bfloat16 for CUDA/MPS, float32
  for CPU)
- Integrate TF32 configuration into get_optimal_device for CUDA
- Update fused optimizer check to only enable on CUDA (not MPS/CPU)

The get_optimal_device() function provides:
- Automatic device detection with configurable priority order
- Force device selection via parameter or FORCE_DEVICE env var
- Integrated TF32 configuration for CUDA devices
- Appropriate autocast dtype selection per device type
- Detailed device info logging

This ensures the codebase works seamlessly across CUDA, MPS, and CPU
devices with optimal settings for each platform.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changed CPU device selection to use torch.bfloat16 for autocast,
consistent with the repo's assumption of bfloat16-compatible hardware
(2025 AD standard).

This eliminates the warning:
"CPU Autocast only supports dtype of torch.bfloat16, torch.float16"

All devices (CUDA, MPS, CPU) now uniformly use bfloat16 for mixed
precision training.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added configurable autocast support to allow users to enable/disable
mixed precision training via config files without modifying code.

Changes:
- Add use_autocast config option to simple.yaml and test.yaml (default: true)
- Update train.py to conditionally use autocast based on config
- Use contextlib.nullcontext() when autocast is disabled
- Print mixed precision status on startup

Usage:
  use_autocast: true   # Enable bfloat16 mixed precision (default)
  use_autocast: false  # Disable, use full fp32 precision

Both configurations tested successfully with no warnings.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
* Add `DeviceSelection.autocast_context()`; parse `cuda:N`, dedupe prefs, and warn on bad input (decoder_pytorch/device.py:26,53).
* Honor forced indices; guard out-of-range CUDA; loud CPU fallback for debug (decoder_pytorch/device.py:107).
* Use context helper for train/val; fix E731; AMP toggle driven by config (train.py:74).
* Document detection flow, `FORCE_DEVICE`, and autocast usage (README.md:27).
@pszemraj pszemraj self-assigned this Oct 26, 2025
@pszemraj pszemraj added the enhancement New feature or request label Oct 26, 2025
@pszemraj
Copy link
Owner Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@pszemraj
Copy link
Owner Author

CPU (colab) 10-steps test config:

/content/decoder-pytorch-template
Device: CPU
Mixed precision: enabled (torch.bfloat16)
======================================================
Layer (type)          Param Shape  Param #  Grad State
======================================================
  Embedding               256x128   32,768   trainable
    TransformerBlock          N/A  262,416       mixed
    TransformerBlock          N/A  262,416       mixed
  ModuleList                  N/A  524,832       mixed
  RMSNorm                     128      128   trainable
Llama                         N/A  557,728       mixed
======================================================
Total params: 557,728
Trainable params: 557,696
Non-trainable params: 32
======================================================
Step 0 | Val loss: 5.2715
Step 5 | Val loss: 4.5177
training: 100% 10/10 [00:08<00:00,  1.23it/s, loss=4.2018]

Training complete! Final checkpoint saved to runs/test/final.pt

@pszemraj
Copy link
Owner Author

pszemraj commented Oct 27, 2025

Training converges fine on m3 max MBP (macOS 15.7.1). Generations are fine as well

  • both bf16 autocast and compile work (torch 2.9)

  • this uses a slightly smaller config (half the layers), batch size 1 instead of 4 (mps slow compared to GPU), grad acc every 16 to compensate

  • Add such ^ a smaller config to the repo for faster testing for non GPU people

@pszemraj
Copy link
Owner Author

Need to decide

  • Whether to remove tf32
  • Whether to consolidate device.py to utils for simplicity

- Replace device.py with lightweight tuple-based API and auto-fallbacks
- Centralize device checks in training; guard autocast and document grad quirks
- Graceful Flash Attention degradation when kernels unavailable
- Add nano.yaml config for quick CPU/MPS testing
- Update docs to reflect new device API and config
- Stop silently disabling autocast; always respect use_autocast flag
- Wrap autocast context manager on all devices (no silent fp32 fallback)
- Align nano.yaml to ~20M Llama with bf16 autocast enabled
- Clarify autocast behavior in README
- Update nano.yaml: depth 6, dim 384, torch.compile on
- Clarify model scale in README
@pszemraj
Copy link
Owner Author

@codex review.

we already coverd bf16 autocast so dw about taht

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@pszemraj
Copy link
Owner Author

validated that training works on mac torch 2.9 - autocast/bf16 defaults on mps

obviously works on linux/cuda too

@pszemraj pszemraj merged commit 9c90a55 into main Oct 31, 2025
@pszemraj pszemraj deleted the claude/add-device-utils-011CUT2XqJ8H2grwuEdJrKEB branch October 31, 2025 06:48
pszemraj pushed a commit that referenced this pull request Nov 9, 2025
Implements full Llama-style transformer in Rust with Burn tensor library:

## Core Components
- RMSNorm: Root Mean Square Layer Normalization
- SwiGLU: Gated feedforward with SiLU activation
- RoPE: Rotary Position Embeddings
- Attention: Multi-head attention with causal masking
- Transformer: Pre-normalization blocks
- Llama: Complete character-level language model

## Features
- Type-safe tensor operations with compile-time checking
- Backend-agnostic (NdArray CPU, WGPU GPU, etc.)
- Memory-safe implementation (no runtime errors)
- Configuration via YAML (compatible with PyTorch configs)
- enwik8 dataset loading and training loop
- Gradient accumulation framework
- Progress tracking and metrics logging

## Verification
- Tested PyTorch implementation on enwik8
- Training dynamics verified: 5.5 → 3.0 loss (100 steps)
- Generated text shows learning of character patterns

## Structure
- decoder-rust/src/model/: All model components
- decoder-rust/src/data/: Dataset loading
- decoder-rust/src/bin/train.rs: Training script
- Complete documentation in decoder-rust/README.md

Note: Rust compilation requires network access to download dependencies
from crates.io. All code is complete and ready for compilation when
network access is available.

Refs: #1 (decoder template port request)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants