Skip to content

Releases: joe2gaan/localaiservers

v0.2.0 — ROCm7.2 Dense/MoE GFX906 Active Contracts

21 Jun 21:24

Choose a tag to compare

v0.2.0-gfx906-rocm72-dense-moe - ROCm7.2 Dense/MoE GFX906 Active Contracts

Release Boundary

This is the published v0.2.0 release boundary for the ROCm7.2 Dense/MoE
GFX906 active-contract milestone.

v0.1.0 remains the earlier Qwen3.6/GFX906/MI50 TP4 published release
baseline. GitHub Releases are canonical for published release claim
boundaries. Docker Hub remains an evergreen artifact distribution channel and
should not be treated as the latest benchmark announcement.

qwen36-gfx906/README.md remains the canonical
deployment and reproducibility package for this release.

These are benchmark and reproducibility artifacts, not general workload
guarantees. The same ROCm7.2 image covers the active dense and MoE contracts
with model-specific environment settings and overlays.

Runtime Image

Image tag:

joe2gaan/localaiservers:qwen36-gfx906-rocm72-dense-moe-runtime-archive-0a2dbd6b7f0b

Docker Hub manifest digest:

sha256:8c380e9ca48943d8617de5a2e2eaf32a26dcc2c341e4b4f4f8c45294a72b8f1e

Active Contracts

Verified portable performance at MAX_MODEL_LEN=131072 with eight pre-measure
warmups.

Dense 27B TP8 Full-BAR/P2P-On

  • Profile: dense27b_tp8_fullbar_p2pon
  • Host: .20
  • Strict backend TPS: 69.514
  • c1_2000 backend TPS: 70.347
  • c1_10000 backend TPS: 66.069
  • Strict gate: valid
  • ai-info 10K gate: cleared for the recorded active-contract lane

Qwen3.6 35B-A3B MoE TP8 Full-BAR/P2P-On

  • Profile: moe35b_tp8_fullbar_p2pon
  • Host: .30
  • Strict backend TPS: 94.907
  • c1_2000 backend TPS: 97.028
  • c1_10000 backend TPS: 91.290
  • Strict gate: valid
  • Publication status: strict-valid MoE publication bar

Qwen3.6 35B-A3B MoE TP4 Full-BAR/P2P-On

  • Profile: moe35b_tp4_fullbar_p2pon
  • Host: .30
  • Strict prompt: invalid/runaway
  • c1_2000 backend TPS: 116.146
  • c1_10000 backend TPS: 109.283
  • Publication status: capped fixed-token result only, not a strict-valid
    publication claim

The TP4 strict prompt did not stop after more than 60K tokens, so TP4 remains
capped-only until the strict runaway behavior is resolved.

Platform Remediation

The full-BAR/P2P-on lane required platform remediation:

  • Official AMD VBIOS standardization was required.
  • No modified BIOS images were used.
  • amdgpu source patching was required for the full-BAR/P2P-on lane.
  • This is not a user instruction to flash cards.
  • No BIOS or VBIOS binaries are redistributed.
  • No warranty, certification, procurement support, resale support, or hardware
    recommendation is implied.

QC Supporting Proof

The MI50/GFX906 VRAM QC field-check tool remains public under
tools/gfx906-mi50-vram-qc/.

Sanitized supporting reports:

  • The .20 positive-control report passed 1 GiB smoke, 30 GiB single-device,
    and 30 GiB all-device checks.
  • The 16GB GFX906-class negative-control report passed the 1 GiB smoke check
    and failed the 30 GiB check as expected with out-of-memory.

This QC material is educational hardware-verification methodology only, not
certification or warranty.

Links

v0.1.0-gfx906-qwen36-mi50 — Reproducible GFX906 Qwen3.6 MI50 Runtime

15 Jun 01:14

Choose a tag to compare

v0.1.0-gfx906-qwen36-mi50 - Reproducible GFX906 Qwen3.6 MI50 Runtime

Summary

This release summarizes the first public LocalAIServers GFX906 Qwen3.6 MI50 TP4
reproducibility artifact. It is the published v0.1.0 release record for the 90+ TPS
sustained 10K publication baseline.

Public-Benefit Purpose

LocalAIServers is a 501(c)(3) public charity providing public education and open-source
infrastructure for locally hosted AI systems. This artifact helps readers inspect a
reproducible local AI runtime path on affordable AI research hardware.

Hardware Target

  • 4x AMD Instinct MI50 32GB.
  • GFX906.
  • Tensor parallel size: 4.

Runtime Target

  • Model: Qwen/Qwen3.6-35B-A3B.
  • Runtime: vLLM on ROCm/GFX906.
  • Canonical technical deployment package:
    qwen36-gfx906/README.md.

Docker/Runtime Artifact

joe2gaan/localaiservers:qwen36-gfx906-c1-topk8-runtime-archive-aa34cb675f83

Docker Hub digest:

sha256:f5e69ee127b766960e386e0e4eda8e26c399bd02f57c494847cb9a92ce04d8ac

Source runtime archive:

aa34cb675f83ff6cade31cbbb357b1c31d793bee18da491f501d7c39fda3612a

This archive SHA is the strict byte-for-byte source-reproduction target for the
published runtime. Source rebuilds should be cited as release-reproduction evidence
only when the exported Docker archive matches this SHA. The live main deploy script
defaults to BYTE_FOR_BYTE_VALIDATION_MODE=auto; set
EXPECTED_REPRO_DOCKER_ARCHIVE_SHA256=aa34cb675f83ff6cade31cbbb357b1c31d793bee18da491f501d7c39fda3612a
when source-reproducing this release so a mismatch fails the source-build path. Use
BYTE_FOR_BYTE_VALIDATION_MODE=0 only for non-canonical local deploys where no
release-reproduction claim is being made. The prebuilt Docker Hub image path is
validated separately by the manifest digest above.

Reference Results

Benchmark artifact:
benchmarks/qwen36-gfx906-mi50-tp4/README.md.

c1_10000: 90+ TPS sustained backend decode publication baseline

This release should be cited as the 90+ TPS sustained 10K publication artifact. It is
not a latest-result announcement. LocalAIServers has reached 95+ TPS on Qwen3.6 10K
decode, but that version is outside this v0.1.0 release and should be cited only after
it is published separately. Use the canonical technical deployment package for
reproduction commands and the exact run context for this release lane.

Source-level GFX906 preservation artifacts

These artifacts document source-level kernel/runtime preservation work, including MoE
fastpath work, dense RowParallel/RCCL collective-boundary research, rejected paths,
diagnostic lanes, and active source directions.

Reproducibility Contract

The canonical deployment README preserves Docker image identity, Docker Hub digest,
archive hashes, source pins, deploy commands, run commands, benchmark commands, hardware
requirements, and limitations.

Limitations

  • Scoped to the documented Qwen3.6 GFX906 MI50 TP4 runtime lane.
  • Not a claim about all prompts, workloads, GFX906 systems, or officially supported
    performance.
  • Not a public cloud service or direct public machine-access program.
  • Dense model results should not be inferred from this MoE benchmark.

Next Milestones

  • Publish more benchmark artifacts.
  • Expand QC methodology and hardware verification records.
  • Continue source-level GFX906 preservation work.