Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)#160
Open
ChaseWNorton wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window
This submission starts from the public 10-minute baseline family and makes three main changes:
2xto3x.seq_len=2048.The model was trained for the official
600swallclock limit on8x H100 SXM, then repacked into a submission-valid artifact under the16,000,000byte limit.Model
VOCAB_SIZE=1024NUM_LAYERS=9MODEL_DIM=512NUM_HEADS=8NUM_KV_HEADS=4MLP_MULT=3TIE_EMBEDDINGS=1This keeps the backbone close to the baseline while spending more of the parameter budget on the MLP.
Training Setup
The timed run used:
TRAIN_BATCH_TOKENS=786432TRAIN_SEQ_LEN=2048ITERATIONS=20000MAX_WALLCLOCK_SECONDS=600WARMUP_STEPS=20WARMDOWN_ITERS=3000TIED_EMBED_LR=0.03MATRIX_LR=0.02SCALAR_LR=0.02MUON_MOMENTUM=0.99Logged optimizer summary:
tie_embeddings:True embed_lr:0.03 head_lr:0.0 matrix_lr:0.02 scalar_lr:0.02Logged attention summary:
attention_mode:gqa num_heads:8 num_kv_heads:4The script includes QAT support, but this specific timed run stopped before QAT activation:
qat_enabled:Trueqat_start_frac:0.500qat_start_step:100007534So the final reported result is from post-training repacking of the timed checkpoint rather than from a checkpoint that had entered the QAT phase.
Timed Training Result
The official training run stopped at the wallclock cap:
step:7534/20000train_time:600120msstep_avg:79.65msValidation at stop:
val_loss=1.9844val_bpb=1.1753Other logged details:
16738 MiB16944 MiB86099351bytesExport / Compression
The first export evaluated from the timed run used:
fp16token embedding passthroughfp16passthrough for the last twoc_kweightszlibThat version evaluated very well but was not submission-valid because it was over the size cap:
16639274bytes1.181010951.16018011The final submission-valid repack uses:
QGv3serializationlzmacompressiontok_emb.weightfp16passthrough tensorsThis change was enough to get back under the limit while preserving almost all of the sliding-window gain.
Final Submission Artifact
final_model.mixed_tok8_lzma.ptzlzma158459806492415910904This is submission-valid under the
16,000,000byte cap.Final Scores
Exact post-pack scores for the final under-cap artifact:
standard eval:
val_loss=1.99867543val_bpb=1.18372817sliding-window eval with
seq_len=2048,stride=256:val_loss=1.96250243val_bpb=1.16230441The submission score is therefore:
1.16230441 val_bpbNotes
8-bitwhile quantizing the rest of the model to6-bit.torch.saveartifact.fp32to match the trained checkpoint behavior after reload.Included Files
README.md- this writeupsubmission.json- submission metadatatrain.log- exact log from the timed8x H100 SXMruntrain_gpt.py- code snapshot for the submission artifact and evaluation path