You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A ~110M parameter language model trained from scratch using the DeepSeek-V4 architecture. This repo contains all the code, configs, and tokenizer used to pretrain and fine-tune the model.
bf16 NaN: The model produces NaN in bf16 at this small scale. Use fp32 for inference and training. This is due to the Hyper-Connections architecture producing values that overflow bf16 range.
from_pretrained quirk: The custom architecture causes from_pretrained to re-initialize some weights. Use manual load_state_dict instead (see model cards for examples).
Large vocab / small model: The 129K vocab embedding table consumes 37% of all parameters, limiting capacity for language modeling.