ConvGPT

vLLM and SGLang module for ConvGPT.

Overview

ConvGPT introduces a novel approach to Large Language Model compression by integrating 2D convolutional networks directly into the pre-training architecture, rather than relying on post-training quantization or pruning. Designed specifically for Mobile/Edge (SLM) use cases, it achieves significant parameter reduction while maintaining high reasoning capabilities.

Convolutional Embedding Compression: Unlike standard Transformers that maintain a constant hidden size throughout, ConvGPT utilizes a Conv2D + Average Pooling layer to compress the input hidden state vector by a factor of 9x before it enters the residual stream. This allows the model to maintain high-dimensional information in the embedding layer and prediction head while operating on a highly efficient, smaller vector in the decoder layers.
Causal masking in 2D: The architecture implements specialized padding and reshaping mechanisms during the convolution steps to strictly preserve autoregressive causality. This eliminates "token leakage" (look-ahead bias), ensuring the model remains robust during generation and prevents the test-time degradation often seen in naive convolutional language models.
Extreme Parameter Efficiency:
Current Model: 164M parameters (comparable performance to a standard 722M parameter architecture) — a ~4.4x size reduction.
Scaling Potential: The architecture scales efficiently; a configuration with hidden_size=2048 results in just 266M parameters compared to a 1.7B parameter baseline (a 6.5x reduction).
Performance-to-Size Ratio: Trained on 250B tokens (PleIAs/SYNTH), this 164M model achieves >30% on GPQA-Diamond, a significant outlier for its size class, demonstrating that logic and reasoning capabilities can be preserved even with aggressive vector compression.
Normalization Stability: Includes post-convolution normalization to manage vector value scaling, ensuring training stability and consistent generation output.

Installation

pip install -e .

Running

vLLM:

vllm serve mkurman/ConvGPT-SYNTH-250B-EC --served-model-name convgpt --trust_remote_code --max-model-len 16384 --gpu-memory-utilization 0.7 --max_num_batched_tokens 8192

SGLang:

export SGLANG_EXTERNAL_MODEL_PACKAGE=convgpt.sglang && python -m sglang.launch_server mkurman/ConvGPT-SYNTH-250B-EC --trust-remote-code --port 8000 --mem-fraction-static 0.7 --context-length 16384 --allow-auto-truncate --attention-backend fa3

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/convgpt		src/convgpt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convgpt.png		convgpt.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConvGPT

Overview

Installation

Running

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ConvGPT

Overview

Installation

Running

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages