You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Batteries included: promote some basic version/utils of reasonably fast offline/batched inference into PyTorch core (maybe based on gpt-fast, nano-vllm, torchao) #229
Given how much LLM training (via FSDP) and inference (often with vllm) are both needed for RL/GRPO, I wonder if it's time to upstream some basic components / utils for okay-speed inference directly to PyTorch? As vllm gets ever-more complicated...
The goal would be being able to immediately run inference of FSDP-wrapped models without much of weight-conversion, or being able to use torchao for quantization of the existing weights. And it could also drive the dynamic shape testing for torch.compile / CUDA graphs...
It is a bit strange not needing any special framework for training besides FSDP, and needing an inference framework for basic inference. So maybe time for upstreaming some of the time-proven components from the inference engines...