Skip to content

Batteries included: promote some basic version/utils of reasonably fast offline/batched inference into PyTorch core (maybe based on gpt-fast, nano-vllm, torchao) #229

@vadimkantorov

Description

@vadimkantorov

Given how much LLM training (via FSDP) and inference (often with vllm) are both needed for RL/GRPO, I wonder if it's time to upstream some basic components / utils for okay-speed inference directly to PyTorch? As vllm gets ever-more complicated...

The goal would be being able to immediately run inference of FSDP-wrapped models without much of weight-conversion, or being able to use torchao for quantization of the existing weights. And it could also drive the dynamic shape testing for torch.compile / CUDA graphs...

It is a bit strange not needing any special framework for training besides FSDP, and needing an inference framework for basic inference. So maybe time for upstreaming some of the time-proven components from the inference engines...

This is already happening a bit with:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions