Skip to content

Add /v1/completions endpoint (OpenAI legacy completions API) to transformers serve#44558

Open
rain-1 wants to merge 9 commits intohuggingface:mainfrom
rain-1:main
Open

Add /v1/completions endpoint (OpenAI legacy completions API) to transformers serve#44558
rain-1 wants to merge 9 commits intohuggingface:mainfrom
rain-1:main

Conversation

@rain-1
Copy link
Copy Markdown

@rain-1 rain-1 commented Mar 10, 2026

Adds support for the legacy text completions endpoint, which accepts a freeform text prompt (no chat template) and returns generated text in choices[].text. Supports both streaming and non-streaming modes, suffix for fill-in-the-middle insertion, and proper finish_reason detection.

What does this PR do?

Hello, my motivation for this:

I work with base models. I often need to continue text documents. So I need the OpenAI /v1/completions that uses an LLM to continune text. This is applicable for base models/pretrained foundational models, which have not been post-trained as instruct models to follow a chat template.

The 'transformers' tool is amazingly useful for bringing up a quick API endpoint, and I would love this to support the ability to continue a prompt! That's why I worked with Claude Code/Opus 4.6 to produce this PR.

Note: While this is a legacy feature of the OpenAI API (they have moved away from providing base model support) vllm still supports. It's very useful for research or any work with pretrained models.

Here's an example script that tests the functionality

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Non-streaming
resp = client.completions.create(
    model="Qwen/Qwen3.5-0.8B-Base",
    prompt="The capital of France is",
    max_tokens=20
)
for i, choice in enumerate(resp.choices):
    print(i)
    print("-")
    print(choice.text)
    print("\n\n")

# Streaming
for chunk in client.completions.create(
    model="Qwen/Qwen3.5-0.8B-Base",
    prompt="Once upon a time",
    max_tokens=50,
    stream=True,
):
    print(chunk.choices[0].text, end="", flush=True)

So I run transformers serve then run that script and I get the result:

0
-
 Paris, and the capital of the United States is Washington, D.C. The capital of the United



, in a magical land of science, there was a very special kind of animal called a jellyfish. This jellyfish had a really interesting way of living.

So this implementation is working well for my purposes.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). No this provides a new feature to the transformers command line tool
  • Did you read the contributor guideline,
    Pull Request section? yes
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case. no, this PR is the first contact
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings. Yes, API endpoint documented
  • Did you write any new necessary tests? *Yes, we added a test for this feature. and also tested the functionality with an independent pythoon script

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

based on git blame I tag @LysandreJik

rain-1 and others added 2 commits March 10, 2026 06:58
…sformers serve`

Adds support for the legacy text completions endpoint, which accepts a
freeform text prompt (no chat template) and returns generated text in
choices[].text. Supports both streaming and non-streaming modes, suffix
for fill-in-the-middle insertion, and proper finish_reason detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rain-1
Copy link
Copy Markdown
Author

rain-1 commented Mar 10, 2026

Added documentation for the new API endpoint

@LysandreJik
Copy link
Copy Markdown
Member

Thanks for the PR, I'll take a look!

Copy link
Copy Markdown
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docs side lgtm, thanks!

rain-1 and others added 3 commits March 11, 2026 09:28
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@rain-1
Copy link
Copy Markdown
Author

rain-1 commented Mar 11, 2026

Thanks for those improvements @stevhliu I've comitted them all

@LysandreJik
Copy link
Copy Markdown
Member

Hey @rain-1, we're doing significant changes to the structure of serve at this time with @SunMarc, sorry your PR is getting delayed.

Btw, what prohibits you from updating to a newer, maintained API? I'm not sure if it's transformers serve's place to introduce deprecated API endpoints instead of supporting the existing/future ones

@rain-1
Copy link
Copy Markdown
Author

rain-1 commented Mar 22, 2026

@LysandreJik No worries. I hold off on other contributions til this is done.

newer, maintained API

Do any of these support text continuation?

I'm working with pretrained LLM models. They do not have chat template. So I need an API endpoint that does input "Once upon a time, " -> output "Once upon a time, there was a princess who"

I'm not sure if it's transformers serve's place to introduce deprecated API endpoints

OpenAI deprecated this endpoint because they cannot safely serve base models to customers as part of their product. It's hard to do safety tuning on those.

Regardless of that.. LLM development still has the pretraining phase.

Most tools are using a chat template which is ontly trained into the model in the post-training phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants