Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion src/pylon/deploy/pylon-config/location.conf.template
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,19 @@ location ~ ^/log-manager/([^/]+):(\d+)/(.*)$ {
# job server
location ~ ^/job-server/([^/]+):(\d+)/(.*)$ {
proxy_pass http://$1:$2/$3$is_args$args;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 180m;
proxy_read_timeout 180m;
proxy_send_timeout 180m;

# Before nginx commits any response bytes to the client, retry once on
# connection-level errors (TCP RST / ETIMEDOUT during connect).
# non_idempotent is required because inference endpoints are POST.
proxy_next_upstream error timeout non_idempotent;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 5s;

# Disable buffering based on content type
proxy_buffering off;
proxy_cache off;
Expand All @@ -65,7 +74,9 @@ location ~ ^/copilot/api/operation(.*)$ {

# Model proxy backend
location ~ ^/model-proxy/(.*)$ {
proxy_pass {{MODEL_PROXY_URI}}/$1$is_args$args;
proxy_pass http://model_proxy_upstream/$1$is_args$args;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 60m;
proxy_read_timeout 60m;
proxy_send_timeout 60m;
Comment on lines +77 to 82
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes http://model_proxy_upstream even when MODEL_PROXY_URI is https://... (the upstream definition strips both schemes). That will break TLS-to-upstream and can cause backend connection failures or unintended plaintext traffic. Preserve the upstream scheme (e.g., select http vs https via templating / map, and add the required proxy_ssl_* directives when using HTTPS).

Suggested change
proxy_pass http://model_proxy_upstream/$1$is_args$args;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 60m;
proxy_read_timeout 60m;
proxy_send_timeout 60m;
proxy_pass {{MODEL_PROXY_UPSTREAM_SCHEME}}://model_proxy_upstream/$1$is_args$args;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 60m;
proxy_read_timeout 60m;
proxy_send_timeout 60m;
proxy_ssl_server_name on;
proxy_ssl_name model_proxy_upstream;

Copilot uses AI. Check for mistakes.
Expand Down
21 changes: 17 additions & 4 deletions src/pylon/deploy/pylon-config/nginx.conf.template
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

worker_processes 1;
worker_processes 1;

events {
worker_connections 1024;
worker_connections 65535;
}
Comment on lines +18 to 22
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising worker_connections to 65535 likely won’t be effective unless the process RLIMIT_NOFILE (and/or worker_rlimit_nofile) is also increased to cover worker_processes * worker_connections plus overhead. Without that, Nginx will warn and cap usable connections. Consider setting worker_rlimit_nofile (and aligning the container/pod ulimit) alongside this change.

Copilot uses AI. Check for mistakes.

http {
Expand Down Expand Up @@ -65,8 +65,21 @@ http {
# allow 10.1.0.0/16;
# deny all;

{%- if MODEL_PROXY_URI %}
# Upstream keepalive pool for model-proxy.
# Reuses persistent connections and enables automatic stale-connection retry
# before nginx has committed to the client response -- eliminating the race
# that causes [Errno 104] Connection reset by peer under high concurrency.
upstream model_proxy_upstream {
server {{MODEL_PROXY_URI | replace('http://', '') | replace('https://', '')}};
keepalive 32;
keepalive_requests 1000;
keepalive_timeout 60s;
}
{%- endif %}

server {
listen 80;
listen 80 backlog=4096;
server_name localhost;
client_max_body_size 0; # Disable checking of client request body size.
client_body_buffer_size 256M;
Expand All @@ -84,7 +97,7 @@ http {

{% if SSL_ENABLE %}
server {
listen 443 ssl;
listen 443 ssl backlog=4096;
server_name localhost;

ssl_certificate /root/{{CRT_NAME}};
Expand Down
4 changes: 4 additions & 0 deletions src/pylon/deploy/pylon-config/run.sh.template
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ python3 /pylon-config/render.py
cp /root/nginx.conf /etc/nginx/nginx.conf
cp /root/location.conf /etc/nginx/location.conf

# Increase TCP listen backlog to match nginx backlog=4096.
# Requires NET_ADMIN capability; if it fails, fall back to OS default silently.
sysctl -w net.core.somaxconn=4096 2>/dev/null || true
Comment on lines +25 to +26
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states this requires NET_ADMIN, but changing net.core.somaxconn commonly requires sysctl permissions that aren’t granted by NET_ADMIN (and in Kubernetes is typically handled via pod-level securityContext.sysctls). Because failures are fully silenced, operators may believe backlog tuning is active when it isn’t. Consider (a) configuring this via pod securityContext.sysctls instead of container capabilities, and/or (b) logging a one-line warning when the sysctl write fails so it’s observable.

Suggested change
# Requires NET_ADMIN capability; if it fails, fall back to OS default silently.
sysctl -w net.core.somaxconn=4096 2>/dev/null || true
# Note: Changing net.core.somaxconn may require sysctl permissions (e.g. pod securityContext.sysctls in Kubernetes),
# not just NET_ADMIN. If this fails, log a warning and continue with the OS default.
sysctl -w net.core.somaxconn=4096 >/dev/null 2>&1 || echo "Warning: Failed to set net.core.somaxconn=4096; ensure pod securityContext.sysctls is configured if backlog tuning is required." >&2

Copilot uses AI. Check for mistakes.

{% if 'ssl' in cluster_cfg['pylon'] %}
cp /https-config/{{cluster_cfg['pylon']['ssl']['crt_name']}} /root/{{cluster_cfg['pylon']['ssl']['crt_name']}}
cp /https-config/{{cluster_cfg['pylon']['ssl']['key_name']}} /root/{{cluster_cfg['pylon']['ssl']['key_name']}}
Expand Down
3 changes: 3 additions & 0 deletions src/pylon/deploy/pylon.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ spec:
- name: pylon
image: {{ cluster_cfg['cluster']['docker-registry']['prefix'] }}pylon:{{ cluster_cfg['cluster']['docker-registry']['tag'] }}
imagePullPolicy: Always
securityContext:
capabilities:
add: ["NET_ADMIN"]
volumeMounts:
- mountPath: /pylon-config
name: pylon-configuration
Expand Down
156 changes: 156 additions & 0 deletions tools/mock_inference_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
import random
import string
import time
import asyncio
from typing import List
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import (List). Removing it avoids lint noise and keeps dependencies tidy.

Suggested change
from typing import List

Copilot uses AI. Check for mistakes.

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, StreamingResponse
import uvicorn
import argparse
import json

app = FastAPI()

# ====== 全局配置 ======
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the comments in Chinese

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and other places

CONFIG = {
"models": ["mock-gpt-4"],
"min_delay": 0.1,
"max_delay": 1.0,
"min_tokens": 5,
"max_tokens": 50,
}


# ====== utils ======
def random_text(n: int) -> str:
return "".join(random.choices(string.ascii_letters + string.digits, k=n))


def random_delay():
return random.uniform(CONFIG["min_delay"], CONFIG["max_delay"])


def validate_model(model: str):
if model not in CONFIG["models"]:
return CONFIG["models"][0] # fallback
return model


# ====== /v1/models ======
@app.get("/v1/models")
async def list_models():
return {
"object": "list",
"data": [
{
"id": m,
"object": "model",
"created": int(time.time()),
"owned_by": "mock",
}
for m in CONFIG["models"]
],
}


# ====== response ======
def build_chat_response(model: str, content: str):
return {
"id": f"chatcmpl-{random_text(12)}",
"object": "chat.completion",
"created": int(time.time()),
"model": model,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": content,
},
"finish_reason": "stop",
}
],
"usage": {
"prompt_tokens": random.randint(5, 20),
"completion_tokens": len(content),
"total_tokens": len(content) + random.randint(5, 20),
},
}


async def stream_response(model: str, full_text: str):
chunk_size = 5

for i in range(0, len(full_text), chunk_size):
chunk = full_text[i : i + chunk_size]

data = {
"id": f"chatcmpl-{random_text(12)}",
"object": "chat.completion.chunk",
"model": model,
"choices": [
Comment on lines +88 to +92
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OpenAI-style streaming responses, clients typically expect the id to remain stable across all chunks of a single completion. Generating a new id per chunk can break client-side correlation/assembling logic. Prefer generating one id per request and reusing it in all chunks (and optionally include consistent created metadata as well).

Copilot uses AI. Check for mistakes.
{
"delta": {"content": chunk},
"index": 0,
"finish_reason": None,
}
],
}

yield f"data: {json.dumps(data)}\n\n"
await asyncio.sleep(0.05)

yield "data: [DONE]\n\n"


# ====== /v1/chat/completions ======
@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
body = await request.json()

model = validate_model(body.get("model", CONFIG["models"][0]))
stream = body.get("stream", False)

# 随机输出
token_len = random.randint(CONFIG["min_tokens"], CONFIG["max_tokens"])
content = random_text(token_len)

# 随机延迟
await asyncio.sleep(random_delay())

if stream:
return StreamingResponse(
stream_response(model, content),
media_type="text/event-stream",
)
else:
return JSONResponse(build_chat_response(model, content))


# ====== main ======
def main():
parser = argparse.ArgumentParser()

parser.add_argument("--host", default="0.0.0.0")
parser.add_argument("--port", type=int, default=8000)

parser.add_argument("--models", type=str, default="mock-gpt-4")
parser.add_argument("--min-delay", type=float, default=0.1)
parser.add_argument("--max-delay", type=float, default=1.0)
parser.add_argument("--min-tokens", type=int, default=5)
parser.add_argument("--max-tokens", type=int, default=50)

args = parser.parse_args()

CONFIG["models"] = args.models.split(",")
CONFIG["min_delay"] = args.min_delay
CONFIG["max_delay"] = args.max_delay
CONFIG["min_tokens"] = args.min_tokens
CONFIG["max_tokens"] = args.max_tokens

uvicorn.run(app, host=args.host, port=args.port)


if __name__ == "__main__":
main()
Loading
Loading