Add containerized LCB Service by nv-alicheng · Pull Request #105 · mlcommons/endpoints

nv-alicheng · 2026-01-23T21:56:05Z

What does this PR do?

Moves LiveCodeBench eval to a security-enhanced docker container as a web service.
Decouples LCB eval from the official lcb_runner repo to reduce the number of dependencies.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-01-23T21:56:15Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist · 2026-01-23T21:56:43Z

Summary of Changes

Hello @nv-alicheng, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a fundamental shift in how LiveCodeBench evaluations are performed by encapsulating the entire process within a dedicated, security-hardened Docker container. This architectural change aims to mitigate security risks associated with executing untrusted code and to simplify dependency management. It also refines the project's overall dependency structure and provides a flexible mechanism for interacting with the new containerized evaluation service.

Highlights

Containerized LiveCodeBench Service: The LiveCodeBench (LCB) evaluation is now moved into a security-enhanced Docker container, running as a web service. This significantly improves security by isolating arbitrary code execution and decouples LCB evaluation from the main repository's dependencies.
Updated Dependency Management: The project's dependencies are now managed primarily through pyproject.toml using optional dependencies, replacing the previous requirements/*.txt files. This streamlines installation for users and developers.
Flexible LCB Evaluation Workflow: The LiveCodeBenchScorer now attempts to evaluate code via a WebSocket connection to the containerized LCB service. If the service is unavailable or disabled, it falls back to a local subprocess execution, with a clear warning about security implications.
Enhanced Extractor Interface: The Extractor interface and its implementations have been updated to include a default parameter, allowing for more robust handling of extraction failures by providing a fallback value.
Comprehensive Documentation for LCB Service: A new README.md file is added specifically for the LiveCodeBench service, detailing its requirements, security hardening best practices, build and run instructions, and troubleshooting.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request significantly improves the LiveCodeBench (LCB) evaluation workflow by introducing a containerized FastAPI WebSocket service. This change effectively addresses critical security concerns associated with executing untrusted LLM-generated code by isolating the evaluation environment. The accompanying documentation is comprehensive, detailing security hardening best practices, container setup, and usage. Dependency management has also been modernized by transitioning to "pyproject.toml" and leveraging optional dependencies. The "LiveCodeBenchScorer" now intelligently attempts WebSocket communication first, with a secure, opt-in fallback to local subprocess execution. The "Extractor" classes have been made more robust with the addition of a "default" parameter, enhancing downstream processing reliability. Overall, these changes represent a well-executed architectural improvement that enhances both the security and maintainability of the LCB integration.

src/inference_endpoint/evaluation/scoring.py

…rements

src/inference_endpoint/evaluation/livecodebench/run_lcb_tests.py

src/inference_endpoint/evaluation/livecodebench/lcb_serve.py

src/inference_endpoint/evaluation/livecodebench/run_lcb_tests.py

src/inference_endpoint/evaluation/livecodebench/lcb_serve.py

src/inference_endpoint/evaluation/livecodebench/_server.py

…f the library, fix type annotations

src/inference_endpoint/evaluation/livecodebench/_server.py

src/inference_endpoint/evaluation/livecodebench/lcb_serve.py

src/inference_endpoint/dataset_manager/predefined/livecodebench/__init__.py

src/inference_endpoint/evaluation/scoring.py

src/inference_endpoint/evaluation/livecodebench/run_lcb_tests.py

…nerate as subprocess instead of handling with lcb-service

…st suite JSON files

src/inference_endpoint/evaluation/livecodebench/lcb_serve.py

nvzhihanj · 2026-01-27T04:55:00Z

Code review

Found 3 issues:

Bug in evaluate_dataframe() method: The method assigns the return value of self.evaluate() (which returns dict[str, list[bool]]) to num_passed, then attempts to divide it by total_samples. This will cause a TypeError: unsupported operand type(s) for /: 'dict' and 'int' at runtime.

endpoints/src/inference_endpoint/evaluation/livecodebench/lcb_serve.py

Lines 467 to 477 in 318e948

    
           # Evaluate and get number of passed samples 
        
           num_passed = self.evaluate( 
        
               codes_dict=codes_dict, 
        
               timeout_sec=timeout_sec, 
        
               on_problem_complete=on_problem_complete, 
        
           ) 
        
           # Calculate pass@1 
        
           total_samples = len(df) 
        
           pass_at_1 = num_passed / total_samples if total_samples > 0 else 0.0

Hardcoded dataset path ignores lcb_version parameter: The subprocess command in _evaluate_via_subprocess() hardcodes "datasets/livecodebench/release_v6" for the --datasets-dir argument, ignoring self.lcb_version. This causes incorrect behavior when users specify a different version (e.g., release_v5).

endpoints/src/inference_endpoint/evaluation/scoring.py

Lines 510 to 514 in 318e948

    
           self.lcb_version, 
        
           "--datasets-dir", 
        
           "datasets/livecodebench/release_v6", 
        
           "--timeout", 
        
           str(self.timeout),

Breaking change: IdentityExtractor removed but still referenced: The IdentityExtractor class was removed from extractor.py, but three YAML configuration files still reference identity_extractor. This will cause KeyError when loading these configs:
- examples/05_Llama3.1-8B_Example/offline_llama3_8b_cnn.yaml
- examples/05_Llama3.1-8B_Example/online_llama3_8b_cnn.yaml
- examples/06_Llama2-70B_Example/online_llama2_70b_orca.yaml

endpoints/src/inference_endpoint/evaluation/extractor.py

Lines 1 to 40 in 318e948

    
           # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. 
        
           # SPDX-License-Identifier: Apache-2.0 
        
           # 
        
           # Licensed under the Apache License, Version 2.0 (the "License"); 
        
           # you may not use this file except in compliance with the License. 
        
           # You may obtain a copy of the License at 
        
           # 
        
           # http://www.apache.org/licenses/LICENSE-2.0 
        
           # 
        
           # Unless required by applicable law or agreed to in writing, software 
        
           # distributed under the License is distributed on an "AS IS" BASIS, 
        
           # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
        
           # See the License for the specific permissions and 
        
           # limitations under the License. 
        
           import inspect 
        
           import re 
        
           from abc import ABC, abstractmethod 
        
           from typing import ClassVar 
        
           class Extractor(ABC): 
        
               """An Extractor is used to extract phrases or substrings from the model's outputs using 
        
               multiple regex patterns with a priority system. This is useful for extracting values from 
        
               strings with the same general format but small variations, such as a model outputting a 
        
               numeric value plain or inside a LaTeX block. 
        
               """ 
        
               # Provide a registration and lookup system for derived Extractor classes by name. 
        
               # This allows registering new extractors that can be instantiated via config/lookup. 
        
               PREDEFINED: ClassVar[dict[str, type["Extractor"]]] = {} 
        
               def __init_subclass__( 
        
                   cls, 
        
                   extractor_id: str | None = None, 
        
                   **kwargs, 
        
               ): 
        
                   super().__init_subclass__(**kwargs)

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

src/inference_endpoint/evaluation/livecodebench/lcb_serve.py

src/inference_endpoint/evaluation/scoring.py

src/inference_endpoint/evaluation/extractor.py

…ractor.py

arekay-nv

Awesome, thanks!

nv-alicheng added 5 commits January 23, 2026 11:43

Consolidate dependencies into pyproject.toml

d0cd679

Move lcb_serve to evaluation. Remove dependency on lcb_runner

f8a2917

Add lcb-service, move generate to lcb-service container

705bdc3

Update example07 LCB instructions

70dddff

Fix bug in dockerfile where export was essentially a no-op

a020b1d

nv-alicheng requested a review from a team as a code owner January 23, 2026 21:56

github-actions bot requested review from arekay-nv and nvzhihanj January 23, 2026 21:56

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

src/inference_endpoint/evaluation/scoring.py Outdated Show resolved Hide resolved

src/inference_endpoint/evaluation/scoring.py Outdated Show resolved Hide resolved

Update CI to install dependencies via pyproject.toml instead of requi…

674b2ea

…rements

github-code-quality bot found potential problems Jan 23, 2026

View reviewed changes

Fix bug where template .yaml files were not being installed as part o…

29a385f

…f the library, fix type annotations

arekay-nv reviewed Jan 25, 2026

View reviewed changes

nv-alicheng added 2 commits January 26, 2026 13:28

Fix rebase error with scoring.py

fc4d231

PR comments - Add logging to lcb server, fix imports

deb83b5

github-code-quality bot found potential problems Jan 26, 2026

View reviewed changes

src/inference_endpoint/evaluation/livecodebench/run_lcb_tests.py Dismissed Show dismissed Hide dismissed

nv-alicheng added 2 commits January 26, 2026 17:05

Split test cases out of parquet into individual JSON files. Invoke ge…

d86526e

…nerate as subprocess instead of handling with lcb-service

Enable optional pre-load and caching mechanism to dynamically read te…

a605064

…st suite JSON files

github-code-quality bot found potential problems Jan 27, 2026

View reviewed changes

src/inference_endpoint/evaluation/livecodebench/lcb_serve.py Fixed Show fixed Hide fixed

nv-alicheng added 2 commits January 26, 2026 18:41

Add more debugging logs, remove __main__ from lcb module

fbc570e

Only preload if cache is unlimited size

84aaf6b

github-code-quality bot found potential problems Jan 27, 2026

View reviewed changes

src/inference_endpoint/evaluation/livecodebench/lcb_serve.py Fixed Show fixed Hide fixed

Update README for lcb eval with tips on setting LCB_TEST_CACHE_SIZE

318e948

nvzhihanj reviewed Jan 27, 2026

View reviewed changes

src/inference_endpoint/evaluation/livecodebench/lcb_serve.py Outdated Show resolved Hide resolved

nvzhihanj reviewed Jan 27, 2026

View reviewed changes

src/inference_endpoint/evaluation/scoring.py Outdated Show resolved Hide resolved

nvzhihanj reviewed Jan 27, 2026

View reviewed changes

src/inference_endpoint/evaluation/extractor.py Show resolved Hide resolved

Address PR comments, fix evaluate_dataframe, fix rebase errors in ext…

27a81b8

…ractor.py

nvzhihanj approved these changes Jan 27, 2026

View reviewed changes

arekay-nv approved these changes Jan 27, 2026

View reviewed changes

nvzhihanj and others added 2 commits January 27, 2026 11:05

Merge branch 'main' into feat/alicheng-lcb-container

0643013

Fix GH workflows yaml

60f8c8b

nv-alicheng merged commit 0140906 into main Jan 27, 2026
4 checks passed

github-actions bot locked and limited conversation to collaborators Jan 27, 2026

arekay-nv deleted the feat/alicheng-lcb-container branch April 2, 2026 03:05

Conversation

nv-alicheng commented Jan 23, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvzhihanj commented Jan 27, 2026

Code review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jan 23, 2026 •

edited

Loading