Skip to content

Add containerized LCB Service#105

Merged
nv-alicheng merged 17 commits intomainfrom
feat/alicheng-lcb-container
Jan 27, 2026
Merged

Add containerized LCB Service#105
nv-alicheng merged 17 commits intomainfrom
feat/alicheng-lcb-container

Conversation

@nv-alicheng
Copy link
Copy Markdown
Collaborator

What does this PR do?

Moves LiveCodeBench eval to a security-enhanced docker container as a web service.
Decouples LCB eval from the official lcb_runner repo to reduce the number of dependencies.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@nv-alicheng nv-alicheng requested a review from a team as a code owner January 23, 2026 21:56
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 23, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @nv-alicheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a fundamental shift in how LiveCodeBench evaluations are performed by encapsulating the entire process within a dedicated, security-hardened Docker container. This architectural change aims to mitigate security risks associated with executing untrusted code and to simplify dependency management. It also refines the project's overall dependency structure and provides a flexible mechanism for interacting with the new containerized evaluation service.

Highlights

  • Containerized LiveCodeBench Service: The LiveCodeBench (LCB) evaluation is now moved into a security-enhanced Docker container, running as a web service. This significantly improves security by isolating arbitrary code execution and decouples LCB evaluation from the main repository's dependencies.
  • Updated Dependency Management: The project's dependencies are now managed primarily through pyproject.toml using optional dependencies, replacing the previous requirements/*.txt files. This streamlines installation for users and developers.
  • Flexible LCB Evaluation Workflow: The LiveCodeBenchScorer now attempts to evaluate code via a WebSocket connection to the containerized LCB service. If the service is unavailable or disabled, it falls back to a local subprocess execution, with a clear warning about security implications.
  • Enhanced Extractor Interface: The Extractor interface and its implementations have been updated to include a default parameter, allowing for more robust handling of extraction failures by providing a fallback value.
  • Comprehensive Documentation for LCB Service: A new README.md file is added specifically for the LiveCodeBench service, detailing its requirements, security hardening best practices, build and run instructions, and troubleshooting.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request significantly improves the LiveCodeBench (LCB) evaluation workflow by introducing a containerized FastAPI WebSocket service. This change effectively addresses critical security concerns associated with executing untrusted LLM-generated code by isolating the evaluation environment. The accompanying documentation is comprehensive, detailing security hardening best practices, container setup, and usage. Dependency management has also been modernized by transitioning to "pyproject.toml" and leveraging optional dependencies. The "LiveCodeBenchScorer" now intelligently attempts WebSocket communication first, with a secure, opt-in fallback to local subprocess execution. The "Extractor" classes have been made more robust with the addition of a "default" parameter, enhancing downstream processing reliability. Overall, these changes represent a well-executed architectural improvement that enhances both the security and maintainability of the LCB integration.

@nvzhihanj
Copy link
Copy Markdown
Collaborator

Code review

Found 3 issues:

  1. Bug in evaluate_dataframe() method: The method assigns the return value of self.evaluate() (which returns dict[str, list[bool]]) to num_passed, then attempts to divide it by total_samples. This will cause a TypeError: unsupported operand type(s) for /: 'dict' and 'int' at runtime.

# Evaluate and get number of passed samples
num_passed = self.evaluate(
codes_dict=codes_dict,
timeout_sec=timeout_sec,
on_problem_complete=on_problem_complete,
)
# Calculate pass@1
total_samples = len(df)
pass_at_1 = num_passed / total_samples if total_samples > 0 else 0.0

  1. Hardcoded dataset path ignores lcb_version parameter: The subprocess command in _evaluate_via_subprocess() hardcodes "datasets/livecodebench/release_v6" for the --datasets-dir argument, ignoring self.lcb_version. This causes incorrect behavior when users specify a different version (e.g., release_v5).

self.lcb_version,
"--datasets-dir",
"datasets/livecodebench/release_v6",
"--timeout",
str(self.timeout),

  1. Breaking change: IdentityExtractor removed but still referenced: The IdentityExtractor class was removed from extractor.py, but three YAML configuration files still reference identity_extractor. This will cause KeyError when loading these configs:
    • examples/05_Llama3.1-8B_Example/offline_llama3_8b_cnn.yaml
    • examples/05_Llama3.1-8B_Example/online_llama3_8b_cnn.yaml
    • examples/06_Llama2-70B_Example/online_llama2_70b_orca.yaml

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific permissions and
# limitations under the License.
import inspect
import re
from abc import ABC, abstractmethod
from typing import ClassVar
class Extractor(ABC):
"""An Extractor is used to extract phrases or substrings from the model's outputs using
multiple regex patterns with a priority system. This is useful for extracting values from
strings with the same general format but small variations, such as a model outputting a
numeric value plain or inside a LaTeX block.
"""
# Provide a registration and lookup system for derived Extractor classes by name.
# This allows registering new extractors that can be instantiated via config/lookup.
PREDEFINED: ClassVar[dict[str, type["Extractor"]]] = {}
def __init_subclass__(
cls,
extractor_id: str | None = None,
**kwargs,
):
super().__init_subclass__(**kwargs)

Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

@nv-alicheng nv-alicheng merged commit 0140906 into main Jan 27, 2026
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 27, 2026
@arekay-nv arekay-nv deleted the feat/alicheng-lcb-container branch April 2, 2026 03:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants