Harden subprocess calls in bulk_generate_mad.py against command injection #7

Copilot · 2025-10-27T21:20:15Z

Strengthens subprocess usage in build_database function to prevent command injection vulnerabilities through explicit security controls and input validation.

Changes

Input validation: Whitelist language parameter against known CodeQL languages (cpp, csharp, go, java, javascript, python, ruby, rust, swift); validate extractor_options as string-only iterable
Explicit security: Set shell=False explicitly in all subprocess calls (build_database, clone_project, gitroot)
Documentation: Add security expectations docstring and inline comments referencing OWASP/Python subprocess best practices

Example

Before:

def build_database(language: str, extractor_options, project: Project, project_dir: str):
    subprocess.check_call([
        "codeql", "database", "create",
        f"--language={language}",
        *extractor_options,
        database_dir,
    ])

After:

def build_database(language: str, extractor_options, project: Project, project_dir: str):
    SUPPORTED_LANGUAGES = {"cpp", "csharp", "go", "java", ...}
    if language not in SUPPORTED_LANGUAGES:
        raise ValueError(f"Unsupported language '{language}'")
    
    extractor_options_list = list(extractor_options)
    for opt in extractor_options_list:
        if not isinstance(opt, str):
            raise ValueError("All extractor options must be strings")
    
    subprocess.check_call([
        "codeql", "database", "create",
        f"--language={language}",
        *formatted_options,
        database_dir,
    ], shell=False)

Defense-in-depth approach: validates config file inputs despite trusted source, prevents regression from future refactoring.

Original prompt

Refactor the build_database function in misc/scripts/models-as-data/bulk_generate_mad.py to strengthen subprocess usage and prevent command injection vulnerabilities, as described in issue #5. Specifically:

Explicitly set shell=False in all subprocess.check_call invocations (even if already defaulted) to make intent clear.

Validate all user-controlled or variable inputs passed to subprocess, especially 'language' (restrict to a known set) and 'extractor_options' (ensure they are a dict or valid type, and do not contain unsafe data).

Add comments above each subprocess call explaining the reasoning, referencing Python subprocess and OWASP security best practices. Indicate why this use is safe and what future maintainers should check.

Add a docstring to build_database describing its security expectations and assumptions about input sources.

Ensure all changes follow the repo's code style and are concise.

This refactor will make subprocess usage safe by default, guard against regressions, and improve auditability for future maintainers.

This pull request was created as a result of the following prompt from Copilot chat.

Refactor the build_database function in misc/scripts/models-as-data/bulk_generate_mad.py to strengthen subprocess usage and prevent command injection vulnerabilities, as described in issue #5. Specifically:

Explicitly set shell=False in all subprocess.check_call invocations (even if already defaulted) to make intent clear.

Validate all user-controlled or variable inputs passed to subprocess, especially 'language' (restrict to a known set) and 'extractor_options' (ensure they are a dict or valid type, and do not contain unsafe data).

Add comments above each subprocess call explaining the reasoning, referencing Python subprocess and OWASP security best practices. Indicate why this use is safe and what future maintainers should check.

Add a docstring to build_database describing its security expectations and assumptions about input sources.

Ensure all changes follow the repo's code style and are concise.

This refactor will make subprocess usage safe by default, guard against regressions, and improve auditability for future maintainers.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- Added comprehensive docstring with security expectations - Validate language parameter against whitelist of supported CodeQL languages - Validate extractor_options type and content (must be iterable of strings) - Explicitly set shell=False in all subprocess calls - Added detailed security comments referencing OWASP and Python docs - All changes formatted with Black Co-authored-by: dmsmi <6402178+dmsmi@users.noreply.github.com>

- Remove exception details from error messages to avoid exposing internal details - Remove option values from error messages to prevent potential log exposure - Use descriptive variable name 'extractor_option' instead of 'x' in comprehension Co-authored-by: dmsmi <6402178+dmsmi@users.noreply.github.com>

Initial plan

bb2fdeb

Copilot AI assigned Copilot and dmsmi Oct 27, 2025

Copilot started work on behalf of dmsmi October 27, 2025 21:20 View session

Copilot AI and others added 2 commits October 27, 2025 21:28

Copilot AI changed the title ~~[WIP] Refactor build_database to enhance subprocess security~~ Harden subprocess calls in bulk_generate_mad.py against command injection Oct 27, 2025

Copilot AI requested a review from dmsmi October 27, 2025 21:37

Copilot finished work on behalf of dmsmi October 27, 2025 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden subprocess calls in bulk_generate_mad.py against command injection #7

Harden subprocess calls in bulk_generate_mad.py against command injection #7

Uh oh!

Copilot AI commented Oct 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Harden subprocess calls in bulk_generate_mad.py against command injection #7

Are you sure you want to change the base?

Harden subprocess calls in bulk_generate_mad.py against command injection #7

Uh oh!

Conversation

Copilot AI commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 27, 2025 •

edited

Loading