Skip to content

Conversation

Copy link

Copilot AI commented Oct 27, 2025

Strengthens subprocess usage in build_database function to prevent command injection vulnerabilities through explicit security controls and input validation.

Changes

  • Input validation: Whitelist language parameter against known CodeQL languages (cpp, csharp, go, java, javascript, python, ruby, rust, swift); validate extractor_options as string-only iterable
  • Explicit security: Set shell=False explicitly in all subprocess calls (build_database, clone_project, gitroot)
  • Documentation: Add security expectations docstring and inline comments referencing OWASP/Python subprocess best practices

Example

Before:

def build_database(language: str, extractor_options, project: Project, project_dir: str):
    subprocess.check_call([
        "codeql", "database", "create",
        f"--language={language}",
        *extractor_options,
        database_dir,
    ])

After:

def build_database(language: str, extractor_options, project: Project, project_dir: str):
    SUPPORTED_LANGUAGES = {"cpp", "csharp", "go", "java", ...}
    if language not in SUPPORTED_LANGUAGES:
        raise ValueError(f"Unsupported language '{language}'")
    
    extractor_options_list = list(extractor_options)
    for opt in extractor_options_list:
        if not isinstance(opt, str):
            raise ValueError("All extractor options must be strings")
    
    subprocess.check_call([
        "codeql", "database", "create",
        f"--language={language}",
        *formatted_options,
        database_dir,
    ], shell=False)

Defense-in-depth approach: validates config file inputs despite trusted source, prevents regression from future refactoring.

Original prompt

Refactor the build_database function in misc/scripts/models-as-data/bulk_generate_mad.py to strengthen subprocess usage and prevent command injection vulnerabilities, as described in issue #5. Specifically:

  • Explicitly set shell=False in all subprocess.check_call invocations (even if already defaulted) to make intent clear.
  • Validate all user-controlled or variable inputs passed to subprocess, especially 'language' (restrict to a known set) and 'extractor_options' (ensure they are a dict or valid type, and do not contain unsafe data).
  • Add comments above each subprocess call explaining the reasoning, referencing Python subprocess and OWASP security best practices. Indicate why this use is safe and what future maintainers should check.
  • Add a docstring to build_database describing its security expectations and assumptions about input sources.
  • Ensure all changes follow the repo's code style and are concise.

This refactor will make subprocess usage safe by default, guard against regressions, and improve auditability for future maintainers.

This pull request was created as a result of the following prompt from Copilot chat.

Refactor the build_database function in misc/scripts/models-as-data/bulk_generate_mad.py to strengthen subprocess usage and prevent command injection vulnerabilities, as described in issue #5. Specifically:

  • Explicitly set shell=False in all subprocess.check_call invocations (even if already defaulted) to make intent clear.
  • Validate all user-controlled or variable inputs passed to subprocess, especially 'language' (restrict to a known set) and 'extractor_options' (ensure they are a dict or valid type, and do not contain unsafe data).
  • Add comments above each subprocess call explaining the reasoning, referencing Python subprocess and OWASP security best practices. Indicate why this use is safe and what future maintainers should check.
  • Add a docstring to build_database describing its security expectations and assumptions about input sources.
  • Ensure all changes follow the repo's code style and are concise.

This refactor will make subprocess usage safe by default, guard against regressions, and improve auditability for future maintainers.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits October 27, 2025 21:28
- Added comprehensive docstring with security expectations
- Validate language parameter against whitelist of supported CodeQL languages
- Validate extractor_options type and content (must be iterable of strings)
- Explicitly set shell=False in all subprocess calls
- Added detailed security comments referencing OWASP and Python docs
- All changes formatted with Black

Co-authored-by: dmsmi <6402178+dmsmi@users.noreply.github.com>
- Remove exception details from error messages to avoid exposing internal details
- Remove option values from error messages to prevent potential log exposure
- Use descriptive variable name 'extractor_option' instead of 'x' in comprehension

Co-authored-by: dmsmi <6402178+dmsmi@users.noreply.github.com>
Copilot AI changed the title [WIP] Refactor build_database to enhance subprocess security Harden subprocess calls in bulk_generate_mad.py against command injection Oct 27, 2025
Copilot AI requested a review from dmsmi October 27, 2025 21:37
Copilot finished work on behalf of dmsmi October 27, 2025 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants