Expand eval suite to 21 tests and enhance SKILL.md#27
Conversation
Add self-contained eval prompts with inline code samples and content assertions for: PHP 8.x features (readonly, enums, match, named args, nullsafe, #[Override], #[SensitiveParameter], typed constants), strict types, DTOs, PSR interfaces, PHPStan config, Rector config, PHP-CS-Fixer config, PHPat architecture tests, baseline strategy, and migration planning.
Script runs each eval prompt with and without the skill loaded, collecting output size, line count, and duration metrics. Supports selective runs via --indices flag.
Add #[Override], typed constants, #[SensitiveParameter], property hooks, copy-on-write awareness, DOMDocument UTF-8 pitfall, and PHP-CS-Fixer deprecation detection to expertise areas and migration checklist.
There was a problem hiding this comment.
Code Review
This pull request updates the php-modernization plugin to version 1.13.0, expanding its expertise to include PHP 8.3 features like typed constants and the #[Override] attribute. It also introduces a comprehensive A/B testing suite in evals/ to measure the plugin's performance. Feedback focuses on improving the portability and robustness of the run-ab-test.sh script, specifically by replacing non-POSIX commands like date +%s%N and seq with portable Python or native shell alternatives, and fixing a potential shell injection vulnerability when passing variables to Python.
| START_B=$(date +%s%N) | ||
| claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \ | ||
| --disable-slash-commands \ | ||
| --plugin-dir "$SKILL_DIR/.." \ | ||
| --append-system-prompt "$SYSTEM_HINT" \ | ||
| -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true | ||
| END_B=$(date +%s%N) | ||
| DURATION_B=$(( (END_B - START_B) / 1000000 )) |
There was a problem hiding this comment.
This block has two issues: 1) date +%s%N is not portable across systems. 2) The --plugin-dir path is incorrect; the plugin manifest (plugin.json) is located in the repository root (under .claude-plugin/), but $SKILL_DIR/.. points to the skills/ directory. Correcting the path to the root ensures the skill is loaded correctly.
| START_B=$(date +%s%N) | |
| claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \ | |
| --disable-slash-commands \ | |
| --plugin-dir "$SKILL_DIR/.." \ | |
| --append-system-prompt "$SYSTEM_HINT" \ | |
| -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true | |
| END_B=$(date +%s%N) | |
| DURATION_B=$(( (END_B - START_B) / 1000000 )) | |
| START_B=$(python3 -c 'import time; print(int(time.time() * 1000))') | |
| claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \ | |
| --disable-slash-commands \ | |
| --plugin-dir "$(dirname "$SCRIPT_DIR")" \ | |
| --append-system-prompt "$SYSTEM_HINT" \ | |
| -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true | |
| END_B=$(python3 -c 'import time; print(int(time.time() * 1000))') | |
| DURATION_B=$(( END_B - START_B )) |
References
- Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.
| if [[ "${1:-}" == "--indices" && -n "${2:-}" ]]; then | ||
| IFS=',' read -ra INDICES <<< "$2" | ||
| else | ||
| INDICES=($(seq 0 $((EVAL_COUNT - 1)))) |
There was a problem hiding this comment.
The seq command is not POSIX-compliant. For better portability across different environments (e.g., BSD vs. GNU), consider using a native shell loop to populate the indices.
| INDICES=($(seq 0 $((EVAL_COUNT - 1)))) | |
| INDICES=(); for ((i=0; i<EVAL_COUNT; i++)); do INDICES+=("$i"); done |
References
- Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.
| START_A=$(date +%s%N) | ||
| claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \ | ||
| --disable-slash-commands \ | ||
| --append-system-prompt "$SYSTEM_HINT" \ | ||
| -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true | ||
| END_A=$(date +%s%N) | ||
| DURATION_A=$(( (END_A - START_A) / 1000000 )) |
There was a problem hiding this comment.
The use of date +%s%N is not POSIX-compliant and will fail on non-GNU systems (like macOS), where the %N flag is not supported. Since python3 is already a dependency for this script, you can use it for portable high-precision timing.
| START_A=$(date +%s%N) | |
| claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \ | |
| --disable-slash-commands \ | |
| --append-system-prompt "$SYSTEM_HINT" \ | |
| -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true | |
| END_A=$(date +%s%N) | |
| DURATION_A=$(( (END_A - START_A) / 1000000 )) | |
| START_A=$(python3 -c 'import time; print(int(time.time() * 1000))') | |
| claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \ | |
| --disable-slash-commands \ | |
| --append-system-prompt "$SYSTEM_HINT" \ | |
| -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true | |
| END_A=$(python3 -c 'import time; print(int(time.time() * 1000))') | |
| DURATION_A=$(( END_A - START_A )) |
References
- Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.
| python3 -c " | ||
| import json | ||
| metrics = { | ||
| 'name': '$EVAL_NAME', | ||
| 'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A}, | ||
| 'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B} | ||
| } | ||
| with open('$RESULTS_DIR/${EVAL_NAME}_metrics.json', 'w') as f: | ||
| json.dump(metrics, f, indent=2) | ||
| " |
There was a problem hiding this comment.
Injecting shell variables directly into the Python script string is fragile. If a variable like $EVAL_NAME contains a single quote, it will break the Python syntax and cause a crash. It is safer to pass these values as command-line arguments to the Python process.
| python3 -c " | |
| import json | |
| metrics = { | |
| 'name': '$EVAL_NAME', | |
| 'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A}, | |
| 'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B} | |
| } | |
| with open('$RESULTS_DIR/${EVAL_NAME}_metrics.json', 'w') as f: | |
| json.dump(metrics, f, indent=2) | |
| " | |
| python3 -c " | |
| import json, sys | |
| name, results_dir = sys.argv[1], sys.argv[2] | |
| metrics = { | |
| 'name': name, | |
| 'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A}, | |
| 'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B} | |
| } | |
| with open(f'{results_dir}/{name}_metrics.json', 'w') as f: | |
| json.dump(metrics, f, indent=2) | |
| " "$EVAL_NAME" "$RESULTS_DIR" |
There was a problem hiding this comment.
Pull request overview
Expands the php-modernization skill’s evaluation suite and supporting docs/scripts to better cover PHP 8.x modernization patterns, static analysis tooling, and migration guidance.
Changes:
- Expanded
evals/evals.jsonfrom a minimal set to 21 self-contained evals spanning feature usage, tooling config, and migration topics. - Added
evals/run-ab-test.shto run baseline vs. skill A/B comparisons and store per-eval output/metrics. - Updated
skills/php-modernization/SKILL.md,.claude-plugin/plugin.json, and.gitignoreto reflect the new version and generated artifacts.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/php-modernization/SKILL.md | Updates version and enriches expertise/checklist items for newer PHP features and pitfalls. |
| evals/run-ab-test.sh | Adds a CLI script to run A/B eval comparisons and write results/metrics files. |
| evals/evals.json | Adds a broader, 21-case eval suite with prompt content and assertion checks. |
| .gitignore | Ignores generated A/B results directory. |
| .claude-plugin/plugin.json | Bumps plugin version to 1.13.0. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "type": "content_contains", | ||
| "value": "public function __construct(", | ||
| "description": "Uses constructor property promotion" |
There was a problem hiding this comment.
In modernize-php-class, the assertion content_contains: "public function __construct(" doesn’t actually verify constructor property promotion (the original code already contains a constructor). Tighten this to assert a visibility keyword on the constructor parameters (e.g., __construct(public ...) or a regex that matches promoted properties, otherwise the eval can pass without using the intended PHP 8.0+ feature.
| "type": "content_contains", | |
| "value": "public function __construct(", | |
| "description": "Uses constructor property promotion" | |
| "type": "content_regex", | |
| "value": "__construct\\s*\\((public|protected|private)\\s", | |
| "description": "Uses constructor property promotion (visibility keyword on constructor parameters)" |
| { | ||
| "type": "content_regex", | ||
| "value": "public (readonly )?function __construct\\(", | ||
| "description": "Uses constructor property promotion" | ||
| }, | ||
| { | ||
| "type": "content_regex", | ||
| "value": "(public|private|protected) readonly", | ||
| "description": "Applies readonly modifier to promoted properties" | ||
| } |
There was a problem hiding this comment.
The readonly-properties eval uses the regex public (readonly )?function __construct\(, but PHP doesn’t support readonly function, and this pattern also doesn’t ensure constructor property promotion occurred. Consider asserting promotion directly (e.g., visibility keywords in the parameter list) and keep readonly checks focused on readonly class / public readonly properties.
| # WITHOUT skill (baseline) | ||
| echo " Running WITHOUT skill..." | ||
| START_A=$(date +%s%N) | ||
| claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \ | ||
| --disable-slash-commands \ | ||
| --append-system-prompt "$SYSTEM_HINT" \ | ||
| -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true | ||
| END_A=$(date +%s%N) | ||
| DURATION_A=$(( (END_A - START_A) / 1000000 )) | ||
|
|
||
| # WITH skill | ||
| echo " Running WITH skill..." | ||
| START_B=$(date +%s%N) | ||
| claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \ | ||
| --disable-slash-commands \ | ||
| --plugin-dir "$SKILL_DIR/.." \ | ||
| --append-system-prompt "$SYSTEM_HINT" \ | ||
| -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true | ||
| END_B=$(date +%s%N) |
There was a problem hiding this comment.
The script intentionally ignores all claude failures (2>/dev/null || true). That makes A/B results hard to trust because an invocation can fail silently and still produce “metrics” from an empty/partial output file. Consider capturing stderr to a per-eval log, recording the exit status in the metrics JSON, and failing the run (or at least printing a clear warning) when a call exits non-zero.
| # Parse optional --indices flag for selective runs | ||
| if [[ "${1:-}" == "--indices" && -n "${2:-}" ]]; then | ||
| IFS=',' read -ra INDICES <<< "$2" | ||
| else | ||
| INDICES=($(seq 0 $((EVAL_COUNT - 1)))) | ||
| fi | ||
|
|
||
| echo "Running A/B tests for ${#INDICES[@]} of $EVAL_COUNT evals..." | ||
|
|
||
| for i in "${INDICES[@]}"; do | ||
| EVAL_NAME=$(python3 -c "import json; print(json.load(open('$EVALS_FILE'))[$i]['name'])") | ||
| EVAL_PROMPT=$(python3 -c "import json,sys; sys.stdout.write(json.load(open('$EVALS_FILE'))[$i]['prompt'])") | ||
|
|
There was a problem hiding this comment.
--indices values are interpolated directly into the Python one-liner (...json.load(... )[$i]...). Since i comes from user input, a malicious value (or even accidental whitespace) can break the script or inject Python code. Validate indices as ^[0-9]+$ (and range-check against EVAL_COUNT) before use, or pass the index via sys.argv and convert with int() inside Python.
Summary
evals/run-ab-test.sh) for systematic comparisonEval Coverage (21 tests)
A/B Test Results (8 representative evals, sonnet baseline)
Niche pattern analysis (skill-specific knowledge)
Findings: Baseline Claude passes standard PHP assertions well. The skill adds value for niche patterns (PHPStan treatPhpDocTypesAsCertain, copy-on-write security, deprecated alias detection). Identified gap: PHP 8.4 Dom\HTMLDocument not surfaced in eval responses despite being documented in references.
SKILL.md Improvements
Test plan