Expand eval suite to 21 tests and enhance SKILL.md by CybotTM · Pull Request #27 · netresearch/php-modernization-skill

CybotTM · 2026-04-01T06:20:05Z

Summary

Expand eval suite from 2 to 21 self-contained tests covering all skill areas
Add A/B test runner script (evals/run-ab-test.sh) for systematic comparison
Enhance SKILL.md with PHP 8.2-8.4 features, pitfalls, and migration checklist items
Bump version to 1.13.0

Eval Coverage (21 tests)

Area	Evals
PHP 8.x features	modernize-php-class, switch-to-match, readonly-properties, nullsafe-operator, named-arguments, sensitive-parameter-attribute, override-attribute, typed-class-constants
Strict types	add-strict-types
Enums	constants-to-enum
DTOs/Value Objects	array-to-dto, value-object-pattern, type-safe-collection
Static analysis	configure-phpstan-level9, configure-rector, configure-php-cs-fixer, phpstan-baseline-strategy, phpat-architecture-test, composer-scripts-qa
PSR compliance	psr-interface-typehint
Migration	migration-assessment

A/B Test Results (8 representative evals, sonnet baseline)

Eval	Assertions WITHOUT	Assertions WITH	Output WITHOUT	Output WITH
modernize-php-class	3/3	3/3	1414B	1693B
constants-to-enum	4/4	4/4	1858B	2014B
array-to-dto	3/3	3/3	1958B	1804B
configure-php-cs-fixer	3/3	3/3	3586B	3532B
psr-interface-typehint	3/3	3/3	2096B	2281B
phpstan-baseline-strategy	3/3	3/3	1524B	1976B
migration-assessment	3/4	3/4	3382B	3355B
sensitive-parameter-attribute	1/1	1/1	2662B	3245B
Total	23/24 (96%)	23/24 (96%)

Niche pattern analysis (skill-specific knowledge)

Pattern	WITHOUT	WITH	Advantage
treatPhpDoc + runtime context	N	Y	+SKILL
copy-on-write reference fix ($)	N	Y	+SKILL
PHP 8.4 HTMLDocument alternative	N	N	gap identified

Findings: Baseline Claude passes standard PHP assertions well. The skill adds value for niche patterns (PHPStan treatPhpDocTypesAsCertain, copy-on-write security, deprecated alias detection). Identified gap: PHP 8.4 Dom\HTMLDocument not surfaced in eval responses despite being documented in references.

SKILL.md Improvements

Added PHP 8.2-8.4 features to expertise: #[Override], typed constants, #[SensitiveParameter], property hooks
Added copy-on-write awareness and DOMDocument UTF-8 pitfall to expertise
Enhanced migration checklist with 3 new items
Word count: 462/500

Test plan

Verify evals.json is valid JSON with 21 entries
Run bash evals/run-ab-test.sh --indices 0,3,9 for spot-check
Verify SKILL.md word count <= 500
Verify plugin.json version matches SKILL.md version

Add self-contained eval prompts with inline code samples and content assertions for: PHP 8.x features (readonly, enums, match, named args, nullsafe, #[Override], #[SensitiveParameter], typed constants), strict types, DTOs, PSR interfaces, PHPStan config, Rector config, PHP-CS-Fixer config, PHPat architecture tests, baseline strategy, and migration planning.

Script runs each eval prompt with and without the skill loaded, collecting output size, line count, and duration metrics. Supports selective runs via --indices flag.

Add #[Override], typed constants, #[SensitiveParameter], property hooks, copy-on-write awareness, DOMDocument UTF-8 pitfall, and PHP-CS-Fixer deprecation detection to expertise areas and migration checklist.

gemini-code-assist

Code Review

This pull request updates the php-modernization plugin to version 1.13.0, expanding its expertise to include PHP 8.3 features like typed constants and the #[Override] attribute. It also introduces a comprehensive A/B testing suite in evals/ to measure the plugin's performance. Feedback focuses on improving the portability and robustness of the run-ab-test.sh script, specifically by replacing non-POSIX commands like date +%s%N and seq with portable Python or native shell alternatives, and fixing a potential shell injection vulnerability when passing variables to Python.

gemini-code-assist · 2026-04-01T06:21:38Z

+    START_B=$(date +%s%N)
+    claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
+        --disable-slash-commands \
+        --plugin-dir "$SKILL_DIR/.." \
+        --append-system-prompt "$SYSTEM_HINT" \
+        -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true
+    END_B=$(date +%s%N)
+    DURATION_B=$(( (END_B - START_B) / 1000000 ))


This block has two issues: 1) date +%s%N is not portable across systems. 2) The --plugin-dir path is incorrect; the plugin manifest (plugin.json) is located in the repository root (under .claude-plugin/), but $SKILL_DIR/.. points to the skills/ directory. Correcting the path to the root ensures the skill is loaded correctly.

Suggested change

START_B=$(date +%s%N)

claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \

--disable-slash-commands \

--plugin-dir "$SKILL_DIR/.." \

--append-system-prompt "$SYSTEM_HINT" \

-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true

END_B=$(date +%s%N)

DURATION_B=$(( (END_B - START_B) / 1000000 ))

START_B=$(python3 -c 'import time; print(int(time.time() * 1000))')

claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \

--disable-slash-commands \

--plugin-dir "$(dirname "$SCRIPT_DIR")" \

--append-system-prompt "$SYSTEM_HINT" \

-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true

END_B=$(python3 -c 'import time; print(int(time.time() * 1000))')

DURATION_B=$(( END_B - START_B ))

References

Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.

gemini-code-assist · 2026-04-01T06:21:38Z

+if [[ "${1:-}" == "--indices" && -n "${2:-}" ]]; then
+    IFS=',' read -ra INDICES <<< "$2"
+else
+    INDICES=($(seq 0 $((EVAL_COUNT - 1))))


The seq command is not POSIX-compliant. For better portability across different environments (e.g., BSD vs. GNU), consider using a native shell loop to populate the indices.

Suggested change

INDICES=($(seq 0 $((EVAL_COUNT - 1))))

INDICES=(); for ((i=0; i<EVAL_COUNT; i++)); do INDICES+=("$i"); done

References

Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.

gemini-code-assist · 2026-04-01T06:21:38Z

+    START_A=$(date +%s%N)
+    claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
+        --disable-slash-commands \
+        --append-system-prompt "$SYSTEM_HINT" \
+        -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true
+    END_A=$(date +%s%N)
+    DURATION_A=$(( (END_A - START_A) / 1000000 ))


The use of date +%s%N is not POSIX-compliant and will fail on non-GNU systems (like macOS), where the %N flag is not supported. Since python3 is already a dependency for this script, you can use it for portable high-precision timing.

Suggested change

START_A=$(date +%s%N)

claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \

--disable-slash-commands \

--append-system-prompt "$SYSTEM_HINT" \

-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true

END_A=$(date +%s%N)

DURATION_A=$(( (END_A - START_A) / 1000000 ))

START_A=$(python3 -c 'import time; print(int(time.time() * 1000))')

claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \

--disable-slash-commands \

--append-system-prompt "$SYSTEM_HINT" \

-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true

END_A=$(python3 -c 'import time; print(int(time.time() * 1000))')

DURATION_A=$(( END_A - START_A ))

References

Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.

gemini-code-assist · 2026-04-01T06:21:38Z

+    python3 -c "
+import json
+metrics = {
+    'name': '$EVAL_NAME',
+    'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A},
+    'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B}
+}
+with open('$RESULTS_DIR/${EVAL_NAME}_metrics.json', 'w') as f:
+    json.dump(metrics, f, indent=2)
+"


Injecting shell variables directly into the Python script string is fragile. If a variable like $EVAL_NAME contains a single quote, it will break the Python syntax and cause a crash. It is safer to pass these values as command-line arguments to the Python process.

Suggested change

python3 -c "

import json

metrics = {

'name': '$EVAL_NAME',

'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A},

'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B}

}

with open('$RESULTS_DIR/${EVAL_NAME}_metrics.json', 'w') as f:

json.dump(metrics, f, indent=2)

"

python3 -c "

import json, sys

name, results_dir = sys.argv[1], sys.argv[2]

metrics = {

'name': name,

'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A},

'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B}

}

with open(f'{results_dir}/{name}_metrics.json', 'w') as f:

json.dump(metrics, f, indent=2)

" "$EVAL_NAME" "$RESULTS_DIR"

Copilot

Pull request overview

Expands the php-modernization skill’s evaluation suite and supporting docs/scripts to better cover PHP 8.x modernization patterns, static analysis tooling, and migration guidance.

Changes:

Expanded evals/evals.json from a minimal set to 21 self-contained evals spanning feature usage, tooling config, and migration topics.
Added evals/run-ab-test.sh to run baseline vs. skill A/B comparisons and store per-eval output/metrics.
Updated skills/php-modernization/SKILL.md, .claude-plugin/plugin.json, and .gitignore to reflect the new version and generated artifacts.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
skills/php-modernization/SKILL.md	Updates version and enriches expertise/checklist items for newer PHP features and pitfalls.
evals/run-ab-test.sh	Adds a CLI script to run A/B eval comparisons and write results/metrics files.
evals/evals.json	Adds a broader, 21-case eval suite with prompt content and assertion checks.
.gitignore	Ignores generated A/B results directory.
.claude-plugin/plugin.json	Bumps plugin version to 1.13.0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T06:23:25Z

+        "type": "content_contains",
+        "value": "public function __construct(",
+        "description": "Uses constructor property promotion"


In modernize-php-class, the assertion content_contains: "public function __construct(" doesn’t actually verify constructor property promotion (the original code already contains a constructor). Tighten this to assert a visibility keyword on the constructor parameters (e.g., __construct(public ...) or a regex that matches promoted properties, otherwise the eval can pass without using the intended PHP 8.0+ feature.

Suggested change

"type": "content_contains",

"value": "public function __construct(",

"description": "Uses constructor property promotion"

"type": "content_regex",

"value": "__construct\\s*\\((public|protected|private)\\s",

"description": "Uses constructor property promotion (visibility keyword on constructor parameters)"

Copilot · 2026-04-01T06:23:26Z

+      {
+        "type": "content_regex",
+        "value": "public (readonly )?function __construct\\(",
+        "description": "Uses constructor property promotion"
+      },
+      {
+        "type": "content_regex",
+        "value": "(public|private|protected) readonly",
+        "description": "Applies readonly modifier to promoted properties"
+      }


The readonly-properties eval uses the regex public (readonly )?function __construct\(, but PHP doesn’t support readonly function, and this pattern also doesn’t ensure constructor property promotion occurred. Consider asserting promotion directly (e.g., visibility keywords in the parameter list) and keep readonly checks focused on readonly class / public readonly properties.

Copilot · 2026-04-01T06:23:26Z

+    # WITHOUT skill (baseline)
+    echo "  Running WITHOUT skill..."
+    START_A=$(date +%s%N)
+    claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
+        --disable-slash-commands \
+        --append-system-prompt "$SYSTEM_HINT" \
+        -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true
+    END_A=$(date +%s%N)
+    DURATION_A=$(( (END_A - START_A) / 1000000 ))
+
+    # WITH skill
+    echo "  Running WITH skill..."
+    START_B=$(date +%s%N)
+    claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
+        --disable-slash-commands \
+        --plugin-dir "$SKILL_DIR/.." \
+        --append-system-prompt "$SYSTEM_HINT" \
+        -p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true
+    END_B=$(date +%s%N)


The script intentionally ignores all claude failures (2>/dev/null || true). That makes A/B results hard to trust because an invocation can fail silently and still produce “metrics” from an empty/partial output file. Consider capturing stderr to a per-eval log, recording the exit status in the metrics JSON, and failing the run (or at least printing a clear warning) when a call exits non-zero.

Copilot · 2026-04-01T06:23:26Z

+# Parse optional --indices flag for selective runs
+if [[ "${1:-}" == "--indices" && -n "${2:-}" ]]; then
+    IFS=',' read -ra INDICES <<< "$2"
+else
+    INDICES=($(seq 0 $((EVAL_COUNT - 1))))
+fi
+
+echo "Running A/B tests for ${#INDICES[@]} of $EVAL_COUNT evals..."
+
+for i in "${INDICES[@]}"; do
+    EVAL_NAME=$(python3 -c "import json; print(json.load(open('$EVALS_FILE'))[$i]['name'])")
+    EVAL_PROMPT=$(python3 -c "import json,sys; sys.stdout.write(json.load(open('$EVALS_FILE'))[$i]['prompt'])")
+


--indices values are interpolated directly into the Python one-liner (...json.load(... )[$i]...). Since i comes from user input, a malicious value (or even accidental whitespace) can break the script or inject Python code. Validate indices as ^[0-9]+$ (and range-check against EVAL_COUNT) before use, or pass the index via sys.argv and convert with int() inside Python.

CybotTM added 4 commits April 1, 2026 08:18

feat: add A/B test runner script for eval comparison

7e13500

Script runs each eval prompt with and without the skill loaded, collecting output size, line count, and duration metrics. Supports selective runs via --indices flag.

feat: enhance SKILL.md with PHP 8.2-8.4 features and pitfalls

2a049be

Add #[Override], typed constants, #[SensitiveParameter], property hooks, copy-on-write awareness, DOMDocument UTF-8 pitfall, and PHP-CS-Fixer deprecation detection to expertise areas and migration checklist.

chore: bump version to 1.13.0, gitignore eval results

b4cd10d

Copilot AI review requested due to automatic review settings April 1, 2026 06:20

github-actions Bot approved these changes Apr 1, 2026

View reviewed changes

Copilot started reviewing on behalf of CybotTM April 1, 2026 06:20 View session

gemini-code-assist Bot reviewed Apr 1, 2026

View reviewed changes

Copilot AI reviewed Apr 1, 2026

View reviewed changes

CybotTM merged commit c80ebef into main Apr 1, 2026
9 checks passed

CybotTM deleted the feature/evals-and-improvements branch April 1, 2026 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand eval suite to 21 tests and enhance SKILL.md#27

Expand eval suite to 21 tests and enhance SKILL.md#27
CybotTM merged 4 commits intomainfrom
feature/evals-and-improvements

CybotTM commented Apr 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	INDICES=($(seq 0 $((EVAL_COUNT - 1))))
	INDICES=(); for ((i=0; i<EVAL_COUNT; i++)); do INDICES+=("$i"); done

Conversation

CybotTM commented Apr 1, 2026

Summary

Eval Coverage (21 tests)

A/B Test Results (8 representative evals, sonnet baseline)

Niche pattern analysis (skill-specific knowledge)

SKILL.md Improvements

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants