Skip to content

Expand eval suite to 21 tests and enhance SKILL.md#27

Merged
CybotTM merged 4 commits intomainfrom
feature/evals-and-improvements
Apr 1, 2026
Merged

Expand eval suite to 21 tests and enhance SKILL.md#27
CybotTM merged 4 commits intomainfrom
feature/evals-and-improvements

Conversation

@CybotTM
Copy link
Copy Markdown
Member

@CybotTM CybotTM commented Apr 1, 2026

Summary

  • Expand eval suite from 2 to 21 self-contained tests covering all skill areas
  • Add A/B test runner script (evals/run-ab-test.sh) for systematic comparison
  • Enhance SKILL.md with PHP 8.2-8.4 features, pitfalls, and migration checklist items
  • Bump version to 1.13.0

Eval Coverage (21 tests)

Area Evals
PHP 8.x features modernize-php-class, switch-to-match, readonly-properties, nullsafe-operator, named-arguments, sensitive-parameter-attribute, override-attribute, typed-class-constants
Strict types add-strict-types
Enums constants-to-enum
DTOs/Value Objects array-to-dto, value-object-pattern, type-safe-collection
Static analysis configure-phpstan-level9, configure-rector, configure-php-cs-fixer, phpstan-baseline-strategy, phpat-architecture-test, composer-scripts-qa
PSR compliance psr-interface-typehint
Migration migration-assessment

A/B Test Results (8 representative evals, sonnet baseline)

Eval Assertions WITHOUT Assertions WITH Output WITHOUT Output WITH
modernize-php-class 3/3 3/3 1414B 1693B
constants-to-enum 4/4 4/4 1858B 2014B
array-to-dto 3/3 3/3 1958B 1804B
configure-php-cs-fixer 3/3 3/3 3586B 3532B
psr-interface-typehint 3/3 3/3 2096B 2281B
phpstan-baseline-strategy 3/3 3/3 1524B 1976B
migration-assessment 3/4 3/4 3382B 3355B
sensitive-parameter-attribute 1/1 1/1 2662B 3245B
Total 23/24 (96%) 23/24 (96%)

Niche pattern analysis (skill-specific knowledge)

Pattern WITHOUT WITH Advantage
treatPhpDoc + runtime context N Y +SKILL
copy-on-write reference fix ($) N Y +SKILL
PHP 8.4 HTMLDocument alternative N N gap identified

Findings: Baseline Claude passes standard PHP assertions well. The skill adds value for niche patterns (PHPStan treatPhpDocTypesAsCertain, copy-on-write security, deprecated alias detection). Identified gap: PHP 8.4 Dom\HTMLDocument not surfaced in eval responses despite being documented in references.

SKILL.md Improvements

  • Added PHP 8.2-8.4 features to expertise: #[Override], typed constants, #[SensitiveParameter], property hooks
  • Added copy-on-write awareness and DOMDocument UTF-8 pitfall to expertise
  • Enhanced migration checklist with 3 new items
  • Word count: 462/500

Test plan

  • Verify evals.json is valid JSON with 21 entries
  • Run bash evals/run-ab-test.sh --indices 0,3,9 for spot-check
  • Verify SKILL.md word count <= 500
  • Verify plugin.json version matches SKILL.md version

CybotTM added 4 commits April 1, 2026 08:18
Add self-contained eval prompts with inline code samples and content
assertions for: PHP 8.x features (readonly, enums, match, named args,
nullsafe, #[Override], #[SensitiveParameter], typed constants), strict
types, DTOs, PSR interfaces, PHPStan config, Rector config, PHP-CS-Fixer
config, PHPat architecture tests, baseline strategy, and migration
planning.
Script runs each eval prompt with and without the skill loaded,
collecting output size, line count, and duration metrics. Supports
selective runs via --indices flag.
Add #[Override], typed constants, #[SensitiveParameter], property hooks,
copy-on-write awareness, DOMDocument UTF-8 pitfall, and PHP-CS-Fixer
deprecation detection to expertise areas and migration checklist.
Copilot AI review requested due to automatic review settings April 1, 2026 06:20
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the php-modernization plugin to version 1.13.0, expanding its expertise to include PHP 8.3 features like typed constants and the #[Override] attribute. It also introduces a comprehensive A/B testing suite in evals/ to measure the plugin's performance. Feedback focuses on improving the portability and robustness of the run-ab-test.sh script, specifically by replacing non-POSIX commands like date +%s%N and seq with portable Python or native shell alternatives, and fixing a potential shell injection vulnerability when passing variables to Python.

Comment thread evals/run-ab-test.sh
Comment on lines +45 to +52
START_B=$(date +%s%N)
claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
--disable-slash-commands \
--plugin-dir "$SKILL_DIR/.." \
--append-system-prompt "$SYSTEM_HINT" \
-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true
END_B=$(date +%s%N)
DURATION_B=$(( (END_B - START_B) / 1000000 ))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block has two issues: 1) date +%s%N is not portable across systems. 2) The --plugin-dir path is incorrect; the plugin manifest (plugin.json) is located in the repository root (under .claude-plugin/), but $SKILL_DIR/.. points to the skills/ directory. Correcting the path to the root ensures the skill is loaded correctly.

Suggested change
START_B=$(date +%s%N)
claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
--disable-slash-commands \
--plugin-dir "$SKILL_DIR/.." \
--append-system-prompt "$SYSTEM_HINT" \
-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true
END_B=$(date +%s%N)
DURATION_B=$(( (END_B - START_B) / 1000000 ))
START_B=$(python3 -c 'import time; print(int(time.time() * 1000))')
claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
--disable-slash-commands \
--plugin-dir "$(dirname "$SCRIPT_DIR")" \
--append-system-prompt "$SYSTEM_HINT" \
-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true
END_B=$(python3 -c 'import time; print(int(time.time() * 1000))')
DURATION_B=$(( END_B - START_B ))
References
  1. Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.

Comment thread evals/run-ab-test.sh
if [[ "${1:-}" == "--indices" && -n "${2:-}" ]]; then
IFS=',' read -ra INDICES <<< "$2"
else
INDICES=($(seq 0 $((EVAL_COUNT - 1))))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The seq command is not POSIX-compliant. For better portability across different environments (e.g., BSD vs. GNU), consider using a native shell loop to populate the indices.

Suggested change
INDICES=($(seq 0 $((EVAL_COUNT - 1))))
INDICES=(); for ((i=0; i<EVAL_COUNT; i++)); do INDICES+=("$i"); done
References
  1. Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.

Comment thread evals/run-ab-test.sh
Comment on lines +35 to +41
START_A=$(date +%s%N)
claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
--disable-slash-commands \
--append-system-prompt "$SYSTEM_HINT" \
-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true
END_A=$(date +%s%N)
DURATION_A=$(( (END_A - START_A) / 1000000 ))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of date +%s%N is not POSIX-compliant and will fail on non-GNU systems (like macOS), where the %N flag is not supported. Since python3 is already a dependency for this script, you can use it for portable high-precision timing.

Suggested change
START_A=$(date +%s%N)
claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
--disable-slash-commands \
--append-system-prompt "$SYSTEM_HINT" \
-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true
END_A=$(date +%s%N)
DURATION_A=$(( (END_A - START_A) / 1000000 ))
START_A=$(python3 -c 'import time; print(int(time.time() * 1000))')
claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
--disable-slash-commands \
--append-system-prompt "$SYSTEM_HINT" \
-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true
END_A=$(python3 -c 'import time; print(int(time.time() * 1000))')
DURATION_A=$(( END_A - START_A ))
References
  1. Prefer POSIX-compliant tools over non-standard command-line flags in shell scripts to ensure portability across different environments.

Comment thread evals/run-ab-test.sh
Comment on lines +64 to +73
python3 -c "
import json
metrics = {
'name': '$EVAL_NAME',
'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A},
'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B}
}
with open('$RESULTS_DIR/${EVAL_NAME}_metrics.json', 'w') as f:
json.dump(metrics, f, indent=2)
"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Injecting shell variables directly into the Python script string is fragile. If a variable like $EVAL_NAME contains a single quote, it will break the Python syntax and cause a crash. It is safer to pass these values as command-line arguments to the Python process.

Suggested change
python3 -c "
import json
metrics = {
'name': '$EVAL_NAME',
'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A},
'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B}
}
with open('$RESULTS_DIR/${EVAL_NAME}_metrics.json', 'w') as f:
json.dump(metrics, f, indent=2)
"
python3 -c "
import json, sys
name, results_dir = sys.argv[1], sys.argv[2]
metrics = {
'name': name,
'without': {'bytes': $SIZE_A, 'lines': $LINES_A, 'duration_ms': $DURATION_A},
'with': {'bytes': $SIZE_B, 'lines': $LINES_B, 'duration_ms': $DURATION_B}
}
with open(f'{results_dir}/{name}_metrics.json', 'w') as f:
json.dump(metrics, f, indent=2)
" "$EVAL_NAME" "$RESULTS_DIR"

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Expands the php-modernization skill’s evaluation suite and supporting docs/scripts to better cover PHP 8.x modernization patterns, static analysis tooling, and migration guidance.

Changes:

  • Expanded evals/evals.json from a minimal set to 21 self-contained evals spanning feature usage, tooling config, and migration topics.
  • Added evals/run-ab-test.sh to run baseline vs. skill A/B comparisons and store per-eval output/metrics.
  • Updated skills/php-modernization/SKILL.md, .claude-plugin/plugin.json, and .gitignore to reflect the new version and generated artifacts.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
skills/php-modernization/SKILL.md Updates version and enriches expertise/checklist items for newer PHP features and pitfalls.
evals/run-ab-test.sh Adds a CLI script to run A/B eval comparisons and write results/metrics files.
evals/evals.json Adds a broader, 21-case eval suite with prompt content and assertion checks.
.gitignore Ignores generated A/B results directory.
.claude-plugin/plugin.json Bumps plugin version to 1.13.0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread evals/evals.json
Comment on lines +17 to +19
"type": "content_contains",
"value": "public function __construct(",
"description": "Uses constructor property promotion"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In modernize-php-class, the assertion content_contains: "public function __construct(" doesn’t actually verify constructor property promotion (the original code already contains a constructor). Tighten this to assert a visibility keyword on the constructor parameters (e.g., __construct(public ...) or a regex that matches promoted properties, otherwise the eval can pass without using the intended PHP 8.0+ feature.

Suggested change
"type": "content_contains",
"value": "public function __construct(",
"description": "Uses constructor property promotion"
"type": "content_regex",
"value": "__construct\\s*\\((public|protected|private)\\s",
"description": "Uses constructor property promotion (visibility keyword on constructor parameters)"

Copilot uses AI. Check for mistakes.
Comment thread evals/evals.json
Comment on lines +184 to +193
{
"type": "content_regex",
"value": "public (readonly )?function __construct\\(",
"description": "Uses constructor property promotion"
},
{
"type": "content_regex",
"value": "(public|private|protected) readonly",
"description": "Applies readonly modifier to promoted properties"
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readonly-properties eval uses the regex public (readonly )?function __construct\(, but PHP doesn’t support readonly function, and this pattern also doesn’t ensure constructor property promotion occurred. Consider asserting promotion directly (e.g., visibility keywords in the parameter list) and keep readonly checks focused on readonly class / public readonly properties.

Copilot uses AI. Check for mistakes.
Comment thread evals/run-ab-test.sh
Comment on lines +33 to +51
# WITHOUT skill (baseline)
echo " Running WITHOUT skill..."
START_A=$(date +%s%N)
claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
--disable-slash-commands \
--append-system-prompt "$SYSTEM_HINT" \
-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_without.txt" 2>/dev/null || true
END_A=$(date +%s%N)
DURATION_A=$(( (END_A - START_A) / 1000000 ))

# WITH skill
echo " Running WITH skill..."
START_B=$(date +%s%N)
claude --print --model sonnet --max-turns 3 --dangerously-skip-permissions \
--disable-slash-commands \
--plugin-dir "$SKILL_DIR/.." \
--append-system-prompt "$SYSTEM_HINT" \
-p "$EVAL_PROMPT" > "$RESULTS_DIR/${EVAL_NAME}_with.txt" 2>/dev/null || true
END_B=$(date +%s%N)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script intentionally ignores all claude failures (2>/dev/null || true). That makes A/B results hard to trust because an invocation can fail silently and still produce “metrics” from an empty/partial output file. Consider capturing stderr to a per-eval log, recording the exit status in the metrics JSON, and failing the run (or at least printing a clear warning) when a call exits non-zero.

Copilot uses AI. Check for mistakes.
Comment thread evals/run-ab-test.sh
Comment on lines +18 to +30
# Parse optional --indices flag for selective runs
if [[ "${1:-}" == "--indices" && -n "${2:-}" ]]; then
IFS=',' read -ra INDICES <<< "$2"
else
INDICES=($(seq 0 $((EVAL_COUNT - 1))))
fi

echo "Running A/B tests for ${#INDICES[@]} of $EVAL_COUNT evals..."

for i in "${INDICES[@]}"; do
EVAL_NAME=$(python3 -c "import json; print(json.load(open('$EVALS_FILE'))[$i]['name'])")
EVAL_PROMPT=$(python3 -c "import json,sys; sys.stdout.write(json.load(open('$EVALS_FILE'))[$i]['prompt'])")

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--indices values are interpolated directly into the Python one-liner (...json.load(... )[$i]...). Since i comes from user input, a malicious value (or even accidental whitespace) can break the script or inject Python code. Validate indices as ^[0-9]+$ (and range-check against EVAL_COUNT) before use, or pass the index via sys.argv and convert with int() inside Python.

Copilot uses AI. Check for mistakes.
@CybotTM CybotTM merged commit c80ebef into main Apr 1, 2026
9 checks passed
@CybotTM CybotTM deleted the feature/evals-and-improvements branch April 1, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants