Skip to content

Add basic evals for AI features#128

Merged
mreichhoff merged 16 commits intomainfrom
eval-setup
Feb 4, 2026
Merged

Add basic evals for AI features#128
mreichhoff merged 16 commits intomainfrom
eval-setup

Conversation

@mreichhoff
Copy link
Owner

I'm looking to improve prompts and upgrade to a newer version of gemini. This change is the first step in that direction.

This commit includes a fix for the english --> chinese prompt as there was a failure when the input english had the word 'chinese' in it.

This is also integrated with github workflows to allow quick checks of whether changes are an improvement.

I intend to expand the eval datasets over time, especially with more complicated and interesting grammar concepts.

I'm looking to improve prompts and upgrade to a newer version of
gemini. This change is the first step in that direction.

This commit includes a fix for the english --> chinese prompt as
there was a failure when the input english had the word 'chinese'
in it.

This will also be integrated with github workflows to allow quick
checks of whether changes are an improvement.
@github-actions
Copy link

github-actions bot commented Feb 1, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
0 ✅ 4/4 (100%)
1 ✅ 4/4 (100%)
2 ✅ 4/4 (100%)

word context

Evaluator Pass Rate
0 ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality 🟡 4/5 (80%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 1, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 2, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 2, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@mreichhoff mreichhoff changed the title Add eval framework for AI features Add basic evals for AI features Feb 2, 2026
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate
chineseTextPresent ✅ 4/4 (100%)
englishTranslationPresent ✅ 4/4 (100%)
outputStructureValid ✅ 4/4 (100%)

explain chinese

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

explain english

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
grammarExplanationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

generate sentences

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
validPinyinFormat ✅ 5/5 (100%)
sentenceGenerationQuality ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

word context

Evaluator Pass Rate
chineseTextPresent ✅ 5/5 (100%)
englishTranslationPresent ✅ 5/5 (100%)
outputStructureValid ✅ 5/5 (100%)

📦 Download full results

@mreichhoff mreichhoff merged commit 08a33fe into main Feb 4, 2026
1 check passed
@mreichhoff mreichhoff deleted the eval-setup branch February 4, 2026 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant