Skip to content

Conversation

@ECNU3D
Copy link

@ECNU3D ECNU3D commented Jun 5, 2025

This PR extends the original simple-evals repository with the following key improvements. The full extension of simple-eval to agentic eval generation can be find here: https://github.com/ECNU3D/agentic-simple-evals

Additional Model Support

  • Gemini Models: Added support for Google's Gemini models (GeminiSampler) with both API key and Vertex AI authentication, including support for Gemini grounding capabilities
  • Claude on Vertex AI: Implemented ClaudeVertexCompletionSampler for running Claude models through Google Cloud Vertex AI instead of direct Anthropic API
  • Llama Models on Vertex AI: Added examples to show how to integrate with OpenAI API compatible models

Windows Compatibility

  • Windows HumanEval Fix: Added human_eval_windows_patch.py to resolve Windows compatibility issues with the HumanEval benchmark by replacing Unix-specific timeout mechanisms with Windows-compatible threading-based solutions

Infrastructure Improvements

  • Checkpointing System: Implemented robust checkpointing functionality across all evaluations to support resuming interrupted evaluation runs, with checkpoint loading and saving capabilities
  • Batch Processing: Added configurable batch processing to improve memory management and allow for better control over evaluation execution
  • Enhanced Error Handling: Improved exception handling and retry mechanisms for API calls
  • Progress Tracking: Better progress reporting and logging throughout evaluation processes

Configuration Enhancements

  • Environment Variable Handling: Improved API key and authentication management with fallback mechanisms
  • Configurable Parameters: Enhanced parameterization for batch sizes, timeouts, and other evaluation settings
  • Flexible Authentication: Support for multiple authentication methods including API keys, Vertex AI, and Application Default Credentials

ECNU3D added 2 commits June 5, 2025 08:58
… open ai compatible endpoint

feat: enhance simple eval with math, mmlu, and gpqa support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant