Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 123 additions & 113 deletions hf_model_evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,20 +17,33 @@ This skill provides tools to add structured evaluation results to Hugging Face m

# Dependencies
- huggingface_hub>=0.26.0
- markdown-it-py>=3.0.0
- python-dotenv>=1.2.1
- pyyaml>=6.0.3
- requests>=2.32.5
- inspect-ai>=0.3.0
- re (built-in)

# IMPORTANT: Using This Skill

**Always run `--help` to get guidance on table extraction and YAML generation:**
```bash
python scripts/evaluation_manager.py --help
python scripts/evaluation_manager.py inspect-tables --help
python scripts/evaluation_manager.py extract-readme --help
```

The `--help` output includes workflow guidance for converting tables to YAML.

# Core Capabilities

## 1. Extract Evaluation Tables from README
- **Parse Markdown Tables**: Automatically detect and parse evaluation tables in model READMEs
- **Multiple Table Support**: Handle models with multiple benchmark tables
- **Format Detection**: Recognize common evaluation table formats (benchmarks as rows/columns, or transposed with models as rows)
- **Smart Model Matching**: Find and extract scores for specific models in comparison tables
- **Smart Conversion**: Convert parsed tables to model-index YAML format
## 1. Inspect and Extract Evaluation Tables from README
- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with their structure, columns, and suggested extraction commands
- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)
- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)
- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
- **Column Matching**: Automatically identify model columns, with `--model-name-override` for comparison tables
- **YAML Generation**: Convert selected table to model-index YAML format

## 2. Import from Artificial Analysis
- **API Integration**: Fetch benchmark scores directly from Artificial Analysis
Expand Down Expand Up @@ -63,29 +76,72 @@ The skill includes Python scripts in `scripts/` to perform operations.

### Method 1: Extract from README

Extract evaluation tables from a model's existing README and add them to model-index metadata.
Extract evaluation tables from a model's existing README and convert to model-index YAML.

**Basic Usage:**
#### Recommended Workflow: Inspect Tables First

**Step 1: Inspect the tables** to see structure and get the extraction command:
```bash
python scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name"
python scripts/evaluation_manager.py inspect-tables --repo-id "allenai/OLMo-7B"
```

**With Custom Task Type:**
This outputs:
```
======================================================================
Tables found in README for: allenai/OLMo-7B
======================================================================

## Table 3
Format: comparison
Rows: 14

Columns (6):
[1] [Llama 7B](...)
[2] [Llama 2 7B](...)
[5] **OLMo 7B** (ours) ~ partial match

Sample rows (first column):
- arc_challenge
- arc_easy
- boolq

⚠ No exact match. Best candidate: **OLMo 7B** (ours)

Suggested command:
python scripts/evaluation_manager.py extract-readme \
--repo-id "allenai/OLMo-7B" \
--table 3 \
--model-name-override "**OLMo 7B** (ours)" \
--dry-run
```

**Step 2: Copy and run the suggested command** (with `--dry-run` to preview YAML):
```bash
python scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--task-type "text-generation" \
--dataset-name "Custom Benchmarks"
--repo-id "allenai/OLMo-7B" \
--table 3 \
--model-name-override "**OLMo 7B** (ours)" \
--dry-run
```

**Dry Run (Preview Only):**
**Step 3: Verify the YAML output** - check benchmark names and values match the README

**Step 4: Apply changes** - remove `--dry-run` and optionally add `--create-pr`:
```bash
python scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--dry-run
--repo-id "allenai/OLMo-7B" \
--table 3 \
--model-name-override "**OLMo 7B** (ours)" \
--create-pr
```

#### Key Flags

- `--table N`: **Required when multiple tables exist.** Specifies which table to extract (1-indexed, matches `inspect-tables` output)
- `--model-name-override`: Column header text for comparison tables (e.g., `"**OLMo 7B** (ours)"`)
- `--dry-run`: Preview YAML without making changes
- `--create-pr`: Create a pull request instead of direct push

#### Supported Table Formats

**Format 1: Benchmarks as Rows**
Expand Down Expand Up @@ -124,80 +180,14 @@ In this format, the script will:
- Find the row matching your model name (handles bold/markdown formatting)
- Extract all benchmark scores from that specific row only

#### Validating Extraction Results
#### Validation Checklist

**CRITICAL**: Always validate extracted results before creating a PR or pushing changes.

After running `extract-readme`, you MUST:

1. **Use `--dry-run` first** to preview the extraction:
```bash
python scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--dry-run
```

2. **Manually verify the output**:
- Check that the correct model's scores were extracted (not other models)
- Verify benchmark names are correct
- Confirm all expected benchmarks are present
- Ensure numeric values match the README exactly

3. **For transposed tables** (models as rows):
- Verify only ONE model's row was extracted
- Check that it matched the correct model name
- Look for warnings like "Could not find model 'X' in transposed table"
- If scores from multiple models appear, the table format was misdetected

4. **Compare against the source**:
- Open the model README in a browser
- Cross-reference each extracted score with the table
- Verify no scores are mixed from different rows/columns

5. **Common validation failures**:
- **Multiple models extracted**: Wrong table format detected
- **Missing benchmarks**: Column headers not recognized
- **Wrong scores**: Matched wrong model row or column
- **Empty metrics list**: Table not detected or parsing failed

**Example validation workflow**:
```bash
# Step 1: Dry run to preview
python scripts/evaluation_manager.py extract-readme \
--repo-id "allenai/Olmo-3-1125-32B" \
--dry-run

# Step 2: If model name not found in table, script shows available models
# ⚠ Could not find model 'Olmo-3-1125-32B' in transposed table
#
# Available models in table:
# 1. **Open-weight Models**
# 2. Qwen-2.5-32B
# ...
# 12. **Olmo 3-32B**
#
# Please select the correct model name from the list above.

# Step 3: Re-run with the correct model name
python scripts/evaluation_manager.py extract-readme \
--repo-id "allenai/Olmo-3-1125-32B" \
--model-name-override "**Olmo 3-32B**" \
--dry-run

# Step 4: Review the YAML output carefully
# Verify: Are these all benchmarks for Olmo-3-32B ONLY?
# Verify: Do the scores match the README table?

# Step 5: If validation passes, create PR
python scripts/evaluation_manager.py extract-readme \
--repo-id "allenai/Olmo-3-1125-32B" \
--model-name-override "**Olmo 3-32B**" \
--create-pr

# Step 6: Validate the model card after update
python scripts/evaluation_manager.py show \
--repo-id "allenai/Olmo-3-1125-32B"
```
Before applying changes (removing `--dry-run`), verify:
- [ ] Correct table selected (use `inspect-tables` to confirm)
- [ ] Correct column extracted (check `--model-name-override` if comparison table)
- [ ] Benchmark names match the README
- [ ] Numeric values match the README exactly
- [ ] No scores from other models included

### Method 2: Import from Artificial Analysis

Expand Down Expand Up @@ -272,21 +262,35 @@ python scripts/run_eval_job.py \
python scripts/evaluation_manager.py --help
```

**Inspect Tables (start here):**
```bash
python scripts/evaluation_manager.py inspect-tables \
--repo-id "username/model-name"
```
Shows all tables in the README with:
- Table format (simple, comparison, transposed)
- Column headers with model match indicators
- Sample rows from first column
- **Ready-to-use `extract-readme` command** with correct `--table` and `--model-name-override`

Run `inspect-tables --help` to see the full workflow.

**Extract from README:**
```bash
python scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
[--table N] \
[--model-name-override "Column Header"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--model-name-override "Model Name From Table"] \
[--dry-run] \
[--create-pr]
```

The `--model-name-override` flag is useful when:
- The model name in the table differs from the repo name
- Working with transposed tables where models are listed with different formatting
- The script cannot automatically match the model name
Key flags:
- `--table N`: Table number from `inspect-tables` output (required if multiple tables)
- `--model-name-override`: Exact column header for comparison tables
- `--dry-run`: Preview YAML output without applying

**Import from Artificial Analysis:**
```bash
Expand Down Expand Up @@ -399,15 +403,15 @@ done < models.txt

### Best Practices

1. **ALWAYS Validate Extraction**: Use `--dry-run` first and manually verify all extracted scores match the README exactly before pushing
2. **Check for Transposed Tables**: If the README has comparison tables with multiple models, verify only YOUR model's scores were extracted
3. **Validate After Updates**: Run `validate` and `show` commands to ensure proper formatting
4. **Source Attribution**: Include source information for traceability
5. **Regular Updates**: Keep evaluation scores current as new benchmarks emerge
6. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
7. **Monitor Costs**: Evaluation Jobs are billed by usage. Ensure you check running jobs and costs
8. **One model per repo**: Only add one model's 'results' to the model-index. The main model of the repo. No derivatives or forks!
9. **Markdown formatting**: Never use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
1. **Always start with `inspect-tables`**: See table structure and get the correct extraction command
2. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow
3. **Use `--dry-run` first**: Preview YAML output before applying changes
4. **Verify extracted values**: Compare YAML output against the README table manually
5. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist
6. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output
7. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
8. **One model per repo**: Only add the main model's results to model-index
9. **No markdown in YAML names**: The model name field in YAML should be plain text

### Model Name Matching

Expand Down Expand Up @@ -459,11 +463,23 @@ python scripts/evaluation_manager.py import-aa \

### Troubleshooting

**Issue**: "No evaluation tables found in README"
- **Solution**: Check if README contains markdown tables with numeric scores
**Issue**: "Found N evaluation tables. Run inspect-tables first"
- **Cause**: Multiple tables exist and `--table` was not specified
- **Solution**: Run `inspect-tables` to see available tables, then use `--table N`

**Issue**: Wrong values extracted (scores don't match README)
- **Cause**: Wrong column extracted from comparison table
- **Solution**:
1. Run `inspect-tables` to see column headers
2. Use `--model-name-override` with the exact column header text
3. Use `--dry-run` to verify before applying

**Issue**: "No tables found in README"
- **Cause**: Tables may be in code blocks or non-standard format
- **Solution**: Check README contains proper markdown tables with `|` separators

**Issue**: "Could not find model 'X' in transposed table"
- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list
- **Solution**: Use `--model-name-override` with the exact name from the table
- **Example**: `--model-name-override "**Olmo 3-32B**"`

**Issue**: "AA_API_KEY not set"
Expand All @@ -472,12 +488,6 @@ python scripts/evaluation_manager.py import-aa \
**Issue**: "Token does not have write access"
- **Solution**: Ensure HF_TOKEN has write permissions for the repository

**Issue**: "Model not found in Artificial Analysis"
- **Solution**: Verify creator-slug and model-name match API values

**Issue**: "Payment required for hardware"
- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware

### Integration Examples

**Python Script Integration:**
Expand Down
Loading