huggingface · evalstate · Nov 28, 2025 · Nov 28, 2025 · Nov 28, 2025 · Nov 28, 2025
diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md
@@ -17,20 +17,33 @@ This skill provides tools to add structured evaluation results to Hugging Face m
 
 # Dependencies
 - huggingface_hub>=0.26.0
+- markdown-it-py>=3.0.0
 - python-dotenv>=1.2.1
 - pyyaml>=6.0.3
 - requests>=2.32.5
 - inspect-ai>=0.3.0
 - re (built-in)
 
+# IMPORTANT: Using This Skill
+
+**Always run `--help` to get guidance on table extraction and YAML generation:**
+```bash
+python scripts/evaluation_manager.py --help
+python scripts/evaluation_manager.py inspect-tables --help
+python scripts/evaluation_manager.py extract-readme --help
+```
+
+The `--help` output includes workflow guidance for converting tables to YAML.
+
 # Core Capabilities
 
-## 1. Extract Evaluation Tables from README
-- **Parse Markdown Tables**: Automatically detect and parse evaluation tables in model READMEs
-- **Multiple Table Support**: Handle models with multiple benchmark tables
-- **Format Detection**: Recognize common evaluation table formats (benchmarks as rows/columns, or transposed with models as rows)
-- **Smart Model Matching**: Find and extract scores for specific models in comparison tables
-- **Smart Conversion**: Convert parsed tables to model-index YAML format
+## 1. Inspect and Extract Evaluation Tables from README
+- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with their structure, columns, and suggested extraction commands
+- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)
+- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)
+- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
+- **Column Matching**: Automatically identify model columns, with `--model-name-override` for comparison tables
+- **YAML Generation**: Convert selected table to model-index YAML format
 
 ## 2. Import from Artificial Analysis
 - **API Integration**: Fetch benchmark scores directly from Artificial Analysis
@@ -63,29 +76,72 @@ The skill includes Python scripts in `scripts/` to perform operations.
 
 ### Method 1: Extract from README
 
-Extract evaluation tables from a model's existing README and add them to model-index metadata.
+Extract evaluation tables from a model's existing README and convert to model-index YAML.
 
-**Basic Usage:**
+#### Recommended Workflow: Inspect Tables First
+
+**Step 1: Inspect the tables** to see structure and get the extraction command:
 ```bash
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name"
+python scripts/evaluation_manager.py inspect-tables --repo-id "allenai/OLMo-7B"
 ```
 
-**With Custom Task Type:**
+This outputs:
+```
+======================================================================
+Tables found in README for: allenai/OLMo-7B
+======================================================================
+
+## Table 3
+   Format: comparison
+   Rows: 14
+
+   Columns (6):
+      [1] [Llama 7B](...)
+      [2] [Llama 2 7B](...)
+      [5] **OLMo 7B** (ours)  ~ partial match
+
+   Sample rows (first column):
+      - arc_challenge
+      - arc_easy
+      - boolq
+
+   ⚠ No exact match. Best candidate: **OLMo 7B** (ours)
+
+   Suggested command:
+      python scripts/evaluation_manager.py extract-readme \
+        --repo-id "allenai/OLMo-7B" \
+        --table 3 \
+        --model-name-override "**OLMo 7B** (ours)" \
+        --dry-run
+```
+
+**Step 2: Copy and run the suggested command** (with `--dry-run` to preview YAML):
 ```bash
 python scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name" \
-  --task-type "text-generation" \
-  --dataset-name "Custom Benchmarks"
+  --repo-id "allenai/OLMo-7B" \
+  --table 3 \
+  --model-name-override "**OLMo 7B** (ours)" \
+  --dry-run
 ```
 
-**Dry Run (Preview Only):**
+**Step 3: Verify the YAML output** - check benchmark names and values match the README
+
+**Step 4: Apply changes** - remove `--dry-run` and optionally add `--create-pr`:
 ```bash
 python scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name" \
-  --dry-run
+  --repo-id "allenai/OLMo-7B" \
+  --table 3 \
+  --model-name-override "**OLMo 7B** (ours)" \
+  --create-pr
 ```
 
+#### Key Flags
+
+- `--table N`: **Required when multiple tables exist.** Specifies which table to extract (1-indexed, matches `inspect-tables` output)
+- `--model-name-override`: Column header text for comparison tables (e.g., `"**OLMo 7B** (ours)"`)
+- `--dry-run`: Preview YAML without making changes
+- `--create-pr`: Create a pull request instead of direct push
+
 #### Supported Table Formats
 
 **Format 1: Benchmarks as Rows**
@@ -124,80 +180,14 @@ In this format, the script will:
 - Find the row matching your model name (handles bold/markdown formatting)
 - Extract all benchmark scores from that specific row only
 
-#### Validating Extraction Results
+#### Validation Checklist
 
-**CRITICAL**: Always validate extracted results before creating a PR or pushing changes.
-
-After running `extract-readme`, you MUST:
-
-1. **Use `--dry-run` first** to preview the extraction:
-```bash
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name" \
-  --dry-run
-```
-
-2. **Manually verify the output**:
-   - Check that the correct model's scores were extracted (not other models)
-   - Verify benchmark names are correct
-   - Confirm all expected benchmarks are present
-   - Ensure numeric values match the README exactly
-
-3. **For transposed tables** (models as rows):
-   - Verify only ONE model's row was extracted
-   - Check that it matched the correct model name
-   - Look for warnings like "Could not find model 'X' in transposed table"
-   - If scores from multiple models appear, the table format was misdetected
-
-4. **Compare against the source**:
-   - Open the model README in a browser
-   - Cross-reference each extracted score with the table
-   - Verify no scores are mixed from different rows/columns
-
-5. **Common validation failures**:
-   - **Multiple models extracted**: Wrong table format detected
-   - **Missing benchmarks**: Column headers not recognized
-   - **Wrong scores**: Matched wrong model row or column
-   - **Empty metrics list**: Table not detected or parsing failed
-
-**Example validation workflow**:
-```bash
-# Step 1: Dry run to preview
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "allenai/Olmo-3-1125-32B" \
-  --dry-run
-
-# Step 2: If model name not found in table, script shows available models
-# ⚠ Could not find model 'Olmo-3-1125-32B' in transposed table
-#
-# Available models in table:
-#   1. **Open-weight Models**
-#   2. Qwen-2.5-32B
-#   ...
-#   12. **Olmo 3-32B**
-#
-# Please select the correct model name from the list above.
-
-# Step 3: Re-run with the correct model name
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "allenai/Olmo-3-1125-32B" \
-  --model-name-override "**Olmo 3-32B**" \
-  --dry-run
-
-# Step 4: Review the YAML output carefully
-# Verify: Are these all benchmarks for Olmo-3-32B ONLY?
-# Verify: Do the scores match the README table?
-
-# Step 5: If validation passes, create PR
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "allenai/Olmo-3-1125-32B" \
-  --model-name-override "**Olmo 3-32B**" \
-  --create-pr
-
-# Step 6: Validate the model card after update
-python scripts/evaluation_manager.py show \
-  --repo-id "allenai/Olmo-3-1125-32B"
-```
+Before applying changes (removing `--dry-run`), verify:
+- [ ] Correct table selected (use `inspect-tables` to confirm)
+- [ ] Correct column extracted (check `--model-name-override` if comparison table)
+- [ ] Benchmark names match the README
+- [ ] Numeric values match the README exactly
+- [ ] No scores from other models included
 
 ### Method 2: Import from Artificial Analysis
 
@@ -272,21 +262,35 @@ python scripts/run_eval_job.py \
 python scripts/evaluation_manager.py --help
 ```
 
+**Inspect Tables (start here):**
+```bash
+python scripts/evaluation_manager.py inspect-tables \
+  --repo-id "username/model-name"
+```
+Shows all tables in the README with:
+- Table format (simple, comparison, transposed)
+- Column headers with model match indicators
+- Sample rows from first column
+- **Ready-to-use `extract-readme` command** with correct `--table` and `--model-name-override`
+
+Run `inspect-tables --help` to see the full workflow.
+
 **Extract from README:**
 ```bash
 python scripts/evaluation_manager.py extract-readme \
   --repo-id "username/model-name" \
+  [--table N] \
+  [--model-name-override "Column Header"] \
   [--task-type "text-generation"] \
   [--dataset-name "Custom Benchmarks"] \
-  [--model-name-override "Model Name From Table"] \
   [--dry-run] \
   [--create-pr]
 ```
 
-The `--model-name-override` flag is useful when:
-- The model name in the table differs from the repo name
-- Working with transposed tables where models are listed with different formatting
-- The script cannot automatically match the model name
+Key flags:
+- `--table N`: Table number from `inspect-tables` output (required if multiple tables)
+- `--model-name-override`: Exact column header for comparison tables
+- `--dry-run`: Preview YAML output without applying
 
 **Import from Artificial Analysis:**
 ```bash
@@ -399,15 +403,15 @@ done < models.txt
 
 ### Best Practices
 
-1. **ALWAYS Validate Extraction**: Use `--dry-run` first and manually verify all extracted scores match the README exactly before pushing
-2. **Check for Transposed Tables**: If the README has comparison tables with multiple models, verify only YOUR model's scores were extracted
-3. **Validate After Updates**: Run `validate` and `show` commands to ensure proper formatting
-4. **Source Attribution**: Include source information for traceability
-5. **Regular Updates**: Keep evaluation scores current as new benchmarks emerge
-6. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
-7. **Monitor Costs**: Evaluation Jobs are billed by usage. Ensure you check running jobs and costs
-8. **One model per repo**: Only add one model's 'results' to the model-index. The main model of the repo. No derivatives or forks!
-9. **Markdown formatting**: Never use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
+1. **Always start with `inspect-tables`**: See table structure and get the correct extraction command
+2. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow
+3. **Use `--dry-run` first**: Preview YAML output before applying changes
+4. **Verify extracted values**: Compare YAML output against the README table manually
+5. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist
+6. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output
+7. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
+8. **One model per repo**: Only add the main model's results to model-index
+9. **No markdown in YAML names**: The model name field in YAML should be plain text
 
 ### Model Name Matching
 
@@ -459,11 +463,23 @@ python scripts/evaluation_manager.py import-aa \
 
 ### Troubleshooting
 
-**Issue**: "No evaluation tables found in README"
-- **Solution**: Check if README contains markdown tables with numeric scores
+**Issue**: "Found N evaluation tables. Run inspect-tables first"
+- **Cause**: Multiple tables exist and `--table` was not specified
+- **Solution**: Run `inspect-tables` to see available tables, then use `--table N`
+
+**Issue**: Wrong values extracted (scores don't match README)
+- **Cause**: Wrong column extracted from comparison table
+- **Solution**:
+  1. Run `inspect-tables` to see column headers
+  2. Use `--model-name-override` with the exact column header text
+  3. Use `--dry-run` to verify before applying
+
+**Issue**: "No tables found in README"
+- **Cause**: Tables may be in code blocks or non-standard format
+- **Solution**: Check README contains proper markdown tables with `|` separators
 
 **Issue**: "Could not find model 'X' in transposed table"
-- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list
+- **Solution**: Use `--model-name-override` with the exact name from the table
 - **Example**: `--model-name-override "**Olmo 3-32B**"`
 
 **Issue**: "AA_API_KEY not set"
@@ -472,12 +488,6 @@ python scripts/evaluation_manager.py import-aa \
 **Issue**: "Token does not have write access"
 - **Solution**: Ensure HF_TOKEN has write permissions for the repository
 
-**Issue**: "Model not found in Artificial Analysis"
-- **Solution**: Verify creator-slug and model-name match API values
-
-**Issue**: "Payment required for hardware"
-- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware
-
 ### Integration Examples
 
 **Python Script Integration:**