From 20472073a8f4e5e3d7cdd34da8e2d2c597f66bf2 Mon Sep 17 00:00:00 2001 From: evalstate <1936278+evalstate@users.noreply.github.com> Date: Fri, 28 Nov 2025 19:25:31 +0000 Subject: [PATCH 1/4] update scripts and skill for table handling --- hf_model_evaluation/SKILL.md | 114 +++++-- .../scripts/evaluation_manager.py | 312 ++++++++++++++++-- 2 files changed, 379 insertions(+), 47 deletions(-) diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md index 3a82c11..d123c78 100644 --- a/hf_model_evaluation/SKILL.md +++ b/hf_model_evaluation/SKILL.md @@ -17,20 +17,33 @@ This skill provides tools to add structured evaluation results to Hugging Face m # Dependencies - huggingface_hub>=0.26.0 +- markdown-it-py>=3.0.0 - python-dotenv>=1.2.1 - pyyaml>=6.0.3 - requests>=2.32.5 - inspect-ai>=0.3.0 - re (built-in) +# IMPORTANT: Using This Skill + +**Always run `--help` to get guidance on table extraction and YAML generation:** +```bash +python scripts/evaluation_manager.py --help +python scripts/evaluation_manager.py inspect-tables --help +python scripts/evaluation_manager.py extract-readme --help +``` + +The `--help` output includes workflow guidance for converting tables to YAML. + # Core Capabilities -## 1. Extract Evaluation Tables from README -- **Parse Markdown Tables**: Automatically detect and parse evaluation tables in model READMEs -- **Multiple Table Support**: Handle models with multiple benchmark tables -- **Format Detection**: Recognize common evaluation table formats (benchmarks as rows/columns, or transposed with models as rows) -- **Smart Model Matching**: Find and extract scores for specific models in comparison tables -- **Smart Conversion**: Convert parsed tables to model-index YAML format +## 1. Inspect and Extract Evaluation Tables from README +- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with their structure, columns, and suggested extraction commands +- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples) +- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist) +- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models) +- **Column Matching**: Automatically identify model columns, with `--model-name-override` for comparison tables +- **YAML Generation**: Convert selected table to model-index YAML format ## 2. Import from Artificial Analysis - **API Integration**: Fetch benchmark scores directly from Artificial Analysis @@ -63,29 +76,72 @@ The skill includes Python scripts in `scripts/` to perform operations. ### Method 1: Extract from README -Extract evaluation tables from a model's existing README and add them to model-index metadata. +Extract evaluation tables from a model's existing README and convert to model-index YAML. -**Basic Usage:** +#### Recommended Workflow: Inspect Tables First + +**Step 1: Inspect the tables** to see structure and get the extraction command: ```bash -python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" +python scripts/evaluation_manager.py inspect-tables --repo-id "allenai/OLMo-7B" +``` + +This outputs: +``` +====================================================================== +Tables found in README for: allenai/OLMo-7B +====================================================================== + +## Table 3 + Format: comparison + Rows: 14 + + Columns (6): + [1] [Llama 7B](...) + [2] [Llama 2 7B](...) + [5] **OLMo 7B** (ours) ~ partial match + + Sample rows (first column): + - arc_challenge + - arc_easy + - boolq + + ⚠ No exact match. Best candidate: **OLMo 7B** (ours) + + Suggested command: + python scripts/evaluation_manager.py extract-readme \ + --repo-id "allenai/OLMo-7B" \ + --table 3 \ + --model-name-override "**OLMo 7B** (ours)" \ + --dry-run ``` -**With Custom Task Type:** +**Step 2: Copy and run the suggested command** (with `--dry-run` to preview YAML): ```bash python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --task-type "text-generation" \ - --dataset-name "Custom Benchmarks" + --repo-id "allenai/OLMo-7B" \ + --table 3 \ + --model-name-override "**OLMo 7B** (ours)" \ + --dry-run ``` -**Dry Run (Preview Only):** +**Step 3: Verify the YAML output** - check benchmark names and values match the README + +**Step 4: Apply changes** - remove `--dry-run` and optionally add `--create-pr`: ```bash python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --dry-run + --repo-id "allenai/OLMo-7B" \ + --table 3 \ + --model-name-override "**OLMo 7B** (ours)" \ + --create-pr ``` +#### Key Flags + +- `--table N`: **Required when multiple tables exist.** Specifies which table to extract (1-indexed, matches `inspect-tables` output) +- `--model-name-override`: Column header text for comparison tables (e.g., `"**OLMo 7B** (ours)"`) +- `--dry-run`: Preview YAML without making changes +- `--create-pr`: Create a pull request instead of direct push + #### Supported Table Formats **Format 1: Benchmarks as Rows** @@ -272,21 +328,35 @@ python scripts/run_eval_job.py \ python scripts/evaluation_manager.py --help ``` +**Inspect Tables (start here):** +```bash +python scripts/evaluation_manager.py inspect-tables \ + --repo-id "username/model-name" +``` +Shows all tables in the README with: +- Table format (simple, comparison, transposed) +- Column headers with model match indicators +- Sample rows from first column +- **Ready-to-use `extract-readme` command** with correct `--table` and `--model-name-override` + +Run `inspect-tables --help` to see the full workflow. + **Extract from README:** ```bash python scripts/evaluation_manager.py extract-readme \ --repo-id "username/model-name" \ + [--table N] \ + [--model-name-override "Column Header"] \ [--task-type "text-generation"] \ [--dataset-name "Custom Benchmarks"] \ - [--model-name-override "Model Name From Table"] \ [--dry-run] \ [--create-pr] ``` -The `--model-name-override` flag is useful when: -- The model name in the table differs from the repo name -- Working with transposed tables where models are listed with different formatting -- The script cannot automatically match the model name +Key flags: +- `--table N`: Table number from `inspect-tables` output (required if multiple tables) +- `--model-name-override`: Exact column header for comparison tables +- `--dry-run`: Preview YAML output without applying **Import from Artificial Analysis:** ```bash diff --git a/hf_model_evaluation/scripts/evaluation_manager.py b/hf_model_evaluation/scripts/evaluation_manager.py index 4ecc32f..7114a22 100644 --- a/hf_model_evaluation/scripts/evaluation_manager.py +++ b/hf_model_evaluation/scripts/evaluation_manager.py @@ -2,6 +2,7 @@ # requires-python = ">=3.13" # dependencies = [ # "huggingface-hub>=1.1.4", +# "markdown-it-py>=3.0.0", # "python-dotenv>=1.2.1", # "pyyaml>=6.0.3", # "requests>=2.32.5", @@ -27,6 +28,7 @@ import requests import yaml from huggingface_hub import ModelCard +from markdown_it import MarkdownIt dotenv.load_dotenv() @@ -297,15 +299,22 @@ def extract_metrics_from_table( if is_transposed_table(header, rows): table_format = "transposed" else: - # Heuristic: if first row has mostly numeric values, benchmarks are columns - try: - numeric_count = sum( - 1 for cell in rows[0] if cell and - re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip()) - ) - table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows" - except (IndexError, ValueError): + # Check if first column header is empty/generic (indicates benchmarks in rows) + first_header = header[0].lower().strip() if header else "" + is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"] + + if is_first_col_benchmarks: table_format = "rows" + else: + # Heuristic: if first row has mostly numeric values, benchmarks are columns + try: + numeric_count = sum( + 1 for cell in rows[0] if cell and + re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip()) + ) + table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows" + except (IndexError, ValueError): + table_format = "rows" if table_format == "rows": # Benchmarks are in rows, scores in columns @@ -438,7 +447,8 @@ def extract_evaluations_from_readme( task_type: str = "text-generation", dataset_name: str = "Benchmarks", dataset_type: str = "benchmark", - model_name_override: Optional[str] = None + model_name_override: Optional[str] = None, + table_index: Optional[int] = None ) -> Optional[List[Dict[str, Any]]]: """ Extract evaluation results from a model's README. @@ -448,7 +458,8 @@ def extract_evaluations_from_readme( task_type: Task type for model-index (e.g., "text-generation") dataset_name: Name for the benchmark dataset dataset_type: Type identifier for the dataset - model_name_override: Override model name for matching (useful for transposed tables) + model_name_override: Override model name for matching (column header for comparison tables) + table_index: 1-indexed table number from inspect-tables output Returns: Model-index formatted results or None if no evaluations found @@ -468,28 +479,54 @@ def extract_evaluations_from_readme( else: model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id - # Extract all tables - tables = extract_tables_from_markdown(readme_content) + # Use markdown-it parser for accurate table extraction + all_tables = extract_tables_with_parser(readme_content) - if not tables: + if not all_tables: print(f"No tables found in README for {repo_id}") return None - # Parse and filter evaluation tables + # If table_index specified, use that specific table + if table_index is not None: + if table_index < 1 or table_index > len(all_tables): + print(f"Invalid table index {table_index}. Found {len(all_tables)} tables.") + print("Run inspect-tables to see available tables.") + return None + tables_to_process = [all_tables[table_index - 1]] + else: + # Filter to evaluation tables only + eval_tables = [] + for table in all_tables: + header = table.get("headers", []) + rows = table.get("rows", []) + if is_evaluation_table(header, rows): + eval_tables.append(table) + + if len(eval_tables) > 1: + print(f"\n⚠ Found {len(eval_tables)} evaluation tables.") + print("Run inspect-tables first, then use --table to select one:") + print(f' python scripts/evaluation_manager.py inspect-tables --repo-id "{repo_id}"') + return None + elif len(eval_tables) == 0: + print(f"No evaluation tables found in README for {repo_id}") + return None + + tables_to_process = eval_tables + + # Extract metrics from selected table(s) all_metrics = [] - for table_str in tables: - header, rows = parse_markdown_table(table_str) - - if is_evaluation_table(header, rows): - metrics = extract_metrics_from_table(header, rows, model_name=model_name) - all_metrics.extend(metrics) + for table in tables_to_process: + header = table.get("headers", []) + rows = table.get("rows", []) + metrics = extract_metrics_from_table(header, rows, model_name=model_name) + all_metrics.extend(metrics) if not all_metrics: - print(f"No evaluation tables found in README for {repo_id}") + print(f"No metrics extracted from table") return None # Build model-index structure - model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id + display_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id results = [{ "task": {"type": task_type}, @@ -511,6 +548,211 @@ def extract_evaluations_from_readme( return None +# ============================================================================ +# Table Inspection (using markdown-it-py for accurate parsing) +# ============================================================================ + + +def extract_tables_with_parser(markdown_content: str) -> List[Dict[str, Any]]: + """ + Extract tables from markdown using markdown-it-py parser. + Uses GFM (GitHub Flavored Markdown) which includes table support. + """ + md = MarkdownIt("gfm-like") + tokens = md.parse(markdown_content) + + tables = [] + i = 0 + while i < len(tokens): + token = tokens[i] + + if token.type == "table_open": + table_data = {"headers": [], "rows": []} + current_row = [] + in_header = False + + i += 1 + while i < len(tokens) and tokens[i].type != "table_close": + t = tokens[i] + if t.type == "thead_open": + in_header = True + elif t.type == "thead_close": + in_header = False + elif t.type == "tr_open": + current_row = [] + elif t.type == "tr_close": + if in_header: + table_data["headers"] = current_row + else: + table_data["rows"].append(current_row) + current_row = [] + elif t.type == "inline": + current_row.append(t.content.strip()) + i += 1 + + if table_data["headers"] or table_data["rows"]: + tables.append(table_data) + + i += 1 + + return tables + + +def detect_table_format(table: Dict[str, Any], repo_id: str) -> Dict[str, Any]: + """Analyze a table to detect its format and identify model columns.""" + headers = table.get("headers", []) + rows = table.get("rows", []) + + if not headers or not rows: + return {"format": "unknown", "columns": headers, "model_columns": [], "row_count": 0, "sample_rows": []} + + first_header = headers[0].lower() if headers else "" + is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"] + + # Check for numeric columns + numeric_columns = [] + for col_idx in range(1, len(headers)): + numeric_count = 0 + for row in rows[:5]: + if col_idx < len(row): + try: + val = re.sub(r'\s*\([^)]*\)', '', row[col_idx]) + float(val.replace("%", "").replace(",", "").strip()) + numeric_count += 1 + except (ValueError, AttributeError): + pass + if numeric_count > len(rows[:5]) / 2: + numeric_columns.append(col_idx) + + # Determine format + if is_first_col_benchmarks and len(numeric_columns) > 1: + format_type = "comparison" + elif is_first_col_benchmarks and len(numeric_columns) == 1: + format_type = "simple" + elif len(numeric_columns) > len(headers) / 2: + format_type = "transposed" + else: + format_type = "unknown" + + # Find model columns + model_columns = [] + model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id + model_tokens, _ = normalize_model_name(model_name) + + for idx, header in enumerate(headers): + if idx == 0 and is_first_col_benchmarks: + continue + if header: + header_tokens, _ = normalize_model_name(header) + is_match = model_tokens == header_tokens + is_partial = model_tokens.issubset(header_tokens) or header_tokens.issubset(model_tokens) + model_columns.append({ + "index": idx, + "header": header, + "is_exact_match": is_match, + "is_partial_match": is_partial and not is_match + }) + + return { + "format": format_type, + "columns": headers, + "model_columns": model_columns, + "row_count": len(rows), + "sample_rows": [row[0] for row in rows[:5] if row] + } + + +def inspect_tables(repo_id: str) -> None: + """Inspect and display all evaluation tables in a model's README.""" + try: + card = ModelCard.load(repo_id, token=HF_TOKEN) + readme_content = card.content + + if not readme_content: + print(f"No README content found for {repo_id}") + return + + tables = extract_tables_with_parser(readme_content) + + if not tables: + print(f"No tables found in README for {repo_id}") + return + + print(f"\n{'='*70}") + print(f"Tables found in README for: {repo_id}") + print(f"{'='*70}") + + eval_table_count = 0 + for table in tables: + analysis = detect_table_format(table, repo_id) + + if analysis["format"] == "unknown" and not analysis.get("sample_rows"): + continue + + eval_table_count += 1 + print(f"\n## Table {eval_table_count}") + print(f" Format: {analysis['format']}") + print(f" Rows: {analysis['row_count']}") + + print(f"\n Columns ({len(analysis['columns'])}):") + for col_info in analysis.get("model_columns", []): + idx = col_info["index"] + header = col_info["header"] + if col_info["is_exact_match"]: + print(f" [{idx}] {header} ✓ EXACT MATCH") + elif col_info["is_partial_match"]: + print(f" [{idx}] {header} ~ partial match") + else: + print(f" [{idx}] {header}") + + if analysis.get("sample_rows"): + print(f"\n Sample rows (first column):") + for row_val in analysis["sample_rows"][:5]: + print(f" - {row_val}") + + # Build suggested command + cmd_parts = [ + "python scripts/evaluation_manager.py extract-readme", + f'--repo-id "{repo_id}"', + f"--table {eval_table_count}" + ] + + override_value = None + if analysis["format"] == "comparison": + exact = next((c for c in analysis.get("model_columns", []) if c["is_exact_match"]), None) + if exact: + print(f"\n ✓ Column match: {exact['header']}") + else: + partial = next((c for c in analysis.get("model_columns", []) if c["is_partial_match"]), None) + if partial: + override_value = partial["header"] + print(f"\n ⚠ No exact match. Best candidate: {partial['header']}") + elif analysis.get("model_columns"): + print(f"\n ⚠ Could not identify model column. Options:") + for col_info in analysis.get("model_columns", []): + print(f' "{col_info["header"]}"') + override_value = analysis["model_columns"][0]["header"] + + if override_value: + cmd_parts.append(f'--model-name-override "{override_value}"') + + cmd_parts.append("--dry-run") + + print(f"\n Suggested command:") + print(f" {cmd_parts[0]} \\") + for part in cmd_parts[1:-1]: + print(f" {part} \\") + print(f" {cmd_parts[-1]}") + + if eval_table_count == 0: + print("\nNo evaluation tables detected.") + + print(f"\n{'='*70}\n") + + except Exception as e: + print(f"Error inspecting tables: {e}") + + # ============================================================================ # Method 2: Import from Artificial Analysis # ============================================================================ @@ -816,12 +1058,13 @@ def main(): help="Extract evaluation tables from model README" ) extract_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") + extract_parser.add_argument("--table", type=int, help="Table number (1-indexed, from inspect-tables output)") + extract_parser.add_argument("--model-name-override", type=str, help="Column header for comparison tables") extract_parser.add_argument("--task-type", type=str, default="text-generation", help="Task type") extract_parser.add_argument("--dataset-name", type=str, default="Benchmarks", help="Dataset name") extract_parser.add_argument("--dataset-type", type=str, default="benchmark", help="Dataset type") - extract_parser.add_argument("--model-name-override", type=str, help="Override model name for table matching") extract_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push") - extract_parser.add_argument("--dry-run", action="store_true", help="Preview without updating") + extract_parser.add_argument("--dry-run", action="store_true", help="Preview YAML without updating") # Import from AA command aa_parser = subparsers.add_parser( @@ -847,6 +1090,21 @@ def main(): ) validate_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") + # Inspect tables command + inspect_parser = subparsers.add_parser( + "inspect-tables", + help="Inspect tables in README → outputs suggested extract-readme command", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Workflow: + 1. inspect-tables → see table structure and columns + 2. copy command → suggested extract-readme command + 3. run with --dry-run → preview YAML output + 4. remove --dry-run → apply changes +""" + ) + inspect_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") + args = parser.parse_args() if not args.command: @@ -860,7 +1118,8 @@ def main(): task_type=args.task_type, dataset_name=args.dataset_name, dataset_type=args.dataset_type, - model_name_override=args.model_name_override + model_name_override=args.model_name_override, + table_index=args.table ) if not results: @@ -902,6 +1161,9 @@ def main(): elif args.command == "validate": validate_model_index(args.repo_id) + elif args.command == "inspect-tables": + inspect_tables(args.repo_id) + if __name__ == "__main__": main() From 65a4e2d97979aee100439e2e27bb4c9206253133 Mon Sep 17 00:00:00 2001 From: evalstate <1936278+evalstate@users.noreply.github.com> Date: Fri, 28 Nov 2025 19:25:47 +0000 Subject: [PATCH 2/4] update skill text (before bench run wipes it!) --- hf_model_evaluation/SKILL.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md index d123c78..e4749dd 100644 --- a/hf_model_evaluation/SKILL.md +++ b/hf_model_evaluation/SKILL.md @@ -469,15 +469,15 @@ done < models.txt ### Best Practices -1. **ALWAYS Validate Extraction**: Use `--dry-run` first and manually verify all extracted scores match the README exactly before pushing -2. **Check for Transposed Tables**: If the README has comparison tables with multiple models, verify only YOUR model's scores were extracted -3. **Validate After Updates**: Run `validate` and `show` commands to ensure proper formatting -4. **Source Attribution**: Include source information for traceability -5. **Regular Updates**: Keep evaluation scores current as new benchmarks emerge -6. **Create PRs for Others**: Use `--create-pr` when updating models you don't own -7. **Monitor Costs**: Evaluation Jobs are billed by usage. Ensure you check running jobs and costs -8. **One model per repo**: Only add one model's 'results' to the model-index. The main model of the repo. No derivatives or forks! -9. **Markdown formatting**: Never use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field. +1. **Always start with `inspect-tables`**: See table structure and get the correct extraction command +2. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow +3. **Use `--dry-run` first**: Preview YAML output before applying changes +4. **Verify extracted values**: Compare YAML output against the README table manually +5. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist +6. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output +7. **Create PRs for Others**: Use `--create-pr` when updating models you don't own +8. **One model per repo**: Only add the main model's results to model-index +9. **No markdown in YAML names**: The model name field in YAML should be plain text ### Model Name Matching From ffdfab352a33eb437d6fdf3528ba81455dc1a57b Mon Sep 17 00:00:00 2001 From: evalstate <1936278+evalstate@users.noreply.github.com> Date: Fri, 28 Nov 2025 19:26:08 +0000 Subject: [PATCH 3/4] same as last --- hf_model_evaluation/SKILL.md | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md index e4749dd..b2f5475 100644 --- a/hf_model_evaluation/SKILL.md +++ b/hf_model_evaluation/SKILL.md @@ -529,11 +529,23 @@ python scripts/evaluation_manager.py import-aa \ ### Troubleshooting -**Issue**: "No evaluation tables found in README" -- **Solution**: Check if README contains markdown tables with numeric scores +**Issue**: "Found N evaluation tables. Run inspect-tables first" +- **Cause**: Multiple tables exist and `--table` was not specified +- **Solution**: Run `inspect-tables` to see available tables, then use `--table N` + +**Issue**: Wrong values extracted (scores don't match README) +- **Cause**: Wrong column extracted from comparison table +- **Solution**: + 1. Run `inspect-tables` to see column headers + 2. Use `--model-name-override` with the exact column header text + 3. Use `--dry-run` to verify before applying + +**Issue**: "No tables found in README" +- **Cause**: Tables may be in code blocks or non-standard format +- **Solution**: Check README contains proper markdown tables with `|` separators **Issue**: "Could not find model 'X' in transposed table" -- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list +- **Solution**: Use `--model-name-override` with the exact name from the table - **Example**: `--model-name-override "**Olmo 3-32B**"` **Issue**: "AA_API_KEY not set" @@ -542,12 +554,6 @@ python scripts/evaluation_manager.py import-aa \ **Issue**: "Token does not have write access" - **Solution**: Ensure HF_TOKEN has write permissions for the repository -**Issue**: "Model not found in Artificial Analysis" -- **Solution**: Verify creator-slug and model-name match API values - -**Issue**: "Payment required for hardware" -- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware - ### Integration Examples **Python Script Integration:** From 5ee3e988acb9fa6cc049f16a7e74e9a84e4a2c71 Mon Sep 17 00:00:00 2001 From: evalstate <1936278+evalstate@users.noreply.github.com> Date: Fri, 28 Nov 2025 19:26:25 +0000 Subject: [PATCH 4/4] remove validaton flow --- hf_model_evaluation/SKILL.md | 82 ++++-------------------------------- 1 file changed, 8 insertions(+), 74 deletions(-) diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md index b2f5475..ee5c67e 100644 --- a/hf_model_evaluation/SKILL.md +++ b/hf_model_evaluation/SKILL.md @@ -180,80 +180,14 @@ In this format, the script will: - Find the row matching your model name (handles bold/markdown formatting) - Extract all benchmark scores from that specific row only -#### Validating Extraction Results - -**CRITICAL**: Always validate extracted results before creating a PR or pushing changes. - -After running `extract-readme`, you MUST: - -1. **Use `--dry-run` first** to preview the extraction: -```bash -python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --dry-run -``` - -2. **Manually verify the output**: - - Check that the correct model's scores were extracted (not other models) - - Verify benchmark names are correct - - Confirm all expected benchmarks are present - - Ensure numeric values match the README exactly - -3. **For transposed tables** (models as rows): - - Verify only ONE model's row was extracted - - Check that it matched the correct model name - - Look for warnings like "Could not find model 'X' in transposed table" - - If scores from multiple models appear, the table format was misdetected - -4. **Compare against the source**: - - Open the model README in a browser - - Cross-reference each extracted score with the table - - Verify no scores are mixed from different rows/columns - -5. **Common validation failures**: - - **Multiple models extracted**: Wrong table format detected - - **Missing benchmarks**: Column headers not recognized - - **Wrong scores**: Matched wrong model row or column - - **Empty metrics list**: Table not detected or parsing failed - -**Example validation workflow**: -```bash -# Step 1: Dry run to preview -python scripts/evaluation_manager.py extract-readme \ - --repo-id "allenai/Olmo-3-1125-32B" \ - --dry-run - -# Step 2: If model name not found in table, script shows available models -# ⚠ Could not find model 'Olmo-3-1125-32B' in transposed table -# -# Available models in table: -# 1. **Open-weight Models** -# 2. Qwen-2.5-32B -# ... -# 12. **Olmo 3-32B** -# -# Please select the correct model name from the list above. - -# Step 3: Re-run with the correct model name -python scripts/evaluation_manager.py extract-readme \ - --repo-id "allenai/Olmo-3-1125-32B" \ - --model-name-override "**Olmo 3-32B**" \ - --dry-run - -# Step 4: Review the YAML output carefully -# Verify: Are these all benchmarks for Olmo-3-32B ONLY? -# Verify: Do the scores match the README table? - -# Step 5: If validation passes, create PR -python scripts/evaluation_manager.py extract-readme \ - --repo-id "allenai/Olmo-3-1125-32B" \ - --model-name-override "**Olmo 3-32B**" \ - --create-pr - -# Step 6: Validate the model card after update -python scripts/evaluation_manager.py show \ - --repo-id "allenai/Olmo-3-1125-32B" -``` +#### Validation Checklist + +Before applying changes (removing `--dry-run`), verify: +- [ ] Correct table selected (use `inspect-tables` to confirm) +- [ ] Correct column extracted (check `--model-name-override` if comparison table) +- [ ] Benchmark names match the README +- [ ] Numeric values match the README exactly +- [ ] No scores from other models included ### Method 2: Import from Artificial Analysis