diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md index 3a82c11..ebec9d7 100644 --- a/hf_model_evaluation/SKILL.md +++ b/hf_model_evaluation/SKILL.md @@ -17,20 +17,36 @@ This skill provides tools to add structured evaluation results to Hugging Face m # Dependencies - huggingface_hub>=0.26.0 +- markdown-it-py>=3.0.0 - python-dotenv>=1.2.1 - pyyaml>=6.0.3 - requests>=2.32.5 - inspect-ai>=0.3.0 - re (built-in) +# IMPORTANT: Using This Skill + +**Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`: +```bash +uv run scripts/evaluation_manager.py --help +uv run scripts/evaluation_manager.py inspect-tables --help +uv run scripts/evaluation_manager.py extract-readme --help +``` +Key workflow (matches CLI help): +1) `inspect-tables` → find table numbers/columns +2) `extract-readme --table N` → prints YAML by default +3) add `--apply` (push) or `--create-pr` to write changes + # Core Capabilities -## 1. Extract Evaluation Tables from README -- **Parse Markdown Tables**: Automatically detect and parse evaluation tables in model READMEs -- **Multiple Table Support**: Handle models with multiple benchmark tables -- **Format Detection**: Recognize common evaluation table formats (benchmarks as rows/columns, or transposed with models as rows) -- **Smart Model Matching**: Find and extract scores for specific models in comparison tables -- **Smart Conversion**: Convert parsed tables to model-index YAML format +## 1. Inspect and Extract Evaluation Tables from README +- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows +- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples) +- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist) +- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models) +- **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text. +- **YAML Generation**: Convert selected table to model-index YAML format +- **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`) ## 2. Import from Artificial Analysis - **API Integration**: Fetch benchmark scores directly from Artificial Analysis @@ -56,148 +72,42 @@ This skill provides tools to add structured evaluation results to Hugging Face m The skill includes Python scripts in `scripts/` to perform operations. ### Prerequisites -- Install dependencies: `uv add huggingface_hub python-dotenv pyyaml inspect-ai` +- Preferred: use `uv run` (PEP 723 header auto-installs deps) +- Or install manually: `pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests` - Set `HF_TOKEN` environment variable with Write-access token - For Artificial Analysis: Set `AA_API_KEY` environment variable -- Activate virtual environment: `source .venv/bin/activate` - -### Method 1: Extract from README +- `.env` is loaded automatically if `python-dotenv` is installed -Extract evaluation tables from a model's existing README and add them to model-index metadata. +### Method 1: Extract from README (CLI workflow) -**Basic Usage:** +Recommended flow (matches `--help`): ```bash -python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" +# 1) Inspect tables to get table numbers and column hints +uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model" + +# 2) Extract a specific table (prints YAML by default) +uv run scripts/evaluation_manager.py extract-readme \ + --repo-id "username/model" \ + --table 1 \ + [--model-column-index ] \ + [--model-name-override ""] # use exact header text if you can't use the index + +# 3) Apply changes (push or PR) +uv run scripts/evaluation_manager.py extract-readme \ + --repo-id "username/model" \ + --table 1 \ + --apply # push directly +# or +uv run scripts/evaluation_manager.py extract-readme \ + --repo-id "username/model" \ + --table 1 \ + --create-pr # open a PR ``` -**With Custom Task Type:** -```bash -python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --task-type "text-generation" \ - --dataset-name "Custom Benchmarks" -``` - -**Dry Run (Preview Only):** -```bash -python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --dry-run -``` - -#### Supported Table Formats - -**Format 1: Benchmarks as Rows** -```markdown -| Benchmark | Score | -|-----------|-------| -| MMLU | 85.2 | -| HumanEval | 72.5 | -``` - -**Format 2: Benchmarks as Columns** -```markdown -| MMLU | HumanEval | GSM8K | -|------|-----------|-------| -| 85.2 | 72.5 | 91.3 | -``` - -**Format 3: Multiple Metrics** -```markdown -| Benchmark | Accuracy | F1 Score | -|-----------|----------|----------| -| MMLU | 85.2 | 0.84 | -``` - -**Format 4: Transposed Tables (Models as Rows)** -```markdown -| Model | MMLU | HumanEval | GSM8K | ARC | -|----------------|------|-----------|-------|------| -| GPT-4 | 86.4 | 67.0 | 92.0 | 96.3 | -| Claude-3 | 86.8 | 84.9 | 95.0 | 96.4 | -| **Your-Model** | 85.2 | 72.5 | 91.3 | 95.8 | -``` - -In this format, the script will: -- Detect that models are in rows (first column) and benchmarks in columns (header) -- Find the row matching your model name (handles bold/markdown formatting) -- Extract all benchmark scores from that specific row only - -#### Validating Extraction Results - -**CRITICAL**: Always validate extracted results before creating a PR or pushing changes. - -After running `extract-readme`, you MUST: - -1. **Use `--dry-run` first** to preview the extraction: -```bash -python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --dry-run -``` - -2. **Manually verify the output**: - - Check that the correct model's scores were extracted (not other models) - - Verify benchmark names are correct - - Confirm all expected benchmarks are present - - Ensure numeric values match the README exactly - -3. **For transposed tables** (models as rows): - - Verify only ONE model's row was extracted - - Check that it matched the correct model name - - Look for warnings like "Could not find model 'X' in transposed table" - - If scores from multiple models appear, the table format was misdetected - -4. **Compare against the source**: - - Open the model README in a browser - - Cross-reference each extracted score with the table - - Verify no scores are mixed from different rows/columns - -5. **Common validation failures**: - - **Multiple models extracted**: Wrong table format detected - - **Missing benchmarks**: Column headers not recognized - - **Wrong scores**: Matched wrong model row or column - - **Empty metrics list**: Table not detected or parsing failed - -**Example validation workflow**: -```bash -# Step 1: Dry run to preview -python scripts/evaluation_manager.py extract-readme \ - --repo-id "allenai/Olmo-3-1125-32B" \ - --dry-run - -# Step 2: If model name not found in table, script shows available models -# ⚠ Could not find model 'Olmo-3-1125-32B' in transposed table -# -# Available models in table: -# 1. **Open-weight Models** -# 2. Qwen-2.5-32B -# ... -# 12. **Olmo 3-32B** -# -# Please select the correct model name from the list above. - -# Step 3: Re-run with the correct model name -python scripts/evaluation_manager.py extract-readme \ - --repo-id "allenai/Olmo-3-1125-32B" \ - --model-name-override "**Olmo 3-32B**" \ - --dry-run - -# Step 4: Review the YAML output carefully -# Verify: Are these all benchmarks for Olmo-3-32B ONLY? -# Verify: Do the scores match the README table? - -# Step 5: If validation passes, create PR -python scripts/evaluation_manager.py extract-readme \ - --repo-id "allenai/Olmo-3-1125-32B" \ - --model-name-override "**Olmo 3-32B**" \ - --create-pr - -# Step 6: Validate the model card after update -python scripts/evaluation_manager.py show \ - --repo-id "allenai/Olmo-3-1125-32B" -``` +Validation checklist: +- YAML is printed by default; compare against the README table before applying. +- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact. +- For transposed tables (models as rows), ensure only one row is extracted. ### Method 2: Import from Artificial Analysis @@ -267,46 +177,42 @@ python scripts/run_eval_job.py \ ### Commands Reference -**List Available Commands:** +**Top-level help and version:** +```bash +uv run scripts/evaluation_manager.py --help +uv run scripts/evaluation_manager.py --version +``` + +**Inspect Tables (start here):** ```bash -python scripts/evaluation_manager.py --help +uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name" ``` **Extract from README:** ```bash -python scripts/evaluation_manager.py extract-readme \ +uv run scripts/evaluation_manager.py extract-readme \ --repo-id "username/model-name" \ + --table N \ + [--model-column-index N] \ + [--model-name-override "Exact Column Header or Model Name"] \ [--task-type "text-generation"] \ [--dataset-name "Custom Benchmarks"] \ - [--model-name-override "Model Name From Table"] \ - [--dry-run] \ - [--create-pr] + [--apply | --create-pr] ``` -The `--model-name-override` flag is useful when: -- The model name in the table differs from the repo name -- Working with transposed tables where models are listed with different formatting -- The script cannot automatically match the model name - **Import from Artificial Analysis:** ```bash -python scripts/evaluation_manager.py import-aa \ +AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \ --creator-slug "creator-name" \ --model-name "model-slug" \ --repo-id "username/model-name" \ [--create-pr] ``` -**View Current Evaluations:** -```bash -python scripts/evaluation_manager.py show \ - --repo-id "username/model-name" -``` - -**Validate Model-Index:** +**View / Validate:** ```bash -python scripts/evaluation_manager.py validate \ - --repo-id "username/model-name" +uv run scripts/evaluation_manager.py show --repo-id "username/model-name" +uv run scripts/evaluation_manager.py validate --repo-id "username/model-name" ``` **Run Evaluation Job:** @@ -354,41 +260,6 @@ model-index: WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field. -### Advanced Usage - -**Extract Multiple Tables:** -```bash -# The script automatically detects and processes all evaluation tables -python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --merge-tables -``` - -**Custom Metric Mapping:** -```bash -# Use a JSON file to map column names to metric types -python scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --metric-mapping "$(cat metric_mapping.json)" -``` - -Example `metric_mapping.json`: -```json -{ - "MMLU": {"type": "mmlu", "name": "Massive Multitask Language Understanding"}, - "HumanEval": {"type": "humaneval", "name": "Code Generation (HumanEval)"}, - "GSM8K": {"type": "gsm8k", "name": "Grade School Math"} -} -``` - -**Batch Processing:** -```bash -# Process multiple models from a list -while read repo_id; do - python scripts/evaluation_manager.py extract-readme --repo-id "$repo_id" -done < models.txt -``` - ### Error Handling - **Table Not Found**: Script will report if no evaluation tables are detected - **Invalid Format**: Clear error messages for malformed tables @@ -399,15 +270,15 @@ done < models.txt ### Best Practices -1. **ALWAYS Validate Extraction**: Use `--dry-run` first and manually verify all extracted scores match the README exactly before pushing -2. **Check for Transposed Tables**: If the README has comparison tables with multiple models, verify only YOUR model's scores were extracted -3. **Validate After Updates**: Run `validate` and `show` commands to ensure proper formatting -4. **Source Attribution**: Include source information for traceability -5. **Regular Updates**: Keep evaluation scores current as new benchmarks emerge -6. **Create PRs for Others**: Use `--create-pr` when updating models you don't own -7. **Monitor Costs**: Evaluation Jobs are billed by usage. Ensure you check running jobs and costs -8. **One model per repo**: Only add one model's 'results' to the model-index. The main model of the repo. No derivatives or forks! -9. **Markdown formatting**: Never use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field. +1. **Always start with `inspect-tables`**: See table structure and get the correct extraction command +2. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow +3. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr` +4. **Verify extracted values**: Compare YAML output against the README table manually +5. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist +6. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output +7. **Create PRs for Others**: Use `--create-pr` when updating models you don't own +8. **One model per repo**: Only add the main model's results to model-index +9. **No markdown in YAML names**: The model name field in YAML should be plain text ### Model Name Matching diff --git a/hf_model_evaluation/scripts/evaluation_manager.py b/hf_model_evaluation/scripts/evaluation_manager.py index 4ecc32f..b58e89b 100644 --- a/hf_model_evaluation/scripts/evaluation_manager.py +++ b/hf_model_evaluation/scripts/evaluation_manager.py @@ -2,6 +2,7 @@ # requires-python = ">=3.13" # dependencies = [ # "huggingface-hub>=1.1.4", +# "markdown-it-py>=3.0.0", # "python-dotenv>=1.2.1", # "pyyaml>=6.0.3", # "requests>=2.32.5", @@ -21,17 +22,61 @@ import argparse import os import re +from textwrap import dedent from typing import Any, Dict, List, Optional, Tuple -import dotenv -import requests -import yaml -from huggingface_hub import ModelCard -dotenv.load_dotenv() +def load_env() -> None: + """Load .env if python-dotenv is available; keep help usable without it.""" + try: + import dotenv # type: ignore + except ModuleNotFoundError: + return + dotenv.load_dotenv() + + +def require_markdown_it(): + try: + from markdown_it import MarkdownIt # type: ignore + except ModuleNotFoundError as exc: + raise ModuleNotFoundError( + "markdown-it-py is required for table parsing. " + "Install with `uv add markdown-it-py` or `pip install markdown-it-py`." + ) from exc + return MarkdownIt + + +def require_model_card(): + try: + from huggingface_hub import ModelCard # type: ignore + except ModuleNotFoundError as exc: + raise ModuleNotFoundError( + "huggingface-hub is required for model card operations. " + "Install with `uv add huggingface_hub` or `pip install huggingface-hub`." + ) from exc + return ModelCard + + +def require_requests(): + try: + import requests # type: ignore + except ModuleNotFoundError as exc: + raise ModuleNotFoundError( + "requests is required for Artificial Analysis import. " + "Install with `uv add requests` or `pip install requests`." + ) from exc + return requests + -HF_TOKEN = os.getenv("HF_TOKEN") -AA_API_KEY = os.getenv("AA_API_KEY") +def require_yaml(): + try: + import yaml # type: ignore + except ModuleNotFoundError as exc: + raise ModuleNotFoundError( + "PyYAML is required for YAML output. " + "Install with `uv add pyyaml` or `pip install pyyaml`." + ) from exc + return yaml # ============================================================================ @@ -275,7 +320,8 @@ def extract_metrics_from_table( header: List[str], rows: List[List[str]], table_format: str = "auto", - model_name: Optional[str] = None + model_name: Optional[str] = None, + model_column_index: Optional[int] = None ) -> List[Dict[str, Any]]: """ Extract metrics from parsed table data. @@ -297,21 +343,28 @@ def extract_metrics_from_table( if is_transposed_table(header, rows): table_format = "transposed" else: - # Heuristic: if first row has mostly numeric values, benchmarks are columns - try: - numeric_count = sum( - 1 for cell in rows[0] if cell and - re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip()) - ) - table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows" - except (IndexError, ValueError): + # Check if first column header is empty/generic (indicates benchmarks in rows) + first_header = header[0].lower().strip() if header else "" + is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"] + + if is_first_col_benchmarks: table_format = "rows" + else: + # Heuristic: if first row has mostly numeric values, benchmarks are columns + try: + numeric_count = sum( + 1 for cell in rows[0] if cell and + re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip()) + ) + table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows" + except (IndexError, ValueError): + table_format = "rows" if table_format == "rows": # Benchmarks are in rows, scores in columns # Try to identify the main model column if model_name is provided - target_column = None - if model_name: + target_column = model_column_index + if target_column is None and model_name: target_column = find_main_model_column(header, model_name) for row in rows: @@ -438,7 +491,9 @@ def extract_evaluations_from_readme( task_type: str = "text-generation", dataset_name: str = "Benchmarks", dataset_type: str = "benchmark", - model_name_override: Optional[str] = None + model_name_override: Optional[str] = None, + table_index: Optional[int] = None, + model_column_index: Optional[int] = None ) -> Optional[List[Dict[str, Any]]]: """ Extract evaluation results from a model's README. @@ -448,13 +503,17 @@ def extract_evaluations_from_readme( task_type: Task type for model-index (e.g., "text-generation") dataset_name: Name for the benchmark dataset dataset_type: Type identifier for the dataset - model_name_override: Override model name for matching (useful for transposed tables) + model_name_override: Override model name for matching (column header for comparison tables) + table_index: 1-indexed table number from inspect-tables output Returns: Model-index formatted results or None if no evaluations found """ try: - card = ModelCard.load(repo_id, token=HF_TOKEN) + load_env() + ModelCard = require_model_card() + hf_token = os.getenv("HF_TOKEN") + card = ModelCard.load(repo_id, token=hf_token) readme_content = card.content if not readme_content: @@ -468,28 +527,59 @@ def extract_evaluations_from_readme( else: model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id - # Extract all tables - tables = extract_tables_from_markdown(readme_content) + # Use markdown-it parser for accurate table extraction + all_tables = extract_tables_with_parser(readme_content) - if not tables: + if not all_tables: print(f"No tables found in README for {repo_id}") return None - # Parse and filter evaluation tables + # If table_index specified, use that specific table + if table_index is not None: + if table_index < 1 or table_index > len(all_tables): + print(f"Invalid table index {table_index}. Found {len(all_tables)} tables.") + print("Run inspect-tables to see available tables.") + return None + tables_to_process = [all_tables[table_index - 1]] + else: + # Filter to evaluation tables only + eval_tables = [] + for table in all_tables: + header = table.get("headers", []) + rows = table.get("rows", []) + if is_evaluation_table(header, rows): + eval_tables.append(table) + + if len(eval_tables) > 1: + print(f"\n⚠ Found {len(eval_tables)} evaluation tables.") + print("Run inspect-tables first, then use --table to select one:") + print(f' uv run scripts/evaluation_manager.py inspect-tables --repo-id "{repo_id}"') + return None + elif len(eval_tables) == 0: + print(f"No evaluation tables found in README for {repo_id}") + return None + + tables_to_process = eval_tables + + # Extract metrics from selected table(s) all_metrics = [] - for table_str in tables: - header, rows = parse_markdown_table(table_str) - - if is_evaluation_table(header, rows): - metrics = extract_metrics_from_table(header, rows, model_name=model_name) - all_metrics.extend(metrics) + for table in tables_to_process: + header = table.get("headers", []) + rows = table.get("rows", []) + metrics = extract_metrics_from_table( + header, + rows, + model_name=model_name, + model_column_index=model_column_index + ) + all_metrics.extend(metrics) if not all_metrics: - print(f"No evaluation tables found in README for {repo_id}") + print(f"No metrics extracted from table") return None # Build model-index structure - model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id + display_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id results = [{ "task": {"type": task_type}, @@ -511,6 +601,185 @@ def extract_evaluations_from_readme( return None +# ============================================================================ +# Table Inspection (using markdown-it-py for accurate parsing) +# ============================================================================ + + +def extract_tables_with_parser(markdown_content: str) -> List[Dict[str, Any]]: + """ + Extract tables from markdown using markdown-it-py parser. + Uses GFM (GitHub Flavored Markdown) which includes table support. + """ + MarkdownIt = require_markdown_it() + # Disable linkify to avoid optional dependency errors; not needed for table parsing. + md = MarkdownIt("gfm-like", {"linkify": False}) + tokens = md.parse(markdown_content) + + tables = [] + i = 0 + while i < len(tokens): + token = tokens[i] + + if token.type == "table_open": + table_data = {"headers": [], "rows": []} + current_row = [] + in_header = False + + i += 1 + while i < len(tokens) and tokens[i].type != "table_close": + t = tokens[i] + if t.type == "thead_open": + in_header = True + elif t.type == "thead_close": + in_header = False + elif t.type == "tr_open": + current_row = [] + elif t.type == "tr_close": + if in_header: + table_data["headers"] = current_row + else: + table_data["rows"].append(current_row) + current_row = [] + elif t.type == "inline": + current_row.append(t.content.strip()) + i += 1 + + if table_data["headers"] or table_data["rows"]: + tables.append(table_data) + + i += 1 + + return tables + + +def detect_table_format(table: Dict[str, Any], repo_id: str) -> Dict[str, Any]: + """Analyze a table to detect its format and identify model columns.""" + headers = table.get("headers", []) + rows = table.get("rows", []) + + if not headers or not rows: + return {"format": "unknown", "columns": headers, "model_columns": [], "row_count": 0, "sample_rows": []} + + first_header = headers[0].lower() if headers else "" + is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"] + + # Check for numeric columns + numeric_columns = [] + for col_idx in range(1, len(headers)): + numeric_count = 0 + for row in rows[:5]: + if col_idx < len(row): + try: + val = re.sub(r'\s*\([^)]*\)', '', row[col_idx]) + float(val.replace("%", "").replace(",", "").strip()) + numeric_count += 1 + except (ValueError, AttributeError): + pass + if numeric_count > len(rows[:5]) / 2: + numeric_columns.append(col_idx) + + # Determine format + if is_first_col_benchmarks and len(numeric_columns) > 1: + format_type = "comparison" + elif is_first_col_benchmarks and len(numeric_columns) == 1: + format_type = "simple" + elif len(numeric_columns) > len(headers) / 2: + format_type = "transposed" + else: + format_type = "unknown" + + # Find model columns + model_columns = [] + model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id + model_tokens, _ = normalize_model_name(model_name) + + for idx, header in enumerate(headers): + if idx == 0 and is_first_col_benchmarks: + continue + if header: + header_tokens, _ = normalize_model_name(header) + is_match = model_tokens == header_tokens + is_partial = model_tokens.issubset(header_tokens) or header_tokens.issubset(model_tokens) + model_columns.append({ + "index": idx, + "header": header, + "is_exact_match": is_match, + "is_partial_match": is_partial and not is_match + }) + + return { + "format": format_type, + "columns": headers, + "model_columns": model_columns, + "row_count": len(rows), + "sample_rows": [row[0] for row in rows[:5] if row] + } + + +def inspect_tables(repo_id: str) -> None: + """Inspect and display all evaluation tables in a model's README.""" + try: + load_env() + ModelCard = require_model_card() + hf_token = os.getenv("HF_TOKEN") + card = ModelCard.load(repo_id, token=hf_token) + readme_content = card.content + + if not readme_content: + print(f"No README content found for {repo_id}") + return + + tables = extract_tables_with_parser(readme_content) + + if not tables: + print(f"No tables found in README for {repo_id}") + return + + print(f"\n{'='*70}") + print(f"Tables found in README for: {repo_id}") + print(f"{'='*70}") + + eval_table_count = 0 + for table in tables: + analysis = detect_table_format(table, repo_id) + + if analysis["format"] == "unknown" and not analysis.get("sample_rows"): + continue + + eval_table_count += 1 + print(f"\n## Table {eval_table_count}") + print(f" Format: {analysis['format']}") + print(f" Rows: {analysis['row_count']}") + + print(f"\n Columns ({len(analysis['columns'])}):") + for col_info in analysis.get("model_columns", []): + idx = col_info["index"] + header = col_info["header"] + if col_info["is_exact_match"]: + print(f" [{idx}] {header} ✓ EXACT MATCH") + elif col_info["is_partial_match"]: + print(f" [{idx}] {header} ~ partial match") + else: + print(f" [{idx}] {header}") + + if analysis.get("sample_rows"): + print(f"\n Sample rows (first column):") + for row_val in analysis["sample_rows"][:5]: + print(f" - {row_val}") + + if eval_table_count == 0: + print("\nNo evaluation tables detected.") + else: + print("\nSuggested next step:") + print(f' uv run scripts/evaluation_manager.py extract-readme --repo-id "{repo_id}" --table [--model-column-index ]') + + print(f"\n{'='*70}\n") + + except Exception as e: + print(f"Error inspecting tables: {e}") + + # ============================================================================ # Method 2: Import from Artificial Analysis # ============================================================================ @@ -527,12 +796,16 @@ def get_aa_model_data(creator_slug: str, model_name: str) -> Optional[Dict[str, Returns: Model data dictionary or None if not found """ + load_env() + AA_API_KEY = os.getenv("AA_API_KEY") if not AA_API_KEY: raise ValueError("AA_API_KEY environment variable is not set") url = "https://artificialanalysis.ai/api/v2/data/llms/models" headers = {"x-api-key": AA_API_KEY} + requests = require_requests() + try: response = requests.get(url, headers=headers, timeout=30) response.raise_for_status() @@ -650,12 +923,15 @@ def update_model_card_with_evaluations( Returns: True if successful, False otherwise """ - if not HF_TOKEN: - raise ValueError("HF_TOKEN environment variable is not set") - try: + load_env() + ModelCard = require_model_card() + hf_token = os.getenv("HF_TOKEN") + if not hf_token: + raise ValueError("HF_TOKEN environment variable is not set") + # Load existing card - card = ModelCard.load(repo_id, token=HF_TOKEN) + card = ModelCard.load(repo_id, token=hf_token) # Get model name model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id @@ -693,7 +969,7 @@ def update_model_card_with_evaluations( # Push update card.push_to_hub( repo_id, - token=HF_TOKEN, + token=hf_token, commit_message=commit_message, commit_description=commit_description, create_pr=create_pr @@ -711,7 +987,10 @@ def update_model_card_with_evaluations( def show_evaluations(repo_id: str) -> None: """Display current evaluations in a model card.""" try: - card = ModelCard.load(repo_id, token=HF_TOKEN) + load_env() + ModelCard = require_model_card() + hf_token = os.getenv("HF_TOKEN") + card = ModelCard.load(repo_id, token=hf_token) if "model-index" not in card.data: print(f"No model-index found in {repo_id}") @@ -756,7 +1035,10 @@ def show_evaluations(repo_id: str) -> None: def validate_model_index(repo_id: str) -> bool: """Validate model-index format in a model card.""" try: - card = ModelCard.load(repo_id, token=HF_TOKEN) + load_env() + ModelCard = require_model_card() + hf_token = os.getenv("HF_TOKEN") + card = ModelCard.load(repo_id, token=hf_token) if "model-index" not in card.data: print(f"✗ No model-index found in {repo_id}") @@ -805,28 +1087,85 @@ def validate_model_index(repo_id: str) -> bool: def main(): parser = argparse.ArgumentParser( - description="Manage evaluation results in Hugging Face model cards" + description=( + "Manage evaluation results in Hugging Face model cards.\n\n" + "Use standard Python or `uv run scripts/evaluation_manager.py ...` " + "to auto-resolve dependencies from the PEP 723 header." + ), + formatter_class=argparse.RawTextHelpFormatter, + epilog=dedent( + """\ + Typical workflows: + - Inspect tables first: + uv run scripts/evaluation_manager.py inspect-tables --repo-id + - Extract from README (prints YAML by default): + uv run scripts/evaluation_manager.py extract-readme --repo-id --table N + - Apply changes: + uv run scripts/evaluation_manager.py extract-readme --repo-id --table N --apply + - Import from Artificial Analysis: + AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug org --model-name slug --repo-id + + Tips: + - YAML is printed by default; use --apply or --create-pr to write changes. + - Set HF_TOKEN (and AA_API_KEY for import-aa); .env is loaded automatically if python-dotenv is installed. + - When multiple tables exist, run inspect-tables then select with --table N. + - To apply changes (push or PR), rerun extract-readme with --apply or --create-pr. + """ + ), ) + parser.add_argument("--version", action="version", version="evaluation_manager 1.2.0") subparsers = parser.add_subparsers(dest="command", help="Command to execute") # Extract from README command extract_parser = subparsers.add_parser( "extract-readme", - help="Extract evaluation tables from model README" + help="Extract evaluation tables from model README", + formatter_class=argparse.RawTextHelpFormatter, + description="Parse README tables into model-index YAML. Default behavior prints YAML; use --apply/--create-pr to write changes.", + epilog=dedent( + """\ + Examples: + uv run scripts/evaluation_manager.py extract-readme --repo-id username/model + uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-column-index 3 + uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-name-override \"**Model 7B**\" # exact header text + uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --create-pr + + Apply changes: + - Default: prints YAML to stdout (no writes). + - Add --apply to push directly, or --create-pr to open a PR. + Model selection: + - Preferred: --model-column-index
+ - If using --model-name-override, copy the column header text exactly. + """ + ), ) extract_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") - extract_parser.add_argument("--task-type", type=str, default="text-generation", help="Task type") + extract_parser.add_argument("--table", type=int, help="Table number (1-indexed, from inspect-tables output)") + extract_parser.add_argument("--model-column-index", type=int, help="Preferred: column index from inspect-tables output (exact selection)") + extract_parser.add_argument("--model-name-override", type=str, help="Exact column header/model name for comparison/transpose tables (when index is not used)") + extract_parser.add_argument("--task-type", type=str, default="text-generation", help="Sets model-index task.type (e.g., text-generation, summarization)") extract_parser.add_argument("--dataset-name", type=str, default="Benchmarks", help="Dataset name") extract_parser.add_argument("--dataset-type", type=str, default="benchmark", help="Dataset type") - extract_parser.add_argument("--model-name-override", type=str, help="Override model name for table matching") extract_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push") - extract_parser.add_argument("--dry-run", action="store_true", help="Preview without updating") + extract_parser.add_argument("--apply", action="store_true", help="Apply changes (default is to print YAML only)") + extract_parser.add_argument("--dry-run", action="store_true", help="Preview YAML without updating (default)") # Import from AA command aa_parser = subparsers.add_parser( "import-aa", - help="Import evaluation scores from Artificial Analysis" + help="Import evaluation scores from Artificial Analysis", + formatter_class=argparse.RawTextHelpFormatter, + description="Fetch scores from Artificial Analysis API and write them into model-index.", + epilog=dedent( + """\ + Examples: + AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug anthropic --model-name claude-sonnet-4 --repo-id username/model + uv run scripts/evaluation_manager.py import-aa --creator-slug openai --model-name gpt-4o --repo-id username/model --create-pr + + Requires: AA_API_KEY in env (or .env if python-dotenv installed). + """ + ), ) aa_parser.add_argument("--creator-slug", type=str, required=True, help="AA creator slug") aa_parser.add_argument("--model-name", type=str, required=True, help="AA model name") @@ -836,71 +1175,114 @@ def main(): # Show evaluations command show_parser = subparsers.add_parser( "show", - help="Display current evaluations in model card" + help="Display current evaluations in model card", + formatter_class=argparse.RawTextHelpFormatter, + description="Print model-index content from the model card (requires HF_TOKEN for private repos).", ) show_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") # Validate command validate_parser = subparsers.add_parser( "validate", - help="Validate model-index format" + help="Validate model-index format", + formatter_class=argparse.RawTextHelpFormatter, + description="Schema sanity check for model-index section of the card.", ) validate_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") + # Inspect tables command + inspect_parser = subparsers.add_parser( + "inspect-tables", + help="Inspect tables in README → outputs suggested extract-readme command", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Workflow: + 1. inspect-tables → see table structure, columns, and table numbers + 2. extract-readme → run with --table N (from step 1); YAML prints by default + 3. apply changes → rerun extract-readme with --apply or --create-pr + +Reminder: + - Preferred: use --model-column-index . If needed, use --model-name-override with the exact column header text. +""" + ) + inspect_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") + args = parser.parse_args() if not args.command: parser.print_help() return - # Execute command - if args.command == "extract-readme": - results = extract_evaluations_from_readme( - repo_id=args.repo_id, - task_type=args.task_type, - dataset_name=args.dataset_name, - dataset_type=args.dataset_type, - model_name_override=args.model_name_override - ) + try: + # Execute command + if args.command == "extract-readme": + results = extract_evaluations_from_readme( + repo_id=args.repo_id, + task_type=args.task_type, + dataset_name=args.dataset_name, + dataset_type=args.dataset_type, + model_name_override=args.model_name_override, + table_index=args.table, + model_column_index=args.model_column_index + ) - if not results: - print("No evaluations extracted") - return + if not results: + print("No evaluations extracted") + return + + apply_changes = args.apply or args.create_pr + + # Default behavior: print YAML (dry-run) + yaml = require_yaml() + print("\nExtracted evaluations (YAML):") + print( + yaml.dump( + {"model-index": [{"name": args.repo_id.split('/')[-1], "results": results}]}, + sort_keys=False + ) + ) + + if apply_changes: + if args.model_name_override and args.model_column_index is not None: + print("Note: --model-column-index takes precedence over --model-name-override.") + update_model_card_with_evaluations( + repo_id=args.repo_id, + results=results, + create_pr=args.create_pr, + commit_message="Extract evaluation results from README" + ) + + elif args.command == "import-aa": + results = import_aa_evaluations( + creator_slug=args.creator_slug, + model_name=args.model_name, + repo_id=args.repo_id + ) + + if not results: + print("No evaluations imported") + return - if args.dry_run: - print("\nPreview of extracted evaluations:") - print(yaml.dump({"model-index": [{"name": args.repo_id.split("/")[-1], "results": results}]}, sort_keys=False)) - else: update_model_card_with_evaluations( repo_id=args.repo_id, results=results, create_pr=args.create_pr, - commit_message="Extract evaluation results from README" + commit_message=f"Add Artificial Analysis evaluations for {args.model_name}" ) - elif args.command == "import-aa": - results = import_aa_evaluations( - creator_slug=args.creator_slug, - model_name=args.model_name, - repo_id=args.repo_id - ) - - if not results: - print("No evaluations imported") - return - - update_model_card_with_evaluations( - repo_id=args.repo_id, - results=results, - create_pr=args.create_pr, - commit_message=f"Add Artificial Analysis evaluations for {args.model_name}" - ) + elif args.command == "show": + show_evaluations(args.repo_id) - elif args.command == "show": - show_evaluations(args.repo_id) + elif args.command == "validate": + validate_model_index(args.repo_id) - elif args.command == "validate": - validate_model_index(args.repo_id) + elif args.command == "inspect-tables": + inspect_tables(args.repo_id) + except ModuleNotFoundError as exc: + # Surface dependency hints cleanly when user only needs help output + print(exc) + except Exception as exc: + print(f"Error: {exc}") if __name__ == "__main__":