From 20472073a8f4e5e3d7cdd34da8e2d2c597f66bf2 Mon Sep 17 00:00:00 2001
From: evalstate <1936278+evalstate@users.noreply.github.com>
Date: Fri, 28 Nov 2025 19:25:31 +0000
Subject: [PATCH 1/4] update scripts and skill for table handling

---
 hf_model_evaluation/SKILL.md                  | 114 +++++--
 .../scripts/evaluation_manager.py             | 312 ++++++++++++++++--
 2 files changed, 379 insertions(+), 47 deletions(-)

diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md
index 3a82c11..d123c78 100644
--- a/hf_model_evaluation/SKILL.md
+++ b/hf_model_evaluation/SKILL.md
@@ -17,20 +17,33 @@ This skill provides tools to add structured evaluation results to Hugging Face m
 
 # Dependencies
 - huggingface_hub>=0.26.0
+- markdown-it-py>=3.0.0
 - python-dotenv>=1.2.1
 - pyyaml>=6.0.3
 - requests>=2.32.5
 - inspect-ai>=0.3.0
 - re (built-in)
 
+# IMPORTANT: Using This Skill
+
+**Always run `--help` to get guidance on table extraction and YAML generation:**
+```bash
+python scripts/evaluation_manager.py --help
+python scripts/evaluation_manager.py inspect-tables --help
+python scripts/evaluation_manager.py extract-readme --help
+```
+
+The `--help` output includes workflow guidance for converting tables to YAML.
+
 # Core Capabilities
 
-## 1. Extract Evaluation Tables from README
-- **Parse Markdown Tables**: Automatically detect and parse evaluation tables in model READMEs
-- **Multiple Table Support**: Handle models with multiple benchmark tables
-- **Format Detection**: Recognize common evaluation table formats (benchmarks as rows/columns, or transposed with models as rows)
-- **Smart Model Matching**: Find and extract scores for specific models in comparison tables
-- **Smart Conversion**: Convert parsed tables to model-index YAML format
+## 1. Inspect and Extract Evaluation Tables from README
+- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with their structure, columns, and suggested extraction commands
+- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)
+- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)
+- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
+- **Column Matching**: Automatically identify model columns, with `--model-name-override` for comparison tables
+- **YAML Generation**: Convert selected table to model-index YAML format
 
 ## 2. Import from Artificial Analysis
 - **API Integration**: Fetch benchmark scores directly from Artificial Analysis
@@ -63,29 +76,72 @@ The skill includes Python scripts in `scripts/` to perform operations.
 
 ### Method 1: Extract from README
 
-Extract evaluation tables from a model's existing README and add them to model-index metadata.
+Extract evaluation tables from a model's existing README and convert to model-index YAML.
 
-**Basic Usage:**
+#### Recommended Workflow: Inspect Tables First
+
+**Step 1: Inspect the tables** to see structure and get the extraction command:
 ```bash
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name"
+python scripts/evaluation_manager.py inspect-tables --repo-id "allenai/OLMo-7B"
+```
+
+This outputs:
+```
+======================================================================
+Tables found in README for: allenai/OLMo-7B
+======================================================================
+
+## Table 3
+   Format: comparison
+   Rows: 14
+
+   Columns (6):
+      [1] [Llama 7B](...)
+      [2] [Llama 2 7B](...)
+      [5] **OLMo 7B** (ours)  ~ partial match
+
+   Sample rows (first column):
+      - arc_challenge
+      - arc_easy
+      - boolq
+
+   ⚠ No exact match. Best candidate: **OLMo 7B** (ours)
+
+   Suggested command:
+      python scripts/evaluation_manager.py extract-readme \
+        --repo-id "allenai/OLMo-7B" \
+        --table 3 \
+        --model-name-override "**OLMo 7B** (ours)" \
+        --dry-run
 ```
 
-**With Custom Task Type:**
+**Step 2: Copy and run the suggested command** (with `--dry-run` to preview YAML):
 ```bash
 python scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name" \
-  --task-type "text-generation" \
-  --dataset-name "Custom Benchmarks"
+  --repo-id "allenai/OLMo-7B" \
+  --table 3 \
+  --model-name-override "**OLMo 7B** (ours)" \
+  --dry-run
 ```
 
-**Dry Run (Preview Only):**
+**Step 3: Verify the YAML output** - check benchmark names and values match the README
+
+**Step 4: Apply changes** - remove `--dry-run` and optionally add `--create-pr`:
 ```bash
 python scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name" \
-  --dry-run
+  --repo-id "allenai/OLMo-7B" \
+  --table 3 \
+  --model-name-override "**OLMo 7B** (ours)" \
+  --create-pr
 ```
 
+#### Key Flags
+
+- `--table N`: **Required when multiple tables exist.** Specifies which table to extract (1-indexed, matches `inspect-tables` output)
+- `--model-name-override`: Column header text for comparison tables (e.g., `"**OLMo 7B** (ours)"`)
+- `--dry-run`: Preview YAML without making changes
+- `--create-pr`: Create a pull request instead of direct push
+
 #### Supported Table Formats
 
 **Format 1: Benchmarks as Rows**
@@ -272,21 +328,35 @@ python scripts/run_eval_job.py \
 python scripts/evaluation_manager.py --help
 ```
 
+**Inspect Tables (start here):**
+```bash
+python scripts/evaluation_manager.py inspect-tables \
+  --repo-id "username/model-name"
+```
+Shows all tables in the README with:
+- Table format (simple, comparison, transposed)
+- Column headers with model match indicators
+- Sample rows from first column
+- **Ready-to-use `extract-readme` command** with correct `--table` and `--model-name-override`
+
+Run `inspect-tables --help` to see the full workflow.
+
 **Extract from README:**
 ```bash
 python scripts/evaluation_manager.py extract-readme \
   --repo-id "username/model-name" \
+  [--table N] \
+  [--model-name-override "Column Header"] \
   [--task-type "text-generation"] \
   [--dataset-name "Custom Benchmarks"] \
-  [--model-name-override "Model Name From Table"] \
   [--dry-run] \
   [--create-pr]
 ```
 
-The `--model-name-override` flag is useful when:
-- The model name in the table differs from the repo name
-- Working with transposed tables where models are listed with different formatting
-- The script cannot automatically match the model name
+Key flags:
+- `--table N`: Table number from `inspect-tables` output (required if multiple tables)
+- `--model-name-override`: Exact column header for comparison tables
+- `--dry-run`: Preview YAML output without applying
 
 **Import from Artificial Analysis:**
 ```bash
diff --git a/hf_model_evaluation/scripts/evaluation_manager.py b/hf_model_evaluation/scripts/evaluation_manager.py
index 4ecc32f..7114a22 100644
--- a/hf_model_evaluation/scripts/evaluation_manager.py
+++ b/hf_model_evaluation/scripts/evaluation_manager.py
@@ -2,6 +2,7 @@
 # requires-python = ">=3.13"
 # dependencies = [
 #     "huggingface-hub>=1.1.4",
+#     "markdown-it-py>=3.0.0",
 #     "python-dotenv>=1.2.1",
 #     "pyyaml>=6.0.3",
 #     "requests>=2.32.5",
@@ -27,6 +28,7 @@
 import requests
 import yaml
 from huggingface_hub import ModelCard
+from markdown_it import MarkdownIt
 
 dotenv.load_dotenv()
 
@@ -297,15 +299,22 @@ def extract_metrics_from_table(
         if is_transposed_table(header, rows):
             table_format = "transposed"
         else:
-            # Heuristic: if first row has mostly numeric values, benchmarks are columns
-            try:
-                numeric_count = sum(
-                    1 for cell in rows[0] if cell and
-                    re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip())
-                )
-                table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows"
-            except (IndexError, ValueError):
+            # Check if first column header is empty/generic (indicates benchmarks in rows)
+            first_header = header[0].lower().strip() if header else ""
+            is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"]
+
+            if is_first_col_benchmarks:
                 table_format = "rows"
+            else:
+                # Heuristic: if first row has mostly numeric values, benchmarks are columns
+                try:
+                    numeric_count = sum(
+                        1 for cell in rows[0] if cell and
+                        re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip())
+                    )
+                    table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows"
+                except (IndexError, ValueError):
+                    table_format = "rows"
 
     if table_format == "rows":
         # Benchmarks are in rows, scores in columns
@@ -438,7 +447,8 @@ def extract_evaluations_from_readme(
     task_type: str = "text-generation",
     dataset_name: str = "Benchmarks",
     dataset_type: str = "benchmark",
-    model_name_override: Optional[str] = None
+    model_name_override: Optional[str] = None,
+    table_index: Optional[int] = None
 ) -> Optional[List[Dict[str, Any]]]:
     """
     Extract evaluation results from a model's README.
@@ -448,7 +458,8 @@ def extract_evaluations_from_readme(
         task_type: Task type for model-index (e.g., "text-generation")
         dataset_name: Name for the benchmark dataset
         dataset_type: Type identifier for the dataset
-        model_name_override: Override model name for matching (useful for transposed tables)
+        model_name_override: Override model name for matching (column header for comparison tables)
+        table_index: 1-indexed table number from inspect-tables output
 
     Returns:
         Model-index formatted results or None if no evaluations found
@@ -468,28 +479,54 @@ def extract_evaluations_from_readme(
         else:
             model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
 
-        # Extract all tables
-        tables = extract_tables_from_markdown(readme_content)
+        # Use markdown-it parser for accurate table extraction
+        all_tables = extract_tables_with_parser(readme_content)
 
-        if not tables:
+        if not all_tables:
             print(f"No tables found in README for {repo_id}")
             return None
 
-        # Parse and filter evaluation tables
+        # If table_index specified, use that specific table
+        if table_index is not None:
+            if table_index < 1 or table_index > len(all_tables):
+                print(f"Invalid table index {table_index}. Found {len(all_tables)} tables.")
+                print("Run inspect-tables to see available tables.")
+                return None
+            tables_to_process = [all_tables[table_index - 1]]
+        else:
+            # Filter to evaluation tables only
+            eval_tables = []
+            for table in all_tables:
+                header = table.get("headers", [])
+                rows = table.get("rows", [])
+                if is_evaluation_table(header, rows):
+                    eval_tables.append(table)
+
+            if len(eval_tables) > 1:
+                print(f"\n⚠ Found {len(eval_tables)} evaluation tables.")
+                print("Run inspect-tables first, then use --table to select one:")
+                print(f'  python scripts/evaluation_manager.py inspect-tables --repo-id "{repo_id}"')
+                return None
+            elif len(eval_tables) == 0:
+                print(f"No evaluation tables found in README for {repo_id}")
+                return None
+
+            tables_to_process = eval_tables
+
+        # Extract metrics from selected table(s)
         all_metrics = []
-        for table_str in tables:
-            header, rows = parse_markdown_table(table_str)
-
-            if is_evaluation_table(header, rows):
-                metrics = extract_metrics_from_table(header, rows, model_name=model_name)
-                all_metrics.extend(metrics)
+        for table in tables_to_process:
+            header = table.get("headers", [])
+            rows = table.get("rows", [])
+            metrics = extract_metrics_from_table(header, rows, model_name=model_name)
+            all_metrics.extend(metrics)
 
         if not all_metrics:
-            print(f"No evaluation tables found in README for {repo_id}")
+            print(f"No metrics extracted from table")
             return None
 
         # Build model-index structure
-        model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
+        display_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
 
         results = [{
             "task": {"type": task_type},
@@ -511,6 +548,211 @@ def extract_evaluations_from_readme(
         return None
 
 
+# ============================================================================
+# Table Inspection (using markdown-it-py for accurate parsing)
+# ============================================================================
+
+
+def extract_tables_with_parser(markdown_content: str) -> List[Dict[str, Any]]:
+    """
+    Extract tables from markdown using markdown-it-py parser.
+    Uses GFM (GitHub Flavored Markdown) which includes table support.
+    """
+    md = MarkdownIt("gfm-like")
+    tokens = md.parse(markdown_content)
+
+    tables = []
+    i = 0
+    while i < len(tokens):
+        token = tokens[i]
+
+        if token.type == "table_open":
+            table_data = {"headers": [], "rows": []}
+            current_row = []
+            in_header = False
+
+            i += 1
+            while i < len(tokens) and tokens[i].type != "table_close":
+                t = tokens[i]
+                if t.type == "thead_open":
+                    in_header = True
+                elif t.type == "thead_close":
+                    in_header = False
+                elif t.type == "tr_open":
+                    current_row = []
+                elif t.type == "tr_close":
+                    if in_header:
+                        table_data["headers"] = current_row
+                    else:
+                        table_data["rows"].append(current_row)
+                    current_row = []
+                elif t.type == "inline":
+                    current_row.append(t.content.strip())
+                i += 1
+
+            if table_data["headers"] or table_data["rows"]:
+                tables.append(table_data)
+
+        i += 1
+
+    return tables
+
+
+def detect_table_format(table: Dict[str, Any], repo_id: str) -> Dict[str, Any]:
+    """Analyze a table to detect its format and identify model columns."""
+    headers = table.get("headers", [])
+    rows = table.get("rows", [])
+
+    if not headers or not rows:
+        return {"format": "unknown", "columns": headers, "model_columns": [], "row_count": 0, "sample_rows": []}
+
+    first_header = headers[0].lower() if headers else ""
+    is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"]
+
+    # Check for numeric columns
+    numeric_columns = []
+    for col_idx in range(1, len(headers)):
+        numeric_count = 0
+        for row in rows[:5]:
+            if col_idx < len(row):
+                try:
+                    val = re.sub(r'\s*\([^)]*\)', '', row[col_idx])
+                    float(val.replace("%", "").replace(",", "").strip())
+                    numeric_count += 1
+                except (ValueError, AttributeError):
+                    pass
+        if numeric_count > len(rows[:5]) / 2:
+            numeric_columns.append(col_idx)
+
+    # Determine format
+    if is_first_col_benchmarks and len(numeric_columns) > 1:
+        format_type = "comparison"
+    elif is_first_col_benchmarks and len(numeric_columns) == 1:
+        format_type = "simple"
+    elif len(numeric_columns) > len(headers) / 2:
+        format_type = "transposed"
+    else:
+        format_type = "unknown"
+
+    # Find model columns
+    model_columns = []
+    model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
+    model_tokens, _ = normalize_model_name(model_name)
+
+    for idx, header in enumerate(headers):
+        if idx == 0 and is_first_col_benchmarks:
+            continue
+        if header:
+            header_tokens, _ = normalize_model_name(header)
+            is_match = model_tokens == header_tokens
+            is_partial = model_tokens.issubset(header_tokens) or header_tokens.issubset(model_tokens)
+            model_columns.append({
+                "index": idx,
+                "header": header,
+                "is_exact_match": is_match,
+                "is_partial_match": is_partial and not is_match
+            })
+
+    return {
+        "format": format_type,
+        "columns": headers,
+        "model_columns": model_columns,
+        "row_count": len(rows),
+        "sample_rows": [row[0] for row in rows[:5] if row]
+    }
+
+
+def inspect_tables(repo_id: str) -> None:
+    """Inspect and display all evaluation tables in a model's README."""
+    try:
+        card = ModelCard.load(repo_id, token=HF_TOKEN)
+        readme_content = card.content
+
+        if not readme_content:
+            print(f"No README content found for {repo_id}")
+            return
+
+        tables = extract_tables_with_parser(readme_content)
+
+        if not tables:
+            print(f"No tables found in README for {repo_id}")
+            return
+
+        print(f"\n{'='*70}")
+        print(f"Tables found in README for: {repo_id}")
+        print(f"{'='*70}")
+
+        eval_table_count = 0
+        for table in tables:
+            analysis = detect_table_format(table, repo_id)
+
+            if analysis["format"] == "unknown" and not analysis.get("sample_rows"):
+                continue
+
+            eval_table_count += 1
+            print(f"\n## Table {eval_table_count}")
+            print(f"   Format: {analysis['format']}")
+            print(f"   Rows: {analysis['row_count']}")
+
+            print(f"\n   Columns ({len(analysis['columns'])}):")
+            for col_info in analysis.get("model_columns", []):
+                idx = col_info["index"]
+                header = col_info["header"]
+                if col_info["is_exact_match"]:
+                    print(f"      [{idx}] {header}  ✓ EXACT MATCH")
+                elif col_info["is_partial_match"]:
+                    print(f"      [{idx}] {header}  ~ partial match")
+                else:
+                    print(f"      [{idx}] {header}")
+
+            if analysis.get("sample_rows"):
+                print(f"\n   Sample rows (first column):")
+                for row_val in analysis["sample_rows"][:5]:
+                    print(f"      - {row_val}")
+
+            # Build suggested command
+            cmd_parts = [
+                "python scripts/evaluation_manager.py extract-readme",
+                f'--repo-id "{repo_id}"',
+                f"--table {eval_table_count}"
+            ]
+
+            override_value = None
+            if analysis["format"] == "comparison":
+                exact = next((c for c in analysis.get("model_columns", []) if c["is_exact_match"]), None)
+                if exact:
+                    print(f"\n   ✓ Column match: {exact['header']}")
+                else:
+                    partial = next((c for c in analysis.get("model_columns", []) if c["is_partial_match"]), None)
+                    if partial:
+                        override_value = partial["header"]
+                        print(f"\n   ⚠ No exact match. Best candidate: {partial['header']}")
+                    elif analysis.get("model_columns"):
+                        print(f"\n   ⚠ Could not identify model column. Options:")
+                        for col_info in analysis.get("model_columns", []):
+                            print(f'      "{col_info["header"]}"')
+                        override_value = analysis["model_columns"][0]["header"]
+
+            if override_value:
+                cmd_parts.append(f'--model-name-override "{override_value}"')
+
+            cmd_parts.append("--dry-run")
+
+            print(f"\n   Suggested command:")
+            print(f"      {cmd_parts[0]} \\")
+            for part in cmd_parts[1:-1]:
+                print(f"        {part} \\")
+            print(f"        {cmd_parts[-1]}")
+
+        if eval_table_count == 0:
+            print("\nNo evaluation tables detected.")
+
+        print(f"\n{'='*70}\n")
+
+    except Exception as e:
+        print(f"Error inspecting tables: {e}")
+
+
 # ============================================================================
 # Method 2: Import from Artificial Analysis
 # ============================================================================
@@ -816,12 +1058,13 @@ def main():
         help="Extract evaluation tables from model README"
     )
     extract_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
+    extract_parser.add_argument("--table", type=int, help="Table number (1-indexed, from inspect-tables output)")
+    extract_parser.add_argument("--model-name-override", type=str, help="Column header for comparison tables")
     extract_parser.add_argument("--task-type", type=str, default="text-generation", help="Task type")
     extract_parser.add_argument("--dataset-name", type=str, default="Benchmarks", help="Dataset name")
     extract_parser.add_argument("--dataset-type", type=str, default="benchmark", help="Dataset type")
-    extract_parser.add_argument("--model-name-override", type=str, help="Override model name for table matching")
     extract_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")
-    extract_parser.add_argument("--dry-run", action="store_true", help="Preview without updating")
+    extract_parser.add_argument("--dry-run", action="store_true", help="Preview YAML without updating")
 
     # Import from AA command
     aa_parser = subparsers.add_parser(
@@ -847,6 +1090,21 @@ def main():
     )
     validate_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
 
+    # Inspect tables command
+    inspect_parser = subparsers.add_parser(
+        "inspect-tables",
+        help="Inspect tables in README → outputs suggested extract-readme command",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Workflow:
+  1. inspect-tables     → see table structure and columns
+  2. copy command       → suggested extract-readme command
+  3. run with --dry-run → preview YAML output
+  4. remove --dry-run   → apply changes
+"""
+    )
+    inspect_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
+
     args = parser.parse_args()
 
     if not args.command:
@@ -860,7 +1118,8 @@ def main():
             task_type=args.task_type,
             dataset_name=args.dataset_name,
             dataset_type=args.dataset_type,
-            model_name_override=args.model_name_override
+            model_name_override=args.model_name_override,
+            table_index=args.table
         )
 
         if not results:
@@ -902,6 +1161,9 @@ def main():
     elif args.command == "validate":
         validate_model_index(args.repo_id)
 
+    elif args.command == "inspect-tables":
+        inspect_tables(args.repo_id)
+
 
 if __name__ == "__main__":
     main()

From 65a4e2d97979aee100439e2e27bb4c9206253133 Mon Sep 17 00:00:00 2001
From: evalstate <1936278+evalstate@users.noreply.github.com>
Date: Fri, 28 Nov 2025 19:25:47 +0000
Subject: [PATCH 2/4] update skill text (before bench run wipes it!)

---
 hf_model_evaluation/SKILL.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md
index d123c78..e4749dd 100644
--- a/hf_model_evaluation/SKILL.md
+++ b/hf_model_evaluation/SKILL.md
@@ -469,15 +469,15 @@ done < models.txt
 
 ### Best Practices
 
-1. **ALWAYS Validate Extraction**: Use `--dry-run` first and manually verify all extracted scores match the README exactly before pushing
-2. **Check for Transposed Tables**: If the README has comparison tables with multiple models, verify only YOUR model's scores were extracted
-3. **Validate After Updates**: Run `validate` and `show` commands to ensure proper formatting
-4. **Source Attribution**: Include source information for traceability
-5. **Regular Updates**: Keep evaluation scores current as new benchmarks emerge
-6. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
-7. **Monitor Costs**: Evaluation Jobs are billed by usage. Ensure you check running jobs and costs
-8. **One model per repo**: Only add one model's 'results' to the model-index. The main model of the repo. No derivatives or forks!
-9. **Markdown formatting**: Never use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
+1. **Always start with `inspect-tables`**: See table structure and get the correct extraction command
+2. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow
+3. **Use `--dry-run` first**: Preview YAML output before applying changes
+4. **Verify extracted values**: Compare YAML output against the README table manually
+5. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist
+6. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output
+7. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
+8. **One model per repo**: Only add the main model's results to model-index
+9. **No markdown in YAML names**: The model name field in YAML should be plain text
 
 ### Model Name Matching
 

From ffdfab352a33eb437d6fdf3528ba81455dc1a57b Mon Sep 17 00:00:00 2001
From: evalstate <1936278+evalstate@users.noreply.github.com>
Date: Fri, 28 Nov 2025 19:26:08 +0000
Subject: [PATCH 3/4] same as last

---
 hf_model_evaluation/SKILL.md | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md
index e4749dd..b2f5475 100644
--- a/hf_model_evaluation/SKILL.md
+++ b/hf_model_evaluation/SKILL.md
@@ -529,11 +529,23 @@ python scripts/evaluation_manager.py import-aa \
 
 ### Troubleshooting
 
-**Issue**: "No evaluation tables found in README"
-- **Solution**: Check if README contains markdown tables with numeric scores
+**Issue**: "Found N evaluation tables. Run inspect-tables first"
+- **Cause**: Multiple tables exist and `--table` was not specified
+- **Solution**: Run `inspect-tables` to see available tables, then use `--table N`
+
+**Issue**: Wrong values extracted (scores don't match README)
+- **Cause**: Wrong column extracted from comparison table
+- **Solution**:
+  1. Run `inspect-tables` to see column headers
+  2. Use `--model-name-override` with the exact column header text
+  3. Use `--dry-run` to verify before applying
+
+**Issue**: "No tables found in README"
+- **Cause**: Tables may be in code blocks or non-standard format
+- **Solution**: Check README contains proper markdown tables with `|` separators
 
 **Issue**: "Could not find model 'X' in transposed table"
-- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list
+- **Solution**: Use `--model-name-override` with the exact name from the table
 - **Example**: `--model-name-override "**Olmo 3-32B**"`
 
 **Issue**: "AA_API_KEY not set"
@@ -542,12 +554,6 @@ python scripts/evaluation_manager.py import-aa \
 **Issue**: "Token does not have write access"
 - **Solution**: Ensure HF_TOKEN has write permissions for the repository
 
-**Issue**: "Model not found in Artificial Analysis"
-- **Solution**: Verify creator-slug and model-name match API values
-
-**Issue**: "Payment required for hardware"
-- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware
-
 ### Integration Examples
 
 **Python Script Integration:**

From 5ee3e988acb9fa6cc049f16a7e74e9a84e4a2c71 Mon Sep 17 00:00:00 2001
From: evalstate <1936278+evalstate@users.noreply.github.com>
Date: Fri, 28 Nov 2025 19:26:25 +0000
Subject: [PATCH 4/4] remove validaton flow

---
 hf_model_evaluation/SKILL.md | 82 ++++--------------------------------
 1 file changed, 8 insertions(+), 74 deletions(-)

diff --git a/hf_model_evaluation/SKILL.md b/hf_model_evaluation/SKILL.md
index b2f5475..ee5c67e 100644
--- a/hf_model_evaluation/SKILL.md
+++ b/hf_model_evaluation/SKILL.md
@@ -180,80 +180,14 @@ In this format, the script will:
 - Find the row matching your model name (handles bold/markdown formatting)
 - Extract all benchmark scores from that specific row only
 
-#### Validating Extraction Results
-
-**CRITICAL**: Always validate extracted results before creating a PR or pushing changes.
-
-After running `extract-readme`, you MUST:
-
-1. **Use `--dry-run` first** to preview the extraction:
-```bash
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name" \
-  --dry-run
-```
-
-2. **Manually verify the output**:
-   - Check that the correct model's scores were extracted (not other models)
-   - Verify benchmark names are correct
-   - Confirm all expected benchmarks are present
-   - Ensure numeric values match the README exactly
-
-3. **For transposed tables** (models as rows):
-   - Verify only ONE model's row was extracted
-   - Check that it matched the correct model name
-   - Look for warnings like "Could not find model 'X' in transposed table"
-   - If scores from multiple models appear, the table format was misdetected
-
-4. **Compare against the source**:
-   - Open the model README in a browser
-   - Cross-reference each extracted score with the table
-   - Verify no scores are mixed from different rows/columns
-
-5. **Common validation failures**:
-   - **Multiple models extracted**: Wrong table format detected
-   - **Missing benchmarks**: Column headers not recognized
-   - **Wrong scores**: Matched wrong model row or column
-   - **Empty metrics list**: Table not detected or parsing failed
-
-**Example validation workflow**:
-```bash
-# Step 1: Dry run to preview
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "allenai/Olmo-3-1125-32B" \
-  --dry-run
-
-# Step 2: If model name not found in table, script shows available models
-# ⚠ Could not find model 'Olmo-3-1125-32B' in transposed table
-#
-# Available models in table:
-#   1. **Open-weight Models**
-#   2. Qwen-2.5-32B
-#   ...
-#   12. **Olmo 3-32B**
-#
-# Please select the correct model name from the list above.
-
-# Step 3: Re-run with the correct model name
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "allenai/Olmo-3-1125-32B" \
-  --model-name-override "**Olmo 3-32B**" \
-  --dry-run
-
-# Step 4: Review the YAML output carefully
-# Verify: Are these all benchmarks for Olmo-3-32B ONLY?
-# Verify: Do the scores match the README table?
-
-# Step 5: If validation passes, create PR
-python scripts/evaluation_manager.py extract-readme \
-  --repo-id "allenai/Olmo-3-1125-32B" \
-  --model-name-override "**Olmo 3-32B**" \
-  --create-pr
-
-# Step 6: Validate the model card after update
-python scripts/evaluation_manager.py show \
-  --repo-id "allenai/Olmo-3-1125-32B"
-```
+#### Validation Checklist
+
+Before applying changes (removing `--dry-run`), verify:
+- [ ] Correct table selected (use `inspect-tables` to confirm)
+- [ ] Correct column extracted (check `--model-name-override` if comparison table)
+- [ ] Benchmark names match the README
+- [ ] Numeric values match the README exactly
+- [ ] No scores from other models included
 
 ### Method 2: Import from Artificial Analysis