# Mena to Scott Catalog Matcher

**Fixed and Complete Version**

Matches stamps from Mena catalog to Scott catalog using multi-signal scoring.

## Key Features:
- ✅ Multi-signal scoring (denomination, color, year, perforation)
- ✅ Handles nested Scott data structure
- ✅ Enriches variety stamps with base stamp data
- ✅ Normalizes abbreviations ("2 reales" ↔ "2r", "pl brn" ↔ "pale brown")
- ✅ Color family matching
- ✅ Confidence scoring with thresholds

## 1. Imports

In [2]:
from pathlib import Path
import json
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
import re
from difflib import SequenceMatcher

## 2. Data Structures

In [25]:
@dataclass
class MatchResult:
    """Represents a match between Mena and Scott catalogs"""
    mena_catalog_no: str
    scott_number: str
    confidence: str
    score: float
    signals: Dict[str, float]
    breakdown: str
    boost_reasons: List[str]
    requires_review: bool


@dataclass
class UnmatchedEntry:
    """Represents an unmatched catalog entry"""
    catalog_no: str
    denomination: str
    color: str
    reason: str

## 3. Normalization Dictionaries and Functions

In [24]:
# Color abbreviation mappings
COLOR_ABBREVIATIONS = {
    "pl brn": "pale brown", "dk brn": "dark brown", "lt bl": "light blue",
    "dk bl": "dark blue", "org": "orange", "grn": "green", "dk grn": "dark green",
    "lt grn": "light green", "yel": "yellow", "blk": "black", "scar": "scarlet",
    "car": "carmine", "vio": "violet", "pur": "purple", "brn": "brown",
    "ol": "olive", "org red": "orange red", "red brn": "red brown", "gray": "grey",
}

# Color family groupings
COLOR_FAMILIES = {
    "blue_family": ["blue", "light blue", "dark blue", "pale blue", "ultramarine", 
                    "blue violet", "pale gray violet", "gray violet"],
    "red_family": ["red", "scarlet", "carmine", "rose", "vermillion", "crimson",
                   "dark red", "rose red", "lake"],
    "yellow_family": ["yellow", "orange", "lemon", "gold", "amber", "yellow green"],
    "green_family": ["green", "light green", "dark green", "olive", "emerald", 
                     "yellow green", "dark green"],
    "brown_family": ["brown", "pale brown", "dark brown", "sepia", "chocolate", 
                     "red brown", "dark brown"],
}


def normalize_color(color_string: str) -> str:
    """Normalize color strings to standard format"""
    if not color_string:
        return ""
    color_lower = color_string.lower().strip()
    if color_lower in COLOR_ABBREVIATIONS:
        return COLOR_ABBREVIATIONS[color_lower]
    return " ".join(color_lower.split())

def clean_scott_color(color_string: str) -> str:
    """
    Remove overprint notation suffixes from Scott color strings.
    
    Examples:
        "carmine (Bk)" → "carmine"
        "green (R)" → "green"
        "blue vio (R)" → "blue vio"
    """
    if not color_string:
        return ""
    
    # Remove overprint suffixes: (R), (Bk), (BI), (G), (V), etc.
    cleaned = re.sub(r'\s*\([A-Z][a-z]?\)$', '', color_string)
    
    return cleaned.strip()

def find_color_family(color: str) -> Optional[str]:
    """Find which color family a color belongs to"""
    color_normalized = normalize_color(color)
    for family, colors in COLOR_FAMILIES.items():
        if color_normalized in colors:
            return family
    return None


def calculate_color_family_similarity(color1: str, color2: str) -> float:
    """Calculate similarity between two colors based on color families"""
    norm1 = normalize_color(color1)
    norm2 = normalize_color(color2)
    if norm1 == norm2:
        return 1.0
    family1 = find_color_family(norm1)
    family2 = find_color_family(norm2)
    
    # CRITICAL FIX: Different color families should have LOW similarity
    if family1 and family2:
        if family1 == family2:
            return 0.85  # Same family (e.g., light blue vs dark blue)
        else:
            return 0.3   # Different families (e.g., brown vs green) - LOWERED!
    
    return SequenceMatcher(None, norm1, norm2).ratio()

In [23]:
def parse_denomination_string(denom_string: str) -> Dict[str, Any]:
    """
    Parse Scott denomination strings including surcharges.
    
    Examples:
        '½r' → {"value": 0.5, "unit": "real"}
        '2r' → {"value": 2, "unit": "real"}
        '1c on 20c' → {"value": 1, "unit": "c", "surcharge": {"on_value": 20, "on_unit": "c"}}
    """
    if not denom_string:
        return {"value": None, "unit": None}
    
    denom_string = denom_string.lower().strip()
    
    # Check if it's a surcharge (contains "on")
    if " on " in denom_string:
        parts = denom_string.split(" on ")
        if len(parts) == 2:
            # Parse the new denomination (first part)
            new_denom = parse_simple_denomination(parts[0].strip())
            # Parse the original denomination (second part)
            orig_denom = parse_simple_denomination(parts[1].strip())
            
            return {
                "value": new_denom["value"],
                "unit": new_denom["unit"],
                "surcharge": {
                    "on_value": orig_denom["value"],
                    "on_unit": orig_denom["unit"]
                }
            }
    
    # Not a surcharge, parse normally
    return parse_simple_denomination(denom_string)


def fix_scott_surcharge_data(scott_stamp: Dict[str, Any]) -> Dict[str, Any]:
    """
    Fix Scott stamps where surcharge info is incorrectly split between denomination and color.
    
    Example:
        Input:  {denomination: "1c", color: "on ½r ('82)"}
        Output: {denomination: "1c on ½r", color: "vermilion (assumed)"}
    """
    denom = str(scott_stamp.get('denomination', '')).strip()
    color = str(scott_stamp.get('color', '')).strip()
    
    # Check if color field contains surcharge info (starts with "on")
    if color.lower().startswith('on '):
        # Reconstruct full denomination
        # Remove year markers like ('82) from color field
        surcharge_part = re.sub(r'\s*\([\'"]?\d{2}\).*$', '', color)
        full_denomination = f"{denom} {surcharge_part}"
        
        # Try to extract actual color from notes or illustration reference
        # For now, mark as unknown
        fixed_color = "surcharge color unknown"
        
        return {
            **scott_stamp,
            'denomination': full_denomination,
            'color': fixed_color,
            'original_color_field': color  # Keep for reference
        }
    
    return scott_stamp

def parse_simple_denomination(denom_string: str) -> Dict[str, Any]:
    """Parse a simple denomination string (helper function)"""
    if not denom_string:
        return {"value": None, "unit": None}
    
    denom_string = denom_string.strip()
    
    # Handle ½
    if "½" in denom_string:
        value = 0.5
        unit = re.sub(r'[½\d\s.]', '', denom_string)
    else:
        match = re.search(r'(\d+\.?\d*)', denom_string)
        if match:
            value = float(match.group(1))
        else:
            return {"value": None, "unit": None}
        unit = re.sub(r'[\d\s.]', '', denom_string)
    
    # Normalize unit
    unit = unit.strip()
    if unit == 'r':
        unit = 'real'
    elif unit == 'p':
        unit = 'peso'
    elif unit in ['c', 'ct', 'cts']:
        unit = 'centavo'
    
    return {"value": value, "unit": unit}


def normalize_denomination(value: float, unit: str) -> Dict[str, Any]:
    """Normalize Mena denomination to match Scott format"""
    unit_normalized = unit.lower().strip()
    
    # Remove plural 's' - CRITICAL FIX!
    if unit_normalized.endswith('es'):
        unit_normalized = unit_normalized[:-2]  # "reales" -> "real"
    elif unit_normalized.endswith('s'):
        unit_normalized = unit_normalized[:-1]  # "centavos" -> "centavo"
    
    # Handle special abbreviations
    if unit_normalized in ['p', 'ps']:
        unit_normalized = 'peso'
    elif unit_normalized in ['r']:
        unit_normalized = 'real'
    elif unit_normalized in ['c', 'ct']:
        unit_normalized = 'centavo'
    
    return {"value": value, "unit": unit_normalized}

## 4. Year Extraction Functions

In [22]:
import re
from typing import Any, Dict, Iterable, Optional

YEAR_RE = re.compile(r'(?<!\d)(\d{4})(?!\d)')

def _year_from_value(v: Any) -> Optional[int]:
    if v is None:
        return None
    if isinstance(v, int) and 1000 <= v <= 3000:
        return v
    s = str(v)
    m = YEAR_RE.search(s)
    return int(m.group(1)) if m else None

def _year_from_range_text(text: str) -> Optional[int]:
    """
    Extract the start year from strings like '1881-82' or '1881–1882'.
    If only one 4-digit year exists, return that.
    """
    if not text:
        return None
    # Full 4-digit first; this covers '1881-1882' or any plain '1881'
    m = YEAR_RE.search(text)
    if not m:
        return None
    start_year = int(m.group(1))
    # Handle compact '1881-82' where the second part is 2 digits
    m2 = re.search(r'(?<!\d)(\d{4})\s*[-–]\s*(\d{2})(?!\d)', text)
    if m2:
        # Example: 1881-82 -> second becomes 1882 (not used here, but confirms a range)
        return int(m2.group(1))  # choose the starting year as "primary"
    return start_year

def extract_primary_year_from_issue(issue: Dict[str, Any]) -> Optional[int]:
    """
    Best-effort extraction of the 'primary' year for an issue.
    Priority order:
      1) issue_dates: placed_on_sale, probable_first_circulation, announced
      2) issue_dates: second_plate_sale, demonetized (still useful if above missing)
      3) legal_basis[].date  (earliest year)
      4) production_orders.printings[].date (earliest year)
      5) production_orders.remainders.date
      6) title or issue_id (year or start of a range like '1881-82')
    """
    issue_data = issue.get('issue_data', issue)  # allow passing issue_data directly
    issue_dates = issue_data.get('issue_dates', {}) or {}

    # 1 & 2) Strongest: when it first appeared/was sold/announced; then other date fields
    primary_order = [
        'placed_on_sale',
        'probable_first_circulation',
        'announced',
        'second_plate_sale',
        'demonetized',
    ]
    for key in primary_order:
        y = _year_from_value(issue_dates.get(key))
        if y:
            return y

    # 3) Legal basis dates (earliest)
    legal_basis = issue_data.get('legal_basis', []) or []
    lb_years = [_year_from_value(lb.get('date')) for lb in legal_basis if _year_from_value(lb.get('date'))]
    if lb_years:
        return min(lb_years)

    # 4) Production orders (printings earliest)
    prod = issue.get('production_orders') or issue_data.get('production_orders') or {}
    printings = prod.get('printings', []) or []
    printing_years: list[int] = []
    for p in printings:
        y = _year_from_value(p.get('date'))
        if y:
            printing_years.append(y)
    if printing_years:
        return min(printing_years)

    # 5) Remainders date
    remainders = prod.get('remainders') or {}
    y = _year_from_value(remainders.get('date'))
    if y:
        return y

    # 6) Title / issue_id as a last resort (handle ranges like '1881-82')
    title = issue_data.get('title') or ''
    y = _year_from_range_text(title)
    if y:
        return y

    issue_id = issue_data.get('issue_id') or ''
    y = _year_from_range_text(issue_id)
    if y:
        return y

    return None


def extract_scott_year(scott_stamp: Dict[str, Any]) -> Optional[int]:
    """Extract year from Scott stamp entry"""
    if 'year' in scott_stamp and scott_stamp['year']:
        return int(scott_stamp['year'])
    if 'header' in scott_stamp and scott_stamp['header']:
        match = re.search(r'(\d{4})', str(scott_stamp['header']))
        if match:
            return int(match.group(1))
    return None

## 5. Scott Data Preprocessing

This section handles:
1. Flattening nested Scott catalog structure
2. Enriching variety stamps with base stamp data
3. Adding year information from headers

In [10]:
def enrich_variety_stamps(scott_stamps: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """
    Enrich variety stamps by inheriting data from base stamps IN THE SAME ISSUE.
    
    CRITICAL FIX: Uses (scott_number + year + header) as key to avoid cross-issue contamination.
    """
    # Build base stamp lookup with issue context
    # Key: (scott_number, year, header) to ensure we match within same issue
    base_stamps = {}
    for stamp in scott_stamps:
        if 'variety_of' not in stamp or not stamp.get('variety_of'):
            scott_no = stamp.get('scott_number', '')
            year = stamp.get('year')
            header = stamp.get('header', '')
            key = (scott_no, year, header)
            base_stamps[key] = stamp
    
    # Enrich varieties
    enriched = []
    for stamp in scott_stamps:
        stamp_copy = stamp.copy()
        
        if 'variety_of' in stamp and stamp['variety_of']:
            base_no = stamp['variety_of']
            year = stamp.get('year')
            header = stamp.get('header', '')
            
            # Look for base stamp in SAME issue
            key = (base_no, year, header)
            base_stamp = base_stamps.get(key)
            
            if base_stamp:
                # Inherit denomination if missing
                if not stamp.get('denomination') and base_stamp.get('denomination'):
                    stamp_copy['denomination'] = base_stamp['denomination']
                
                # Try to extract color from description first
                if not stamp.get('color') and stamp.get('description'):
                    desc = stamp['description'].lower()
                    color_keywords = [
                        'light blue', 'dark blue', 'pale blue', 'blue',
                        'light green', 'dark green', 'pale green', 'green',
                        'light brown', 'dark brown', 'pale brown', 'brown',
                        'light violet', 'dark violet', 'pale violet', 'violet',
                        'blue violet', 'gray violet', 'pale gray violet',
                        'scarlet', 'red', 'carmine', 'rose', 'vermillion',
                        'yellow', 'orange', 'lemon', 'gold',
                        'black', 'purple', 'gray', 'grey'
                    ]
                    for color in color_keywords:
                        if color in desc:
                            stamp_copy['color'] = color
                            break
                
                # If still no color, inherit from base
                if not stamp_copy.get('color') and base_stamp.get('color'):
                    stamp_copy['color'] = base_stamp['color']
                
                # Inherit perforation if missing
                if not stamp_copy.get('perforation') and base_stamp.get('perforation'):
                    stamp_copy['perforation'] = base_stamp['perforation']
            else:
                # Fallback: try to find ANY base stamp with that number (less ideal)
                for (num, _, _), base_stamp in base_stamps.items():
                    if num == base_no:
                        if not stamp_copy.get('denomination') and base_stamp.get('denomination'):
                            stamp_copy['denomination'] = base_stamp['denomination']
                        if not stamp_copy.get('color') and base_stamp.get('color'):
                            stamp_copy['color'] = base_stamp['color']
                        break
        
        enriched.append(stamp_copy)
    
    return enriched

def strip_leading_zeros(catalog_no: str) -> str:
    """
    Strip leading zeros from regular stamps, but convert Scott's "0X" notation to "OX" (Official).
    
    Scott catalog convention:
    - "01", "02", "022" → Official stamps (convert to "O1", "O2", "O22")
    - "001", "0001" → Regular stamps with leading zeros (strip to "1")
    - "C01" → Keep as-is (already has letter prefix)
    
    Examples:
        "01" → "O1" (Official #1)
        "022" → "O22" (Official #22)
        "001" → "1" (Regular stamp with extra zeros)
        "21" → "21" (no change)
        "C01" → "C01" (no change)
    """
    catalog_no = str(catalog_no).strip()
    
    # If already has letter prefix, keep as-is
    if re.match(r'^[A-Za-z]', catalog_no):
        return catalog_no
    
    # Scott notation: "0" followed by 1-2 digits = Official stamps
    # Examples: "01" → "O1", "02" → "O2", "022" → "O22"
    if re.match(r'^0\d{1,2}$', catalog_no):
        return 'O' + catalog_no[1:]  # Replace leading 0 with O
    
    # Regular leading zeros (3+ digits starting with 0)
    # Examples: "001" → "1", "0021" → "21"
    if catalog_no.startswith('0'):
        return catalog_no.lstrip('0') or '0'
    
    return catalog_no

def flatten_and_enrich_scott_data(scott_grouped_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Flatten Scott data from grouped structure and enrich variety stamps."""
    flat_stamps = []
    
    for group in scott_grouped_data:
        header = group.get('header', '')
        stamps = group.get('stamps', [])
        
        for stamp in stamps:
            stamp_copy = stamp.copy()
            stamp_copy['header'] = header
            
            # Normalize scott_number
            scott_no = stamp_copy.get('scott_number', '')
            stamp_copy['scott_number'] = strip_leading_zeros(scott_no)
            
            # Extract year from header
            if header:
                match = re.search(r'(\d{4})', str(header))
                if match:
                    stamp_copy['year'] = int(match.group(1))
            
            # CRITICAL FIX: Reconstruct surcharge denominations
            stamp_copy = fix_scott_surcharge_data(stamp_copy)
            
            flat_stamps.append(stamp_copy)
    
    # Enrich varieties
    enriched_stamps = enrich_variety_stamps(flat_stamps)
    
    return enriched_stamps

## 6. Matching Logic

In [11]:
def build_candidate_pool(mena_issue: Dict[str, Any], 
                         all_scott_stamps: List[Dict[str, Any]], 
                         year_tolerance: int = 2) -> List[Dict[str, Any]]:
    """Build a pool of Scott stamp candidates based on year"""
    primary_year = extract_primary_year_from_issue(mena_issue)
    if not primary_year:
        print("WARNING: No primary year found for Mena issue, returning all stamps")
        return all_scott_stamps
    
    candidates = []
    no_year_count = 0
    
    for scott_stamp in all_scott_stamps:
        scott_year = extract_scott_year(scott_stamp)
        
        if scott_year is not None:
            if abs(scott_year - primary_year) <= year_tolerance:
                candidates.append(scott_stamp)
        else:
            no_year_count += 1
    
    print(f"Found {len(candidates)} Scott candidates for year {primary_year} (±{year_tolerance} years)")
    print(f"Excluded {no_year_count} stamps without year information")
    
    # Year distribution
    year_counts = {}
    for c in candidates:
        y = extract_scott_year(c)
        year_counts[y] = year_counts.get(y, 0) + 1
    
    print(f"Year distribution: {dict(sorted(year_counts.items()))}")
    
        
    print(f"\nAll candidates overall:")
    for c in candidates:
        print(f"  Scott #{c.get('scott_number')}: {c.get('denomination')} {c.get('color')} (year={extract_scott_year(c)})")
    
    return candidates

In [None]:
def normalize_catalog_number(catalog_no: str) -> tuple:
    """
    Normalize catalog number to (category, number, suffix) for matching and sorting.
    
    CRITICAL: Only treat letter "O" as Official, NOT digit "0" (which is just formatting).
    
    Examples:
        "21" → ("", 21.0, "")
        "021" → ("", 21.0, "")  # Leading zero is just formatting
        "O21" → ("O", 21.0, "")  # Letter O means Official
        "C164" → ("C", 164.0, "")
    """
    catalog_no = str(catalog_no).strip()
    
    # Extract category prefix (ONLY letters at start, NOT digits)
    category_match = re.match(r'^([A-Za-z]+)', catalog_no)
    if category_match:
        category = category_match.group(1).upper()
        remaining = catalog_no[len(category):]
    else:
        category = ""  # Regular issue (no letter prefix)
        remaining = catalog_no
    
    # Strip leading zeros from numeric part (for sorting only)
    remaining = remaining.lstrip('0') or '0'
    
    # Extract numeric part
    number_match = re.match(r'^(\d+)', remaining)
    if number_match:
        base_num = float(number_match.group(1))
        remaining = remaining[len(number_match.group(1)):]
        
        # Extract suffix (letters after number)
        suffix_match = re.match(r'^([a-z]+)', remaining, re.IGNORECASE)
        if suffix_match:
            suffix = suffix_match.group(1).lower()
            # Convert suffix to decimal for sorting (a=0.1, b=0.2, etc.)
            for i, char in enumerate(suffix):
                base_num += (ord(char) - ord('a') + 1) * (0.1 ** (i + 1))
        else:
            suffix = ""
    else:
        base_num = 999999.0
        suffix = ""
    
    return (category, base_num, suffix)



def make_scott_unique_key(scott_stamp: Dict[str, Any]) -> str:
    """
    Create a unique key for Scott stamps that handles duplicate numbers across years.
    
    Format: "number__year" (e.g., "1__1883", "7__1881")
    """
    scott_no = scott_stamp.get('scott_number', 'UNKNOWN')
    year = extract_scott_year(scott_stamp)
    
    if year:
        return f"{scott_no}__{year}"
    else:
        return scott_no

def get_stamp_category(stamp: Dict[str, Any], is_mena: bool = True) -> str:
    """
    Get the true category of a stamp, considering both catalog prefix and section/type.
    
    For Mena: Use section field as source of truth
    For Scott: Use catalog number prefix
    """
    if is_mena:
        # For Mena, section field is authoritative
        section = stamp.get('issue_data', {}).get('section', '') if 'issue_data' in stamp else ''
        section = section.lower().strip()
        
        # Map Mena sections to categories
        section_to_category = {
            'surface mail': '',  # Regular
            'airmail': 'C',
            'air mail': 'C',
            'official': 'O',
            'telegraph': 'T',
            'telegraphs': 'T',
            'postage due': 'J',
            'dues': 'J',
            'special delivery': 'E',
            'registration': 'F',
            'guanacaste': 'G',
        }
        
        for key, cat in section_to_category.items():
            if key in section:
                return cat
        
        # Fallback to catalog number prefix
        return normalize_catalog_number(stamp.get('catalog_no', ''))[0]
    
    else:
        # For Scott, use catalog number prefix
        return normalize_catalog_number(stamp.get('scott_number', ''))[0]

# Category priority for sorting (regular issues first, then alphabetically)
CATEGORY_PRIORITY = {
    "": 0,      # Regular issues
    "A": 1,     # Airmail (Mena)
    "AR": 2,    # Postal Fiscal (Scott)
    "B": 3,     # Semi-Postal
    "C": 4,     # Airmail (Scott) / Christmas Postal Tax (Mena)
    "CE": 5,    # Air post special delivery
    "CO": 6,    # Air mail official
    "CT": 7,    # Christmas Postal Tax
    "D": 8,     # Dues
    "E": 9,     # Essay (Mena) / Special Delivery (Scott)
    "EN": 10,   # Envelope
    "G": 11,    # Guanacaste
    "J": 12,    # Postage Due (Scott)
    "O": 13,    # Official
    "OA": 14,   # Official Airmail
    "PC": 15,   # Postal Card
    "PR": 16,   # Postal Revenue
    "PS": 17,   # Postal Seal
    "R": 18,    # Revenue
    "RA": 19,   # Postal Tax (Scott)
    "RL": 20,   # Registration Label
    "RS": 21,   # Radiograph Seal
    "SD": 22,   # Special Delivery (Mena)
    "SP": 23,   # Semi-postal / Surcharge Proof
    "SS": 24,   # Souvenir Sheet
    "T": 25,    # Telegraphs
    "TR": 26,   # Telegraph Revenue
    "TS": 27,   # Telegraph Seals
    "W": 28,    # Wrapper
}

def get_category_priority(category: str) -> int:
    """Get sorting priority for a category"""
    return CATEGORY_PRIORITY.get(category.upper(), 999)

In [None]:
def calculate_match_score(mena_stamp: Dict[str, Any], 
                         scott_stamp: Dict[str, Any], 
                         mena_issue_context: Dict[str, Any]) -> Dict[str, Any]:
    """Calculate match score using multiple signals (100-point system)"""
    signals = {}
    total_score = 0.0
    breakdown_parts = []
    
    # ========== PRE-CHECK: CATEGORY MATCH (10 points, -50 if wrong) ==========
    mena_category = get_stamp_category(
        {'issue_data': mena_issue_context, 'catalog_no': mena_stamp.get('catalog_no', '')}, 
        is_mena=True
    )
    scott_category = get_stamp_category(scott_stamp, is_mena=False)
    
    # Define category equivalences
    category_compatible = False
    if mena_category == scott_category:
        category_compatible = True
    else:
        equivalences = [
            ({'', 'C'}, {'', 'C'}),
            ({'O'}, {'O', 'CO'}),
            ({'E'}, {'E', 'CE'}),
            ({'J'}, {'J'}),
            ({'G'}, {'G'}),
        ]
        for mena_set, scott_set in equivalences:
            if mena_category in mena_set and scott_category in scott_set:
                category_compatible = True
                break
    
    if not category_compatible:
        signals['category'] = -50
        total_score -= 50
        breakdown_parts.append(f"Cat: ✗ (M:{mena_category or 'REG'} vs S:{scott_category or 'REG'})")
    else:
        signals['category'] = 10
        total_score += 10
        breakdown_parts.append("Cat: ✓")
    
    # ========== SIGNAL 1: DENOMINATION (35 points) ==========
    mena_denom = normalize_denomination(
        mena_stamp['denomination']['value'], 
        mena_stamp['denomination']['unit']
    )
    scott_denom = parse_denomination_string(scott_stamp.get('denomination', ''))
    
    # Check surcharge status
    mena_has_surcharge = (mena_stamp.get('overprint', {}).get('present') and 
                          mena_stamp.get('overprint', {}).get('type') == 'surcharge')
    scott_has_surcharge = 'surcharge' in scott_denom
    
    # Surcharge mismatch penalty (ONLY if one has and other doesn't)
    if mena_has_surcharge != scott_has_surcharge:
        signals['surcharge_mismatch'] = -10
        total_score -= 10
        breakdown_parts.append("Surcharge: ✗")
    
    # Denomination matching
    if mena_has_surcharge and scott_has_surcharge:
        # Both are surcharges - check BOTH values
        new_match = (mena_denom['value'] == scott_denom['value'] and 
                     mena_denom['unit'] == scott_denom['unit'])
        
        mena_orig = normalize_denomination(
            mena_stamp['overprint']['on_denomination']['value'],
            mena_stamp['overprint']['on_denomination']['unit']
        )
        scott_orig = scott_denom['surcharge']
        orig_match = (mena_orig['value'] == scott_orig['on_value'] and 
                      mena_orig['unit'] == scott_orig['on_unit'])
        
        if new_match and orig_match:
            signals['denomination'] = 35
            total_score += 35
            breakdown_parts.append("Denom: ✓")
        elif new_match:
            signals['denomination'] = 20
            total_score += 20
            breakdown_parts.append("Denom: ⚠️")
        else:
            signals['denomination'] = 0
            breakdown_parts.append("Denom: ✗")
    
    elif (mena_denom['value'] == scott_denom['value'] and 
          mena_denom['unit'] == scott_denom['unit']):
        signals['denomination'] = 35
        total_score += 35
        breakdown_parts.append("Denom: ✓")
    else:
        signals['denomination'] = 0
        breakdown_parts.append("Denom: ✗")
    
    # ========== SIGNAL 2: COLOR (30 points) ==========
    if mena_stamp.get('color') and scott_stamp.get('color'):
        mena_color_raw = mena_stamp['color']
        scott_color_raw = scott_stamp['color']
        
        # CRITICAL: Clean Scott color (remove overprint suffixes)
        scott_color = clean_scott_color(scott_color_raw)
        
        # Handle compound/variant colors
        mena_colors = []
        if '/' in mena_color_raw:
            mena_colors = [c.strip() for c in mena_color_raw.split('/')]
        elif ' & ' in mena_color_raw:
            mena_colors = [c.strip() for c in mena_color_raw.split('&')]
        else:
            mena_colors = [mena_color_raw]
        
        # Try each variant, take BEST match
        best_similarity = 0.0
        best_mena_color = mena_colors[0]
        
        for mena_color in mena_colors:
            similarity = calculate_color_family_similarity(mena_color, scott_color)
            if similarity > best_similarity:
                best_similarity = similarity
                best_mena_color = mena_color
        
        color_score = best_similarity * 30
        signals['color'] = color_score
        total_score += color_score
        
        if best_similarity >= 0.85:
            breakdown_parts.append(f"Color: {int(best_similarity*100)}%")
        else:
            breakdown_parts.append(f"Color: {int(best_similarity*100)}%")
    else:
        signals['color'] = 0
    
    # ========== SIGNAL 3: YEAR (25 points) ==========
    mena_year = extract_primary_year(mena_issue_context['issue_dates'])
    scott_year = extract_scott_year(scott_stamp)
    if mena_year and scott_year:
        year_diff = abs(mena_year - scott_year)
        if year_diff == 0:
            signals['year'] = 25  # Increased from 20
            total_score += 25
            breakdown_parts.append("Year: ✓")
        elif year_diff == 1:
            signals['year'] = 15
            total_score += 15
            breakdown_parts.append("Year: ~1")
        elif year_diff == 2:
            signals['year'] = 10  # Decreased from 15
            total_score += 10
            breakdown_parts.append("Year: ~2")
    
    # ========== SIGNAL 4: PERFORATION (10 points) ==========
    mena_perf = str(mena_stamp.get('perforation', '')).strip()
    scott_perf = str(scott_stamp.get('perforation', '')).strip()
    
    if mena_perf and scott_perf:
        mena_perf_num = re.findall(r'[\d.]+', mena_perf)
        scott_perf_num = re.findall(r'[\d.]+', scott_perf)
        
        if mena_perf_num and scott_perf_num:
            if any(m == s for m in mena_perf_num for s in scott_perf_num):
                signals['perforation'] = 10
                total_score += 10
            else:
                signals['perforation'] = -5
                total_score -= 5
                breakdown_parts.append(f"Perf: ✗")
    
    return {
        'total_score': total_score, 
        'signals': signals, 
        'breakdown': " | ".join(breakdown_parts)
    }

In [None]:
def score_all_candidates(mena_issue: Dict[str, Any], 
                        scott_candidate_pool: List[Dict[str, Any]], 
                        min_threshold: float = 30.0) -> List[Dict[str, Any]]:
    """Score all Mena stamps against all Scott candidates"""
    scoring_matrix = []
    
    for mena_stamp in mena_issue['stamps']:
        mena_row = {
            'mena_catalog_no': mena_stamp['catalog_no'], 
            'candidates': []
        }
        
        for scott_candidate in scott_candidate_pool:
            score_result = calculate_match_score(
                mena_stamp, scott_candidate, mena_issue['issue_data']
            )
            
            if score_result['total_score'] >= min_threshold:
                unique_key = make_scott_unique_key(scott_candidate)
                
                mena_row['candidates'].append({
                    'scott_number': scott_candidate.get('scott_number', 'UNKNOWN'),
                    'scott_unique_key': unique_key,
                    'scott_year': extract_scott_year(scott_candidate),
                    'score': score_result['total_score'],
                    'signals': score_result['signals'],
                    'breakdown': score_result['breakdown']
                })
        
        if mena_row['candidates']:
            scoring_matrix.append(mena_row)
            
            # # DEBUG: Show scores for Mena #13
            # if mena_stamp['catalog_no'] == '13':
            #     print(f"\n[DEBUG] All candidates for Mena #13 (2c):")
            #     for cand in sorted(mena_row['candidates'], key=lambda x: -x['score'])[:10]:
            #         print(f"  Scott #{cand['scott_number']} ({cand['scott_year']}): {cand['score']:.1f} - {cand['breakdown']}")
    
    return scoring_matrix

In [None]:
def find_optimal_assignment(scoring_matrix: List[Dict[str, Any]]) -> List[MatchResult]:
    """Find optimal assignment using unique Scott keys"""
    from scipy.optimize import linear_sum_assignment
    import numpy as np
    
    mena_stamps = [row['mena_catalog_no'] for row in scoring_matrix]
    
    # CRITICAL: Use unique keys instead of just scott_number
    all_scott_keys = set()
    for row in scoring_matrix:
        for cand in row['candidates']:
            all_scott_keys.add(cand['scott_unique_key'])  # CHANGED
    
    scott_stamps = sorted(all_scott_keys)
    
    # print(f"\n[DEBUG find_optimal_assignment]")
    # print(f"  Mena stamps: {mena_stamps}")
    # print(f"  Unique Scott keys: {sorted(all_scott_keys)[:20]}")
    
    # Build cost matrix
    n_mena = len(mena_stamps)
    n_scott = len(scott_stamps)
    max_dim = max(n_mena, n_scott)
    
    cost_matrix = np.full((max_dim, max_dim), 1000.0)
    
    # CRITICAL: Use unique keys for lookup
    scott_to_idx = {scott_key: i for i, scott_key in enumerate(scott_stamps)}
    
    for i, row in enumerate(scoring_matrix):
        for cand in row['candidates']:
            scott_key = cand['scott_unique_key']  # CHANGED
            if scott_key in scott_to_idx:
                j = scott_to_idx[scott_key]
                cost_matrix[i, j] = -cand['score']
    
    # Find optimal assignment
    mena_indices, scott_indices = linear_sum_assignment(cost_matrix)
    
    # Build results
    assignments = []
    details_map = {}
    
    # CRITICAL: Build lookup with unique keys
    for row in scoring_matrix:
        for cand in row['candidates']:
            key = (row['mena_catalog_no'], cand['scott_unique_key'])  # CHANGED
            details_map[key] = cand
    
    for mena_idx, scott_idx in zip(mena_indices, scott_indices):
        if mena_idx >= n_mena or scott_idx >= n_scott:
            continue
        
        mena_no = mena_stamps[mena_idx]
        scott_key = scott_stamps[scott_idx]  # This is now "7__1883" format
        key = (mena_no, scott_key)
        
        if key not in details_map:
            continue
        
        cand = details_map[key]
        score = cand['score']
        
        if score < 30:
            continue
        
        confidence = "HIGH" if score >= 70 else "MEDIUM" if score >= 50 else "LOW"
        requires_review = score < 70
        
        # Display with year for clarity
        scott_display = f"{cand['scott_number']} ({cand.get('scott_year', '?')})"
        
        assignments.append(MatchResult(
            mena_catalog_no=mena_no,
            scott_number=scott_display,  # CHANGED to show year
            confidence=confidence,
            score=score,
            signals=cand['signals'],
            breakdown=cand['breakdown'],
            boost_reasons=[],
            requires_review=requires_review
        ))
    
    assignments.sort(key=lambda x: normalize_catalog_number(x.mena_catalog_no)[1])
    
    return assignments

def extract_numeric_prefix(catalog_no: str) -> float:
    """
    Extract numeric prefix from catalog number for sorting.
    
    Examples:
        "17" → 17.0
        "17a" → 17.1
        "21" → 21.0
        "22" → 22.0
        "C164" → 164.0 (strips letter prefix)
    """
    # Remove letter prefixes (like "C" in "C164")
    no_prefix = re.sub(r'^[A-Z]+', '', catalog_no)
    
    # Extract the numeric part
    match = re.match(r'(\d+)', no_prefix)
    if match:
        base_num = float(match.group(1))
        
        # Add fractional part for suffixes (a=0.1, b=0.2, etc.)
        suffix_match = re.search(r'[a-z]', catalog_no.lower())
        if suffix_match:
            suffix = suffix_match.group(0)
            base_num += (ord(suffix) - ord('a') + 1) * 0.1
        
        return base_num
    
    # Fallback for non-standard formats
    return 999999.0

## 7. Main Matching Function

In [None]:
def match_mena_to_scott(mena_issue: Dict[str, Any], 
                       all_scott_stamps: List[Dict[str, Any]], 
                       year_tolerance: int = 2, 
                       min_score_threshold: float = 30.0) -> Dict[str, Any]:
    """Main function to match Mena issue to Scott catalog"""
    
    print("\n" + "="*80)
    print("MENA TO SCOTT CATALOG MATCHING")
    print("="*80)
    
    # Build candidate pool
    scott_candidates = build_candidate_pool(mena_issue, all_scott_stamps, year_tolerance)
    
    # # CRITICAL DEBUG: Check what Scott #1, #5, #7, #17, #19 actually are
    # suspect_numbers = ['1', '5', '7', '17', '19']

    # print("\n" + "="*80)
    # print("DEBUGGING: What are these Scott numbers in the candidate pool?")
    # print("="*80)

    # for suspect in suspect_numbers:
    #     matches = [s for s in scott_candidates if s.get('scott_number') == suspect]
    #     if matches:
    #         for s in matches:
    #             print(f"\nScott #{suspect}:")
    #             print(f"  Denomination: {s.get('denomination')}")
    #             print(f"  Color: {s.get('color')}")
    #             print(f"  Year: {extract_scott_year(s)}")
    #             print(f"  Header: {s.get('header', 'N/A')}")
    #             print(f"  Illustration: {s.get('illustration', 'N/A')}")
    #     else:
    #         print(f"\nScott #{suspect}: NOT IN CANDIDATE POOL")

    # print("\n" + "="*80)
    # print("Now checking ALL Scott stamps (not just candidates):")
    # print("="*80)

    # for suspect in suspect_numbers:
    #     all_matches = [s for s in all_scott_stamps if s.get('scott_number') == suspect]
    #     print(f"\nScott #{suspect} appears {len(all_matches)} time(s) in full catalog:")
    #     for i, s in enumerate(all_matches[:3], 1):  # Show first 3
    #         print(f"  {i}. Year={extract_scott_year(s)}, Denom={s.get('denomination')}, Header={s.get('header', 'N/A')[:50]}")
    
    # Score all candidates
    scoring_matrix = score_all_candidates(mena_issue, scott_candidates, min_score_threshold)
    
    # Find optimal assignment
    assignments = find_optimal_assignment(scoring_matrix)
    
    # Calculate statistics
    statistics = {
        'total_mena_stamps': len(mena_issue['stamps']),
        'total_assignments': len(assignments),
        'high_confidence': sum(1 for a in assignments if a.confidence == "HIGH"),
        'medium_confidence': sum(1 for a in assignments if a.confidence == "MEDIUM"),
        'low_confidence': sum(1 for a in assignments if a.confidence == "LOW"),
        'success_rate': round(len(assignments) / len(mena_issue['stamps']) * 100, 1) 
                        if mena_issue['stamps'] else 0
    }
    
    # Build result
    result = {
        'issue_match': {
            'mena_issue_id': mena_issue['issue_data']['issue_id'],
            'mena_title': mena_issue['issue_data']['title'],
            'candidate_pool_size': len(scott_candidates)
        },
        'assignments': [
            {
                'mena_catalog_no': a.mena_catalog_no,
                'scott_number': a.scott_number,
                'confidence': a.confidence,
                'score': round(a.score, 1),
                'signals': {k: round(v, 1) for k, v in a.signals.items()},
                'breakdown': a.breakdown,
                'requires_review': a.requires_review
            }
            for a in assignments
        ],
        'statistics': statistics,
        'scoring_matrix': scoring_matrix
    }
    
    return result

## 8. Results Printing

In [None]:
def print_matching_results(result: Dict[str, Any]) -> None:
    """Pretty print the matching results"""
    print("\n" + "="*80)
    print("MATCHING RESULTS")
    print("="*80)
    
    for assignment in result['assignments']:
        print(f"\n✓ Mena #{assignment['mena_catalog_no']} → Scott #{assignment['scott_number']}")
        print(f"  Confidence: {assignment['confidence']} (Score: {assignment['score']}/100)")
        print(f"  {assignment['breakdown']}")
    
    print("\n" + "="*80)
    stats = result['statistics']
    print(f"Total: {stats['total_mena_stamps']} | Matched: {stats['total_assignments']} ({stats['success_rate']}%)")
    print(f"High: {stats['high_confidence']} | Medium: {stats['medium_confidence']} | Low: {stats['low_confidence']}")
    print("="*80 + "\n")

## 9. Load Your Data

**Replace these paths with your actual file paths!**

In [2]:
# Load Mena issue
PATH = Path("results/parsed_catalogues/mena_parse_results_ALL.json")

# Cargar
with PATH.open("r", encoding="utf-8") as f:
    mena_parsed_catalog = json.load(f)


In [7]:
mena_issue = mena_parsed_catalog[7]
print(f"Loaded Mena issue: {mena_issue['issue_data']['title']}")
print(f"Number of stamps: {len(mena_issue['stamps'])}")
print(mena_issue)

Loaded Mena issue: Coat of Arms issue
Number of stamps: 10
{'issue_data': {'issue_id': 'CR-1892-COAT-OF-ARMS', 'section': 'Surface Mail', 'title': 'Coat of Arms issue', 'country': 'Costa Rica', 'issue_dates': {'announced': None, 'placed_on_sale': '1892-05-01', 'probable_first_circulation': None, 'second_plate_sale': None, 'demonetized': '1901-03-01'}, 'legal_basis': [{'type': 'decree', 'id': 'Decree #119', 'date': '1892-04-23', 'ids': [], 'officials': []}], 'currency_context': {'original': 'c', 'decimal_adoption': '1864-01-01', 'revaluation_date': None, 'revaluation_map': {}}, 'printing': {'printer': 'Waterlow & Sons', 'process': ['engraved'], 'format': {'panes': 100}, 'plates': {}}, 'perforation': '13.5-15.5'}, 'production_orders': {'printings': [{'date': '1895-01-01', 'quantities': [{'plate_desc': '1c', 'quantity': 500000}, {'plate_desc': '2c', 'quantity': 500000}, {'plate_desc': '5c', 'quantity': 0}, {'plate_desc': '10c', 'quantity': 0}, {'plate_desc': '20c', 'quantity': 0}, {'plate

In [8]:
# Load Scott catalog (grouped structure)
PATH = Path("results/parsed_catalogues/scott_parse_results_ALL.json")

# Cargar
with PATH.open("r", encoding="utf-8") as f:
    scott_grouped = json.load(f)

print(f"Loaded Scott catalog: {len(scott_grouped)} issue groups")

Loaded Scott catalog: 1086 issue groups


In [12]:
# CRITICAL STEP: Flatten and enrich Scott data
all_scott_stamps = flatten_and_enrich_scott_data(scott_grouped)

print(f"Preprocessed: {len(all_scott_stamps)} total stamps")
print(f"\nExample enriched variety stamp (Scott #1a):")
for stamp in all_scott_stamps[:10]:
    if stamp.get('scott_number') == '1a' and stamp.get('year') == 1863:
        print(f"  denomination: {stamp.get('denomination')}")
        print(f"  color: {stamp.get('color')}")
        print(f"  variety_of: {stamp.get('variety_of')}")
        break

NameError: name 'fix_scott_surcharge_data' is not defined

## 10. Run Matching

In [None]:
# Run the matching algorithm
result = match_mena_to_scott(
    mena_issue=mena_issue,
    all_scott_stamps=all_scott_stamps,
    year_tolerance=2,
    min_score_threshold=60.0
)

# Print results
print_matching_results(result)

In [None]:
scott_raw_candidates = build_candidate_pool(mena_issue, all_scott_stamps, 2)
scott_str_candidates = []
for c in scott_raw_candidates:
    scott_str_candidates.append(f"  Scott #{c.get('scott_number')}: {c.get('denomination')} {c.get('color')} (year={extract_scott_year(c)})")

## 11. Save Results

In [None]:
# Save to JSON
output_file = "matching_results.json"
with open(output_file, 'w') as f:
    json.dump(result, f, indent=2)

print(f"✓ Results saved to: {output_file}")

## 12. Detailed Results Table

In [None]:
import pandas as pd

# Create DataFrame
df = pd.DataFrame([
    {
        'Mena #': a['mena_catalog_no'],
        'Scott #': a['scott_number'],
        'Score': a['score'],
        'Confidence': a['confidence'],
        'Denom': a['signals'].get('denomination', 0),
        'Color': a['signals'].get('color', 0),
        'Year': a['signals'].get('year', 0),
        'Perf': a['signals'].get('perforation', 0),
        'Review': '⚠️' if a['requires_review'] else '✓'
    }
    for a in result['assignments']
])

print("\n" + "="*80)
print("DETAILED MATCHING TABLE")
print("="*80)
print(df.to_string(index=False))
print("\nLegend: Denom=Denomination, Perf=Perforation")
print("="*80)

## Summary

### Key Fixes Applied:

1. ✅ **Scott Data Flattening** - Converts nested structure to flat list
2. ✅ **Variety Enrichment** - Inherits data from base stamps to varieties
3. ✅ **Denomination Normalization** - Handles "reales" → "real", "p" → "peso"
4. ✅ **Color Family Matching** - Recognizes "yellow" ≈ "orange" (85%)
5. ✅ **Year Extraction** - Pulls year from multiple date formats

### Expected Results:
- **Match Rate**: >90%
- **High Confidence**: >70%
- **Zero False Positives**

### Confidence Levels:
- **HIGH** (70-100): Very reliable, approve immediately
- **MEDIUM** (50-69): Likely correct, review recommended
- **LOW** (30-49): Uncertain, requires manual verification

In [None]:
def find_catalog_gaps_complete(all_scott_stamps: List[Dict[str, Any]], 
                               analyze_all: bool = False,
                               show_details: bool = False):
    """
    Find gaps in Scott catalog numbering with proper category handling.
    
    Costa Rica Scott Catalog Ranges (approximate):
    - Regular: 1-733
    - Airmail (C): C1-C940
    - Official (O): O1-O75
    - Guanacaste (G): G1-G70
    - Postage Due (J): J1-J50
    - Air Post Official (CO): CO1-CO30
    - Semi-Postal (B): B1-B10
    - Special Delivery (E): E1-E10
    - Postal Tax (RA): RA1-RA50
    
    Args:
        analyze_all: If False, only analyze matchable categories (ignore proofs, specimens)
        show_details: Show color and header details for gaps
    """
    from collections import defaultdict
    
    # Complete Scott category definitions with Costa Rica context
    SCOTT_CATEGORIES = {
        "": {"name": "Regular Issues", "mena_equiv": "(none)", "analyze": True, "typical_max": 733},
        "C": {"name": "Air Post (Airmail)", "mena_equiv": "A", "analyze": True, "typical_max": 940},
        "O": {"name": "Official", "mena_equiv": "O", "analyze": True, "typical_max": 75},
        "CO": {"name": "Air Post Official", "mena_equiv": "OA", "analyze": True, "typical_max": 30},
        "CE": {"name": "Air Post Special Delivery", "mena_equiv": "SD+A", "analyze": True, "typical_max": 10},
        "E": {"name": "Special Delivery", "mena_equiv": "SD", "analyze": True, "typical_max": 10},
        "J": {"name": "Postage Due", "mena_equiv": "D", "analyze": True, "typical_max": 50},
        "B": {"name": "Semi-Postal", "mena_equiv": "SP", "analyze": True, "typical_max": 10},
        "RA": {"name": "Postal Tax", "mena_equiv": "CT", "analyze": True, "typical_max": 50},
        "AR": {"name": "Postal Fiscal", "mena_equiv": "R", "analyze": True, "typical_max": 20},
        "G": {"name": "Guanacaste", "mena_equiv": "G", "analyze": True, "typical_max": 70},
        # Less common categories
        "F": {"name": "Registration", "mena_equiv": "RL", "analyze": analyze_all, "typical_max": 10},
        "Q": {"name": "Parcel Post", "mena_equiv": "-", "analyze": analyze_all, "typical_max": 10},
        "QE": {"name": "Parcel Post Special Delivery", "mena_equiv": "-", "analyze": analyze_all, "typical_max": 5},
    }
    
    # Group stamps by category
    by_category = defaultdict(list)
    
    for stamp in all_scott_stamps:
        scott_no = stamp.get('scott_number', '')
        year = extract_scott_year(stamp)
        
        # Parse catalog number
        cat, num, suffix = normalize_catalog_number(scott_no)
        
        # Only track base numbers (ignore varieties)
        if suffix == "":
            by_category[cat].append({
                'scott_number': scott_no,
                'numeric': int(num),
                'year': year,
                'denomination': stamp.get('denomination', ''),
                'color': stamp.get('color', ''),
                'header': stamp.get('header', '')
            })
    
    # Find gaps
    print("\n" + "="*80)
    print("SCOTT CATALOG GAP ANALYSIS - COSTA RICA")
    print("="*80)
    print("\nAnalyzing matchable categories (Regular, Airmail, Official, etc.)")
    print("Note: Each category has independent numbering:")
    print("      Regular: 1-733, Airmail: C1-C940, Official: O1-O75, etc.")
    print("="*80)
    
    total_gaps = 0
    total_missing = 0
    categories_with_gaps = []
    
    # Sort categories by common usage
    category_order = ["", "C", "O", "CO", "CE", "E", "J", "B", "RA", "AR", "G", "F", "Q", "QE"]
    
    for category in category_order:
        if category not in by_category:
            continue
        
        # Check if we should analyze this category
        cat_info = SCOTT_CATEGORIES.get(category, {"name": f"{category} Issues", "mena_equiv": "-", "analyze": True, "typical_max": 100})
        if not cat_info["analyze"]:
            continue
        
        stamps = sorted(by_category[category], key=lambda x: x['numeric'])
        
        if len(stamps) < 2:
            continue
        
        # Get range info
        min_num = stamps[0]['numeric']
        max_num = stamps[-1]['numeric']
        prefix = category if category else ""
        expected_total = max_num - min_num + 1
        
        # Find gaps
        gaps = []
        for i in range(len(stamps) - 1):
            current_num = stamps[i]['numeric']
            next_num = stamps[i + 1]['numeric']
            
            if next_num - current_num > 1:
                gap_start = current_num + 1
                gap_end = next_num - 1
                gaps.append({
                    'before': stamps[i],
                    'after': stamps[i + 1],
                    'gap_start': gap_start,
                    'gap_end': gap_end,
                    'gap_size': gap_end - gap_start + 1
                })
        
        # Print header for category
        cat_display = cat_info["name"]
        mena_equiv = cat_info["mena_equiv"]
        typical_max = cat_info["typical_max"]
        
        print(f"\n{'='*80}")
        print(f"{cat_display} (Scott: {prefix if prefix else '(none)'} | Mena: {mena_equiv})")
        print(f"{'='*80}")
        print(f"Range: {prefix}{min_num} to {prefix}{max_num} (typical max: ~{prefix}{typical_max})")
        print(f"Found: {len(stamps)} stamps | Expected if consecutive: {expected_total} stamps")
        
        # Check if we're missing a lot vs typical
        if max_num > typical_max * 1.5:
            print(f"⚠️  NOTE: Maximum number ({prefix}{max_num}) exceeds typical range (~{prefix}{typical_max})")
        
        if gaps:
            missing_count = sum(g['gap_size'] for g in gaps)
            total_missing += missing_count
            categories_with_gaps.append(cat_display)
            
            print(f"Status: ⚠️  {len(gaps)} gap(s) found ({missing_count} missing stamps)")
            print("-" * 80)
            
            for gap in gaps:
                total_gaps += 1
                
                # Format gap range
                if gap['gap_size'] == 1:
                    gap_display = f"{prefix}{gap['gap_start']}"
                else:
                    gap_display = f"{prefix}{gap['gap_start']}-{prefix}{gap['gap_end']}"
                
                # Determine severity
                severity = ""
                if gap['gap_size'] > 50:
                    severity = "🚨 CRITICAL - Very large gap! Possible parser error!"
                elif gap['gap_size'] > 20:
                    severity = "⚠️  WARNING - Large gap"
                elif gap['gap_size'] > 10:
                    severity = "⚠️  Moderate gap"
                elif gap['gap_start'] <= 3:
                    severity = "⚠️  Low numbers missing - verify correct"
                
                print(f"\n  Gap #{len([g for g in gaps if gaps.index(g) <= gaps.index(gap)])}: {gap_display}")
                print(f"    Missing: {gap['gap_size']} stamp{'s' if gap['gap_size'] > 1 else ''} {severity}")
                print(f"    Before: #{gap['before']['scott_number']} = {gap['before']['denomination']} ({gap['before']['year']})")
                print(f"    After:  #{gap['after']['scott_number']} = {gap['after']['denomination']} ({gap['after']['year']})")
                
                # Year analysis
                if gap['after']['year'] and gap['before']['year']:
                    year_diff = abs(gap['after']['year'] - gap['before']['year'])
                    if year_diff > 5:
                        print(f"    📅 {year_diff}-year gap between issues")
                    elif year_diff == 0:
                        print(f"    📅 Same year ({gap['before']['year']}) - likely intentional gap or reserved numbers")
                
                # Show details if requested
                if show_details:
                    print(f"    Details:")
                    print(f"      Before: {gap['before']['color']} | {gap['before']['header']}")
                    print(f"      After:  {gap['after']['color']} | {gap['after']['header']}")
        else:
            print(f"Status: ✓ Complete (no gaps - consecutive numbering)")
    
    # Summary
    print("\n" + "="*80)
    print("SUMMARY")
    print("="*80)
    print(f"Total gaps found: {total_gaps}")
    print(f"Total missing stamps: {total_missing}")
    
    if categories_with_gaps:
        print(f"\nCategories with gaps:")
        for cat in categories_with_gaps:
            print(f"  • {cat}")
        print(f"\nNote: Some gaps are normal (reserved numbers, stamps never issued)")
        print(f"      Gaps >50 stamps likely indicate parser errors")
    else:
        print("\n✓ All categories have consecutive numbering!")
    
    print("="*80)
    
    return {
        'total_gaps': total_gaps,
        'total_missing': total_missing,
        'categories_with_gaps': categories_with_gaps
    }

In [None]:
find_catalog_gaps_complete(all_scott_stamps)

## Approach 2 LLM Few Shot

In [13]:
import json
from landingai_ade import LandingAIADE
# Load environment variables 
from dotenv import load_dotenv
load_dotenv()
import re
import os
import json, traceback

In [14]:
"""
Mena–Scott Matcher (Costa Rica) — LLM-Driven, Schema-Consistent
Author: (Your Name)

- Input:
    * mena_issue: a parsed Mena JSON (must include issue_data.issue_id and stamps[])
    * scott_candidates: a list of candidate strings (raw or structured text lines)

- Output (ALWAYS this schema):
    {{
      "issue_id": "<Mena issue id>",
      "equivalences": [
        {{ "mena": "<Mena catalog_no>", "scott": "<Scott normalized>", "confidence": "low|medium|high" }}
      ]
    }}

- Notes:
  * Uses LangChain + an LLM to do the reasoning/matching (no regex scoring).
  * Enforces mapping conventions:
      - Mena prefixes to Scott families:
          A  -> C        (airmail)
          OA -> CO       (official airmail)
          O  -> O        (official)   **Scott leading '0' also means Official**
          D  -> J        (postage due)
          SD -> E        (special delivery)
          SP -> B        (semi-postal)
          CT -> RA       (postal tax / Christmas)
          G  -> G        (Guanacaste)
      - Regular issue: no prefix ↔ no prefix
      - It's OK to **strip leading letter(s)** to compare base numbers,
        but output must keep Scott's original prefix ("O", "0", "RA", etc.).
  * Temperature=0 and structured JSON enforced via JsonOutputParser.
"""

import json
import traceback
from typing import Dict, Any, List, Optional

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_community.callbacks import get_openai_callback

def _escape_jinja(s: str) -> str:
    # Escapa TODAS las llaves para que sean literales en el prompt
    return s.replace("{", "{{").replace("}", "}}")

# --------------------------
# System prompt
# --------------------------
# --- PROMPT crudo (con llaves normales) ---
MATCHER_SYSTEM_PROMPT_RAW = """
You are MenaScottMatcher — a strict, JSON-only catalog equivalence matcher for Costa Rica stamps.

INPUT:
  1) mena_issue: a Mena JSON (must include issue_data.issue_id and stamps[])
  2) scott_candidates: a list whose items are EITHER:
       • a plain string like "Scott #59: 10c blue & black (year=1907)"
       • an object { "scott_number": <display string>, "scott_number_data": { ... rich fields ... } }
     When the object form is present, prefer the fields in "scott_number_data" for reasoning:
       scott_number, denomination, color, year, header, variety_of, description, overprint, perforation, etc.

OUTPUT (return ONLY this JSON):
{
  "issue_id": "<Mena issue id>",
  "equivalences": [
    { "mena": "<Mena catalog_no>", "scott": "<Scott number with prefix if any>", "confidence": "low|medium|high" }
  ]
}

Rules:
- If no matches: "equivalences": []
- Confidence ∈ {"low","medium","high"} only.
- Never add extra keys; never return commentary.
- Prefer matches aligning denomination (incl. surcharges like "2c on ½r"), year, color, and family/prefix.

Deterministic family mapping (for comparison; always output Scott with its original prefix):
- Mena "A"  -> Scott "C"     (airmail)
- Mena "OA" -> Scott "CO"    (official airmail)
- Mena "O"  -> Scott "O"     (official). For matching, treat Scott "0" and "O" as equivalent; output the original as shown.
- Mena "D"  -> Scott "J"     (postage due)
- Mena "SD" -> Scott "E"     (special delivery)
- Mena "SP" -> Scott "B"     (semi-postal)
- Mena "CT" -> Scott "RA"    (postal tax / Christmas)
- Mena "G"  -> Scott "G"     (Guanacaste)
- Regular (no prefix) ↔ regular (no prefix).

Souvenir sheets (SS*, SSA*) should map to Scott regular numbers usually without initial letter in the scott number.

If evidence is insufficient or conflicting, omit the pair (don’t guess).
Return only the JSON object above.
"""
MATCHER_SYSTEM_PROMPT = _escape_jinja(MATCHER_SYSTEM_PROMPT_RAW)


# --------------------------
# Few-shot examples (now include compound candidates)
# --------------------------

# Example 1: Regular + Official (mixed: simple strings)
FS_INPUT_1 = {
    "mena_issue": {
        "issue_data": {"issue_id": "CR-1907-ISSUE", "issue_dates": {"placed_on_sale": "1907-09-29"}},
        "stamps": [
            {"catalog_no": "59", "status": "regular",
             "denomination": {"value": 10, "unit": "c"}, "color": "blue & black"},
            {"catalog_no": "O48", "status": "official",
             "denomination": {"value": 1, "unit": "c"}, "color": "red brown & indigo"}
        ]
    },
    "scott_candidates": [
        "Scott #59: 10c blue & black (year=1907)",
        "Scott #O48: 1c red brn & ind (year=1907)"
    ]
}
FS_OUTPUT_1 = {
    "issue_id": "CR-1907-ISSUE",
    "equivalences": [
        {"mena": "59", "scott": "59", "confidence": "high"},
        {"mena": "O48", "scott": "O48", "confidence": "high"}
    ]
}

# Example 2: Christmas Postal Tax CT ↔ RA (compound)
FS_INPUT_2 = {
    "mena_issue": {
        "issue_data": {"issue_id": "CR-1968-CHRISTMAS-TAX", "issue_dates": {"placed_on_sale": "1968-12-01"}},
        "stamps": [
            {"catalog_no": "CT37", "status": "postal_tax",
             "denomination": {"value": 5, "unit": "c"}, "color": "gray"},
            {"catalog_no": "CT38", "status": "postal_tax",
             "denomination": {"value": 5, "unit": "c"}, "color": "rose red"}
        ]
    },
    "scott_candidates": [
        {
            "scott_number": "Scott #RA37: 5c gray (year=1968)",
            "scott_number_data": {"scott_number": "RA37", "denomination": "5c", "color": "gray", "year": 1968}
        },
        {
            "scott_number": "Scott #RA38: 5c rose red (year=1968)",
            "scott_number_data": {"scott_number": "RA38", "denomination": "5c", "color": "rose red", "year": 1968}
        }
    ]
}
FS_OUTPUT_2 = {
    "issue_id": "CR-1968-CHRISTMAS-TAX",
    "equivalences": [
        {"mena": "CT37", "scott": "RA37", "confidence": "high"},
        {"mena": "CT38", "scott": "RA38", "confidence": "high"}
    ]
}

# Example 3: Souvenir sheet — should produce empty array (compound)
FS_INPUT_3 = {
    "mena_issue": {
        "issue_data": {"issue_id": "CR-1968-III-PHILATELIC-EXHIBITION-OVERPRINT",
                       "issue_dates": {"announced": "1968-08-01"}},
        "stamps": [
            {"catalog_no": "SSA497", "status": "souvenir_sheet",
             "denomination": {"value": None, "unit": "sheet"}, "color": "multicolor", "perforation": "13.5"},
            {"catalog_no": "SSA497a", "status": "souvenir_sheet",
             "denomination": {"value": None, "unit": "sheet"}, "color": "multicolor", "perforation": ""}
        ]
    },
    "scott_candidates": [
        {
            "scott_number": "Scott #C475: 15c lt bl, blk & lt brn (year=1968)",
            "scott_number_data": {"scott_number": "C475", "denomination": "15c",
                                  "color": "light blue/black/light brown", "year": 1968}
        },
        {
            "scott_number": "Scott #RA37: 5c gray (year=1968)",
            "scott_number_data": {"scott_number": "RA37", "denomination": "5c", "color": "gray", "year": 1968}
        }
    ]
}
FS_OUTPUT_3 = {
    "issue_id": "CR-1968-III-PHILATELIC-EXHIBITION-OVERPRINT",
    "equivalences": []
}

# Example 4: Your 1881–82 surcharges (compound candidates) — teaches “Xc on ½r” mapping
MENA_1881_82_SURCH_ISSUE = {
    "issue_data": {"issue_id": "CR-1881-82-SURCHARGES"},
    "stamps": [
        {"catalog_no": "5", "status": "regular",
         "denomination": {"value": 1, "unit": "c"}, "color": "",
         "overprint": {"present": True, "type": "surcharge",
                       "surcharge_denomination": {"value": 1, "unit": "c"},
                       "on_denomination": {"value": 0.5, "unit": "reales"}}},
        {"catalog_no": "6", "status": "regular",
         "denomination": {"value": 1, "unit": "c"}, "color": "",
         "overprint": {"present": True, "type": "surcharge",
                       "surcharge_denomination": {"value": 1, "unit": "c"},
                       "on_denomination": {"value": 0.5, "unit": "reales"}}},
        {"catalog_no": "7", "status": "regular",
         "denomination": {"value": 2, "unit": "c"}, "color": "",
         "overprint": {"present": True, "type": "surcharge",
                       "surcharge_denomination": {"value": 2, "unit": "c"},
                       "on_denomination": {"value": 0.5, "unit": "reales"}}},
        {"catalog_no": "8", "status": "regular",
         "denomination": {"value": 5, "unit": "c"}, "color": "",
         "overprint": {"present": True, "type": "surcharge",
                       "surcharge_denomination": {"value": 5, "unit": "c"},
                       "on_denomination": {"value": 0.5, "unit": "reales"}}}
    ]
}

# A compact subset of your compound candidate list that’s sufficient to teach the mapping
SCOTT_1881_82_COMPOUND_EXAMPLE = [
    {
        "scott_number": "Scott #7: 1c on ½r surcharge color unknown (year=1881)",
        "scott_number_data": {"scott_number": "7", "denomination": "1c on ½r",
                              "color": "surcharge color unknown", "year": 1881}
    },
    {
        "scott_number": "Scott #7a: 1c on ½r surcharge color unknown (year=1881)",
        "scott_number_data": {"scott_number": "7a", "variety_of": "7",
                              "denomination": "1c on ½r", "color": "surcharge color unknown", "year": 1881}
    },
    {
        "scott_number": "Scott #8: 1c on ½r surcharge color unknown (year=1881)",
        "scott_number_data": {"scott_number": "8", "denomination": "1c on ½r",
                              "color": "surcharge color unknown", "year": 1881}
    },
    {
        "scott_number": "Scott #9: 2c on ½r surcharge color unknown (year=1881)",
        "scott_number_data": {"scott_number": "9", "denomination": "2c on ½r",
                              "color": "surcharge color unknown", "year": 1881}
    },
    {
        "scott_number": "Scott #12: 5c on ½r surcharge color unknown (year=1881)",
        "scott_number_data": {"scott_number": "12", "denomination": "5c on ½r",
                              "color": "surcharge color unknown", "year": 1881}
    },
    # Distractors to encourage precision:
    {
        "scott_number": "Scott #16: 1c green (year=1881)",
        "scott_number_data": {"scott_number": "16", "denomination": "1c", "color": "green", "year": 1881}
    },
    {
        "scott_number": "Scott #O1: 1c green (R) (year=1883)",
        "scott_number_data": {"scott_number": "O1", "denomination": "1c", "color": "green (R)", "year": 1883}
    }
]

FS_INPUT_4 = {
    "mena_issue": MENA_1881_82_SURCH_ISSUE,
    "scott_candidates": SCOTT_1881_82_COMPOUND_EXAMPLE
}
FS_OUTPUT_4 = {
    "issue_id": "CR-1881-82-SURCHARGES",
    "equivalences": [
        {"mena": "5", "scott": "7", "confidence": "high"},
        {"mena": "6", "scott": "8", "confidence": "high"},
        {"mena": "7", "scott": "9", "confidence": "high"},
        {"mena": "8", "scott": "12", "confidence": "high"}
    ]
}

# --------------------------
# Example 4: 1986 President Portraits (compound Scott objects)
# --------------------------

FS_INPUT_5 = {
    "mena_issue": {
        "issue_data": {
            "issue_id": "CR-1986-PRESIDENT-PORTRAITS",
            "issue_dates": {"placed_on_sale": "1986-05-12"}
        },
        "stamps": [
            {
                "catalog_no": "348", "status": "regular",
                "denomination": {"value": 3, "unit": "C"},
                "color": "blue", "perforation": "10.5",
                "notes": "Portrait: F Orlich Se-tenant strips of five"
            },
            {
                "catalog_no": "349", "status": "regular",
                "denomination": {"value": 3, "unit": "C"},
                "color": "blue", "perforation": "10.5",
                "notes": "Portrait: JJ Trejos Se-tenant strips of five"
            },
            {
                "catalog_no": "350", "status": "regular",
                "denomination": {"value": 3, "unit": "C"},
                "color": "blue", "perforation": "10.5",
                "notes": "Portrait: D Oduber Se-tenant strips of five"
            }
        ]
    },
    # Use the compound objects you provided (truncated here to the relevant window).
    # You can paste your entire candidates list; the matcher will prefer the detailed dicts.
    "scott_candidates": [
        {
            "scott_number": "  Scott #344: 3col turq blue (year=1986)",
            "scott_number_data": {
                "scott_number": "344", "denomination": "3col",
                "color": "turq blue", "perforation": "10½",
                "header": "1986, May 12", "year": 1986
            }
        },
        {
            "scott_number": "  Scott #345: 3col turq blue (year=1986)",
            "scott_number_data": {
                "scott_number": "345", "denomination": "3col",
                "color": "turq blue", "perforation": "10½",
                "header": "1986, May 12", "year": 1986
            }
        },
        {
            "scott_number": "  Scott #346: 3col turq blue (year=1986)",
            "scott_number_data": {
                "scott_number": "346", "denomination": "3col",
                "color": "turq blue", "perforation": "10½",
                "header": "1986, May 12", "year": 1986
            }
        },
        {
            "scott_number": "  Scott #347: 3col turq blue (year=1986)",
            "scott_number_data": {
                "scott_number": "347", "denomination": "3col",
                "color": "turq blue", "perforation": "10½",
                "header": "1986, May 12", "year": 1986
            }
        },
        {
            "scott_number": "  Scott #348: 3col turq blue (year=1986)",
            "scott_number_data": {
                "scott_number": "348", "denomination": "3col",
                "color": "turq blue", "perforation": "10½",
                "header": "1986, May 12", "year": 1986
            }
        },
        # (You can include the rest of your long candidate list here unchanged.)
    ]
}

FS_OUTPUT_5 = {
    "issue_id": "CR-1986-PRESIDENT-PORTRAITS",
    "equivalences": [
        {"mena": "348", "scott": "344", "confidence": "high"},
        {"mena": "349", "scott": "345", "confidence": "high"},
        {"mena": "350", "scott": "346", "confidence": "high"}
    ]
}



def _json(obj: Any) -> str:
    return json.dumps(obj, ensure_ascii=False)


def _few_shot_block():
    """
    Few-shot template mixing simple-string and compound-object Scott candidates.
    """
    example_prompt = ChatPromptTemplate.from_messages([
        ("human", "{input}"),
        ("ai", "{output}")
    ])
    return FewShotChatMessagePromptTemplate(
        example_prompt=example_prompt,
        examples=[
            {"input": _json(FS_INPUT_1), "output": _json(FS_OUTPUT_1)},
            {"input": _json(FS_INPUT_2), "output": _json(FS_OUTPUT_2)},
            {"input": _json(FS_INPUT_3), "output": _json(FS_OUTPUT_3)},
            {"input": _json(FS_INPUT_4), "output": _json(FS_OUTPUT_4)},
            {"input": _json(FS_INPUT_5), "output": _json(FS_OUTPUT_5)},

        ],
    )



class MenaScottMatcher:
    """
    LLM-driven matcher that returns a stable, minimal schema:

    {
      "issue_id": "<Mena issue id>",
      "equivalences": [
        { "mena": "<Mena catalog_no>", "scott": "<Scott number>", "confidence": "low|medium|high" }
      ]
    }
    """

    def __init__(
        self,
        openai_api_key: str,
        model_name: str = "gpt-5-mini",
        temperature: float = 1,
    ):
        self.llm = ChatOpenAI(
            model=model_name,
            temperature=temperature,
            api_key=openai_api_key,
            timeout=180.0,
            model_kwargs={
                "verbosity": "low",
                "reasoning_effort": "low",
            }
        )
        self.parser = JsonOutputParser()
        self.chain = self._create_chain()

    def _create_chain(self):
        sys = MATCHER_SYSTEM_PROMPT
        few = _few_shot_block()
        user = ChatPromptTemplate.from_messages([
            ("system", sys),
            few,
            ("human", "{payload}")  # single unified payload per call
        ])
        return user | self.llm | self.parser

    def match(self, mena_issue: Dict[str, Any], scott_candidates: List[Any]) -> Dict[str, Any]:
        payload = {"mena_issue": mena_issue, "scott_candidates": scott_candidates}
        fallback = {
            "issue_id": (mena_issue.get("issue_data") or {}).get("issue_id", "") or "",
            "equivalences": []
        }

        try:
            with get_openai_callback() as cb:
                result = self.chain.invoke({
                    "payload": json.dumps(payload, ensure_ascii=False)
                })
                print(
                    f"[Callback] prompt_tokens={cb.prompt_tokens} "
                    f"completion_tokens={cb.completion_tokens} "
                    f"total_tokens={cb.total_tokens} "
                    f"total_cost={cb.total_cost}"
                )
                cost_per_1m_input = 0.250
                cost_per_1m_output = 2.0
                
                # Convert to cost per token
                cost_per_input_token = cost_per_1m_input / 1_000_000
                cost_per_output_token = cost_per_1m_output / 1_000_000
                
                input_cost = cb.prompt_tokens * cost_per_input_token
                output_cost = cb.completion_tokens * cost_per_output_token
                total_cost = input_cost + output_cost
                print("Cost (USD):", total_cost)
        except Exception as e:
            print("LLM/Parsing error:", repr(e))
            traceback.print_exc()
            return fallback

        # Sanitiza la salida (por si acaso)
        issue_id = result.get("issue_id") or fallback["issue_id"]
        eq = result.get("equivalences")
        if not isinstance(eq, list):
            eq = []
        out = []
        for item in eq:
            if not isinstance(item, dict):
                continue
            mena = str(item.get("mena", "")).strip()
            scott = str(item.get("scott", "")).strip()
            conf = str(item.get("confidence", "low")).lower()
            if conf not in ("low", "medium", "high"):
                conf = "low"
            if mena and scott:
                out.append({"mena": mena, "scott": scott, "confidence": conf})
        return {"issue_id": issue_id, "equivalences": out}

### Get Mena Issue

In [4]:
# Load Mena issue
PATH = Path("results/parsed_catalogues/mena_parse_results_ALL.json")

# Cargar
with PATH.open("r", encoding="utf-8") as f:
    mena_parsed_catalog = json.load(f)

In [5]:
len(mena_parsed_catalog)

837

In [10]:
mena_issue = mena_parsed_catalog[801]
print(f"Loaded Mena issue: {mena_issue['issue_data']['title']}")
print(f"Number of stamps: {len(mena_issue['stamps'])}")
print(mena_issue)

Loaded Mena issue: Radio Station T14NRH Anniversary, 1938
Number of stamps: 8
{'issue_data': {'issue_id': 'CR-1938-RADIO-STATION-T14NRH-ANNIVERSARY', 'section': 'Surface Mail', 'title': 'Radio Station T14NRH Anniversary, 1938', 'country': 'Costa Rica', 'issue_dates': {'announced': '1938-01-01', 'placed_on_sale': '1938-01-01', 'probable_first_circulation': '1938-01-01', 'second_plate_sale': None, 'demonetized': None}, 'legal_basis': [], 'currency_context': {'original': 'c', 'decimal_adoption': None, 'revaluation_date': None, 'revaluation_map': {}}, 'printing': {'printer': '', 'process': ['lithography'], 'format': {'panes': None}, 'plates': {}}, 'perforation': '12'}, 'production_orders': {'printings': [], 'remainders': {'date': None, 'note': '', 'quantities': []}}, 'stamps': [{'catalog_no': 'E54', 'issue_id': 'CR-1938-RADIO-STATION-T14NRH-ANNIVERSARY', 'denomination': {'value': 10, 'unit': 'c'}, 'color': 'black & pink', 'plate': None, 'perforation': '', 'watermark': None, 'quantity_repor

### Get Scott Candidates

In [3]:
# Load Scott catalog (grouped structure)
PATH = Path("results/parsed_catalogues/scott_parse_results_ALL.json")

# Cargar
with PATH.open("r", encoding="utf-8") as f:
    scott_grouped = json.load(f)

print(f"Loaded Scott catalog: {len(scott_grouped)} issue groups")

Loaded Scott catalog: 1086 issue groups


In [26]:
# CRITICAL STEP: Flatten and enrich Scott data
all_scott_stamps = flatten_and_enrich_scott_data(scott_grouped)

print(f"Preprocessed: {len(all_scott_stamps)} total stamps")
print(f"\nExample enriched variety stamp (Scott #1a):")
for stamp in all_scott_stamps[:10]:
    if stamp.get('scott_number') == '1a' and stamp.get('year') == 1863:
        print(f"  denomination: {stamp.get('denomination')}")
        print(f"  color: {stamp.get('color')}")
        print(f"  variety_of: {stamp.get('variety_of')}")
        break

Preprocessed: 2559 total stamps

Example enriched variety stamp (Scott #1a):
  denomination: ½r
  color: light blue
  variety_of: 1


In [29]:
scott_raw_candidates = build_candidate_pool(mena_issue, all_scott_stamps, 6)
scott_str_candidates = []
for c in scott_raw_candidates:
    scott_str_candidates.append(f"  Scott #{c.get('scott_number')}: {c.get('denomination')} {c.get('color')} (year={extract_scott_year(c)})")

Found 65 Scott candidates for year 1892 (±6 years)
Excluded 366 stamps without year information
Year distribution: {1886: 5, 1887: 8, 1888: 2, 1889: 50}

All candidates overall:
  Scott #23: 1c rose (year=1889)
  Scott #24: 5c brown (year=1889)
  Scott #25: 1c brown (year=1889)
  Scott #25a: 1c brown (year=1889)
  Scott #25b: 1c brown (year=1889)
  Scott #25c: 1c brown (year=1889)
  Scott #26: 2c dark green (year=1889)
  Scott #26a: 2c dark green (year=1889)
  Scott #26b: 2c dark green (year=1889)
  Scott #26c: 2c dark green (year=1889)
  Scott #27: 5c orange (year=1889)
  Scott #27a: 5c orange (year=1889)
  Scott #27b: 5c orange (year=1889)
  Scott #28: 10c red brown (year=1889)
  Scott #28a: 10c red brown (year=1889)
  Scott #29: 20c yellow green (year=1889)
  Scott #29a: 20c yellow green (year=1889)
  Scott #29b: 20c yellow green (year=1889)
  Scott #30: 50c rose red (year=1889)
  Scott #31: 1p blue (year=1889)
  Scott #32: 2p dull violet (year=1889)
  Scott #32a: 2p dull violet (ye

### Test

In [None]:
len(mena_parsed_catalog)

In [None]:
# --------------------------
# Example usage
# --------------------------


mena_issue = mena_parsed_catalog[612]
print(f"Loaded Mena issue: {mena_issue['issue_data']['title']}")
print(f"Number of stamps: {len(mena_issue['stamps'])}")
print(mena_issue)

scott_raw_candidates = build_candidate_pool(mena_issue, all_scott_stamps, 2)
scott_str_candidates = []
scott_candidates = []
for c in scott_raw_candidates:
    candidate_str = f"  Scott #{c.get('scott_number')}: {c.get('denomination')} {c.get('color')} (year={extract_scott_year(c)})"
    scott_entry = {
        "scott_number" : candidate_str,
        "scott_number_data": c
        
    }
    #scott_str_candidates.append(f"  Scott #{c.get('scott_number')}: {c.get('denomination')} {c.get('color')} (year={extract_scott_year(c)})")
    scott_candidates.append(scott_entry)

# Replace with your real key via env or secret manager
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-api-key")

matcher = MenaScottMatcher(openai_api_key=OPENAI_API_KEY)

# Demo 1: matches regular + official
mena_issue_demo = mena_issue
candidates_demo = scott_candidates
result = matcher.match(mena_issue_demo, candidates_demo)
print(result)

In [30]:
# -*- coding: utf-8 -*-
import os, json, time, datetime, traceback
from time import sleep
from concurrent.futures import ThreadPoolExecutor, as_completed
from itertools import islice
from typing import List, Dict, Any, Tuple
from tqdm import tqdm

# =========================================================
# Configuración / Parámetros
# =========================================================
# Rango de issues a procesar (1-based para que sea cómodo leer logs)
start_num        = 201          # ejemplo: empieza donde hiciste tu prueba (612 era ejemplo 0-based)
final_num        = 300  # o el tope que gustes
start_idx        = start_num - 1

subbatch_size    = 8
max_workers      = 4
max_retries      = 2
backoff_base_sec = 2
top_k_candidates = 2            # el "2" que usabas al armar el pool

OUT_DIR          = "results/match_catalogues"
OUT_OK           = os.path.join(OUT_DIR, f"mena_match_results_{start_num}-{final_num}.json")
OUT_ERR          = os.path.join(OUT_DIR, f"mena_match_errors_{start_num}-{final_num}.json")

OPENAI_API_KEY   = os.getenv("OPENAI_API_KEY", "your-api-key")
matcher          = MenaScottMatcher(openai_api_key=OPENAI_API_KEY)

# =========================================================
# Helpers
# =========================================================
def chunked(it, size):
    it = iter(it)
    while True:
        batch = list(islice(it, size))
        if not batch:
            break
        yield batch

def build_candidates_for_issue(mena_issue: Dict[str, Any]) -> List[Dict[str, Any]]:
    """
    Aplica tu lógica ejemplo para transformar el pool crudo en la lista 'candidates'
    que consume el matcher. Mantiene el formato:
      { "scott_number": <string legible>,
        "scott_number_data": <dict Scott original> }
    """
    raw = build_candidate_pool(mena_issue, all_scott_stamps, top_k_candidates)
    candidates = []
    for c in raw:
        cand_str = f"  Scott #{c.get('scott_number')}: {c.get('denomination')} {c.get('color')} (year={extract_scott_year(c)})"
        candidates.append({
            "scott_number": cand_str,
            "scott_number_data": c
        })
    return candidates

def safe_match(issue: Dict[str, Any], candidates: List[Dict[str, Any]],
               max_retries: int = 2) -> Tuple[bool, Any]:
    """
    Envuelve matcher.match con reintentos y backoff exponencial.
    Devuelve (ok, data|error_dict)
    """
    attempt = 0
    while True:
        try:
            if not candidates:
                raise ValueError("No hay candidatos Scott para esta issue")
            out = matcher.match(issue, candidates)
            return True, out
        except Exception as e:
            attempt += 1
            if attempt > max_retries:
                return False, {
                    "error": f"{type(e).__name__}: {str(e)}",
                    "traceback": traceback.format_exc()
                }
            sleep(backoff_base_sec ** (attempt - 1))

# =========================================================
# Preparación de entradas
# =========================================================
inputs: List[Dict[str, Any]] = []
for i, mena_issue in enumerate(mena_parsed_catalog[start_idx:final_num], start_num):
    try:
        issue_id = mena_issue.get("issue_data", {}).get("issue_id") or f"idx-{i}"
        candidates = build_candidates_for_issue(mena_issue)
        inputs.append({
            "i": i,
            "issue_id": issue_id,
            "title": mena_issue.get("issue_data", {}).get("title"),
            "issue": mena_issue,
            "candidates": candidates
        })
    except Exception as e:
        # Si falló armar candidatos, registramos un "input" vacío para dejar constancia en errores luego
        inputs.append({
            "i": i,
            "issue_id": mena_issue.get("issue_data", {}).get("issue_id") or f"idx-{i}",
            "title": mena_issue.get("issue_data", {}).get("title"),
            "issue": mena_issue,
            "candidates": [],
            "prep_error": f"{type(e).__name__}: {str(e)}",
            "prep_traceback": traceback.format_exc()
        })

# =========================================================
# Ejecución por oleadas concurrentes
# =========================================================
os.makedirs(OUT_DIR, exist_ok=True)

# Diccionario final: issue_id -> match_result
results_by_issue: Dict[str, Any] = {}
# Lista de errores
error_issues: List[Dict[str, Any]] = []

t0_global   = time.perf_counter()
total_items = len(inputs)

with tqdm(total=total_items, desc="Matcheando issues (oleadas)", unit="iss") as pbar:
    base = 0
    for sub in chunked(inputs, subbatch_size):
        t_oleada = time.perf_counter()
        futures = {}

        with ThreadPoolExecutor(max_workers=max_workers) as ex:
            for j, item in enumerate(sub):
                idx_global = base + j
                # Si hubo error en la preparación, no llamamos al modelo, lo mandamos como error directo
                if item.get("prep_error"):
                    error_issues.append({
                        "issue_id": item["issue_id"],
                        "title": item.get("title"),
                        "error": f"PREP_ERROR: {item['prep_error']}",
                        "traceback": item.get("prep_traceback", "")
                    })
                    pbar.update(1)
                    continue

                futures[ex.submit(safe_match, item["issue"], item["candidates"], max_retries)] = item

            for future in as_completed(futures):
                item = futures[future]
                try:
                    ok, data = future.result()
                    if ok:
                        results_by_issue[item["issue_id"]] = data
                    else:
                        error_issues.append({
                            "issue_id": item["issue_id"],
                            "title": item.get("title"),
                            "error": data.get("error", "UNKNOWN"),
                            "traceback": data.get("traceback", "")
                        })
                except Exception as e:
                    error_issues.append({
                        "issue_id": item["issue_id"],
                        "title": item.get("title"),
                        "error": f"FUTURE_FAILURE: {type(e).__name__}: {str(e)}",
                        "traceback": traceback.format_exc()
                    })
                finally:
                    pbar.update(1)

        # Métricas por oleada
        iter_sec = time.perf_counter() - t_oleada
        done     = min(base + len(sub), total_items)
        elapsed  = time.perf_counter() - t0_global
        avg      = elapsed / max(1, done)
        remaining_sec = avg * (total_items - done)
        eta = datetime.timedelta(seconds=max(0, int(remaining_sec)))
        pbar.set_postfix(oleada_s=f"{iter_sec:.2f}", avg_s=f"{avg:.2f}", eta=str(eta))

        base += len(sub)

elapsed = time.perf_counter() - t0_global
print(f"Tiempo total: {datetime.timedelta(seconds=int(elapsed))}")

# =========================================================
# Guardado de resultados
# =========================================================
with open(OUT_OK, "w", encoding="utf-8") as f:
    json.dump(results_by_issue, f, indent=2, ensure_ascii=False)

with open(OUT_ERR, "w", encoding="utf-8") as f:
    json.dump(error_issues, f, indent=2, ensure_ascii=False)

print(f"OK (issues con match): {len(results_by_issue)} | Errores: {len(error_issues)}")
print(f"Guardados:\n- {OUT_OK}\n- {OUT_ERR}")


  exec(code_obj, self.user_global_ns, self.user_ns)


Found 68 Scott candidates for year 2000 (±2 years)
Excluded 366 stamps without year information
Year distribution: {1998: 23, 1999: 7, 2000: 16, 2001: 9, 2002: 13}

All candidates overall:
  Scott #503: 10col multicolored (year=1998)
  Scott #504: 30col multicolored (year=1998)
  Scott #505: 45col multicolored (year=1998)
  Scott #506: 50col multicolored (year=1998)
  Scott #504a: 30col multicolored (year=1998)
  Scott #508: 10col multicolored (year=1998)
  Scott #509: 15col multicolored (year=1998)
  Scott #510: 20col multicolored (year=1998)
  Scott #511: 30col multicolored (year=1998)
  Scott #512: 35col multicolored (year=1998)
  Scott #513: 40col multicolored (year=1998)
  Scott #514: 45col multicolored (year=1998)
  Scott #515: 50col multicolored (year=1998)
  Scott #516: 55col multicolored (year=1998)
  Scott #517: 60col multicolored (year=1998)
  Scott #518: 50col multicolored (year=1998)
  Scott #521: Value 1 multi (year=1998)
  Scott #521a: Value 1 multi (year=1998)
  Scott #

Matcheando issues (oleadas):   1%|          | 1/100 [00:06<10:35,  6.42s/iss]

[Callback] prompt_tokens=10648 completion_tokens=237 total_tokens=10885 total_cost=0.0
Cost (USD): 0.003136


Matcheando issues (oleadas):   2%|▏         | 2/100 [00:09<06:48,  4.17s/iss]

[Callback] prompt_tokens=10728 completion_tokens=305 total_tokens=11033 total_cost=0.0
Cost (USD): 0.003292


Matcheando issues (oleadas):   3%|▎         | 3/100 [00:11<05:32,  3.43s/iss]

[Callback] prompt_tokens=10810 completion_tokens=415 total_tokens=11225 total_cost=0.0
Cost (USD): 0.0035325


Matcheando issues (oleadas):   4%|▍         | 4/100 [00:13<04:41,  2.94s/iss]

[Callback] prompt_tokens=9973 completion_tokens=175 total_tokens=10148 total_cost=0.0
Cost (USD): 0.00284325


Matcheando issues (oleadas):   5%|▌         | 5/100 [00:14<03:27,  2.18s/iss]

[Callback] prompt_tokens=10826 completion_tokens=442 total_tokens=11268 total_cost=0.0
Cost (USD): 0.0035904999999999995


Matcheando issues (oleadas):   6%|▌         | 6/100 [00:24<07:45,  4.95s/iss]

[Callback] prompt_tokens=10169 completion_tokens=504 total_tokens=10673 total_cost=0.0
Cost (USD): 0.0035502499999999996


Matcheando issues (oleadas):   7%|▋         | 7/100 [00:39<12:23,  8.00s/iss]

[Callback] prompt_tokens=10544 completion_tokens=573 total_tokens=11117 total_cost=0.0
Cost (USD): 0.0037819999999999998


Matcheando issues (oleadas):   8%|▊         | 8/100 [00:49<13:14,  8.64s/iss, avg_s=6.15, eta=0:09:26, oleada_s=49.20]

[Callback] prompt_tokens=10090 completion_tokens=504 total_tokens=10594 total_cost=0.0
Cost (USD): 0.0035305


Matcheando issues (oleadas):   9%|▉         | 9/100 [00:53<11:10,  7.37s/iss, avg_s=6.15, eta=0:09:26, oleada_s=49.20]

[Callback] prompt_tokens=10002 completion_tokens=179 total_tokens=10181 total_cost=0.0
Cost (USD): 0.0028584999999999995


Matcheando issues (oleadas):  10%|█         | 10/100 [01:00<10:48,  7.20s/iss, avg_s=6.15, eta=0:09:26, oleada_s=49.20]

[Callback] prompt_tokens=11239 completion_tokens=430 total_tokens=11669 total_cost=0.0
Cost (USD): 0.00366975


Matcheando issues (oleadas):  11%|█         | 11/100 [01:01<07:41,  5.18s/iss, avg_s=6.15, eta=0:09:26, oleada_s=49.20]

[Callback] prompt_tokens=10073 completion_tokens=495 total_tokens=10568 total_cost=0.0
Cost (USD): 0.00350825


Matcheando issues (oleadas):  12%|█▏        | 12/100 [01:03<06:08,  4.19s/iss, avg_s=6.15, eta=0:09:26, oleada_s=49.20]

[Callback] prompt_tokens=11590 completion_tokens=111 total_tokens=11701 total_cost=0.0
Cost (USD): 0.0031195


Matcheando issues (oleadas):  13%|█▎        | 13/100 [01:06<05:44,  3.96s/iss, avg_s=6.15, eta=0:09:26, oleada_s=49.20]

[Callback] prompt_tokens=11329 completion_tokens=556 total_tokens=11885 total_cost=0.0
Cost (USD): 0.00394425


Matcheando issues (oleadas):  14%|█▍        | 14/100 [01:08<04:35,  3.21s/iss, avg_s=6.15, eta=0:09:26, oleada_s=49.20]

[Callback] prompt_tokens=11209 completion_tokens=239 total_tokens=11448 total_cost=0.0
Cost (USD): 0.00328025


Matcheando issues (oleadas):  15%|█▌        | 15/100 [01:16<06:41,  4.73s/iss, avg_s=6.15, eta=0:09:26, oleada_s=49.20]

[Callback] prompt_tokens=11749 completion_tokens=672 total_tokens=12421 total_cost=0.0
Cost (USD): 0.0042812499999999995


Matcheando issues (oleadas):  16%|█▌        | 16/100 [01:21<06:59,  5.00s/iss, avg_s=5.12, eta=0:07:10, oleada_s=32.70]

[Callback] prompt_tokens=11321 completion_tokens=506 total_tokens=11827 total_cost=0.0
Cost (USD): 0.0038422499999999997


Matcheando issues (oleadas):  17%|█▋        | 17/100 [01:26<06:39,  4.81s/iss, avg_s=5.12, eta=0:07:10, oleada_s=32.70]

[Callback] prompt_tokens=11267 completion_tokens=182 total_tokens=11449 total_cost=0.0
Cost (USD): 0.0031807499999999995


Matcheando issues (oleadas):  18%|█▊        | 18/100 [01:28<05:21,  3.92s/iss, avg_s=5.12, eta=0:07:10, oleada_s=32.70]

[Callback] prompt_tokens=11604 completion_tokens=215 total_tokens=11819 total_cost=0.0
Cost (USD): 0.0033309999999999998


Matcheando issues (oleadas):  19%|█▉        | 19/100 [01:32<05:16,  3.91s/iss, avg_s=5.12, eta=0:07:10, oleada_s=32.70]

[Callback] prompt_tokens=11124 completion_tokens=447 total_tokens=11571 total_cost=0.0
Cost (USD): 0.003675


Matcheando issues (oleadas):  20%|██        | 20/100 [01:33<04:17,  3.22s/iss, avg_s=5.12, eta=0:07:10, oleada_s=32.70]

[Callback] prompt_tokens=11171 completion_tokens=445 total_tokens=11616 total_cost=0.0
Cost (USD): 0.0036827499999999994


Matcheando issues (oleadas):  21%|██        | 21/100 [01:35<03:52,  2.94s/iss, avg_s=5.12, eta=0:07:10, oleada_s=32.70]

[Callback] prompt_tokens=11192 completion_tokens=383 total_tokens=11575 total_cost=0.0
Cost (USD): 0.0035639999999999995


Matcheando issues (oleadas):  22%|██▏       | 22/100 [01:38<03:47,  2.91s/iss, avg_s=5.12, eta=0:07:10, oleada_s=32.70]

[Callback] prompt_tokens=11020 completion_tokens=160 total_tokens=11180 total_cost=0.0
Cost (USD): 0.003075


Matcheando issues (oleadas):  24%|██▍       | 24/100 [01:39<02:53,  2.29s/iss, avg_s=4.15, eta=0:05:15, oleada_s=17.74]

[Callback] prompt_tokens=11203 completion_tokens=383 total_tokens=11586 total_cost=0.0
Cost (USD): 0.0035667499999999996
[Callback] prompt_tokens=12260 completion_tokens=440 total_tokens=12700 total_cost=0.0
Cost (USD): 0.003945


Matcheando issues (oleadas):  25%|██▌       | 25/100 [01:43<02:46,  2.23s/iss, avg_s=4.15, eta=0:05:15, oleada_s=17.74]

[Callback] prompt_tokens=11029 completion_tokens=176 total_tokens=11205 total_cost=0.0
Cost (USD): 0.0031092499999999996


Matcheando issues (oleadas):  26%|██▌       | 26/100 [01:44<02:15,  1.83s/iss, avg_s=4.15, eta=0:05:15, oleada_s=17.74]

[Callback] prompt_tokens=11047 completion_tokens=112 total_tokens=11159 total_cost=0.0
Cost (USD): 0.0029857499999999997


Matcheando issues (oleadas):  27%|██▋       | 27/100 [01:46<02:11,  1.80s/iss, avg_s=4.15, eta=0:05:15, oleada_s=17.74]

[Callback] prompt_tokens=12313 completion_tokens=215 total_tokens=12528 total_cost=0.0
Cost (USD): 0.0035082499999999996


Matcheando issues (oleadas):  28%|██▊       | 28/100 [01:50<03:03,  2.54s/iss, avg_s=4.15, eta=0:05:15, oleada_s=17.74]

[Callback] prompt_tokens=11991 completion_tokens=237 total_tokens=12228 total_cost=0.0
Cost (USD): 0.00347175


Matcheando issues (oleadas):  29%|██▉       | 29/100 [01:54<03:16,  2.77s/iss, avg_s=4.15, eta=0:05:15, oleada_s=17.74]

[Callback] prompt_tokens=12061 completion_tokens=248 total_tokens=12309 total_cost=0.0
Cost (USD): 0.0035112499999999996


Matcheando issues (oleadas):  30%|███       | 30/100 [01:55<02:36,  2.24s/iss, avg_s=4.15, eta=0:05:15, oleada_s=17.74]

[Callback] prompt_tokens=12355 completion_tokens=468 total_tokens=12823 total_cost=0.0
Cost (USD): 0.00402475


Matcheando issues (oleadas):  31%|███       | 31/100 [02:01<04:01,  3.50s/iss, avg_s=4.15, eta=0:05:15, oleada_s=17.74]

[Callback] prompt_tokens=12632 completion_tokens=1009 total_tokens=13641 total_cost=0.0
Cost (USD): 0.005176


Matcheando issues (oleadas):  32%|███▏      | 32/100 [02:03<03:23,  3.00s/iss, avg_s=3.86, eta=0:04:22, oleada_s=23.78]

[Callback] prompt_tokens=13992 completion_tokens=521 total_tokens=14513 total_cost=0.0
Cost (USD): 0.00454


Matcheando issues (oleadas):  33%|███▎      | 33/100 [02:09<04:13,  3.78s/iss, avg_s=3.86, eta=0:04:22, oleada_s=23.78]

[Callback] prompt_tokens=13924 completion_tokens=224 total_tokens=14148 total_cost=0.0
Cost (USD): 0.003928999999999999


Matcheando issues (oleadas):  34%|███▍      | 34/100 [02:10<03:28,  3.17s/iss, avg_s=3.86, eta=0:04:22, oleada_s=23.78]

[Callback] prompt_tokens=14027 completion_tokens=196 total_tokens=14223 total_cost=0.0
Cost (USD): 0.00389875


Matcheando issues (oleadas):  35%|███▌      | 35/100 [02:11<02:40,  2.47s/iss, avg_s=3.86, eta=0:04:22, oleada_s=23.78]

[Callback] prompt_tokens=13911 completion_tokens=381 total_tokens=14292 total_cost=0.0
Cost (USD): 0.00423975


Matcheando issues (oleadas):  36%|███▌      | 36/100 [02:13<02:24,  2.26s/iss, avg_s=3.86, eta=0:04:22, oleada_s=23.78]

[Callback] prompt_tokens=14075 completion_tokens=451 total_tokens=14526 total_cost=0.0
Cost (USD): 0.004420749999999999


Matcheando issues (oleadas):  37%|███▋      | 37/100 [02:15<02:13,  2.12s/iss, avg_s=3.86, eta=0:04:22, oleada_s=23.78]

[Callback] prompt_tokens=13806 completion_tokens=244 total_tokens=14050 total_cost=0.0
Cost (USD): 0.003939499999999999


Matcheando issues (oleadas):  38%|███▊      | 38/100 [02:17<02:14,  2.16s/iss, avg_s=3.86, eta=0:04:22, oleada_s=23.78]

[Callback] prompt_tokens=13437 completion_tokens=179 total_tokens=13616 total_cost=0.0
Cost (USD): 0.0037172499999999996


Matcheando issues (oleadas):  39%|███▉      | 39/100 [02:18<01:50,  1.81s/iss, avg_s=3.86, eta=0:04:22, oleada_s=23.78]

[Callback] prompt_tokens=13434 completion_tokens=173 total_tokens=13607 total_cost=0.0
Cost (USD): 0.0037045


Matcheando issues (oleadas):  40%|████      | 40/100 [02:22<02:21,  2.37s/iss, avg_s=3.55, eta=0:03:33, oleada_s=18.67]

[Callback] prompt_tokens=13987 completion_tokens=422 total_tokens=14409 total_cost=0.0
Cost (USD): 0.00434075


Matcheando issues (oleadas):  41%|████      | 41/100 [02:26<02:59,  3.04s/iss, avg_s=3.55, eta=0:03:33, oleada_s=18.67]

[Callback] prompt_tokens=13445 completion_tokens=177 total_tokens=13622 total_cost=0.0
Cost (USD): 0.00371525


Matcheando issues (oleadas):  42%|████▏     | 42/100 [02:31<03:32,  3.66s/iss, avg_s=3.55, eta=0:03:33, oleada_s=18.67]

[Callback] prompt_tokens=13763 completion_tokens=332 total_tokens=14095 total_cost=0.0
Cost (USD): 0.00410475
[Callback] prompt_tokens=14621 completion_tokens=493 total_tokens=15114 total_cost=0.0
Cost (USD): 0.0046412499999999995


Matcheando issues (oleadas):  44%|████▍     | 44/100 [02:36<02:50,  3.05s/iss, avg_s=3.55, eta=0:03:33, oleada_s=18.67]

[Callback] prompt_tokens=13538 completion_tokens=506 total_tokens=14044 total_cost=0.0
Cost (USD): 0.004396499999999999


Matcheando issues (oleadas):  46%|████▌     | 46/100 [02:37<01:43,  1.91s/iss, avg_s=3.55, eta=0:03:33, oleada_s=18.67]

[Callback] prompt_tokens=13624 completion_tokens=693 total_tokens=14317 total_cost=0.0
Cost (USD): 0.004791999999999999
[Callback] prompt_tokens=12356 completion_tokens=177 total_tokens=12533 total_cost=0.0
Cost (USD): 0.003443


Matcheando issues (oleadas):  47%|████▋     | 47/100 [02:41<02:14,  2.53s/iss, avg_s=3.55, eta=0:03:33, oleada_s=18.67]

[Callback] prompt_tokens=12351 completion_tokens=110 total_tokens=12461 total_cost=0.0
Cost (USD): 0.00330775


Matcheando issues (oleadas):  48%|████▊     | 48/100 [02:51<03:52,  4.48s/iss, avg_s=3.57, eta=0:03:05, oleada_s=29.21]

[Callback] prompt_tokens=13529 completion_tokens=748 total_tokens=14277 total_cost=0.0
Cost (USD): 0.00487825


Matcheando issues (oleadas):  49%|████▉     | 49/100 [02:56<04:02,  4.75s/iss, avg_s=3.57, eta=0:03:05, oleada_s=29.21]

[Callback] prompt_tokens=12474 completion_tokens=192 total_tokens=12666 total_cost=0.0
Cost (USD): 0.0035024999999999995


Matcheando issues (oleadas):  50%|█████     | 50/100 [03:00<03:36,  4.33s/iss, avg_s=3.57, eta=0:03:05, oleada_s=29.21]

[Callback] prompt_tokens=12507 completion_tokens=319 total_tokens=12826 total_cost=0.0
Cost (USD): 0.00376475


Matcheando issues (oleadas):  51%|█████     | 51/100 [03:00<02:39,  3.25s/iss, avg_s=3.57, eta=0:03:05, oleada_s=29.21]

[Callback] prompt_tokens=12483 completion_tokens=254 total_tokens=12737 total_cost=0.0
Cost (USD): 0.00362875


Matcheando issues (oleadas):  52%|█████▏    | 52/100 [03:01<02:08,  2.69s/iss, avg_s=3.57, eta=0:03:05, oleada_s=29.21]

[Callback] prompt_tokens=12818 completion_tokens=422 total_tokens=13240 total_cost=0.0
Cost (USD): 0.0040485


Matcheando issues (oleadas):  53%|█████▎    | 53/100 [03:07<02:48,  3.58s/iss, avg_s=3.57, eta=0:03:05, oleada_s=29.21]

[Callback] prompt_tokens=12518 completion_tokens=394 total_tokens=12912 total_cost=0.0
Cost (USD): 0.0039175


Matcheando issues (oleadas):  54%|█████▍    | 54/100 [03:12<03:02,  3.97s/iss, avg_s=3.57, eta=0:03:05, oleada_s=29.21]

[Callback] prompt_tokens=12579 completion_tokens=474 total_tokens=13053 total_cost=0.0
Cost (USD): 0.00409275


Matcheando issues (oleadas):  55%|█████▌    | 55/100 [03:16<03:04,  4.10s/iss, avg_s=3.57, eta=0:03:05, oleada_s=29.21]

[Callback] prompt_tokens=12555 completion_tokens=782 total_tokens=13337 total_cost=0.0
Cost (USD): 0.00470275


Matcheando issues (oleadas):  56%|█████▌    | 56/100 [03:19<02:41,  3.68s/iss, avg_s=3.57, eta=0:02:36, oleada_s=28.37]

[Callback] prompt_tokens=12882 completion_tokens=673 total_tokens=13555 total_cost=0.0
Cost (USD): 0.004566499999999999


Matcheando issues (oleadas):  57%|█████▋    | 57/100 [03:25<03:07,  4.37s/iss, avg_s=3.57, eta=0:02:36, oleada_s=28.37]

[Callback] prompt_tokens=12387 completion_tokens=116 total_tokens=12503 total_cost=0.0
Cost (USD): 0.0033287499999999997


Matcheando issues (oleadas):  58%|█████▊    | 58/100 [03:26<02:14,  3.20s/iss, avg_s=3.57, eta=0:02:36, oleada_s=28.37]

[Callback] prompt_tokens=12414 completion_tokens=253 total_tokens=12667 total_cost=0.0
Cost (USD): 0.0036095


Matcheando issues (oleadas):  59%|█████▉    | 59/100 [03:26<01:36,  2.36s/iss, avg_s=3.57, eta=0:02:36, oleada_s=28.37]

[Callback] prompt_tokens=12475 completion_tokens=254 total_tokens=12729 total_cost=0.0
Cost (USD): 0.0036267499999999998


Matcheando issues (oleadas):  60%|██████    | 60/100 [03:30<01:50,  2.77s/iss, avg_s=3.57, eta=0:02:36, oleada_s=28.37]

[Callback] prompt_tokens=13065 completion_tokens=316 total_tokens=13381 total_cost=0.0
Cost (USD): 0.00389825


Matcheando issues (oleadas):  61%|██████    | 61/100 [03:31<01:28,  2.27s/iss, avg_s=3.57, eta=0:02:36, oleada_s=28.37]

[Callback] prompt_tokens=12393 completion_tokens=180 total_tokens=12573 total_cost=0.0
Cost (USD): 0.00345825


Matcheando issues (oleadas):  62%|██████▏   | 62/100 [03:33<01:24,  2.22s/iss, avg_s=3.57, eta=0:02:36, oleada_s=28.37]

[Callback] prompt_tokens=12527 completion_tokens=228 total_tokens=12755 total_cost=0.0
Cost (USD): 0.00358775


Matcheando issues (oleadas):  63%|██████▎   | 63/100 [03:36<01:31,  2.47s/iss, avg_s=3.57, eta=0:02:36, oleada_s=28.37]

[Callback] prompt_tokens=12684 completion_tokens=467 total_tokens=13151 total_cost=0.0
Cost (USD): 0.004104999999999999


Matcheando issues (oleadas):  64%|██████▍   | 64/100 [03:41<01:55,  3.21s/iss, avg_s=3.46, eta=0:02:04, oleada_s=21.79]

[Callback] prompt_tokens=12766 completion_tokens=408 total_tokens=13174 total_cost=0.0
Cost (USD): 0.0040075


Matcheando issues (oleadas):  65%|██████▌   | 65/100 [03:46<02:08,  3.66s/iss, avg_s=3.46, eta=0:02:04, oleada_s=21.79]

[Callback] prompt_tokens=13268 completion_tokens=112 total_tokens=13380 total_cost=0.0
Cost (USD): 0.0035409999999999994


Matcheando issues (oleadas):  66%|██████▌   | 66/100 [03:47<01:35,  2.82s/iss, avg_s=3.46, eta=0:02:04, oleada_s=21.79]

[Callback] prompt_tokens=13235 completion_tokens=106 total_tokens=13341 total_cost=0.0
Cost (USD): 0.00352075


Matcheando issues (oleadas):  67%|██████▋   | 67/100 [03:49<01:30,  2.75s/iss, avg_s=3.46, eta=0:02:04, oleada_s=21.79]

[Callback] prompt_tokens=13124 completion_tokens=31 total_tokens=13155 total_cost=0.0
Cost (USD): 0.003343


Matcheando issues (oleadas):  68%|██████▊   | 68/100 [03:50<01:08,  2.13s/iss, avg_s=3.46, eta=0:02:04, oleada_s=21.79]

[Callback] prompt_tokens=12543 completion_tokens=356 total_tokens=12899 total_cost=0.0
Cost (USD): 0.0038477499999999996


Matcheando issues (oleadas):  69%|██████▉   | 69/100 [03:51<00:54,  1.77s/iss, avg_s=3.46, eta=0:02:04, oleada_s=21.79]

[Callback] prompt_tokens=12777 completion_tokens=471 total_tokens=13248 total_cost=0.0
Cost (USD): 0.00413625


Matcheando issues (oleadas):  70%|███████   | 70/100 [03:53<00:56,  1.88s/iss, avg_s=3.46, eta=0:02:04, oleada_s=21.79]

[Callback] prompt_tokens=13274 completion_tokens=178 total_tokens=13452 total_cost=0.0
Cost (USD): 0.0036745


Matcheando issues (oleadas):  71%|███████   | 71/100 [03:55<00:59,  2.06s/iss, avg_s=3.46, eta=0:02:04, oleada_s=21.79]

[Callback] prompt_tokens=13268 completion_tokens=114 total_tokens=13382 total_cost=0.0
Cost (USD): 0.0035449999999999995


Matcheando issues (oleadas):  72%|███████▏  | 72/100 [03:59<01:14,  2.67s/iss, avg_s=3.33, eta=0:01:33, oleada_s=18.49]

[Callback] prompt_tokens=13788 completion_tokens=366 total_tokens=14154 total_cost=0.0
Cost (USD): 0.004179


Matcheando issues (oleadas):  73%|███████▎  | 73/100 [04:05<01:35,  3.53s/iss, avg_s=3.33, eta=0:01:33, oleada_s=18.49]

[Callback] prompt_tokens=13847 completion_tokens=248 total_tokens=14095 total_cost=0.0
Cost (USD): 0.00395775


Matcheando issues (oleadas):  74%|███████▍  | 74/100 [04:07<01:21,  3.14s/iss, avg_s=3.33, eta=0:01:33, oleada_s=18.49]

[Callback] prompt_tokens=11307 completion_tokens=304 total_tokens=11611 total_cost=0.0
Cost (USD): 0.00343475


Matcheando issues (oleadas):  75%|███████▌  | 75/100 [04:12<01:27,  3.51s/iss, avg_s=3.33, eta=0:01:33, oleada_s=18.49]

[Callback] prompt_tokens=13379 completion_tokens=380 total_tokens=13759 total_cost=0.0
Cost (USD): 0.00410475


Matcheando issues (oleadas):  76%|███████▌  | 76/100 [04:15<01:25,  3.57s/iss, avg_s=3.33, eta=0:01:33, oleada_s=18.49]

[Callback] prompt_tokens=11679 completion_tokens=346 total_tokens=12025 total_cost=0.0
Cost (USD): 0.00361175


Matcheando issues (oleadas):  77%|███████▋  | 77/100 [04:17<01:07,  2.94s/iss, avg_s=3.33, eta=0:01:33, oleada_s=18.49]

[Callback] prompt_tokens=11458 completion_tokens=190 total_tokens=11648 total_cost=0.0
Cost (USD): 0.0032445


Matcheando issues (oleadas):  78%|███████▊  | 78/100 [04:19<01:00,  2.74s/iss, avg_s=3.33, eta=0:01:33, oleada_s=18.49]

[Callback] prompt_tokens=11415 completion_tokens=445 total_tokens=11860 total_cost=0.0
Cost (USD): 0.0037437499999999997


Matcheando issues (oleadas):  79%|███████▉  | 79/100 [04:21<00:53,  2.55s/iss, avg_s=3.33, eta=0:01:33, oleada_s=18.49]

[Callback] prompt_tokens=11663 completion_tokens=280 total_tokens=11943 total_cost=0.0
Cost (USD): 0.0034757499999999997


Matcheando issues (oleadas):  80%|████████  | 80/100 [04:24<00:55,  2.78s/iss, avg_s=3.31, eta=0:01:06, oleada_s=25.03]

[Callback] prompt_tokens=11146 completion_tokens=347 total_tokens=11493 total_cost=0.0
Cost (USD): 0.0034805


Matcheando issues (oleadas):  81%|████████  | 81/100 [04:29<01:05,  3.45s/iss, avg_s=3.31, eta=0:01:06, oleada_s=25.03]

[Callback] prompt_tokens=11061 completion_tokens=214 total_tokens=11275 total_cost=0.0
Cost (USD): 0.00319325


Matcheando issues (oleadas):  82%|████████▏ | 82/100 [04:31<00:51,  2.88s/iss, avg_s=3.31, eta=0:01:06, oleada_s=25.03]

[Callback] prompt_tokens=10890 completion_tokens=249 total_tokens=11139 total_cost=0.0
Cost (USD): 0.0032205000000000003


Matcheando issues (oleadas):  83%|████████▎ | 83/100 [04:33<00:43,  2.55s/iss, avg_s=3.31, eta=0:01:06, oleada_s=25.03]

[Callback] prompt_tokens=10871 completion_tokens=315 total_tokens=11186 total_cost=0.0
Cost (USD): 0.0033477499999999996


Matcheando issues (oleadas):  84%|████████▍ | 84/100 [04:35<00:39,  2.48s/iss, avg_s=3.31, eta=0:01:06, oleada_s=25.03]

[Callback] prompt_tokens=10961 completion_tokens=198 total_tokens=11159 total_cost=0.0
Cost (USD): 0.00313625


Matcheando issues (oleadas):  85%|████████▌ | 85/100 [04:39<00:44,  2.98s/iss, avg_s=3.31, eta=0:01:06, oleada_s=25.03]

[Callback] prompt_tokens=11041 completion_tokens=345 total_tokens=11386 total_cost=0.0
Cost (USD): 0.0034502499999999998


Matcheando issues (oleadas):  86%|████████▌ | 86/100 [04:40<00:30,  2.18s/iss, avg_s=3.31, eta=0:01:06, oleada_s=25.03]

[Callback] prompt_tokens=10758 completion_tokens=362 total_tokens=11120 total_cost=0.0
Cost (USD): 0.0034135


Matcheando issues (oleadas):  87%|████████▋ | 87/100 [04:43<00:31,  2.45s/iss, avg_s=3.31, eta=0:01:06, oleada_s=25.03]

[Callback] prompt_tokens=10866 completion_tokens=250 total_tokens=11116 total_cost=0.0
Cost (USD): 0.0032164999999999997


Matcheando issues (oleadas):  88%|████████▊ | 88/100 [04:44<00:25,  2.14s/iss, avg_s=3.23, eta=0:00:38, oleada_s=19.60]

[Callback] prompt_tokens=11102 completion_tokens=538 total_tokens=11640 total_cost=0.0
Cost (USD): 0.0038515


Matcheando issues (oleadas):  89%|████████▉ | 89/100 [04:48<00:30,  2.73s/iss, avg_s=3.23, eta=0:00:38, oleada_s=19.60]

[Callback] prompt_tokens=11168 completion_tokens=116 total_tokens=11284 total_cost=0.0
Cost (USD): 0.0030239999999999998


Matcheando issues (oleadas):  90%|█████████ | 90/100 [04:50<00:24,  2.43s/iss, avg_s=3.23, eta=0:00:38, oleada_s=19.60]

[Callback] prompt_tokens=10847 completion_tokens=188 total_tokens=11035 total_cost=0.0
Cost (USD): 0.0030877499999999998


Matcheando issues (oleadas):  91%|█████████ | 91/100 [04:51<00:16,  1.89s/iss, avg_s=3.23, eta=0:00:38, oleada_s=19.60]

[Callback] prompt_tokens=11155 completion_tokens=251 total_tokens=11406 total_cost=0.0
Cost (USD): 0.00329075


Matcheando issues (oleadas):  92%|█████████▏| 92/100 [04:51<00:11,  1.45s/iss, avg_s=3.23, eta=0:00:38, oleada_s=19.60]

[Callback] prompt_tokens=11196 completion_tokens=255 total_tokens=11451 total_cost=0.0
Cost (USD): 0.003309


Matcheando issues (oleadas):  93%|█████████▎| 93/100 [04:57<00:19,  2.83s/iss, avg_s=3.23, eta=0:00:38, oleada_s=19.60]

[Callback] prompt_tokens=11209 completion_tokens=382 total_tokens=11591 total_cost=0.0
Cost (USD): 0.00356625


Matcheando issues (oleadas):  94%|█████████▍| 94/100 [04:57<00:12,  2.08s/iss, avg_s=3.23, eta=0:00:38, oleada_s=19.60]

[Callback] prompt_tokens=11277 completion_tokens=356 total_tokens=11633 total_cost=0.0
Cost (USD): 0.0035312499999999997


Matcheando issues (oleadas):  95%|█████████▌| 95/100 [04:59<00:09,  1.90s/iss, avg_s=3.23, eta=0:00:38, oleada_s=19.60]

[Callback] prompt_tokens=10590 completion_tokens=286 total_tokens=10876 total_cost=0.0
Cost (USD): 0.0032194999999999997


Matcheando issues (oleadas):  96%|█████████▌| 96/100 [05:02<00:09,  2.43s/iss, avg_s=3.16, eta=0:00:12, oleada_s=18.42]

[Callback] prompt_tokens=11064 completion_tokens=106 total_tokens=11170 total_cost=0.0
Cost (USD): 0.0029779999999999997


Matcheando issues (oleadas):  97%|█████████▋| 97/100 [05:07<00:08,  2.94s/iss, avg_s=3.16, eta=0:00:12, oleada_s=18.42]

[Callback] prompt_tokens=10479 completion_tokens=97 total_tokens=10576 total_cost=0.0
Cost (USD): 0.00281375
[Callback] prompt_tokens=10417 completion_tokens=111 total_tokens=10528 total_cost=0.0
Cost (USD): 0.0028262499999999998


Matcheando issues (oleadas):  99%|█████████▉| 99/100 [05:07<00:01,  1.67s/iss, avg_s=3.16, eta=0:00:12, oleada_s=18.42]

[Callback] prompt_tokens=10440 completion_tokens=100 total_tokens=10540 total_cost=0.0
Cost (USD): 0.00281


Matcheando issues (oleadas): 100%|██████████| 100/100 [05:20<00:00,  3.20s/iss, avg_s=3.21, eta=0:00:00, oleada_s=17.50]

[Callback] prompt_tokens=10463 completion_tokens=172 total_tokens=10635 total_cost=0.0
Cost (USD): 0.00295975
Tiempo total: 0:05:20
OK (issues con match): 100 | Errores: 0
Guardados:
- results/match_catalogues\mena_match_results_201-300.json
- results/match_catalogues\mena_match_errors_201-300.json



