# Mena to Scott Catalog Matcher - Production Version

**Clean implementation with all fixes applied**

## Features:
- ✅ Multi-signal scoring (category, denomination, color, year, perforation)
- ✅ Handles surcharges and overprints correctly
- ✅ Compound color matching (e.g., "yellow green / dark green")
- ✅ Category-aware matching (Surface Mail, Airmail, Official, etc.)
- ✅ Unique Scott identifiers to handle duplicate numbers across years
- ✅ Hungarian algorithm for optimal assignment
- ✅ Color family similarity matching

## Scoring System (100 points):
- Category match: 10 points (or -50 if wrong)
- Denomination: 35 points
- Color: 30 points
- Year: 25 points (exact match)
- Perforation: 10 points
- Surcharge mismatch: -10 points penalty

## 1. Imports

In [1]:
from pathlib import Path
import json
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
import re
from difflib import SequenceMatcher
import numpy as np
from scipy.optimize import linear_sum_assignment

## 2. Data Structures

In [2]:
@dataclass
class MatchResult:
    """Represents a match between Mena and Scott catalogs"""
    mena_catalog_no: str
    scott_number: str
    confidence: str
    score: float
    signals: Dict[str, float]
    breakdown: str
    boost_reasons: List[str]
    requires_review: bool

## 3. Color Normalization and Matching

In [3]:
COLOR_ABBREVIATIONS = {
    "pl brn": "pale brown", "dk brn": "dark brown", "lt bl": "light blue",
    "dk bl": "dark blue", "org": "orange", "grn": "green", "dk grn": "dark green",
    "lt grn": "light green", "yel": "yellow", "blk": "black", "scar": "scarlet",
    "car": "carmine", "vio": "violet", "pur": "purple", "brn": "brown",
    "ol": "olive", "org red": "orange red", "red brn": "red brown", "gray": "grey",
}

COLOR_FAMILIES = {
    "blue_family": ["blue", "light blue", "dark blue", "pale blue", "ultramarine", 
                    "blue violet", "blue vio", "pale gray violet", "gray violet"],
    "red_family": ["red", "scarlet", "carmine", "rose", "vermillion", "crimson",
                   "dark red", "rose red", "lake"],
    "yellow_family": ["yellow", "orange", "lemon", "gold", "amber", "yellow green"],
    "green_family": ["green", "light green", "dark green", "olive", "emerald", 
                     "yellow green"],
    "brown_family": ["brown", "pale brown", "dark brown", "sepia", "chocolate", 
                     "red brown"],
}

def normalize_color(color_string: str) -> str:
    """Normalize color strings to standard format"""
    if not color_string:
        return ""
    color_lower = color_string.lower().strip()
    if color_lower in COLOR_ABBREVIATIONS:
        return COLOR_ABBREVIATIONS[color_lower]
    return " ".join(color_lower.split())

def clean_scott_color(color_string: str) -> str:
    """
    Remove overprint notation suffixes from Scott color strings.
    Examples: "carmine (Bk)" → "carmine", "green (R)" → "green"
    """
    if not color_string:
        return ""
    # Remove overprint suffixes: (R), (Bk), (BI), (G), (V), etc.
    cleaned = re.sub(r'\s*\([A-Z][a-z]?\)$', '', color_string)
    return cleaned.strip()

def find_color_family(color: str) -> Optional[str]:
    """Find which color family a color belongs to"""
    color_normalized = normalize_color(color)
    for family, colors in COLOR_FAMILIES.items():
        if color_normalized in colors:
            return family
    return None

def calculate_color_family_similarity(color1: str, color2: str) -> float:
    """Calculate similarity between two colors based on color families"""
    norm1 = normalize_color(color1)
    norm2 = normalize_color(color2)
    if norm1 == norm2:
        return 1.0
    family1 = find_color_family(norm1)
    family2 = find_color_family(norm2)
    
    if family1 and family2:
        if family1 == family2:
            return 0.85  # Same family
        else:
            return 0.3   # Different families
    
    return SequenceMatcher(None, norm1, norm2).ratio()

## 4. Denomination Parsing

In [4]:
def parse_simple_denomination(denom_string: str) -> Dict[str, Any]:
    """Parse a simple denomination string"""
    if not denom_string:
        return {"value": None, "unit": None}
    
    denom_string = denom_string.strip()
    
    # Handle ½
    if "½" in denom_string:
        value = 0.5
        unit = re.sub(r'[½\d\s.]', '', denom_string)
    else:
        match = re.search(r'(\d+\.?\d*)', denom_string)
        if match:
            value = float(match.group(1))
        else:
            return {"value": None, "unit": None}
        unit = re.sub(r'[\d\s.]', '', denom_string)
    
    # Normalize unit
    unit = unit.strip()
    if unit == 'r':
        unit = 'real'
    elif unit == 'p':
        unit = 'peso'
    elif unit in ['c', 'ct', 'cts']:
        unit = 'centavo'
    
    return {"value": value, "unit": unit}

def parse_denomination_string(denom_string: str) -> Dict[str, Any]:
    """
    Parse Scott denomination strings including surcharges.
    Examples: '½r' → {"value": 0.5, "unit": "real"}
              '1c on 20c' → {"value": 1, "unit": "c", "surcharge": {"on_value": 20, "on_unit": "c"}}
    """
    if not denom_string:
        return {"value": None, "unit": None}
    
    denom_string = denom_string.lower().strip()
    
    # Check if it's a surcharge
    if " on " in denom_string:
        parts = denom_string.split(" on ")
        if len(parts) == 2:
            new_denom = parse_simple_denomination(parts[0].strip())
            orig_denom = parse_simple_denomination(parts[1].strip())
            return {
                "value": new_denom["value"],
                "unit": new_denom["unit"],
                "surcharge": {
                    "on_value": orig_denom["value"],
                    "on_unit": orig_denom["unit"]
                }
            }
    
    return parse_simple_denomination(denom_string)

def normalize_denomination(value: float, unit: str) -> Dict[str, Any]:
    """Normalize Mena denomination to match Scott format"""
    unit_normalized = unit.lower().strip()
    
    # Remove plural 's'
    if unit_normalized.endswith('es'):
        unit_normalized = unit_normalized[:-2]
    elif unit_normalized.endswith('s'):
        unit_normalized = unit_normalized[:-1]
    
    # Handle abbreviations
    if unit_normalized in ['p', 'ps']:
        unit_normalized = 'peso'
    elif unit_normalized in ['r']:
        unit_normalized = 'real'
    elif unit_normalized in ['c', 'ct']:
        unit_normalized = 'centavo'
    
    return {"value": value, "unit": unit_normalized}

## 5. Year Extraction

In [5]:
def extract_primary_year(issue_dates: Dict[str, Any]) -> Optional[int]:
    """Extract the primary year from Mena issue dates"""
    date_priorities = ['placed_on_sale', 'probable_first_circulation', 'announced']
    for date_key in date_priorities:
        if date_key in issue_dates and issue_dates[date_key]:
            match = re.search(r'(\d{4})', str(issue_dates[date_key]))
            if match:
                return int(match.group(1))
    return None

def extract_scott_year(scott_stamp: Dict[str, Any]) -> Optional[int]:
    """Extract year from Scott stamp entry"""
    if 'year' in scott_stamp and scott_stamp['year']:
        return int(scott_stamp['year'])
    if 'header' in scott_stamp and scott_stamp['header']:
        match = re.search(r'(\d{4})', str(scott_stamp['header']))
        if match:
            return int(match.group(1))
    return None

## 6. Scott Data Preprocessing

In [6]:
def strip_leading_zeros(catalog_no: str) -> str:
    """
    Strip leading zeros from regular stamps for display.
    Examples: "021" → "21", "O21" → "O21" (keep letter prefixes)
    """
    catalog_no = str(catalog_no).strip()
    prefix_match = re.match(r'^([A-Za-z]+)', catalog_no)
    if prefix_match:
        return catalog_no
    return catalog_no.lstrip('0') or '0'

def fix_scott_surcharge_data(scott_stamp: Dict[str, Any]) -> Dict[str, Any]:
    """
    Fix Scott stamps where surcharge info is split between denomination and color.
    Example: {denomination: "1c", color: "on ½r ('82)"} → {denomination: "1c on ½r", color: "unknown"}
    """
    denom = str(scott_stamp.get('denomination', '')).strip()
    color = str(scott_stamp.get('color', '')).strip()
    
    if color.lower().startswith('on '):
        surcharge_part = re.sub(r'\s*\([\'"]?\d{2}\).*$', '', color)
        full_denomination = f"{denom} {surcharge_part}"
        fixed_color = "surcharge color unknown"
        
        return {
            **scott_stamp,
            'denomination': full_denomination,
            'color': fixed_color,
            'original_color_field': color
        }
    
    return scott_stamp

def enrich_variety_stamps(scott_stamps: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Enrich variety stamps by inheriting data from base stamps in the same issue"""
    base_stamps = {}
    for stamp in scott_stamps:
        if 'variety_of' not in stamp or not stamp.get('variety_of'):
            scott_no = stamp.get('scott_number', '')
            year = stamp.get('year')
            header = stamp.get('header', '')
            key = (scott_no, year, header)
            base_stamps[key] = stamp
    
    enriched = []
    for stamp in scott_stamps:
        stamp_copy = stamp.copy()
        
        if 'variety_of' in stamp and stamp['variety_of']:
            base_no = stamp['variety_of']
            year = stamp.get('year')
            header = stamp.get('header', '')
            key = (base_no, year, header)
            base_stamp = base_stamps.get(key)
            
            if base_stamp:
                if not stamp.get('denomination') and base_stamp.get('denomination'):
                    stamp_copy['denomination'] = base_stamp['denomination']
                
                if not stamp.get('color'):
                    desc = stamp.get('description', '').lower()
                    color_keywords = [
                        'light blue', 'dark blue', 'pale blue', 'blue',
                        'light green', 'dark green', 'green',
                        'light brown', 'dark brown', 'brown',
                        'light violet', 'dark violet', 'violet',
                        'blue violet', 'gray violet', 'pale gray violet',
                        'scarlet', 'red', 'carmine', 'rose',
                        'yellow', 'orange', 'black', 'purple', 'gray', 'grey'
                    ]
                    for color in color_keywords:
                        if color in desc:
                            stamp_copy['color'] = color
                            break
                
                if not stamp_copy.get('color') and base_stamp.get('color'):
                    stamp_copy['color'] = base_stamp['color']
                
                if not stamp_copy.get('perforation') and base_stamp.get('perforation'):
                    stamp_copy['perforation'] = base_stamp['perforation']
        
        enriched.append(stamp_copy)
    
    return enriched

def flatten_and_enrich_scott_data(scott_grouped_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Flatten Scott data from grouped structure and enrich variety stamps"""
    flat_stamps = []
    
    for group in scott_grouped_data:
        header = group.get('header', '')
        stamps = group.get('stamps', [])
        
        for stamp in stamps:
            stamp_copy = stamp.copy()
            stamp_copy['header'] = header
            
            # Normalize scott_number
            scott_no = stamp_copy.get('scott_number', '')
            stamp_copy['scott_number'] = strip_leading_zeros(scott_no)
            
            # Extract year from header
            if header:
                match = re.search(r'(\d{4})', str(header))
                if match:
                    stamp_copy['year'] = int(match.group(1))
            
            # Fix surcharge denominations
            stamp_copy = fix_scott_surcharge_data(stamp_copy)
            
            flat_stamps.append(stamp_copy)
    
    return enrich_variety_stamps(flat_stamps)

## 7. Category Normalization

In [7]:
def normalize_catalog_number(catalog_no: str) -> tuple:
    """
    Normalize catalog number to (category, number, suffix) for sorting.
    Examples: "17" → ("", 17.0, ""), "O31" → ("O", 31.0, ""), "C164" → ("C", 164.0, "")
    """
    catalog_no = str(catalog_no).strip()
    
    # Extract category prefix (letters only)
    category_match = re.match(r'^([A-Za-z]+)', catalog_no)
    if category_match:
        category = category_match.group(1).upper()
        remaining = catalog_no[len(category):]
    else:
        category = ""
        remaining = catalog_no
    
    # Strip leading zeros
    remaining = remaining.lstrip('0') or '0'
    
    # Extract numeric part
    number_match = re.match(r'^(\d+)', remaining)
    if number_match:
        base_num = float(number_match.group(1))
        remaining = remaining[len(number_match.group(1)):]
        
        # Extract suffix
        suffix_match = re.match(r'^([a-z]+)', remaining, re.IGNORECASE)
        if suffix_match:
            suffix = suffix_match.group(1).lower()
            for i, char in enumerate(suffix):
                base_num += (ord(char) - ord('a') + 1) * (0.1 ** (i + 1))
        else:
            suffix = ""
    else:
        base_num = 999999.0
        suffix = ""
    
    return (category, base_num, suffix)

def get_stamp_category(stamp: Dict[str, Any], is_mena: bool = True) -> str:
    """
    Get the category of a stamp.
    For Mena: Use section field. For Scott: Use catalog number prefix.
    """
    if is_mena:
        section = stamp.get('issue_data', {}).get('section', '') if 'issue_data' in stamp else ''
        section = section.lower().strip()
        
        section_to_category = {
            'surface mail': '',
            'airmail': 'C',
            'air mail': 'C',
            'official': 'O',
            'telegraph': 'T',
            'telegraphs': 'T',
            'postage due': 'J',
            'dues': 'J',
            'special delivery': 'E',
            'registration': 'F',
            'guanacaste': 'G',
        }
        
        for key, cat in section_to_category.items():
            if key in section:
                return cat
        
        return normalize_catalog_number(stamp.get('catalog_no', ''))[0]
    
    else:
        return normalize_catalog_number(stamp.get('scott_number', ''))[0]

def make_scott_unique_key(scott_stamp: Dict[str, Any]) -> str:
    """
    Create unique identifier for Scott stamps: "number__year"
    Handles duplicate numbers across different years.
    """
    scott_no = scott_stamp.get('scott_number', 'UNKNOWN')
    year = extract_scott_year(scott_stamp) or 9999
    return f"{scott_no}__{year}"

## 8. Scoring and Matching Logic

In [8]:
def calculate_match_score(mena_stamp: Dict[str, Any], 
                         scott_stamp: Dict[str, Any], 
                         mena_issue_context: Dict[str, Any]) -> Dict[str, Any]:
    """Calculate match score using multiple signals (100-point system)"""
    signals = {}
    total_score = 0.0
    breakdown_parts = []
    
    # ========== CATEGORY MATCH (10 points, -50 if wrong) ==========
    mena_category = get_stamp_category(
        {'issue_data': mena_issue_context, 'catalog_no': mena_stamp.get('catalog_no', '')}, 
        is_mena=True
    )
    scott_category = get_stamp_category(scott_stamp, is_mena=False)
    
    category_compatible = False
    if mena_category == scott_category:
        category_compatible = True
    else:
        equivalences = [
            ({'', 'C'}, {'', 'C'}),
            ({'O'}, {'O', 'CO'}),
            ({'E'}, {'E', 'CE'}),
            ({'J'}, {'J'}),
            ({'G'}, {'G'}),
        ]
        for mena_set, scott_set in equivalences:
            if mena_category in mena_set and scott_category in scott_set:
                category_compatible = True
                break
    
    if not category_compatible:
        signals['category'] = -50
        total_score -= 50
        breakdown_parts.append(f"Cat: ✗")
    else:
        signals['category'] = 10
        total_score += 10
        breakdown_parts.append("Cat: ✓")
    
    # ========== DENOMINATION (35 points) ==========
    mena_denom = normalize_denomination(
        mena_stamp['denomination']['value'], 
        mena_stamp['denomination']['unit']
    )
    scott_denom = parse_denomination_string(scott_stamp.get('denomination', ''))
    
    # Check surcharge status
    mena_has_surcharge = (mena_stamp.get('overprint', {}).get('present') and 
                          mena_stamp.get('overprint', {}).get('type') == 'surcharge')
    scott_has_surcharge = 'surcharge' in scott_denom
    
    # Surcharge mismatch penalty
    if mena_has_surcharge != scott_has_surcharge:
        signals['surcharge_mismatch'] = -10
        total_score -= 10
        breakdown_parts.append("Surcharge: ✗")
    
    # Denomination matching
    if mena_has_surcharge and scott_has_surcharge:
        new_match = (mena_denom['value'] == scott_denom['value'] and 
                     mena_denom['unit'] == scott_denom['unit'])
        mena_orig = normalize_denomination(
            mena_stamp['overprint']['on_denomination']['value'],
            mena_stamp['overprint']['on_denomination']['unit']
        )
        scott_orig = scott_denom['surcharge']
        orig_match = (mena_orig['value'] == scott_orig['on_value'] and 
                      mena_orig['unit'] == scott_orig['on_unit'])
        
        if new_match and orig_match:
            signals['denomination'] = 35
            total_score += 35
            breakdown_parts.append("Denom: ✓")
        elif new_match:
            signals['denomination'] = 20
            total_score += 20
            breakdown_parts.append("Denom: ⚠️")
        else:
            signals['denomination'] = 0
            breakdown_parts.append("Denom: ✗")
    elif (mena_denom['value'] == scott_denom['value'] and 
          mena_denom['unit'] == scott_denom['unit']):
        signals['denomination'] = 35
        total_score += 35
        breakdown_parts.append("Denom: ✓")
    else:
        signals['denomination'] = 0
        breakdown_parts.append("Denom: ✗")
    
    # ========== COLOR (30 points) ==========
    if mena_stamp.get('color') and scott_stamp.get('color'):
        mena_color_raw = mena_stamp['color']
        scott_color_raw = scott_stamp['color']
        scott_color = clean_scott_color(scott_color_raw)
        
        # Handle compound/variant colors
        mena_colors = []
        if '/' in mena_color_raw:
            mena_colors = [c.strip() for c in mena_color_raw.split('/')]
        elif ' & ' in mena_color_raw:
            mena_colors = [c.strip() for c in mena_color_raw.split('&')]
        else:
            mena_colors = [mena_color_raw]
        
        # Take best match
        best_similarity = 0.0
        for mena_color in mena_colors:
            similarity = calculate_color_family_similarity(mena_color, scott_color)
            if similarity > best_similarity:
                best_similarity = similarity
        
        color_score = best_similarity * 30
        signals['color'] = color_score
        total_score += color_score
        breakdown_parts.append(f"Color: {int(best_similarity*100)}%")
    else:
        signals['color'] = 0
    
    # ========== YEAR (25 points) ==========
    mena_year = extract_primary_year(mena_issue_context['issue_dates'])
    scott_year = extract_scott_year(scott_stamp)
    if mena_year and scott_year:
        year_diff = abs(mena_year - scott_year)
        if year_diff == 0:
            signals['year'] = 25
            total_score += 25
            breakdown_parts.append("Year: ✓")
        elif year_diff == 1:
            signals['year'] = 15
            total_score += 15
            breakdown_parts.append("Year: ~1")
        elif year_diff == 2:
            signals['year'] = 10
            total_score += 10
            breakdown_parts.append("Year: ~2")
    
    # ========== PERFORATION (10 points) ==========
    mena_perf = str(mena_stamp.get('perforation', '')).strip()
    scott_perf = str(scott_stamp.get('perforation', '')).strip()
    
    if mena_perf and scott_perf:
        mena_perf_num = re.findall(r'[\d.]+', mena_perf)
        scott_perf_num = re.findall(r'[\d.]+', scott_perf)
        
        if mena_perf_num and scott_perf_num:
            if any(m == s for m in mena_perf_num for s in scott_perf_num):
                signals['perforation'] = 10
                total_score += 10
            else:
                signals['perforation'] = -5
                total_score -= 5
                breakdown_parts.append(f"Perf: ✗")
    
    return {
        'total_score': total_score, 
        'signals': signals, 
        'breakdown': " | ".join(breakdown_parts)
    }

## 9. Candidate Pool and Scoring

In [9]:
def build_candidate_pool(mena_issue: Dict[str, Any], 
                         all_scott_stamps: List[Dict[str, Any]], 
                         year_tolerance: int = 2) -> List[Dict[str, Any]]:
    """Build a pool of Scott stamp candidates based on year"""
    primary_year = extract_primary_year(mena_issue['issue_data']['issue_dates'])
    if not primary_year:
        return all_scott_stamps
    
    candidates = []
    no_year_count = 0
    
    for scott_stamp in all_scott_stamps:
        scott_year = extract_scott_year(scott_stamp)
        if scott_year is not None:
            if abs(scott_year - primary_year) <= year_tolerance:
                candidates.append(scott_stamp)
        else:
            no_year_count += 1
    
    print(f"Found {len(candidates)} Scott candidates for year {primary_year} (±{year_tolerance} years)")
    print(f"Excluded {no_year_count} stamps without year information")
    
    return candidates

def score_all_candidates(mena_issue: Dict[str, Any], 
                        scott_candidate_pool: List[Dict[str, Any]], 
                        min_threshold: float = 30.0) -> List[Dict[str, Any]]:
    """Score all Mena stamps against all Scott candidates"""
    scoring_matrix = []
    
    for mena_stamp in mena_issue['stamps']:
        mena_row = {
            'mena_catalog_no': mena_stamp['catalog_no'], 
            'candidates': []
        }
        
        for scott_candidate in scott_candidate_pool:
            score_result = calculate_match_score(
                mena_stamp, scott_candidate, mena_issue['issue_data']
            )
            
            if score_result['total_score'] >= min_threshold:
                unique_key = make_scott_unique_key(scott_candidate)
                
                mena_row['candidates'].append({
                    'scott_number': scott_candidate.get('scott_number', 'UNKNOWN'),
                    'scott_unique_key': unique_key,
                    'scott_year': extract_scott_year(scott_candidate),
                    'score': score_result['total_score'],
                    'signals': score_result['signals'],
                    'breakdown': score_result['breakdown']
                })
        
        if mena_row['candidates']:
            scoring_matrix.append(mena_row)
    
    return scoring_matrix

## 10. Optimal Assignment (Hungarian Algorithm)

In [10]:
def find_optimal_assignment(scoring_matrix: List[Dict[str, Any]]) -> List[MatchResult]:
    """Find optimal one-to-one assignment using Hungarian algorithm"""
    mena_stamps = [row['mena_catalog_no'] for row in scoring_matrix]
    
    # Collect unique Scott keys
    all_scott_keys = set()
    for row in scoring_matrix:
        for cand in row['candidates']:
            all_scott_keys.add(cand['scott_unique_key'])
    
    scott_stamps = sorted(all_scott_keys)
    
    # Build cost matrix
    n_mena = len(mena_stamps)
    n_scott = len(scott_stamps)
    max_dim = max(n_mena, n_scott)
    
    cost_matrix = np.full((max_dim, max_dim), 1000.0)
    scott_to_idx = {scott_key: i for i, scott_key in enumerate(scott_stamps)}
    
    for i, row in enumerate(scoring_matrix):
        for cand in row['candidates']:
            scott_key = cand['scott_unique_key']
            if scott_key in scott_to_idx:
                j = scott_to_idx[scott_key]
                cost_matrix[i, j] = -cand['score']
    
    # Find optimal assignment
    mena_indices, scott_indices = linear_sum_assignment(cost_matrix)
    
    # Build results
    assignments = []
    details_map = {}
    
    for row in scoring_matrix:
        for cand in row['candidates']:
            key = (row['mena_catalog_no'], cand['scott_unique_key'])
            details_map[key] = cand
    
    for mena_idx, scott_idx in zip(mena_indices, scott_indices):
        if mena_idx >= n_mena or scott_idx >= n_scott:
            continue
        
        mena_no = mena_stamps[mena_idx]
        scott_key = scott_stamps[scott_idx]
        key = (mena_no, scott_key)
        
        if key not in details_map:
            continue
        
        cand = details_map[key]
        score = cand['score']
        
        if score < 30:
            continue
        
        confidence = "HIGH" if score >= 70 else "MEDIUM" if score >= 50 else "LOW"
        requires_review = score < 70
        
        # Display with year
        scott_display = f"{cand['scott_number']} ({cand.get('scott_year', '?')})"
        
        assignments.append(MatchResult(
            mena_catalog_no=mena_no,
            scott_number=scott_display,
            confidence=confidence,
            score=score,
            signals=cand['signals'],
            breakdown=cand['breakdown'],
            boost_reasons=[],
            requires_review=requires_review
        ))
    
    assignments.sort(key=lambda x: normalize_catalog_number(x.mena_catalog_no)[1])
    
    return assignments

## 11. Main Matching Function

In [11]:
def match_mena_to_scott(mena_issue: Dict[str, Any], 
                       all_scott_stamps: List[Dict[str, Any]], 
                       year_tolerance: int = 2, 
                       min_score_threshold: float = 30.0) -> Dict[str, Any]:
    """Main function to match Mena issue to Scott catalog"""
    
    print("\n" + "="*80)
    print("MENA TO SCOTT CATALOG MATCHING")
    print("="*80)
    
    # Build candidate pool
    scott_candidates = build_candidate_pool(mena_issue, all_scott_stamps, year_tolerance)
    
    # Score all candidates
    scoring_matrix = score_all_candidates(mena_issue, scott_candidates, min_score_threshold)
    
    # Find optimal assignment
    assignments = find_optimal_assignment(scoring_matrix)
    
    # Calculate statistics
    statistics = {
        'total_mena_stamps': len(mena_issue['stamps']),
        'total_assignments': len(assignments),
        'high_confidence': sum(1 for a in assignments if a.confidence == "HIGH"),
        'medium_confidence': sum(1 for a in assignments if a.confidence == "MEDIUM"),
        'low_confidence': sum(1 for a in assignments if a.confidence == "LOW"),
        'success_rate': round(len(assignments) / len(mena_issue['stamps']) * 100, 1) 
                        if mena_issue['stamps'] else 0
    }
    
    # Build result
    result = {
        'issue_match': {
            'mena_issue_id': mena_issue['issue_data']['issue_id'],
            'mena_title': mena_issue['issue_data']['title'],
            'candidate_pool_size': len(scott_candidates)
        },
        'assignments': [
            {
                'mena_catalog_no': a.mena_catalog_no,
                'scott_number': a.scott_number,
                'confidence': a.confidence,
                'score': round(a.score, 1),
                'signals': {k: round(v, 1) for k, v in a.signals.items()},
                'breakdown': a.breakdown,
                'requires_review': a.requires_review
            }
            for a in assignments
        ],
        'statistics': statistics,
        'scoring_matrix': scoring_matrix
    }
    
    return result

## 12. Results Display

In [12]:
def print_matching_results(result: Dict[str, Any]):
    """Print matching results in a formatted way"""
    print("\n" + "="*80)
    print("MATCHING RESULTS")
    print("="*80)
    
    if not result['assignments']:
        print("\nNo matches found above the threshold.")
        return
    
    for assignment in result['assignments']:
        confidence_icon = "✓" if assignment['confidence'] == "HIGH" else "⚠" if assignment['confidence'] == "MEDIUM" else "!"
        print(f"\n{confidence_icon} Mena #{assignment['mena_catalog_no']} → Scott #{assignment['scott_number']}")
        print(f"  Confidence: {assignment['confidence']} (Score: {assignment['score']}/100)")
        print(f"  {assignment['breakdown']}")
    
    stats = result['statistics']
    print("\n" + "="*80)
    print(f"Total: {stats['total_mena_stamps']} | Matched: {stats['total_assignments']} ({stats['success_rate']}%)")
    print(f"High: {stats['high_confidence']} | Medium: {stats['medium_confidence']} | Low: {stats['low_confidence']}")
    print("="*80)

## 13. Example Usage

In [14]:
# Load Mena issue
PATH = Path("results/parsed_catalogues/mena_parse_results_ALL.json")

# Cargar
with PATH.open("r", encoding="utf-8") as f:
    mena_parsed_catalog = json.load(f)


In [15]:
mena_issue = mena_parsed_catalog[3]
print(f"Loaded Mena issue: {mena_issue['issue_data']['title']}")
print(f"Number of stamps: {len(mena_issue['stamps'])}")
print(mena_issue)

Loaded Mena issue: Prospero Fernandez Issue
Number of stamps: 5
{'issue_data': {'issue_id': 'CR-1883-PROSPERO-FERNANDEZ', 'section': 'Surface Mail', 'title': 'Prospero Fernandez Issue', 'country': 'Costa Rica', 'issue_dates': {'announced': None, 'placed_on_sale': '1883-01-13', 'probable_first_circulation': '1883-01-13', 'second_plate_sale': None, 'demonetized': '1889-10-31'}, 'legal_basis': [{'type': 'decree', 'id': 'Decree #17', 'date': '1883-01-13', 'ids': [], 'officials': []}], 'currency_context': {'original': 'c', 'decimal_adoption': '1864-01-01', 'revaluation_date': None, 'revaluation_map': {}}, 'printing': {'printer': 'ABNCo.', 'process': ['engraved'], 'format': {'panes': None}, 'plates': {}, 'notes': ''}, 'perforation': '12'}, 'production_orders': {'printings': [], 'remainders': {'date': None, 'note': '', 'quantities': []}}, 'stamps': [{'catalog_no': '12', 'issue_id': 'CR-1883-PROSPERO-FERNANDEZ', 'denomination': {'value': 1, 'unit': 'c'}, 'color': 'yellow green / dark green', '

In [16]:
# Load Scott catalog (grouped structure)
PATH = Path("results/parsed_catalogues/scott_parse_results_ALL.json")

# Cargar
with PATH.open("r", encoding="utf-8") as f:
    scott_grouped = json.load(f)

print(f"Loaded Scott catalog: {len(scott_grouped)} issue groups")

Loaded Scott catalog: 1083 issue groups


In [17]:
# CRITICAL STEP: Flatten and enrich Scott data
all_scott_stamps = flatten_and_enrich_scott_data(scott_grouped)

print(f"Preprocessed: {len(all_scott_stamps)} total stamps")
print(f"\nExample enriched variety stamp (Scott #1a):")
for stamp in all_scott_stamps[:10]:
    if stamp.get('scott_number') == '1a' and stamp.get('year') == 1863:
        print(f"  denomination: {stamp.get('denomination')}")
        print(f"  color: {stamp.get('color')}")
        print(f"  variety_of: {stamp.get('variety_of')}")
        break

Preprocessed: 2462 total stamps

Example enriched variety stamp (Scott #1a):
  denomination: ½r
  color: light blue
  variety_of: 1


In [18]:
# Run the matching algorithm
result = match_mena_to_scott(
    mena_issue=mena_issue,
    all_scott_stamps=all_scott_stamps,
    year_tolerance=2,
    min_score_threshold=60.0
)

# Print results
print_matching_results(result)


MENA TO SCOTT CATALOG MATCHING
Found 41 Scott candidates for year 1883 (±2 years)
Excluded 355 stamps without year information

MATCHING RESULTS

✓ Mena #12 → Scott #1 (1883)
  Confidence: HIGH (Score: 85.5/100)
  Cat: ✓ | Surcharge: ✗ | Denom: ✓ | Color: 85% | Year: ✓

✓ Mena #13 → Scott #3 (1883)
  Confidence: HIGH (Score: 85.5/100)
  Cat: ✓ | Surcharge: ✗ | Denom: ✓ | Color: 85% | Year: ✓

✓ Mena #14 → Scott #5 (1883)
  Confidence: HIGH (Score: 85.5/100)
  Cat: ✓ | Surcharge: ✗ | Denom: ✓ | Color: 85% | Year: ✓

✓ Mena #15 → Scott #6 (1883)
  Confidence: HIGH (Score: 85.5/100)
  Cat: ✓ | Surcharge: ✗ | Denom: ✓ | Color: 85% | Year: ✓

✓ Mena #16 → Scott #7 (1883)
  Confidence: HIGH (Score: 90.0/100)
  Cat: ✓ | Surcharge: ✗ | Denom: ✓ | Color: 100% | Year: ✓

Total: 5 | Matched: 5 (100.0%)
High: 5 | Medium: 0 | Low: 0
