# OCL Reference Modernizer

**GitHub Issue**: [#2212](https://github.com/OpenConceptLab/ocl_issues/issues/2212)

## Overview
This notebook modernizes OCL (Open Concept Lab) collection references by:
1. 📥 Loading references from OCL collection version export JSON
2. 🔍 Identifying versioned references using URL pattern matching
3. ✨ Generating unversioned equivalents ONLY for versioned references
4. 📤 Creating bulk import/delete files in JSONL format for OCL API
5. 🔧 Supporting multiple cascade presets (OpenMRS, Custom, etc.)
6. 🚫 Preventing duplicate expressions
7. 📊 Reporting detailed statistics on the transformation

## Acceptance Criteria ✅
- ✅ Parse references from OCL collection version export JSON
- ✅ Identify versioned references using URL pattern matching
- ✅ Generate unversioned equivalents ONLY for versioned references
- ✅ Create bulk import/delete JSONL files
- ✅ Handle cascade configuration with presets
- ✅ Prevent duplicate expressions
- ✅ Report concept/mapping reference counts

## Instructions
1. **Run cells sequentially** from top to bottom
2. **Configure your settings** in the Configuration cell
3. **Set your input file path** to your OCL export JSON
4. **Choose cascade preset** (OpenMRS, Custom, etc.)
5. **Review outputs** before applying to your collection
6. **Use OCL's Bulk Import Interface** to apply the generated files

---

## Step 1: Import Required Packages

In [35]:
# Import required packages
import json
import re
import os
from typing import List, Dict, Tuple, Optional, Set
from datetime import datetime
from collections import defaultdict, Counter

print("✅ All packages imported successfully!")
print(f"🕐 Notebook started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ All packages imported successfully!
🕐 Notebook started at: 2025-08-20 10:56:15


## Step 2: Configuration

**🔧 Configure your settings here:**

In [36]:
# CONFIGURATION - CUSTOMIZE THESE SETTINGS
# =============================================================================

class CascadePresets:
    """Predefined cascade configurations for different use cases."""
    
    # OpenMRS cascade with transform
    OPENMRS_WITH_TRANSFORM = {
        "method": "sourcetoconcepts",
        "map_types": "Q-AND-A,CONCEPT-SET",
        "cascade_levels": "*",
        "return_map_types": "*",
        "transform": "openmrs"
    }
    
    # OpenMRS cascade without transform
    OPENMRS_WITHOUT_TRANSFORM = {
        "method": "sourcetoconcepts",
        "map_types": "Q-AND-A,CONCEPT-SET",
        "cascade_levels": "*",
        "return_map_types": "*"
    }
    
    # Source to mappings only
    SOURCE_TO_MAPPINGS = {
        "method": "sourcetomappings",
        "map_types": "*",
        "cascade_levels": "1",
        "return_map_types": "*"
    }
    
    # No cascade (simple references)
    NO_CASCADE = None
    
    # Custom cascade (to be defined by user)
    CUSTOM = {
        "method": "sourcetoconcepts",
        "map_types": "Q-AND-A,CONCEPT-SET",
        "cascade_levels": "*",
        "return_map_types": "*"
    }


class Config:
    """Configuration settings for the reference modernizer."""
    
    # 📁 INPUT FILE - SET THIS TO YOUR OCL EXPORT JSON FILE
    INPUT_FILE = "input-example-jamlung/export.json"  # ⚠️ CHANGE THIS TO YOUR FILE PATH
    
    # 📂 Output Settings
    OUTPUT_DIR = "output"
    TIMESTAMP = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # 📄 File naming
    UNVERSIONED_FILE = f"unversioned_references_{TIMESTAMP}.json"
    DELETE_FILE = f"references_to_delete_{TIMESTAMP}.json"
    REPORT_FILE = f"migration_report_{TIMESTAMP}.txt"
    
    # 🔧 Processing Options
    PRESERVE_ORIGINAL_CASCADE = True  # Try to preserve original cascade settings
    PRESERVE_REFERENCE_TYPE = True    # Preserve original reference type from export
    
    # 🎛️ Cascade Configuration
    # Choose from: OPENMRS_WITH_TRANSFORM, OPENMRS_WITHOUT_TRANSFORM, 
    #              SOURCE_TO_MAPPINGS, NO_CASCADE, CUSTOM
    CASCADE_PRESET = "OPENMRS_WITHOUT_TRANSFORM"  # 🔄 CHANGE THIS AS NEEDED
    
    # 🎨 Custom cascade settings (only used if CASCADE_PRESET = "CUSTOM")
    CUSTOM_CASCADE = {
        "method": "sourcetoconcepts",
        "map_types": "Q-AND-A,CONCEPT-SET,SAME-AS",
        "cascade_levels": "2",
        "return_map_types": "*"
    }
    
    # 🚫 Duplicate handling
    SKIP_DUPLICATES = True  # Skip references with duplicate expressions
    REPORT_DUPLICATES = True  # Report duplicate expressions found

    def get_cascade_settings(self) -> Optional[Dict]:
        """Get the cascade settings based on the selected preset."""
        if self.CASCADE_PRESET == "OPENMRS_WITH_TRANSFORM":
            return CascadePresets.OPENMRS_WITH_TRANSFORM
        elif self.CASCADE_PRESET == "OPENMRS_WITHOUT_TRANSFORM":
            return CascadePresets.OPENMRS_WITHOUT_TRANSFORM
        elif self.CASCADE_PRESET == "SOURCE_TO_MAPPINGS":
            return CascadePresets.SOURCE_TO_MAPPINGS
        elif self.CASCADE_PRESET == "NO_CASCADE":
            return CascadePresets.NO_CASCADE
        elif self.CASCADE_PRESET == "CUSTOM":
            return self.CUSTOM_CASCADE
        else:
            print(f"⚠️ Unknown cascade preset: {self.CASCADE_PRESET}, using OpenMRS default")
            return CascadePresets.OPENMRS_WITHOUT_TRANSFORM

# Initialize configuration
config = Config()

print("⚙️ Configuration loaded:")
print(f"   📁 Input file: {config.INPUT_FILE}")
print(f"   📂 Output directory: {config.OUTPUT_DIR}")
print(f"   🕐 Timestamp: {config.TIMESTAMP}")
print(f"   🔧 Preserve cascade: {config.PRESERVE_ORIGINAL_CASCADE}")
print(f"   🎛️ Cascade preset: {config.CASCADE_PRESET}")
print(f"   🚫 Skip duplicates: {config.SKIP_DUPLICATES}")
print()
print("⚠️ Make sure to update INPUT_FILE with your actual export file path!")
print("⚠️ Choose your CASCADE_PRESET based on your needs!")

⚙️ Configuration loaded:
   📁 Input file: input-example-jamlung/export.json
   📂 Output directory: output
   🕐 Timestamp: 20250820_105615
   🔧 Preserve cascade: True
   🎛️ Cascade preset: OPENMRS_WITHOUT_TRANSFORM
   🚫 Skip duplicates: True

⚠️ Make sure to update INPUT_FILE with your actual export file path!
⚠️ Choose your CASCADE_PRESET based on your needs!


## Step 3: Define Utility Functions

In [37]:
# UTILITY FUNCTIONS
# =============================================================================

def is_versioned_url(url: str) -> bool:
    """
    Check if URL contains a version number after concepts, sources, or mappings.
    
    Examples of versioned URLs:
    - /orgs/CIEL/sources/CIEL/concepts/1015/5282451/
    - /users/jamlung/sources/openmrs-demo-source/mappings/76/6495007/
    """
    return bool(re.search(r'/(concepts|sources|mappings)/[^/]+/\d+/?$', url))


def strip_version_from_url(url: str) -> str:
    """
    Remove version number from concepts, sources, or mappings URLs.
    """
    return re.sub(r'/(concepts|sources|mappings)/([^/]+)/\d+/?$', r'/\1/\2/', url)


def get_reference_type(url: str) -> str:
    """
    Determine if reference is to concepts or mappings.
    """
    if '/concepts/' in url:
        return 'concepts'
    elif '/mappings/' in url:
        return 'mappings'
    else:
        return 'other'


def extract_cascade_from_reference(ref_data: Dict) -> Optional[Dict]:
    """
    Extract cascade settings from original reference data.
    """
    # Look for cascade in various possible locations
    cascade_fields = ['cascade', '__cascade', 'cascadeOptions']
    
    for field in cascade_fields:
        if field in ref_data and ref_data[field]:
            return ref_data[field]
    
    return None


def normalize_expression(expression: str) -> str:
    """
    Normalize expression for duplicate detection.
    Ensures consistent trailing slashes and case.
    """
    if not expression:
        return ""
    
    # Ensure trailing slash for consistency
    if not expression.endswith('/'):
        expression += '/'
    
    return expression.lower()

print("🔧 Utility functions defined successfully!")

# Test the functions with examples
test_versioned = "/orgs/CIEL/sources/CIEL/concepts/1015/5282451/"
test_unversioned = "/orgs/CIEL/sources/CIEL/concepts/1015/"

print(f"\n🧪 Function tests:")
print(f"   Versioned URL test: {is_versioned_url(test_versioned)} (should be True)")
print(f"   Unversioned URL test: {is_versioned_url(test_unversioned)} (should be False)")
print(f"   Strip version test: {strip_version_from_url(test_versioned)}")
print(f"   Reference type test: {get_reference_type(test_versioned)}")
print(f"   Normalize test: {normalize_expression(test_unversioned)}")

🔧 Utility functions defined successfully!

🧪 Function tests:
   Versioned URL test: True (should be True)
   Unversioned URL test: False (should be False)
   Strip version test: /orgs/CIEL/sources/CIEL/concepts/1015/
   Reference type test: concepts
   Normalize test: /orgs/ciel/sources/ciel/concepts/1015/


## Step 4: Define File I/O Functions

In [38]:
# FILE I/O FUNCTIONS
# =============================================================================

def load_collection_export(file_path: str) -> Dict:
    """
    Load collection data from OCL version export JSON file.
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Export file not found: {file_path}")
    
    print(f"📥 Loading collection export from: {file_path}")
    
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    print(f"✅ Export file loaded successfully")
    return data


def extract_references_from_export(export_data: Dict) -> List[Dict]:
    """
    Extract references from collection export data.
    
    """
    print("🔍 Extracting references from export data...")
    
    references = export_data.get('references', [])
    
    if not references:
        print("   ⚠️ No references found in export data")
        return []
    
    if not isinstance(references, list):
        print("   ❌ References field is not a list")
        return []
    
    print(f"   ✅ Found {len(references)} references")
    
    # Optional: Show a sample reference structure for debugging
    if references:
        sample_ref = references[0]
        print(f"   📋 Sample reference keys: {list(sample_ref.keys())}")
        if 'expression' in sample_ref:
            print(f"   📋 Sample expression: {sample_ref['expression']}")
    
    return references


def save_jsonl(data: List[Dict], filename: str) -> None:
    """
    Save data as JSONL (JSON Lines) format compatible with OCL bulk import.
    """
    try:
        os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True)
        with open(filename, 'w', encoding='utf-8') as f:
            for item in data:
                f.write(json.dumps(item, ensure_ascii=False) + '\n')
        print(f"💾 Saved {len(data)} items to {filename}")
    except IOError as e:
        print(f"❌ Error saving {filename}: {e}")

print("📁 File I/O functions defined successfully!")

📁 File I/O functions defined successfully!


## Step 5: Define Reference Processing Classes

In [39]:
# REFERENCE PROCESSING CLASSES
# =============================================================================

class ReferenceProcessor:
    """Handles the processing and transformation of references."""
    
    def __init__(self, config: Config):
        self.config = config
        self.stats = {
            'total_references': 0,
            'versioned_references': 0,
            'unversioned_references': 0,
            'concept_references': 0,
            'mapping_references': 0,
            'other_references': 0,
            'cascade_preserved': 0,
            'cascade_preset': 0,
            'duplicates_found': 0,
            'duplicates_skipped': 0,
            'references_to_modernize': 0,
            'references_unchanged': 0
        }
        self.seen_expressions: Set[str] = set()
        self.duplicate_expressions: List[str] = []
        self.export_data = None
    
    def analyze_references(self, references: List[Dict]) -> None:
        """Analyze reference patterns and update statistics."""
        print("📊 Analyzing references...")
        
        self.stats['total_references'] = len(references)
        
        for i, ref in enumerate(references):
            # Get the expression/URL from reference
            expression = self._extract_expression(ref)
            if not expression:
                continue
            
            # Check if versioned
            if is_versioned_url(expression):
                self.stats['versioned_references'] += 1
            else:
                self.stats['unversioned_references'] += 1
            
            # Check reference type
            ref_type = get_reference_type(expression)
            if ref_type == 'concepts':
                self.stats['concept_references'] += 1
            elif ref_type == 'mappings':
                self.stats['mapping_references'] += 1
            else:
                self.stats['other_references'] += 1
        
        print("✅ Reference analysis complete!")
    
    def _extract_expression(self, ref: Dict) -> str:
        """Extract the main expression/URL from a reference."""
        # Try common field names for the reference URL
        possible_fields = ['expression', 'url', 'uri', 'reference_url']
        
        for field in possible_fields:
            if field in ref and ref[field]:
                return ref[field]
        
        # If it's in a nested structure
        if 'data' in ref and 'expressions' in ref['data']:
            expressions = ref['data']['expressions']
            if expressions and len(expressions) > 0:
                return expressions[0]
        
        return ""
    
    def _get_collection_url(self, ref: Dict) -> str:
        """
        Extract collection URL from export data.
        
        In OCL exports, the collection URL is available at:
        - export_data['url'] = "/users/jamlung/collections/squad-test-not-validation/"
        - export_data['collection']['url'] = "/users/jamlung/collections/squad-test-not-validation/"
        """
        if not self.export_data:
            print("   ⚠️ Export data not available for collection URL extraction")
            return "/collections/unknown/"
            
        # Try main URL field first (this is the collection URL)
        if 'url' in self.export_data:
            collection_url = self.export_data['url']
            # Ensure it ends with /
            return collection_url if collection_url.endswith('/') else collection_url + '/'
        
        # Try nested collection.url field
        if 'collection' in self.export_data and 'url' in self.export_data['collection']:
            collection_url = self.export_data['collection']['url']
            # Ensure it ends with /
            return collection_url if collection_url.endswith('/') else collection_url + '/'
        
        print("   ⚠️ Could not determine collection URL from export data")
        return "/collections/unknown/"
    
    def _is_duplicate_expression(self, expression: str) -> bool:
        """Check if expression is a duplicate and track it."""
        normalized = normalize_expression(expression)
        
        if normalized in self.seen_expressions:
            self.stats['duplicates_found'] += 1
            if self.config.REPORT_DUPLICATES:
                self.duplicate_expressions.append(expression)
            return True
        
        self.seen_expressions.add(normalized)
        return False
    
    def process_references(self, references: List[Dict], export_data: Dict) -> Tuple[List[Dict], List[Dict]]:
        """
        Process references to create unversioned equivalents and deletion commands.
        ONLY processes versioned references for modernization.
        """
        self.export_data = export_data
        self.analyze_references(references)
        
        processed_references = []
        references_to_delete = []
        
        print(f"\n🔄 Processing {len(references)} references...")
        print(f"   🎯 Strategy: Only modernize versioned references")
        print(f"   🚫 Duplicate handling: {'Skip' if self.config.SKIP_DUPLICATES else 'Allow'}")
        
        collection_url = self._get_collection_url({})
        
        for i, ref in enumerate(references, 1):
            expression = self._extract_expression(ref)
            if not expression:
                print(f"   ⚠️ Warning: Could not extract expression from reference {i}")
                continue
            
            # ✨ KEY CHANGE: Only process versioned references
            if not is_versioned_url(expression):
                self.stats['references_unchanged'] += 1
                continue  # Skip unversioned references
            
            self.stats['references_to_modernize'] += 1
            
            # Create unversioned equivalent
            unversioned_expression = strip_version_from_url(expression)
            
            # Check for duplicates
            if self.config.SKIP_DUPLICATES and self._is_duplicate_expression(unversioned_expression):
                self.stats['duplicates_skipped'] += 1
                continue
            
            # Determine cascade settings
            cascade_settings = None
            if self.config.PRESERVE_ORIGINAL_CASCADE:
                original_cascade = extract_cascade_from_reference(ref)
                if original_cascade:
                    cascade_settings = original_cascade
                    self.stats['cascade_preserved'] += 1
            
            if cascade_settings is None:
                cascade_settings = self.config.get_cascade_settings()
                self.stats['cascade_preset'] += 1
            
            # Create unversioned reference
            new_ref = {
                "type": "Reference",
                "collection_url": collection_url,
                "data": {
                    "expressions": [unversioned_expression]
                }
            }
            
            # Add cascade if specified (only for concept references)
            if cascade_settings and get_reference_type(unversioned_expression) == 'concepts':
                new_ref["__cascade"] = cascade_settings
            
            processed_references.append(new_ref)
            
            # Create deletion command for the versioned reference
            delete_ref = {
                "type": "Reference",
                "collection_url": collection_url,
                "data": {
                    "expressions": [expression]  # Keep original versioned URL
                },
                "__action": "DELETE"
                # Note: Intentionally NOT including cascade for deletes
            }
            references_to_delete.append(delete_ref)
        
        print(f"\n✅ Processing complete!")
        print(f"   📤 Unversioned references to add: {len(processed_references)}")
        print(f"   🗑️ Versioned references to delete: {len(references_to_delete)}")
        print(f"   ⚡ References left unchanged: {self.stats['references_unchanged']}")
        if self.stats['duplicates_skipped'] > 0:
            print(f"   🚫 Duplicates skipped: {self.stats['duplicates_skipped']}")
        
        return processed_references, references_to_delete


class ReportGenerator:
    """Generates detailed reports on the migration process."""
    
    def __init__(self, processor: ReferenceProcessor):
        self.processor = processor
        self.stats = processor.stats
    
    def print_summary(self) -> None:
        """Print a summary to console."""
        print("\n" + "=" * 60)
        print("📊 MODERNIZATION SUMMARY")
        print("=" * 60)
        print(f"📁 Total references: {self.stats['total_references']}")
        print(f"   🧬 Concepts: {self.stats['concept_references']}")
        print(f"   🔗 Mappings: {self.stats['mapping_references']}")
        print(f"   ❓ Other: {self.stats['other_references']}")
        print()
        print(f"🏷️ Reference Versions:")
        print(f"   📌 Versioned (to modernize): {self.stats['versioned_references']}")
        print(f"   📄 Already unversioned: {self.stats['unversioned_references']}")
        print()
        print(f"⚙️ Cascade Handling:")
        print(f"   💾 Original preserved: {self.stats['cascade_preserved']}")
        print(f"   🔧 Preset applied: {self.stats['cascade_preset']}")
        print()
        print(f"🚫 Duplicate Handling:")
        print(f"   🔍 Duplicates found: {self.stats['duplicates_found']}")
        print(f"   ⏭️ Duplicates skipped: {self.stats['duplicates_skipped']}")
        print()
        print(f"📈 Migration Impact:")
        print(f"   🗑️ References to remove: {self.stats['references_to_modernize']}")
        print(f"   ➕ References to add: {len(self.processor.seen_expressions) - self.stats['duplicates_skipped']}")
        print(f"   ⚡ References unchanged: {self.stats['references_unchanged']}")
        print(f"   📊 Net change: +{len(self.processor.seen_expressions) - self.stats['duplicates_skipped'] - self.stats['references_to_modernize']}")
        print("=" * 60)
        
        # Show duplicate examples if any
        if self.processor.duplicate_expressions:
            print(f"\n🚫 Duplicate expressions found (showing first 5):")
            for i, expr in enumerate(self.processor.duplicate_expressions[:5]):
                print(f"   {i+1}. {expr}")
            if len(self.processor.duplicate_expressions) > 5:
                print(f"   ... and {len(self.processor.duplicate_expressions) - 5} more")
    
    def generate_report(self, output_file: str) -> None:
        """Generate a detailed migration report."""
        report_lines = [
            "OCL Reference Modernizer - Migration Report",
            "=" * 50,
            f"Generated: {datetime.now().isoformat()}",
            f"GitHub Issue: https://github.com/OpenConceptLab/ocl_issues/issues/2212",
            f"Cascade Preset: {self.processor.config.CASCADE_PRESET}",
            "",
            "REFERENCE ANALYSIS",
            "-" * 20,
            f"Total references found: {self.stats['total_references']}",
            f"Versioned references (to modernize): {self.stats['versioned_references']}",
            f"Already unversioned (unchanged): {self.stats['unversioned_references']}",
            f"References processed: {self.stats['references_to_modernize']}",
            "",
            "REFERENCE TYPES",
            "-" * 15,
            f"Concept references: {self.stats['concept_references']}",
            f"Mapping references: {self.stats['mapping_references']}",
            f"Other references: {self.stats['other_references']}",
            "",
            "CASCADE HANDLING",
            "-" * 15,
            f"Original cascade preserved: {self.stats['cascade_preserved']}",
            f"Preset cascade applied: {self.stats['cascade_preset']}",
            "",
            "DUPLICATE HANDLING",
            "-" * 17,
            f"Duplicates found: {self.stats['duplicates_found']}",
            f"Duplicates skipped: {self.stats['duplicates_skipped']}",
            "",
            "MIGRATION SUMMARY",
            "-" * 17,
            f"References to be deleted: {self.stats['references_to_modernize']}",
            f"New unversioned references: {len(self.processor.seen_expressions) - self.stats['duplicates_skipped']}",
            f"References left unchanged: {self.stats['references_unchanged']}",
            f"Net change in references: {len(self.processor.seen_expressions) - self.stats['duplicates_skipped'] - self.stats['references_to_modernize']}",
            "",
            "NEXT STEPS",
            "-" * 10,
            "1. Review generated JSONL files",
            "2. Test import on a backup/test collection first",
            "3. Use OCL's Bulk Import Interface",
            "4. Import deletion file FIRST to remove versioned references",
            "5. Import unversioned references file SECOND",
            "6. Verify collection state and expansions"
        ]
        
        # Add duplicate details if any
        if self.processor.duplicate_expressions:
            report_lines.extend([
                "",
                "DUPLICATE EXPRESSIONS FOUND",
                "-" * 27
            ])
            for expr in self.processor.duplicate_expressions[:20]:
                report_lines.append(f"- {expr}")
            if len(self.processor.duplicate_expressions) > 20:
                report_lines.append(f"... and {len(self.processor.duplicate_expressions) - 20} more")
        
        try:
            with open(output_file, 'w') as f:
                f.write('\n'.join(report_lines))
            print(f"📄 Migration report saved to: {output_file}")
        except IOError as e:
            print(f"❌ Error saving report: {e}")

print("🔧 Processing classes defined successfully!")

🔧 Processing classes defined successfully!


## Step 6: Load and Validate Input File

**📁 Let's load your OCL export file and see what we're working with:**

In [40]:
# Check if input file exists and load it
print(f"📁 Checking input file: {config.INPUT_FILE}")

if not os.path.exists(config.INPUT_FILE):
    print(f"❌ File not found: {config.INPUT_FILE}")
    print("\n📋 Please:")
    print("   1. Export your OCL collection to JSON format")
    print("   2. Update the INPUT_FILE path in the Configuration cell")
    print("   3. Re-run this cell")
    export_data = None
else:
    try:
        # Load the export file
        export_data = load_collection_export(config.INPUT_FILE)
        
        # Show basic file info
        file_size = os.path.getsize(config.INPUT_FILE)
        print(f"📊 File size: {file_size:,} bytes ({file_size/1024/1024:.1f} MB)")
        
        # Show top-level keys in export
        if isinstance(export_data, dict):
            print(f"🔑 Top-level keys in export: {list(export_data.keys())}")
        
        # Show collection info
        if 'url' in export_data:
            print(f"🏠 Collection URL: {export_data['url']}")
        if 'name' in export_data:
            print(f"📛 Collection name: {export_data['name']}")
        
        print("\n✅ Export file loaded successfully!")
        
    except Exception as e:
        print(f"❌ Error loading file: {e}")
        export_data = None

📁 Checking input file: input-example-jamlung/export.json
📥 Loading collection export from: input-example-jamlung/export.json
✅ Export file loaded successfully
📊 File size: 180,743 bytes (0.2 MB)
🔑 Top-level keys in export: ['type', 'uuid', 'id', 'short_code', 'name', 'full_name', 'description', 'collection_type', 'custom_validation_schema', 'public_access', 'default_locale', 'supported_locales', 'website', 'url', 'owner', 'owner_type', 'owner_url', 'version_url', 'previous_version_url', 'created_on', 'updated_on', 'created_by', 'updated_by', 'extras', 'external_id', 'version', 'concepts_url', 'mappings_url', 'expansions_url', 'is_processing', 'released', 'retired', 'canonical_url', 'identifier', 'publisher', 'contact', 'jurisdiction', 'purpose', 'copyright', 'meta', 'immutable', 'revision_date', 'text', 'experimental', 'locked_date', 'autoexpand', 'expansion_url', 'checksums', 'collection', 'concepts', 'references', 'mappings', 'export_time']
🏠 Collection URL: /users/jamlung/collection

## Step 7: Extract References from Export

**📤 Now let's extract the references from your export file:**

In [41]:
if export_data is None:
    print("❌ No export data available. Please fix the input file issue above.")
    references = []
else:
    try:
        # Extract references from the export
        references = extract_references_from_export(export_data)
        
        if not references:
            print("⚠️ No references found in export file.")
            print("\n📋 Export structure:")
            if isinstance(export_data, dict):
                for key in export_data.keys():
                    value = export_data[key]
                    if isinstance(value, list):
                        print(f"   📋 {key}: {len(value)} items")
                    else:
                        print(f"   📁 {key}: {type(value).__name__}")
        else:
            print(f"\n🎯 Success! Found {len(references)} references to analyze.")
            
            # Show sample reference structure
            if references:
                print("\n📋 Sample reference structure:")
                sample_ref = references[0]
                for key, value in sample_ref.items():
                    if isinstance(value, str) and len(value) > 50:
                        print(f"   {key}: {value[:50]}...")
                    else:
                        print(f"   {key}: {value}")
                        
                # Quick analysis
                versioned_count = sum(1 for ref in references if 'expression' in ref and is_versioned_url(ref['expression']))
                print(f"\n🔍 Quick analysis:")
                print(f"   📌 Versioned references: {versioned_count}")
                print(f"   📄 Unversioned references: {len(references) - versioned_count}")
                    
    except Exception as e:
        print(f"❌ Error extracting references: {e}")
        references = []

🔍 Extracting references from export data...
   ✅ Found 21 references
   📋 Sample reference keys: ['expression', 'reference_type', 'id', 'last_resolved_at', 'uri', 'uuid', 'include', 'type', 'code', 'resource_version', 'namespace', 'system', 'version', 'valueset', 'cascade', 'filter', 'display', 'created_at', 'updated_at', 'concepts', 'mappings', 'translation', 'transform']
   📋 Sample expression: /users/jamlung/sources/zim_demo/concepts/12/

🎯 Success! Found 21 references to analyze.

📋 Sample reference structure:
   expression: /users/jamlung/sources/zim_demo/concepts/12/
   reference_type: concepts
   id: 11654253
   last_resolved_at: 2025-08-20T14:50:33.697777Z
   uri: /users/jamlung/collections/facility-test/script-te...
   uuid: 11654253
   include: True
   type: CollectionReference
   code: 12
   resource_version: None
   namespace: None
   system: /users/jamlung/sources/zim_demo/
   version: None
   valueset: None
   cascade: {'method': 'sourcetoconcepts', 'map_types': 'Q-AND-A,

## Step 8: Process References

**⚙️ Now let's process the references and generate the modernized versions:**

In [42]:
if not references:
    print("❌ No references to process. Please check the steps above.")
    processor = None
    processed_references = []
    references_to_delete = []
else:
    try:
        # Initialize the processor
        processor = ReferenceProcessor(config)
        
        # Process the references
        processed_references, references_to_delete = processor.process_references(references, export_data)
        
        # Show detailed results
        report_generator = ReportGenerator(processor)
        report_generator.print_summary()
        
    except Exception as e:
        print(f"❌ Error processing references: {e}")
        processor = None
        processed_references = []
        references_to_delete = []

📊 Analyzing references...
✅ Reference analysis complete!

🔄 Processing 21 references...
   🎯 Strategy: Only modernize versioned references
   🚫 Duplicate handling: Skip

✅ Processing complete!
   📤 Unversioned references to add: 18
   🗑️ Versioned references to delete: 18
   ⚡ References left unchanged: 3

📊 MODERNIZATION SUMMARY
📁 Total references: 21
   🧬 Concepts: 6
   🔗 Mappings: 15
   ❓ Other: 0

🏷️ Reference Versions:
   📌 Versioned (to modernize): 18
   📄 Already unversioned: 3

⚙️ Cascade Handling:
   💾 Original preserved: 0
   🔧 Preset applied: 18

🚫 Duplicate Handling:
   🔍 Duplicates found: 0
   ⏭️ Duplicates skipped: 0

📈 Migration Impact:
   🗑️ References to remove: 18
   ➕ References to add: 18
   ⚡ References unchanged: 3
   📊 Net change: +0


## Step 9: Generate Output Files

**💾 Let's save the results to files for OCL bulk import:**

In [43]:
if not processed_references and not references_to_delete:
    print("❌ No processed references to save. Please check the processing step above.")
else:
    try:
        print("💾 Saving output files...")
        
        # Create output directory
        output_dir = config.OUTPUT_DIR
        os.makedirs(output_dir, exist_ok=True)
        print(f"📂 Output directory: {output_dir}")
        
        # Generate file paths
        unversioned_path = os.path.join(output_dir, config.UNVERSIONED_FILE)
        delete_path = os.path.join(output_dir, config.DELETE_FILE)
        report_path = os.path.join(output_dir, config.REPORT_FILE)
        
        # Save the files
        if processed_references:
            save_jsonl(processed_references, unversioned_path)
        else:
            print("   ℹ️ No unversioned references to save (no versioned references found)")
            
        if references_to_delete:
            save_jsonl(references_to_delete, delete_path)
        else:
            print("   ℹ️ No references to delete (no versioned references found)")
        
        # Generate detailed report
        if processor:
            report_generator = ReportGenerator(processor)
            report_generator.generate_report(report_path)
        
        print("\n✅ All files saved successfully!")
        print(f"\n📁 Generated files:")
        if processed_references:
            print(f"   📤 Unversioned references: {unversioned_path}")
        if references_to_delete:
            print(f"   🗑️ References to delete: {delete_path}")
        print(f"   📄 Migration report: {report_path}")
        
    except Exception as e:
        print(f"❌ Error saving files: {e}")

💾 Saving output files...
📂 Output directory: output
💾 Saved 18 items to output\unversioned_references_20250820_105615.json
💾 Saved 18 items to output\references_to_delete_20250820_105615.json
📄 Migration report saved to: output\migration_report_20250820_105615.txt

✅ All files saved successfully!

📁 Generated files:
   📤 Unversioned references: output\unversioned_references_20250820_105615.json
   🗑️ References to delete: output\references_to_delete_20250820_105615.json
   📄 Migration report: output\migration_report_20250820_105615.txt


## Step 10: Preview Generated Files

**👀 Let's take a look at what we generated:**

In [44]:
def display_file_preview(filename: str, num_lines: int = 3) -> None:
    """Display preview of a JSONL file."""
    try:
        if os.path.exists(filename):
            print(f"\n📄 Preview of {os.path.basename(filename)} (first {num_lines} lines):")
            print("-" * 50)
            with open(filename, 'r', encoding='utf-8') as f:
                for i, line in enumerate(f):
                    if i < num_lines:
                        try:
                            data = json.loads(line.strip())
                            print(f"Line {i+1}:")
                            print(json.dumps(data, indent=2))
                            print()
                        except json.JSONDecodeError:
                            print(f"Line {i+1}: {line.strip()}")
                    else:
                        break
        else:
            print(f"❌ File not found: {filename}")
    except IOError as e:
        print(f"❌ Error reading {filename}: {e}")


# Preview the generated files
if config.OUTPUT_DIR:
    unversioned_path = os.path.join(config.OUTPUT_DIR, config.UNVERSIONED_FILE)
    delete_path = os.path.join(config.OUTPUT_DIR, config.DELETE_FILE)
    
    print("👁️ Previewing generated files...")
    
    if os.path.exists(unversioned_path):
        display_file_preview(unversioned_path, 2)
    else:
        print(f"\nℹ️ {os.path.basename(unversioned_path)} not created (no versioned references to modernize)")
        
    if os.path.exists(delete_path):
        display_file_preview(delete_path, 2)
    else:
        print(f"\nℹ️ {os.path.basename(delete_path)} not created (no versioned references to delete)")
    
    # Show file sizes
    for filepath in [unversioned_path, delete_path]:
        if os.path.exists(filepath):
            size = os.path.getsize(filepath)
            with open(filepath, 'r') as f:
                lines = sum(1 for _ in f)
            print(f"📊 {os.path.basename(filepath)}: {lines} lines, {size:,} bytes")

👁️ Previewing generated files...

📄 Preview of unversioned_references_20250820_105615.json (first 2 lines):
--------------------------------------------------
Line 1:
{
  "type": "Reference",
  "collection_url": "/users/jamlung/collections/facility-test/",
  "data": {
    "expressions": [
      "/orgs/CIEL/sources/CIEL/mappings/314250/"
    ]
  }
}

Line 2:
{
  "type": "Reference",
  "collection_url": "/users/jamlung/collections/facility-test/",
  "data": {
    "expressions": [
      "/orgs/CIEL/sources/CIEL/mappings/312829/"
    ]
  }
}


📄 Preview of references_to_delete_20250820_105615.json (first 2 lines):
--------------------------------------------------
Line 1:
{
  "type": "Reference",
  "collection_url": "/users/jamlung/collections/facility-test/",
  "data": {
    "expressions": [
      "/orgs/CIEL/sources/CIEL/mappings/314250/5406810/"
    ]
  },
  "__action": "DELETE"
}

Line 2:
{
  "type": "Reference",
  "collection_url": "/users/jamlung/collections/facility-test/",
  "data"

## Step 11: Validation & Next Steps

**✅ Let's validate our results and plan next steps:**

In [45]:
print("🔍 VALIDATION CHECKLIST")
print("=" * 40)

# Check if we have valid results
if processor and hasattr(processor, 'stats'):
    stats = processor.stats
    
    # Validation checks
    checks = [
        ("📊 References found", stats['total_references'] > 0),
        ("🏷️ Versioned refs identified", stats['versioned_references'] >= 0),
        ("🔧 Processing completed", True),  # If we got here, processing completed
        ("🚫 Duplicate handling configured", config.SKIP_DUPLICATES),
        ("⚙️ Cascade handling configured", True),  # Always true if we got this far
    ]
    
    for check_name, passed in checks:
        status = "✅" if passed else "❌"
        print(f"{status} {check_name}")
    
    print("\n📋 STATISTICS SUMMARY:")
    print(f"   📁 Total references: {stats['total_references']}")
    print(f"   🧬 Concept references: {stats['concept_references']}")
    print(f"   🔗 Mapping references: {stats['mapping_references']}")
    print(f"   📌 Versioned (to modernize): {stats['versioned_references']}")
    print(f"   📄 Already unversioned: {stats['unversioned_references']}")
    print(f"   🚫 Duplicates handled: {stats['duplicates_found']} found, {stats['duplicates_skipped']} skipped")
    
    print("\n🎯 EXPECTED RESULTS AFTER MIGRATION:")
    if stats['versioned_references'] > 0:
        print(f"   🗑️ References to be deleted: {stats['references_to_modernize']}")
        print(f"   ➕ References to be added: {len(processed_references)}")
        print(f"   ⚡ References left unchanged: {stats['references_unchanged']}")
        print(f"   📊 Net change: {len(processed_references) - stats['references_to_modernize']}")
    else:
        print(f"   ℹ️ No versioned references found - collection is already modernized!")
        print(f"   📊 All {stats['total_references']} references are already unversioned")
    
else:
    print("❌ No valid processing results to validate.")
    print("   Please check the previous steps for errors.")

print("\n" + "=" * 40)
print("🚀 NEXT STEPS")
print("=" * 40)

if processor and processor.stats['versioned_references'] > 0:
    print("1. 📋 Review the migration report and file previews above")
    print("2. 💾 Backup your OCL collection (create a version)")
    print("3. 🧪 Test on a backup/test collection first (recommended)")
    print("4. 🌐 Use OCL's Bulk Import Interface:")
    print("   a. Import the DELETE file FIRST")
    print("   b. Then import the unversioned references file")
    print("5. ✅ Verify collection state and functionality")
    print("6. 📊 Check collection expansion results")
else:
    print("🎉 Great news! Your collection is already modernized.")
    print("   ✅ All references are already unversioned")
    print("   📊 No migration needed")

print("\n⚠️ IMPORTANT REMINDERS:")
print("   • Always backup before applying changes")
print("   • Test on a small collection first")
print("   • Apply deletions BEFORE additions")
print("   • Validate results thoroughly")

if config.OUTPUT_DIR:
    print(f"\n📁 Your files are ready in: {config.OUTPUT_DIR}/")
    print(f"   🕐 Generated at: {config.TIMESTAMP}")
    print(f"   🎛️ Using cascade preset: {config.CASCADE_PRESET}")

🔍 VALIDATION CHECKLIST
✅ 📊 References found
✅ 🏷️ Versioned refs identified
✅ 🔧 Processing completed
✅ 🚫 Duplicate handling configured
✅ ⚙️ Cascade handling configured

📋 STATISTICS SUMMARY:
   📁 Total references: 21
   🧬 Concept references: 6
   🔗 Mapping references: 15
   📌 Versioned (to modernize): 18
   📄 Already unversioned: 3
   🚫 Duplicates handled: 0 found, 0 skipped

🎯 EXPECTED RESULTS AFTER MIGRATION:
   🗑️ References to be deleted: 18
   ➕ References to be added: 18
   ⚡ References left unchanged: 3
   📊 Net change: 0

🚀 NEXT STEPS
1. 📋 Review the migration report and file previews above
2. 💾 Backup your OCL collection (create a version)
3. 🧪 Test on a backup/test collection first (recommended)
4. 🌐 Use OCL's Bulk Import Interface:
   a. Import the DELETE file FIRST
   b. Then import the unversioned references file
5. ✅ Verify collection state and functionality
6. 📊 Check collection expansion results

⚠️ IMPORTANT REMINDERS:
   • Always backup before applying changes
   • Tes

## 🎉 Congratulations!

You've successfully analyzed your OCL collection references and generated modernization files (if needed)!

### 📁 Generated Files:
- **`unversioned_references_*.json`** - Import this to add modernized references (if versioned refs found)
- **`references_to_delete_*.json`** - Import this FIRST to remove old versioned references (if any)  
- **`migration_report_*.txt`** - Detailed report with statistics and instructions

### ✨ Key Improvements in This Version:
- **🎯 Smart Processing**: Only modernizes versioned references, leaves unversioned ones unchanged
- **🚫 Duplicate Prevention**: Detects and optionally skips duplicate expressions
- **🎛️ Cascade Presets**: Support for OpenMRS and custom cascade configurations
- **📊 Better Reporting**: More detailed statistics and validation

### 🔗 Useful Links:
- **GitHub Issue**: [#2212](https://github.com/OpenConceptLab/ocl_issues/issues/2212)
- **OCL Bulk Import Docs**: [OCL API Reference](https://docs.openconceptlab.org/en/latest/oclapi/apireference/bulkimporting.html)
- **Cascade Documentation**: [OCL Cascade Reference](https://docs.openconceptlab.org/en/latest/oclapi/apireference/cascade.html)

### ⚠️ Remember:
1. **Backup first** - Create a version of your collection before importing
2. **Test first** - Try on a test collection before production
3. **Order matters** - Import deletions before additions
4. **Validate** - Check your collection expansion after migration

---

*This notebook implements the enhanced OCL Reference Modernizer as specified in GitHub issue #2212, with improved behavior for processing only versioned references, preventing duplicates, and supporting cascade presets.*