# Minimal Gzip + JSON Indexer for MRF Files

This notebook implements a system that indexes massive MRF .json.gz files (e.g., 10GB+) to enable partial extraction of top-level JSON keys without full decompression or full parsing.

## Features
- Uses `indexed_gzip` for fast random access to gzip files
- Minimal JSON structural index built with overlapping chunks
- Identifies and extracts complete scalar values during indexing
- Records start offsets for arrays and objects
- **Read-only operations**: Indexing does NOT modify the original .json.gz files

## MRF File Structure
MRF files are always top-level objects with keys like:
- `reporting_entity_name` (scalar)
- `provider_references` (array)
- `in_network` (array)
- `reporting_structure` (object)
- etc.

The indexer records where each top-level key's value starts, allowing extraction without reading the entire file.

## Index Files
- **JSON Structural Index** (`.index`): Maps top-level keys to their uncompressed byte offsets
- **Gzip Index** (`.gzidx`): Enables fast seeking within the gzip file (managed by `indexed_gzip`)

## Imports and Setup

In [12]:
import json
import indexed_gzip as igzip
from pathlib import Path
from typing import Dict, Optional
from dataclasses import dataclass
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
LOG = logging.getLogger(__name__)

## Data Structures

In [13]:
@dataclass
class JsonIndex:
    """Minimal JSON structural index for top-level objects."""
    top_level_keys: Dict[str, int] = None  # key -> value start offset (for arrays and objects)
    scalar_values: Dict[str, str] = None  # key -> complete scalar string value
    
    def __post_init__(self):
        if self.top_level_keys is None:
            self.top_level_keys = {}
        if self.scalar_values is None:
            self.scalar_values = {}
    
    def to_dict(self) -> dict:
        return {
            'top_level_keys': self.top_level_keys,
            'scalar_values': self.scalar_values
        }
    
    @classmethod
    def from_dict(cls, d: dict) -> 'JsonIndex':
        return cls(
            top_level_keys=d.get('top_level_keys', {}),
            scalar_values=d.get('scalar_values', {})
        )
    
    def save(self, path: Path) -> None:
        """Save index to JSON file."""
        with open(path, 'w') as f:
            json.dump(self.to_dict(), f, indent=2)
    
    @classmethod
    def load(cls, path: Path) -> 'JsonIndex':
        """Load index from JSON file."""
        with open(path, 'r') as f:
            return cls.from_dict(json.load(f))

## Helper Functions

In [14]:
# Helper function to check if character is whitespace
def is_whitespace(ch: str) -> bool:
    """Check if character is JSON whitespace."""
    return ch in ' \t\n\r'

## Indexing Algorithm

In [None]:
def build_index(
    path_gz: Path,
    json_index_file: Path,
    gzip_index_file: Optional[Path] = None,
    spacing: int = 300 * 1024,  # 300KB spacing for gzip index
    chunk_size: int = 64 * 1024,  # 64KB chunks for overlapping
    overlap_size: int = 1024  # 1KB overlap to handle values spanning chunks
) -> JsonIndex:
    """
    Build both gzip index and JSON structural index for an MRF file using pattern matching.
    
    Uses a two-pass approach:
    1. Pattern matching pass: Find all "key": patterns, count occurrences, determine value types
    2. Scalar extraction pass: Extract complete string scalar values for top-level keys
    
    For MRF files (top-level objects), this function:
    - Identifies top-level keys by counting pattern occurrences (count=1 = top-level)
    - Determines value types (scalar, array, object) immediately after colon
    - Extracts complete scalar values (strings, numbers, booleans, null)
    - Records start offsets for arrays and objects
    
    Args:
        path_gz: Path to .json.gz file (must be a top-level object)
        json_index_file: Path where to save the JSON structural index
        gzip_index_file: Optional path where to save the gzip index (if None, uses default)
        spacing: Spacing between seek points for gzip index (default: 300KB)
        chunk_size: Size of chunks to read (default: 64KB)
        overlap_size: Overlap between chunks to handle patterns spanning boundaries (default: 1KB)
    
    Returns:
        JsonIndex with top-level keys (offsets for arrays/objects) and scalar_values (complete scalars)
    """
    LOG.info(f"Building indexes for {path_gz.name}...")
    
    # Open file with indexed_gzip (builds gzip index during reading)
    f = igzip.IndexedGzipFile(str(path_gz), spacing=spacing)
    
    try:
        f.seek(0)
        
        # Pattern matching: track all key occurrences
        key_occurrences = {}  # key_name -> count
        key_info = {}  # key_name -> list of (chunk_idx, colon_offset, value_start_offset, value_type)
        
        # Overlap handling for pattern matching
        overlap_buffer = b""  # Keep as bytes for pattern matching
        total_bytes_read = 0  # Track total bytes read from file
        
        LOG.info(f"Pass 1: Pattern matching in chunks of {chunk_size:,} bytes with {overlap_size:,} byte overlap...")
        
        chunk_count = 0
        
        # Pattern to match: "key":
        # In bytes: b'"' + key_name + b'":'
        # We'll scan for b'":' and work backwards to find the key
        
        while True:
            # Read chunk as bytes
            chunk_bytes = f.read(chunk_size)
            if not chunk_bytes:
                break
            
            chunk_count += 1
            
            # Prepend overlap buffer for pattern continuity
            if overlap_buffer:
                chunk = overlap_buffer + chunk_bytes
                chunk_start_offset = total_bytes_read - len(overlap_buffer)
            else:
                chunk = chunk_bytes
                chunk_start_offset = total_bytes_read
            
            # Scan for "key": patterns
            # Pattern: quote, key name, quote, colon
            i = 0
            while i < len(chunk) - 3:  # Need at least 4 bytes: "key":
                # Look for quote-colon pattern: ":
                if chunk[i:i+2] == b'":':
                    # Found potential end of key, work backwards to find start
                    # Find the opening quote before this
                    key_start = i
                    while key_start > 0 and chunk[key_start] != ord('"'):
                        key_start -= 1
                    
                    if key_start < i and chunk[key_start] == ord('"'):
                        # Extract key name (between quotes)
                        key_name_bytes = chunk[key_start+1:i]
                        try:
                            key_name = key_name_bytes.decode('utf-8')
                            
                            # Calculate byte offsets
                            colon_offset = chunk_start_offset + i
                            
                            # Look ahead to determine value type (skip whitespace after colon)
                            value_start = i + 2  # After '":'
                            while value_start < len(chunk) and chunk[value_start] in b' \t\n\r':
                                value_start += 1
                            
                            if value_start < len(chunk):
                                value_char = chunk[value_start]
                                value_start_offset = chunk_start_offset + value_start
                                
                                # Determine value type
                                if value_char == ord('['):
                                    value_type = 'array'
                                elif value_char == ord('{'):
                                    value_type = 'object'
                                elif value_char == ord('"'):
                                    value_type = 'string'
                                elif value_char in b'-0123456789':
                                    value_type = 'number'
                                elif value_char == ord('t'):
                                    value_type = 'true'
                                elif value_char == ord('f'):
                                    value_type = 'false'
                                elif value_char == ord('n'):
                                    value_type = 'null'
                                else:
                                    value_type = 'unknown'
                                
                                # Track this key occurrence
                                if key_name not in key_occurrences:
                                    key_occurrences[key_name] = 0
                                    key_info[key_name] = []
                                
                                key_occurrences[key_name] += 1
                                key_info[key_name].append((
                                    chunk_count,
                                    colon_offset,
                                    value_start_offset,
                                    value_type
                                ))
                                
                                LOG.debug(f"Found key '{key_name}' (type: {value_type}) at offset {colon_offset:,}")
                            
                            # Skip past this key to avoid matching it again
                            i = value_start if value_start < len(chunk) else i + 2
                        else:
                            i += 1
                    else:
                        i += 1
                else:
                    i += 1
            
            # Update total bytes read
            total_bytes_read += len(chunk_bytes)
            
            # Save overlap buffer
            if len(chunk) > overlap_size:
                overlap_buffer = chunk[-overlap_size:]
            else:
                overlap_buffer = chunk
            
            if chunk_count % 100 == 0:
                LOG.debug(f"Processed {chunk_count} chunks, found {len(key_occurrences)} unique keys...")
        
        # Filter to top-level keys (occurrence count = 1)
        top_level_keys = {}  # key -> offset (for arrays and objects)
        scalar_values = {}  # key -> complete scalar value
        top_level_string_keys = []  # Keys that need string extraction
        
        LOG.info(f"Found {len(key_occurrences)} unique key(s), identifying top-level keys...")
        
        for key_name, count in key_occurrences.items():
            if count == 1:
                # This is a top-level key
                info = key_info[key_name][0]
                colon_offset, value_start_offset, value_type = info[1], info[2], info[3]
                
                if value_type in ['array', 'object']:
                    top_level_keys[key_name] = value_start_offset
                    LOG.info(f"Found top-level {value_type} key '{key_name}' at offset {value_start_offset:,}")
                elif value_type == 'string':
                    top_level_string_keys.append((key_name, colon_offset, value_start_offset))
                    LOG.debug(f"Found top-level string key '{key_name}' at offset {value_start_offset:,}")
                elif value_type in ['number', 'true', 'false', 'null']:
                    # These are short, we can extract them now
                    # But we need to read from the file to get the complete value
                    # For now, we'll do a second pass for all scalars
                    top_level_string_keys.append((key_name, colon_offset, value_start_offset))
        
        LOG.info(f"Identified {len(top_level_keys)} top-level array/object key(s)")
        LOG.info(f"Identified {len(top_level_string_keys)} top-level scalar key(s) to extract")
        
        # Pass 2: Extract scalar values (with overlap for strings)
        if top_level_string_keys:
            LOG.info("Pass 2: Extracting scalar values...")
            scalar_values = _extract_scalar_values(f, top_level_string_keys, chunk_size, overlap_size)
        
        # Create JSON index
        json_index = JsonIndex(top_level_keys=top_level_keys, scalar_values=scalar_values)
        
        LOG.info(f"JSON indexing complete:")
        LOG.info(f"  - {len(json_index.scalar_values)} scalar value(s)")
        LOG.info(f"  - {len(json_index.top_level_keys)} array/object key(s)")
        
        if json_index.scalar_values:
            LOG.info(f"  Scalar keys: {list(json_index.scalar_values.keys())}")
        if json_index.top_level_keys:
            LOG.info(f"  Array/Object keys: {list(json_index.top_level_keys.keys())}")
        
        # Save JSON index
        json_index.save(json_index_file)
        LOG.info(f"JSON index saved to {json_index_file}")
        
        # Export gzip index if requested
        if gzip_index_file:
            LOG.info("Exporting gzip index...")
            f.export_index(str(gzip_index_file))
            LOG.info(f"Gzip index saved to {gzip_index_file}")
        
        return json_index
    
    finally:
        f.close()


def _extract_scalar_values(
    f: igzip.IndexedGzipFile,
    string_keys: list,
    chunk_size: int,
    overlap_size: int
) -> dict:
    """
    Extract complete scalar values for top-level string keys using overlapping chunks.
    
    Args:
        f: IndexedGzipFile object (already opened)
        string_keys: List of (key_name, colon_offset, value_start_offset) tuples
        chunk_size: Size of chunks to read
        overlap_size: Overlap between chunks
    
    Returns:
        Dictionary mapping key names to their complete scalar values
    """
    scalar_values = {}
    
    # For each key, extract its value
    for key_name, colon_offset, value_start_offset in string_keys:
        LOG.debug(f"Extracting scalar value for '{key_name}' starting at offset {value_start_offset:,}")
        
        f.seek(value_start_offset)
        
        # Read with overlap to get complete value
        overlap_buffer = b""
        value_buffer = b""
        in_string = False
        escape = False
        
        while True:
            chunk_bytes = f.read(chunk_size)
            if not chunk_bytes:
                break
            
            if overlap_buffer:
                chunk = overlap_buffer + chunk_bytes
            else:
                chunk = chunk_bytes
            
            # Extract string value
            for i, byte_val in enumerate(chunk):
                ch = chr(byte_val)
                
                if escape:
                    escape = False
                    value_buffer += bytes([byte_val])
                elif ch == '\\':
                    escape = True
                    value_buffer += bytes([byte_val])
                elif ch == '"':
                    if in_string:
                        # End of string
                        scalar_values[key_name] = value_buffer.decode('utf-8', errors='replace')
                        LOG.debug(f"Extracted string scalar '{key_name}': '{scalar_values[key_name][:50]}...'")
                        break
                    else:
                        # Start of string
                        in_string = True
                        value_buffer = b""
                else:
                    if in_string:
                        value_buffer += bytes([byte_val])
            
            if key_name in scalar_values:
                break
            
            # Save overlap
            if len(chunk) > overlap_size:
                overlap_buffer = chunk[-overlap_size:]
            else:
                overlap_buffer = chunk
        
        # If we didn't find the closing quote, try to extract number/boolean/null
        if key_name not in scalar_values:
            f.seek(value_start_offset)
            # Read a small chunk to get the value
            value_chunk = f.read(100)  # Should be enough for numbers/booleans/null
            if value_chunk:
                try:
                    value_str = value_chunk.decode('utf-8', errors='replace')
                    # Extract until comma, }, or whitespace
                    value = ""
                    for ch in value_str:
                        if ch in ',} \t\n\r':
                            break
                        value += ch
                    
                    if value in ['true', 'false', 'null'] or (value and value[0] in '-0123456789'):
                        scalar_values[key_name] = value
                        LOG.debug(f"Extracted scalar '{key_name}': {value}")
                except:
                    pass
    
    return scalar_values
            if not chunk_bytes:
                break
            
            chunk_count += 1
            
            # Decode to string
            try:
                new_chunk = chunk_bytes.decode('utf-8')
            except UnicodeDecodeError:
                new_chunk = chunk_bytes.decode('utf-8', errors='replace')
            
            # Prepend overlap buffer to current chunk for continuity
            if overlap_buffer:
                chunk = overlap_buffer + new_chunk
                chunk_start_offset = total_chars_read - len(overlap_buffer)
            else:
                chunk = new_chunk
                chunk_start_offset = total_chars_read
            
            # Process ENTIRE chunk (including overlap) to maintain state and detect complete values
            for i, ch in enumerate(chunk):
                uncompressed_offset = chunk_start_offset + i
                
                # Skip whitespace before first character
                if not first_char_processed:
                    if is_whitespace(ch):
                        continue
                    if ch != '{':
                        raise ValueError(f"Expected top-level object ({{), but found: {ch}")
                    depth_object = 1
                    first_char_processed = True
                    continue
                
                # --- JSON STRUCTURE SCANNING ---
                if in_string:
                    if escape:
                        escape = False
                        if reading_key:
                            current_key_buffer += ch
                    elif ch == '\\':
                        escape = True
                        if reading_key:
                            current_key_buffer += ch
                    elif ch == '"':
                        in_string = False
                        if reading_key:
                            # Finished reading a key string
                            # Only set pending_key if it's not empty
                            if current_key_buffer:
                                pending_key = current_key_buffer
                            else:
                                pending_key = None
                            current_key_buffer = ""
                            reading_key = False
                else:
                    # Check for scalar values after colon (non-whitespace characters)
                    if (seen_colon_after_key and 
                        pending_key is not None and
                        pending_key != "" and
                        pending_key not in scalar_values and
                        pending_key not in top_level_keys and
                        depth_object == 1 and
                        depth_array == 0 and
                        not is_whitespace(ch) and
                        value_start_offset is None):
                        # This is the start of a value - determine type
                        value_start_offset = uncompressed_offset
                        
                        if ch == '"':
                            # String scalar - will be extracted when we see closing quote
                            in_string = True
                            current_scalar_buffer = ""
                        elif ch in '-0123456789':
                            # Number scalar - extract until comma, }, or whitespace
                            current_scalar_buffer = ch
                        elif ch == 't':
                            # Could be 'true'
                            current_scalar_buffer = ch
                        elif ch == 'f':
                            # Could be 'false'
                            current_scalar_buffer = ch
                        elif ch == 'n':
                            # Could be 'null'
                            current_scalar_buffer = ch
                        else:
                            # Not a scalar, reset
                            value_start_offset = None
                    
                    # Continue extracting scalar value if we're in the middle of one
                    elif value_start_offset is not None and pending_key is not None:
                        if in_string:
                            # String extraction - handle escape sequences
                            if escape:
                                escape = False
                                current_scalar_buffer += ch
                            elif ch == '\\':
                                escape = True
                                current_scalar_buffer += ch
                            elif ch == '"':
                                # End of string scalar
                                in_string = False
                                scalar_values[pending_key] = current_scalar_buffer
                                LOG.debug(f"Found string scalar '{pending_key}': '{current_scalar_buffer[:50]}...'")
                                pending_key = None
                                seen_colon_after_key = False
                                value_start_offset = None
                                current_scalar_buffer = ""
                            else:
                                current_scalar_buffer += ch
                        elif current_scalar_buffer:
                            # Number, true, false, or null extraction
                            if ch in '-0123456789.eE+':
                                # Continue number
                                current_scalar_buffer += ch
                            elif current_scalar_buffer == 't' and ch == 'r':
                                current_scalar_buffer += ch
                            elif current_scalar_buffer == 'tr' and ch == 'u':
                                current_scalar_buffer += ch
                            elif current_scalar_buffer == 'tru' and ch == 'e':
                                # Complete 'true'
                                scalar_values[pending_key] = 'true'
                                LOG.debug(f"Found boolean scalar '{pending_key}': true")
                                pending_key = None
                                seen_colon_after_key = False
                                value_start_offset = None
                                current_scalar_buffer = ""
                            elif current_scalar_buffer == 'f' and ch == 'a':
                                current_scalar_buffer += ch
                            elif current_scalar_buffer == 'fa' and ch == 'l':
                                current_scalar_buffer += ch
                            elif current_scalar_buffer == 'fal' and ch == 's':
                                current_scalar_buffer += ch
                            elif current_scalar_buffer == 'fals' and ch == 'e':
                                # Complete 'false'
                                scalar_values[pending_key] = 'false'
                                LOG.debug(f"Found boolean scalar '{pending_key}': false")
                                pending_key = None
                                seen_colon_after_key = False
                                value_start_offset = None
                                current_scalar_buffer = ""
                            elif current_scalar_buffer == 'n' and ch == 'u':
                                current_scalar_buffer += ch
                            elif current_scalar_buffer == 'nu' and ch == 'l':
                                current_scalar_buffer += ch
                            elif current_scalar_buffer == 'nul' and ch == 'l':
                                # Complete 'null'
                                scalar_values[pending_key] = 'null'
                                LOG.debug(f"Found null scalar '{pending_key}'")
                                pending_key = None
                                seen_colon_after_key = False
                                value_start_offset = None
                                current_scalar_buffer = ""
                            elif ch in ',}':
                                # End of number (or invalid token)
                                if current_scalar_buffer and current_scalar_buffer[0] in '-0123456789':
                                    # It's a number
                                    scalar_values[pending_key] = current_scalar_buffer
                                    LOG.debug(f"Found number scalar '{pending_key}': {current_scalar_buffer}")
                                    pending_key = None
                                    seen_colon_after_key = False
                                    value_start_offset = None
                                    current_scalar_buffer = ""
                                else:
                                    # Invalid, reset
                                    value_start_offset = None
                                    current_scalar_buffer = ""
                            elif is_whitespace(ch):
                                # Whitespace might end a number
                                if current_scalar_buffer and current_scalar_buffer[0] in '-0123456789':
                                    scalar_values[pending_key] = current_scalar_buffer
                                    LOG.debug(f"Found number scalar '{pending_key}': {current_scalar_buffer}")
                                    pending_key = None
                                    seen_colon_after_key = False
                                    value_start_offset = None
                                    current_scalar_buffer = ""
                            else:
                                # Invalid character, reset
                                value_start_offset = None
                                current_scalar_buffer = ""
                    
                    if ch == '"':
                        if not in_string:
                            in_string = True
                            if depth_object == 1 and depth_array == 0 and not seen_colon_after_key:
                                # Starting to read a key
                                reading_key = True
                                current_key_buffer = ""
                                pending_key = None
                    
                    elif ch == '{':
                        # Check if this is a top-level object value BEFORE incrementing depth
                        if (depth_object == 1 and 
                            depth_array == 0 and
                            seen_colon_after_key and 
                            pending_key is not None and
                            pending_key != "" and  # Ensure key is not empty
                            pending_key not in top_level_keys and
                            pending_key not in scalar_values):
                            # This is a top-level object value
                            top_level_keys[pending_key] = uncompressed_offset
                            LOG.info(f"Found object key '{pending_key}' at offset {uncompressed_offset}")
                            pending_key = None
                            seen_colon_after_key = False
                            value_start_offset = None
                        
                        depth_object += 1
                    
                    elif ch == '}':
                        # Check if we have a pending scalar that ended with }
                        if value_start_offset is not None and pending_key is not None and current_scalar_buffer:
                            if current_scalar_buffer[0] in '-0123456789':
                                scalar_values[pending_key] = current_scalar_buffer
                                LOG.debug(f"Found number scalar '{pending_key}': {current_scalar_buffer}")
                                pending_key = None
                                seen_colon_after_key = False
                                value_start_offset = None
                                current_scalar_buffer = ""
                        
                        depth_object -= 1
                        if depth_object == 0:
                            # End of top-level object
                            break
                        # Reset state when closing nested object
                        value_start_offset = None
                        current_scalar_buffer = ""
                    
                    elif ch == '[':
                        # Check if this is a top-level array value BEFORE incrementing depth
                        if (depth_object == 1 and 
                            depth_array == 0 and
                            seen_colon_after_key and 
                            pending_key is not None and
                            pending_key != "" and  # Ensure key is not empty
                            pending_key not in top_level_keys and
                            pending_key not in scalar_values):
                            # This is a top-level array value
                            top_level_keys[pending_key] = uncompressed_offset
                            LOG.info(f"Found array key '{pending_key}' at offset {uncompressed_offset}")
                            pending_key = None
                            seen_colon_after_key = False
                            value_start_offset = None
                        
                        depth_array += 1
                        value_start_offset = None
                        current_scalar_buffer = ""
                    
                    elif ch == ']':
                        depth_array -= 1
                        value_start_offset = None
                        current_scalar_buffer = ""
                    
                    elif ch == ':':
                        # Colon means the previous string was a key
                        if (pending_key is not None and 
                            pending_key != "" and  # Ensure key is not empty
                            depth_object == 1 and 
                            depth_array == 0):
                            seen_colon_after_key = True
                            value_start_offset = None  # Reset, will be set when we see value start
                            current_scalar_buffer = ""
                            LOG.debug(f"Saw colon after key '{pending_key}' at offset {uncompressed_offset}")
                        else:
                            # If pending_key is empty or None, reset it
                            pending_key = None
                    
                    elif ch == ',':
                        # Check if we have a pending scalar that ended with comma
                        if value_start_offset is not None and pending_key is not None and current_scalar_buffer:
                            if current_scalar_buffer[0] in '-0123456789':
                                scalar_values[pending_key] = current_scalar_buffer
                                LOG.debug(f"Found number scalar '{pending_key}': {current_scalar_buffer}")
                                pending_key = None
                                seen_colon_after_key = False
                                value_start_offset = None
                                current_scalar_buffer = ""
                        
                        # Comma means we're moving to next key-value pair
                        if pending_key is not None and not seen_colon_after_key:
                            # Key wasn't followed by colon, so it was a value
                            pending_key = None
                        seen_colon_after_key = False
                        value_start_offset = None
                        current_scalar_buffer = ""
            
            # Update total characters read (only count new chunk, not overlap)
            total_chars_read += len(new_chunk)
            
            # Save last part of chunk as overlap buffer (for next iteration)
            if len(chunk) > overlap_size:
                overlap_buffer = chunk[-overlap_size:]
            else:
                overlap_buffer = chunk
            
            if chunk_count % 100 == 0:
                LOG.debug(f"Processed {chunk_count} chunks, found {len(scalar_values)} scalars, {len(top_level_keys)} arrays/objects...")
        
            # Create JSON index
            json_index = JsonIndex(top_level_keys=top_level_keys, scalar_values=scalar_values)

            LOG.info(f"JSON indexing complete:")
            LOG.info(f"  - {len(json_index.scalar_values)} scalar value(s)")
            LOG.info(f"  - {len(json_index.top_level_keys)} array/object key(s)")

            if json_index.scalar_values:
                LOG.info(f"  Scalar keys: {list(json_index.scalar_values.keys())}")
            if json_index.top_level_keys:
                LOG.info(f"  Array/Object keys: {list(json_index.top_level_keys.keys())}")

            # Save JSON index
            json_index.save(json_index_file)
            LOG.info(f"JSON index saved to {json_index_file}")

            # Export gzip index if requested
            if gzip_index_file:
                LOG.info("Exporting gzip index...")
                f.export_index(str(gzip_index_file))
                LOG.info(f"Gzip index saved to {gzip_index_file}")

            return json_index
    
    finally:
        f.close()

SyntaxError: expected 'except' or 'finally' block (1391929143.py, line 138)

## Convenience Wrapper

The `build_index` function uses overlapping chunks to ensure complete scalar values are captured even if they span chunk boundaries. It identifies three types of top-level values:

1. **Scalars**: Complete string values extracted and stored in `scalar_values`
2. **Arrays**: Start offset recorded in `top_level_keys` (value starts with `[`)
3. **Objects**: Start offset recorded in `top_level_keys` (value starts with `{` at top level)

In [None]:
# Convenience wrapper function
def build_index_wrapper(
    input_file: Path,
    json_index_file: Path,
    gzip_index_file: Optional[Path] = None,
    spacing: int = 300 * 1024  # 300KB default (minimum recommended for indexed_gzip is 64KB)
) -> JsonIndex:
    """
    Convenience wrapper to build both indexes.
    
    Args:
        input_file: Path to .json.gz file
        json_index_file: Path where to save JSON structural index
        gzip_index_file: Optional path where to save gzip index
        spacing: Spacing for gzip index (default: 300KB, minimum recommended: 64KB)
    
    Returns:
        JsonIndex object
    """
    if not input_file.exists():
        raise FileNotFoundError(f"File not found: {input_file}")
    
    # Ensure spacing is at least 64KB (minimum for indexed_gzip)
    if spacing < 64 * 1024:
        LOG.warning(f"Spacing {spacing} bytes is too small, using minimum 64KB")
        spacing = 64 * 1024
    
    json_index = build_index(input_file, json_index_file, gzip_index_file, spacing=spacing)
    
    print(f"\nIndex summary:")
    print(f"  JSON index: {json_index_file}")
    if gzip_index_file:
        print(f"  Gzip index: {gzip_index_file}")
    print(f"  Scalar values: {len(json_index.scalar_values)}")
    if json_index.scalar_values:
        print(f"  Scalar keys: {list(json_index.scalar_values.keys())}")
    print(f"  Array/Object keys: {len(json_index.top_level_keys)}")
    if json_index.top_level_keys:
        print(f"  Array/Object keys: {list(json_index.top_level_keys.keys())}")
    
    return json_index

## Usage Examples

In [None]:
mrf_file = Path("D://2026-01_890_58B0_in-network-rates_58_of_60.json.gz")
json_index_file = Path("D://2026-01_890_58B0_in-network-rates_58_of_60.json.gz.index")
gzip_index_file = Path("D://2026-01_890_58B0_in-network-rates_58_of_60.json.gz.gzidx")

# Step 1: Build both indexes (read-only, creates index files)
# Note: spacing defaults to 300KB, minimum is 64KB for indexed_gzip
index = build_index_wrapper(mrf_file, json_index_file, gzip_index_file)

## Production Notes

### Limitations of Current Implementation

1. **Gzip State Serialization**: The current implementation uses a simplified approach for gzip state. For production use, you would need to:
   - Use `zlib.decompressobj()` directly instead of `gzip.GzipFile()`
   - Serialize the actual zlib decompressor state (which is complex)
   - Or use a library that supports gzip seeking (like `indexed_gzip`)

2. **Checkpoint Accuracy**: The current checkpoints are approximate. For exact seeking, you'd need to:
   - Store the exact zlib window state
   - Store the bit buffer state
   - Handle gzip headers and footers correctly

3. **Memory Usage**: For very large files, consider:
   - Streaming checkpoint creation
   - Incremental index updates
   - Using memory-mapped files

### Alternative Approaches

For production, consider:
- Using `indexed_gzip` library (https://github.com/pauldmccarthy/indexed_gzip)
- Using `zstandadr` compression with seekable format
- Pre-processing files into chunked formats (e.g., line-delimited JSON per chunk)