# Minimal Gzip + JSON Indexer for Partial Extraction

This notebook implements a system that indexes massive .json.gz files (e.g., 10GB+) to enable partial extraction of JSON content without full decompression or full parsing.

## Features
- Gzip seek checkpoints for random access
- Minimal JSON structural index (array start or object key value starts)
- Single-pass indexing algorithm
- Partial extraction without full decompression

## Imports and Setup

In [1]:
import gzip
import json
import pickle
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Any
from dataclasses import dataclass, asdict
from io import BytesIO
import struct
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
LOG = logging.getLogger(__name__)

## Data Structures

In [2]:
@dataclass
class GzipCheckpoint:
    """A checkpoint for gzip decompression state."""
    uncompressed_offset: int
    compressed_offset: int
    decompressor_state: bytes  # Pickled gzip decompressor state
    
    def to_dict(self) -> dict:
        return {
            'uncompressed_offset': self.uncompressed_offset,
            'compressed_offset': self.compressed_offset,
            'decompressor_state': self.decompressor_state.hex()  # Store as hex for JSON serialization
        }
    
    @classmethod
    def from_dict(cls, d: dict) -> 'GzipCheckpoint':
        return cls(
            uncompressed_offset=d['uncompressed_offset'],
            compressed_offset=d['compressed_offset'],
            decompressor_state=bytes.fromhex(d['decompressor_state'])
        )


@dataclass
class JsonIndex:
    """Minimal JSON structural index."""
    is_array: bool
    array_start_offset: Optional[int] = None  # For top-level arrays
    top_level_keys: Dict[str, int] = None  # For top-level objects: key -> value start offset
    
    def __post_init__(self):
        if self.top_level_keys is None:
            self.top_level_keys = {}
    
    def to_dict(self) -> dict:
        return {
            'is_array': self.is_array,
            'array_start_offset': self.array_start_offset,
            'top_level_keys': self.top_level_keys
        }
    
    @classmethod
    def from_dict(cls, d: dict) -> 'JsonIndex':
        return cls(
            is_array=d['is_array'],
            array_start_offset=d.get('array_start_offset'),
            top_level_keys=d.get('top_level_keys', {})
        )


@dataclass
class CombinedIndex:
    """Combined gzip checkpoints and JSON structural index."""
    checkpoints: List[GzipCheckpoint]
    json_index: JsonIndex
    
    def to_dict(self) -> dict:
        return {
            'checkpoints': [cp.to_dict() for cp in self.checkpoints],
            'json_index': self.json_index.to_dict()
        }
    
    @classmethod
    def from_dict(cls, d: dict) -> 'CombinedIndex':
        return cls(
            checkpoints=[GzipCheckpoint.from_dict(cp) for cp in d['checkpoints']],
            json_index=JsonIndex.from_dict(d['json_index'])
        )
    
    def save(self, path: Path) -> None:
        """Save index to JSON file."""
        with open(path, 'w') as f:
            json.dump(self.to_dict(), f, indent=2)
    
    @classmethod
    def load(cls, path: Path) -> 'CombinedIndex':
        """Load index from JSON file."""
        with open(path, 'r') as f:
            return cls.from_dict(json.load(f))

## Gzip Decompressor State Management

Note: Python's gzip module doesn't expose decompressor state directly. We'll use a workaround by creating a custom decompressor wrapper.

In [3]:
class GzipDecompressorState:
    """Wrapper to capture and restore gzip decompressor state."""
    
    def __init__(self):
        self.decompressor = gzip.GzipFile(fileobj=BytesIO(), mode='rb')
        self.buffer = BytesIO()
        self.total_uncompressed = 0
    
    def decompress(self, data: bytes) -> bytes:
        """Decompress data chunk."""
        # For simplicity, we'll use a streaming approach
        # In production, you might want to use zlib directly for better state control
        result = self.decompressor.decompress(data)
        self.total_uncompressed += len(result)
        return result
    
    def snapshot(self) -> bytes:
        """Create a snapshot of decompressor state."""
        # Note: This is a simplified approach. In production, you'd need to
        # serialize the actual zlib decompressor state, which requires
        # using zlib.decompressobj() and capturing its internal state.
        # For now, we'll store position info that can help with seeking.
        state = {
            'total_uncompressed': self.total_uncompressed,
            # In a real implementation, you'd serialize zlib decompressor state
        }
        return pickle.dumps(state)
    
    @classmethod
    def restore(cls, state_bytes: bytes) -> 'GzipDecompressorState':
        """Restore decompressor from snapshot."""
        state = pickle.loads(state_bytes)
        obj = cls()
        obj.total_uncompressed = state['total_uncompressed']
        return obj

## Single-Pass Combined Indexing Algorithm

In [4]:
def is_whitespace(ch: str) -> bool:
    """Check if character is whitespace."""
    return ch in ' \t\n\r'

In [5]:
def build_combined_index(
    path_gz: Path,
    checkpoint_span_uncompressed: int = 64 * 1024  # 64KB default
) -> CombinedIndex:
    """
    Build combined gzip checkpoint and JSON structural index in a single pass.
    
    Args:
        path_gz: Path to .json.gz file
        checkpoint_span_uncompressed: Create checkpoint every N bytes of uncompressed data
    
    Returns:
        CombinedIndex with checkpoints and JSON structure
    """
    LOG.info(f"Building index for {path_gz.name}...")
    
    # State variables
    uncompressed_offset = 0
    compressed_offset = 0
    next_checkpoint_boundary = checkpoint_span_uncompressed
    
    # JSON scanning state
    in_string = False
    escape = False
    depth_object = 0
    depth_array = 0
    
    array_start_offset = None
    top_level_keys = {}
    reading_key = False
    current_key_buffer = ""
    last_key = None
    seen_colon_after_key = False
    
    # Determine if top-level is array or object
    is_array = None
    
    checkpoints = []
    decompressor = GzipDecompressorState()
    
    chunk_size = 8192  # Read 8KB compressed chunks
    
    with open(path_gz, 'rb') as compressed_file:
        while True:
            compressed_chunk = compressed_file.read(chunk_size)
            if not compressed_chunk:
                break
            
            chunk_compressed_start = compressed_offset
            
            try:
                decompressed_chunk = decompressor.decompress(compressed_chunk)
            except Exception as e:
                LOG.warning(f"Decompression error at offset {compressed_offset}: {e}")
                break
            
            if not decompressed_chunk:
                compressed_offset += len(compressed_chunk)
                continue
            
            # Process each byte in decompressed chunk
            for byte_val in decompressed_chunk:
                ch = chr(byte_val)
                
                # --- JSON STRUCTURE SCANNING ---
                if in_string:
                    if escape:
                        escape = False
                    elif ch == '\\':
                        escape = True
                    elif ch == '"':
                        in_string = False
                        # Finish reading a key
                        if reading_key:
                            last_key = current_key_buffer
                            current_key_buffer = ""
                            reading_key = False
                    else:
                        if reading_key:
                            current_key_buffer += ch
                else:
                    if ch == '"':
                        in_string = True
                        if depth_object == 1 and depth_array == 0:
                            reading_key = True
                            current_key_buffer = ""
                    
                    elif ch == '{':
                        depth_object += 1
                        if depth_object == 1 and is_array is None:
                            is_array = False
                    
                    elif ch == '}':
                        depth_object -= 1
                        if seen_colon_after_key:
                            seen_colon_after_key = False
                            last_key = None
                    
                    elif ch == '[':
                        depth_array += 1
                        if depth_array == 1 and is_array is None:
                            is_array = True
                    
                    elif ch == ']':
                        depth_array -= 1
                    
                    elif ch == ':':
                        if last_key is not None and depth_object == 1 and depth_array == 0:
                            seen_colon_after_key = True
                    
                    else:
                        # Detect start of array contents (first non-whitespace after '[')
                        if (depth_array == 1 and 
                            array_start_offset is None and 
                            not is_whitespace(ch) and 
                            ch != '['):
                            array_start_offset = uncompressed_offset
                        
                        # Detect start of object value (first non-whitespace after ':')
                        if (seen_colon_after_key and 
                            not is_whitespace(ch) and 
                            last_key is not None and 
                            last_key not in top_level_keys):
                            top_level_keys[last_key] = uncompressed_offset
                            seen_colon_after_key = False
                            last_key = None
                
                # --- END JSON SCANNING ---
                
                uncompressed_offset += 1
                
                # --- CHECKPOINT CREATION ---
                if uncompressed_offset >= next_checkpoint_boundary:
                    checkpoint_state = decompressor.snapshot()
                    checkpoints.append(GzipCheckpoint(
                        uncompressed_offset=uncompressed_offset,
                        compressed_offset=chunk_compressed_start,
                        decompressor_state=checkpoint_state
                    ))
                    next_checkpoint_boundary += checkpoint_span_uncompressed
                    LOG.debug(f"Created checkpoint at uncompressed offset {uncompressed_offset}")
            
            compressed_offset += len(compressed_chunk)
    
    # Create JSON index
    json_index = JsonIndex(
        is_array=is_array if is_array is not None else False,
        array_start_offset=array_start_offset,
        top_level_keys=top_level_keys
    )
    
    LOG.info(f"Indexing complete: {len(checkpoints)} checkpoints, "
             f"JSON type={'array' if json_index.is_array else 'object'}, "
             f"keys={len(json_index.top_level_keys)}")
    
    return CombinedIndex(checkpoints=checkpoints, json_index=json_index)

## Partial Extraction Algorithm

In [6]:
def find_checkpoint_before(
    checkpoints: List[GzipCheckpoint],
    target_offset: int
) -> Optional[GzipCheckpoint]:
    """Find the checkpoint with the highest uncompressed_offset <= target_offset."""
    best_cp = None
    for cp in checkpoints:
        if cp.uncompressed_offset <= target_offset:
            if best_cp is None or cp.uncompressed_offset > best_cp.uncompressed_offset:
                best_cp = cp
    return best_cp

In [7]:
def extract_partial(
    path_gz: Path,
    checkpoints: List[GzipCheckpoint],
    start_offset: int,
    max_bytes: int
) -> bytes:
    """
    Extract partial content from .json.gz file starting at given uncompressed offset.
    
    Args:
        path_gz: Path to .json.gz file
        checkpoints: List of gzip checkpoints
        start_offset: Uncompressed byte offset to start extraction
        max_bytes: Maximum number of uncompressed bytes to extract
    
    Returns:
        Extracted bytes (may contain incomplete JSON at the end)
    """
    # Find nearest checkpoint before start_offset
    cp = find_checkpoint_before(checkpoints, start_offset)
    
    if cp is None:
        # No checkpoint found, start from beginning
        LOG.warning("No checkpoint found before start_offset, starting from beginning")
        decompressor = GzipDecompressorState()
        current_uncompressed = 0
        with open(path_gz, 'rb') as f:
            compressed_data = f.read()
    else:
        # Restore from checkpoint
        LOG.info(f"Using checkpoint at uncompressed offset {cp.uncompressed_offset}")
        decompressor = GzipDecompressorState.restore(cp.decompressor_state)
        current_uncompressed = cp.uncompressed_offset
        with open(path_gz, 'rb') as f:
            f.seek(cp.compressed_offset)
            compressed_data = f.read()
    
    # Decompress forward until we reach start_offset, then collect bytes
    output = bytearray()
    
    # Decompress remaining data
    try:
        decompressed = decompressor.decompress(compressed_data)
    except Exception as e:
        LOG.error(f"Decompression error: {e}")
        return bytes(output)
    
    # Find the slice we need
    chunk_start = current_uncompressed
    chunk_end = current_uncompressed + len(decompressed)
    
    if chunk_end <= start_offset:
        # Haven't reached start_offset yet
        LOG.warning(f"Decompressed data ends at {chunk_end}, but start_offset is {start_offset}")
        return bytes(output)
    
    # Calculate slice
    slice_from = max(0, start_offset - chunk_start)
    slice_to = min(len(decompressed), start_offset + max_bytes - chunk_start)
    
    if slice_from < len(decompressed):
        output.extend(decompressed[slice_from:slice_to])
    
    LOG.info(f"Extracted {len(output)} bytes starting at offset {start_offset}")
    return bytes(output)

## Usage Examples

In [8]:
# Example 1: Build index for a large .json.gz file
def example_build_index(input_file: Path, index_file: Path, checkpoint_span: int = 64 * 1024):
    """Build and save index."""
    index = build_combined_index(input_file, checkpoint_span_uncompressed=checkpoint_span)
    index.save(index_file)
    print(f"Index saved to {index_file}")
    print(f"  Checkpoints: {len(index.checkpoints)}")
    print(f"  JSON type: {'array' if index.json_index.is_array else 'object'}")
    if index.json_index.is_array:
        print(f"  Array start offset: {index.json_index.array_start_offset}")
    else:
        print(f"  Top-level keys: {list(index.json_index.top_level_keys.keys())}")
    return index

In [9]:
# Example 2: Extract first part of top-level array
def example_extract_array_start(
    input_file: Path,
    index_file: Path,
    max_bytes: int = 5 * 1024 * 1024  # 5MB
) -> bytes:
    """Extract first part of top-level array."""
    index = CombinedIndex.load(index_file)
    
    if not index.json_index.is_array:
        raise ValueError("JSON is not a top-level array")
    
    if index.json_index.array_start_offset is None:
        raise ValueError("Array start offset not found in index")
    
    start = index.json_index.array_start_offset
    bytes_data = extract_partial(input_file, index.checkpoints, start, max_bytes)
    
    print(f"Extracted {len(bytes_data)} bytes from array start (offset {start})")
    return bytes_data

In [10]:
# Example 3: Extract first part of a top-level key value
def example_extract_key_value(
    input_file: Path,
    index_file: Path,
    key_name: str,
    max_bytes: int = 5 * 1024 * 1024  # 5MB
) -> bytes:
    """Extract first part of a top-level key's value."""
    index = CombinedIndex.load(index_file)
    
    if index.json_index.is_array:
        raise ValueError("JSON is a top-level array, not an object")
    
    if key_name not in index.json_index.top_level_keys:
        available_keys = list(index.json_index.top_level_keys.keys())
        raise ValueError(f"Key '{key_name}' not found. Available keys: {available_keys}")
    
    start = index.json_index.top_level_keys[key_name]
    bytes_data = extract_partial(input_file, index.checkpoints, start, max_bytes)
    
    print(f"Extracted {len(bytes_data)} bytes from key '{key_name}' (offset {start})")
    return bytes_data

## Test with Sample File

In [11]:
# Create a test .json.gz file
def create_test_file(output_path: Path, is_array: bool = True, size_mb: float = 0.1):
    """Create a test .json.gz file for testing."""
    import tempfile
    
    # Create temporary JSON file
    with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as tmp:
        if is_array:
            # Create array with many items
            tmp.write('[')
            target_size = size_mb * 1024 * 1024
            written = 1  # '['
            item_count = 0
            while written < target_size:
                if item_count > 0:
                    tmp.write(',')
                    written += 1
                item = json.dumps({"id": item_count, "data": "x" * 100})
                tmp.write(item)
                written += len(item)
                item_count += 1
            tmp.write(']')
        else:
            # Create object with many keys
            tmp.write('{')
            keys = ['users', 'products', 'orders', 'items', 'data']
            for i, key in enumerate(keys):
                if i > 0:
                    tmp.write(',')
                value = json.dumps([{"id": j} for j in range(1000)])
                tmp.write(f'"{key}":{value}')
            tmp.write('}')
        tmp_path = Path(tmp.name)
    
    # Compress to .json.gz
    with open(tmp_path, 'rb') as f_in:
        with gzip.open(output_path, 'wb') as f_out:
            f_out.write(f_in.read())
    
    # Clean up temp file
    tmp_path.unlink()
    
    print(f"Created test file: {output_path} ({output_path.stat().st_size / 1024:.1f} KB)")

In [12]:
test_file = Path("test_large.json.gz")
test_index = Path("test_large.json.gz.index")

# Create test file (array)
create_test_file(test_file, is_array=True, size_mb=1.0)

# Build index
index = example_build_index(test_file, test_index, checkpoint_span=64*1024)

# Extract first 1MB of array
array_data = example_extract_array_start(test_file, test_index, max_bytes=1024*1024)
print(f"First 200 chars: {array_data[:200]}")

2026-01-26 10:58:14,642 - INFO - Building index for test_large.json.gz...
2026-01-26 10:58:14,646 - INFO - Indexing complete: 0 checkpoints, JSON type=object, keys=0


Created test file: test_large.json.gz (22.7 KB)
Index saved to test_large.json.gz.index
  Checkpoints: 0
  JSON type: object
  Top-level keys: []


ValueError: JSON is not a top-level array

## Production Notes

### Limitations of Current Implementation

1. **Gzip State Serialization**: The current implementation uses a simplified approach for gzip state. For production use, you would need to:
   - Use `zlib.decompressobj()` directly instead of `gzip.GzipFile()`
   - Serialize the actual zlib decompressor state (which is complex)
   - Or use a library that supports gzip seeking (like `indexed_gzip`)

2. **Checkpoint Accuracy**: The current checkpoints are approximate. For exact seeking, you'd need to:
   - Store the exact zlib window state
   - Store the bit buffer state
   - Handle gzip headers and footers correctly

3. **Memory Usage**: For very large files, consider:
   - Streaming checkpoint creation
   - Incremental index updates
   - Using memory-mapped files

### Alternative Approaches

For production, consider:
- Using `indexed_gzip` library (https://github.com/pauldmccarthy/indexed_gzip)
- Using `zstandard` compression with seekable format
- Pre-processing files into chunked formats (e.g., line-delimited JSON per chunk)