# Minimal Gzip + JSON Indexer for Massive MRF Files

This notebook provides a **minimal structural indexer** for very large `.json.gz` files (e.g., MRF).

## What it does
- Streams a `.json.gz` using `indexed_gzip` (supports random access via a gzip index)
- Scans for **top-level JSON keys**
- Records offsets for **top-level array values** (e.g., `"in_network"`, `"provider_references"`) without full parsing
- Captures **top-level scalar values** (e.g., `"reporting_entity_name"`) when present

## Outputs
- `<file>.index.json`: JSON index with offsets + scalars
- `<file>.gzidx`: gzip seek index created/used by `indexed_gzip`

## Notes
This is **not** a full JSON parser by design. It is intended for the common MRF pattern:
- top-level object
- large arrays at top-level keys


In [1]:
import json
import logging
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, Optional

import indexed_gzip as igzip

# ---------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s"
)
LOG = logging.getLogger(__name__)

# ---------------------------------------------------------------------
# Data Structures
# ---------------------------------------------------------------------

@dataclass
class JsonIndex:
    """Minimal structural index for a top-level JSON object."""
    top_level_offsets: Dict[str, int] = field(default_factory=dict)
    scalar_values: Dict[str, str] = field(default_factory=dict)

# ---------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------

def is_whitespace(ch: str) -> bool:
    return ch in " \t\n\r"

# ---------------------------------------------------------------------
# Core Indexing Logic
# ---------------------------------------------------------------------

def build_index(
    path_gz: Path,
    json_index_file: Path,
    gzip_index_file: Optional[Path] = None,
    spacing: int = 300 * 1024,
    chunk_size: int = 64 * 1024,
    overlap_size: int = 1024,
) -> JsonIndex:
    """Build a minimal index of top-level JSON keys from a .json.gz file.

    Strategy:
    - Stream gzip with indexed_gzip
    - Scan once
    - Track JSON state manually (depth / strings)

    Parameters:
    - path_gz: input .json.gz
    - json_index_file: output index JSON path
    - gzip_index_file: optional gzip index path for indexed_gzip
    - spacing: gzip index spacing (smaller -> bigger index, faster seeking)
    - chunk_size: read size (bytes)
    - overlap_size: text overlap retained between chunks for boundary safety
    """

    LOG.info("Indexing %s", path_gz)
    if gzip_index_file:
        LOG.info("Using gzip index: %s", gzip_index_file)

    index = JsonIndex()

    with igzip.IndexedGzipFile(
        filename=str(path_gz),
        index_file=str(gzip_index_file) if gzip_index_file else None,
        spacing=spacing,
    ) as fh:

        buf = ""
        pos = 0  # decompressed character position (approx) within stream

        depth = 0
        in_string = False
        escape = False
        current_key = None

        while True:
            chunk = fh.read(chunk_size)
            if not chunk:
                break

            text = chunk.decode("utf-8", errors="ignore")
            buf += text

            i = 0
            while i < len(buf):
                ch = buf[i]

                # String mode
                if in_string:
                    if escape:
                        escape = False
                    elif ch == "\\":
                        escape = True
                    elif ch == '"':
                        in_string = False
                    i += 1
                    continue

                if ch == '"':
                    # Potential key
                    start = i + 1
                    end = buf.find('"', start)
                    if end == -1:
                        break  # wait for next chunk

                    key = buf[start:end]
                    i = end + 1

                    # Skip whitespace
                    while i < len(buf) and is_whitespace(buf[i]):
                        i += 1

                    # If we're in top-level object depth==1, and next token is ':', treat as key
                    if i < len(buf) and buf[i] == ":" and depth == 1:
                        current_key = key
                    continue

                # Structural depth tracking
                if ch == "{":
                    depth += 1
                elif ch == "}":
                    depth -= 1
                elif ch == "[":
                    # For top-level arrays, record the offset at '['
                    if depth == 1 and current_key:
                        index.top_level_offsets[current_key] = pos + i
                        current_key = None
                elif ch not in ",:" and depth == 1 and current_key:
                    # Scalar value (numbers, true/false/null, or quoted strings)
                    start = i
                    while i < len(buf) and buf[i] not in ",}":
                        i += 1
                    value = buf[start:i].strip()
                    index.scalar_values[current_key] = value
                    current_key = None
                    continue

                i += 1

            # Retain overlap for boundary safety
            pos += len(text)
            buf = buf[-overlap_size:]

    # Persist index
    json_index_file.write_text(
        json.dumps(
            {"offsets": index.top_level_offsets, "scalars": index.scalar_values},
            indent=2,
        )
    )

    LOG.info("Index written to %s", json_index_file)
    return index

# ---------------------------------------------------------------------
# Convenience Wrapper
# ---------------------------------------------------------------------

def build_index_wrapper(input_file: Path, output_dir: Optional[Path] = None) -> JsonIndex:
    output_dir = output_dir or input_file.parent

    json_index_file = output_dir / f"{input_file.name}.index.json"
    gzip_index_file = output_dir / f"{input_file.name}.gzidx"

    return build_index(
        path_gz=input_file,
        json_index_file=json_index_file,
        gzip_index_file=gzip_index_file,
    )


## Example usage

Update the path below to point to your `.json.gz` MRF file.

In [None]:
# Example:
input_file = Path("D://2026-01_890_58B0_in-network-rates_58_of_60.json.gz")
idx = build_index_wrapper(input_file)
idx.top_level_offsets, idx.scalar_values

2026-01-26 14:28:30,152 | INFO | Indexing D:\2026-01_890_58B0_in-network-rates_58_of_60.json.gz
2026-01-26 14:28:30,152 | INFO | Using gzip index: D:\2026-01_890_58B0_in-network-rates_58_of_60.json.gz.gzidx


## Interpreting output

- `offsets`: position (in the decompressed text stream) where the top-level array begins (`[`)
- `scalars`: stringified scalar values captured at the top-level

If you want offset-based partial extraction helpers (e.g., read the first N items of `in_network`), tell me and I will add them as additional cells.