# Minimal Gzip + JSON Indexer for Massive MRF Files

This notebook provides a **minimal structural indexer** for very large `.json.gz` files (e.g., MRF).

## What it does
- Streams a `.json.gz` using `indexed_gzip` (supports random access via a gzip index)
- Scans for **top-level JSON keys**
- Records offsets for **top-level array values** (e.g., `"in_network"`, `"provider_references"`) without full parsing
- Captures **top-level scalar values** (e.g., `"reporting_entity_name"`) when present

## Outputs
- `<file>.index.json`: JSON index with offsets + scalars
- `<file>.gzidx`: gzip seek index created/used by `indexed_gzip`

## Notes
This is **not** a full JSON parser by design. It is intended for the common MRF pattern:
- top-level object
- large arrays at top-level keys


In [23]:
import json
import logging
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, Optional

import indexed_gzip as igzip

# ---------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s"
)
LOG = logging.getLogger(__name__)

# ---------------------------------------------------------------------
# Data Structures
# ---------------------------------------------------------------------

@dataclass
class JsonIndex:
    """Minimal structural index for a top-level JSON object."""
    top_level_offsets: Dict[str, int] = field(default_factory=dict)
    scalar_values: Dict[str, str] = field(default_factory=dict)

# ---------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------

def is_whitespace(ch: str) -> bool:
    return ch in " \t\n\r"

# ---------------------------------------------------------------------
# Core Indexing Logic
# ---------------------------------------------------------------------

def build_index(
    path_gz: Path,
    json_index_file: Path,
    gzip_index_file: Optional[Path] = None,
    spacing: int = 300 * 1024,
    chunk_size: int = 10000 * 1024,
    overlap_size: int = 10*1024,
    progress_every_mb: int = 5,
) -> JsonIndex:
    """
    Build a minimal index of top-level JSON keys from a .json.gz file.

    Progress reporting is based on COMPRESSED bytes read from the .gz file,
    which is a reliable proxy for elapsed work.

    Parameters
    ----------
    path_gz : Path
        Input .json.gz file
    json_index_file : Path
        Output index JSON path
    gzip_index_file : Optional[Path]
        Optional gzip index path for indexed_gzip
    spacing : int
        Gzip index spacing (bytes)
    chunk_size : int
        Read size (bytes, decompressed)
    overlap_size : int
        Text overlap retained between chunks
    progress_every_mb : int
        Emit progress log every N MB of compressed input processed
    """

    gz_size = path_gz.stat().st_size
    report_interval = progress_every_mb
    next_report = report_interval

    LOG.info(
        "Indexing %s (compressed size %.2f GB)",
        path_gz,
        gz_size / (1024 ** 3),
    )

    index = JsonIndex()

    with igzip.IndexedGzipFile(
        filename=str(path_gz),
        index_file=str(gzip_index_file) if gzip_index_file else None,
        spacing=spacing,
    ) as fh:

        buf = ""
        pos = 0  # approx decompressed character position

        depth = 0
        in_string = False
        escape = False
        current_key = None

        while True:
            chunk = fh.read(chunk_size)
            if not chunk:
                break

            # ---------------------------------------------------------
            # Progress reporting (compressed bytes)
            # ---------------------------------------------------------
            try:
                comp_pos = fh.fileobj.tell()
                if comp_pos >= next_report:
                    pct = (comp_pos / gz_size) * 100 if gz_size else 0.0
                    LOG.info(
                        "Progress: %.1f%% (%.2f / %.2f GB compressed)",
                        pct,
                        comp_pos / (1024 ** 3),
                        gz_size / (1024 ** 3),
                    )
                    next_report += report_interval
            except Exception:
                # Do not fail indexing if progress reporting is unavailable
                pass

            text = chunk.decode("utf-8", errors="ignore")
            buf += text

            i = 0
            while i < len(buf):
                ch = buf[i]

                # -----------------------------
                # String handling
                # -----------------------------
                if in_string:
                    if escape:
                        escape = False
                    elif ch == "\\":
                        escape = True
                    elif ch == '"':
                        in_string = False
                    i += 1
                    continue

                if ch == '"':
                    start = i + 1
                    end = buf.find('"', start)
                    if end == -1:
                        break  # wait for next chunk

                    key = buf[start:end]
                    i = end + 1

                    # Skip whitespace
                    while i < len(buf) and is_whitespace(buf[i]):
                        i += 1

                    # Top-level key
                    if i < len(buf) and buf[i] == ":" and depth == 1:
                        current_key = key
                    continue

                # -----------------------------
                # Structural characters
                # -----------------------------
                if ch == "{":
                    depth += 1
                elif ch == "}":
                    depth -= 1
                elif ch == "[":
                    # Top-level array start
                    if depth == 1 and current_key:
                        index.top_level_offsets[current_key] = pos + i
                        current_key = None
                elif ch not in ",:" and depth == 1 and current_key:
                    # Scalar value
                    start = i
                    while i < len(buf) and buf[i] not in ",}":
                        i += 1
                    value = buf[start:i].strip()
                    index.scalar_values[current_key] = value
                    current_key = None
                    continue

                i += 1

            # Retain overlap for boundary safety
            pos += len(text)
            buf = buf[-overlap_size:]

    # ---------------------------------------------------------
    # Persist index
    # ---------------------------------------------------------
    json_index_file.write_text(
        json.dumps(
            {
                "offsets": index.top_level_offsets,
                "scalars": index.scalar_values,
            },
            indent=2,
        )
    )

    LOG.info("Index written to %s", json_index_file)
    return index

# ---------------------------------------------------------------------
# Convenience Wrapper
# ---------------------------------------------------------------------

def build_index_wrapper(input_file: Path, output_dir: Optional[Path] = None) -> JsonIndex:
    output_dir = output_dir or input_file.parent

    json_index_file = output_dir / f"{input_file.name}.index.json"
    gzip_index_file = output_dir / f"{input_file.name}.gzidx"

    # Only pass an index_file if it already exists (avoids FileNotFoundError)
    gzip_index_arg = gzip_index_file if gzip_index_file.exists() else None

    return build_index(
        path_gz=input_file,
        json_index_file=json_index_file,
        gzip_index_file=gzip_index_arg,
    )
    index_file=str(gzip_index_file) if gzip_index_file else None

## Example usage

Update the path below to point to your `.json.gz` MRF file.

In [24]:
# Example:
input_file = Path("D://_ingested_2026-01_720_27B0_in-network-rates_01_of_57.json.gz")
idx = build_index_wrapper(input_file)
idx.top_level_offsets, idx.scalar_values

2026-01-26 14:55:04,438 | INFO | Indexing D:\_ingested_2026-01_720_27B0_in-network-rates_01_of_57.json.gz (compressed size 0.02 GB)
2026-01-26 14:56:00,507 | INFO | Index written to D:\_ingested_2026-01_720_27B0_in-network-rates_01_of_57.json.gz.index.json


({'negotiated_rates': 9107278},
 {'reporting_entity_name': '"Blue Cross and Blue Shield of Minnesota"',
  'reporting_entity_type': '"Health insurance Issuer"',
  'last_updated_on': '"2025-11-23"',
  'version': '"1.3.1"',
  'provider_references': '[{"provider_group_id":720.0000237894',
  'location': '"https://mrfdata.hmhs.com/files/720/mn/inbound/local/providergrp/bcbsa/720_pdo_prov_mrf_prvgrp_11_0000265241.json"',
  'provider_group_id': '720.0000265241'})

## Interpreting output

- `offsets`: position (in the decompressed text stream) where the top-level array begins (`[`)
- `scalars`: stringified scalar values captured at the top-level

If you want offset-based partial extraction helpers (e.g., read the first N items of `in_network`), tell me and I will add them as additional cells.