# JSONL File Operations Tutorial

This notebook demonstrates the basic operations of the `jsonlfile` package, which provides efficient JSONL file handling with byte-position indexing.

## Setup
First, let's import the required libraries and set up our environment.

In [1]:
import sys
import os
import random
from datetime import datetime


In [2]:
from jsonldb.jsonlfile import save_jsonl, load_jsonl, select_jsonl, update_jsonl, delete_jsonl, lint_jsonl

## Generate Sample Data

Let's create a helper function to generate random records with a consistent structure. Each record will have:
- A timestamp
- A numeric value
- A temperature reading
- A status indicator
- A list of tags

In [3]:
def generate_random_record():
    """Generate a random record with consistent structure."""
    return {
        "timestamp": datetime.now().isoformat(),
        "value": random.randint(1, 1000),
        "temperature": round(random.uniform(20.0, 30.0), 2),
        "status": random.choice(["active", "inactive", "pending"]),
        "tags": random.sample(["hot", "cold", "medium", "critical", "normal"], k=2)
    }

# Generate sample data (100 records)
print("Generating sample data...")
data = {
    f"record_{i:04d}": generate_random_record()
    for i in range(100)
}

# Display a sample record
sample_key = next(iter(data))
print(f"\nSample record:\n{sample_key}: {data[sample_key]}")

Generating sample data...

Sample record:
record_0000: {'timestamp': '2025-03-27T14:30:14.789256', 'value': 976, 'temperature': 24.64, 'status': 'pending', 'tags': ['hot', 'cold']}


In [4]:
data

{'record_0000': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 976,
  'temperature': 24.64,
  'status': 'pending',
  'tags': ['hot', 'cold']},
 'record_0001': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 97,
  'temperature': 22.05,
  'status': 'inactive',
  'tags': ['medium', 'cold']},
 'record_0002': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 487,
  'temperature': 21.55,
  'status': 'active',
  'tags': ['normal', 'medium']},
 'record_0003': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 996,
  'temperature': 23.75,
  'status': 'pending',
  'tags': ['hot', 'normal']},
 'record_0004': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 103,
  'temperature': 27.03,
  'status': 'pending',
  'tags': ['critical', 'medium']},
 'record_0005': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 525,
  'temperature': 23.09,
  'status': 'active',
  'tags': ['normal', 'critical']},
 'record_0006': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 4

## Save Data to JSONL File

Now we'll save our data to a JSONL file. The `save_jsonl` function will:
1. Create a JSONL file with our records
2. Automatically create an index file (.idx) for fast access
3. Ensure proper formatting of each record

In [5]:
print("Saving data to test.jsonl...")
save_jsonl("test.jsonl", data)

# Verify both files were created
print(f"\nJSONL file exists: {os.path.exists('test.jsonl')}")
print(f"Index file exists: {os.path.exists('test.jsonl.idx')}")

Saving data to test.jsonl...

JSONL file exists: True
Index file exists: True


## Load and Verify Data

Let's load the entire file back into memory and verify its contents. The `load_jsonl` function uses the index file for efficient loading.

In [6]:
print("Loading entire file...")
loaded_data = load_jsonl("test.jsonl")
print(f"Loaded {len(loaded_data)} records")

print("\nSample record:")
sample_key = next(iter(loaded_data))
print(f"{sample_key}: {loaded_data[sample_key]}")

Loading entire file...
Loaded 100 records

Sample record:
record_0000: {'timestamp': '2025-03-27T14:30:14.789256', 'value': 976, 'temperature': 24.64, 'status': 'pending', 'tags': ['hot', 'cold']}


In [7]:
loaded_data

{'record_0000': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 976,
  'temperature': 24.64,
  'status': 'pending',
  'tags': ['hot', 'cold']},
 'record_0001': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 97,
  'temperature': 22.05,
  'status': 'inactive',
  'tags': ['medium', 'cold']},
 'record_0002': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 487,
  'temperature': 21.55,
  'status': 'active',
  'tags': ['normal', 'medium']},
 'record_0003': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 996,
  'temperature': 23.75,
  'status': 'pending',
  'tags': ['hot', 'normal']},
 'record_0004': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 103,
  'temperature': 27.03,
  'status': 'pending',
  'tags': ['critical', 'medium']},
 'record_0005': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 525,
  'temperature': 23.09,
  'status': 'active',
  'tags': ['normal', 'critical']},
 'record_0006': {'timestamp': '2025-03-27T14:30:14.789256',
  'value': 4

## Select Range of Records

The `select_jsonl` function allows us to efficiently retrieve records within a specific key range. This is particularly useful for time-series data or when working with sorted keys.

In [8]:
print("Selecting records in range...")
range_data = select_jsonl("test.jsonl", "record_0010", "record_0020")
print(f"Selected {len(range_data)} records in range")

print("\nFirst selected record:")
first_key = min(range_data.keys())
print(f"{first_key}: {range_data[first_key]}")

Selecting records in range...
Selected 11 records in range

First selected record:
record_0010: {'timestamp': '2025-03-27T14:30:14.789256', 'value': 462, 'temperature': 29.01, 'status': 'active', 'tags': ['normal', 'medium']}


## Update Records

The `update_jsonl` function can both update existing records and insert new ones. Let's demonstrate both operations:

In [9]:
print("Updating records...")
updates = {
    "record_0001": {  # Update existing record
        "timestamp": datetime.now().isoformat(),
        "value": 9999,
        "temperature": 25.0,
        "status": "updated",
        "tags": ["modified", "test"]
    },
    "new_record": {  # Insert new record
        "timestamp": datetime.now().isoformat(),
        "value": 8888,
        "temperature": 22.5,
        "status": "new",
        "tags": ["fresh", "test"]
    }
}
update_jsonl("test.jsonl", updates)

# Verify updates
print("\nVerifying updates...")
updated_data = load_jsonl("test.jsonl")
print("Updated record:")
print(f"record_0001: {updated_data['record_0001']}")
print("\nNew record:")
print(f"new_record: {updated_data['new_record']}")

Updating records...

Verifying updates...
Updated record:
record_0001: {'timestamp': '2025-03-27T14:30:21.049357', 'value': 9999, 'temperature': 25.0, 'status': 'updated', 'tags': ['modified', 'test']}

New record:
new_record: {'timestamp': '2025-03-27T14:30:21.049357', 'value': 8888, 'temperature': 22.5, 'status': 'new', 'tags': ['fresh', 'test']}


## Delete Records

The `delete_jsonl` function removes records while maintaining file integrity. Deleted records are marked with spaces in the file, and their entries are removed from the index.

In [10]:
print("Deleting records...")
delete_jsonl("test.jsonl", ["record_0001", "record_0002"])

# Verify deletions
print("\nVerifying deletions...")
final_data = load_jsonl("test.jsonl")
print(f"Records after deletion: {len(final_data)}")
print("Checking deleted records:")
print(f"'record_0001' exists: {'record_0001' in final_data}")
print(f"'record_0002' exists: {'record_0002' in final_data}")

Deleting records...

Verifying deletions...
Records after deletion: 99
Checking deleted records:
'record_0001' exists: False
'record_0002' exists: False


## Lint and Clean

The `lint_jsonl` function sorts the file by keys and removes any deleted records, optimizing the file structure.

In [11]:
print("Linting the file...")
lint_jsonl("test.jsonl")
print("File has been sorted and cleaned")

# Verify the file is sorted
final_data = load_jsonl("test.jsonl")
is_sorted = list(final_data.keys()) == sorted(final_data.keys())
print(f"\nFile is sorted: {is_sorted}")

Linting the file...
File has been sorted and cleaned

File is sorted: False


## Cleanup

Finally, let's clean up our test files.

In [12]:
print("Cleaning up...")
os.remove("test.jsonl")
os.remove("test.jsonl.idx")
print("Done!")

Cleaning up...
Done!
