# Time Series Data Example

This notebook demonstrates how to use the `jsonldf` package with time series data.
We'll create simulated sensor data, save it in JSONL format, and perform various operations
including time-based selection, updates, and data retention policies.

In [1]:
import pandas as pd
import sys
import os
from datetime import datetime, timedelta
import numpy as np
import math
import random

# Add parent directory to path to import jsonldb
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "../../")))

from jsonldb.jsonldf import (
    save_jsonldf, load_jsonldf, update_jsonldf,
    select_jsonldf, delete_jsonldf, lint_jsonldf
)

## Generate Sensor Data

Let's create a function to generate simulated sensor data with periodic variations.

In [2]:
def generate_sensor_data(start_time, num_points, sensor_id):
    """Generate simulated sensor data with periodic variations.
    
    Args:
        start_time: Starting datetime
        num_points: Number of data points to generate
        sensor_id: Identifier for the sensor
    """
    # Generate timestamps
    timestamps = [start_time + timedelta(minutes=5*i) for i in range(num_points)]
    
    # Generate data with periodic variations
    data = []
    for t in timestamps:
        # Add daily variation
        hour_factor = math.sin(2 * math.pi * t.hour / 24)
        
        # Base values with some randomness
        temperature = 25 + 5 * hour_factor + random.uniform(-1, 1)
        humidity = 60 + 10 * hour_factor + random.uniform(-2, 2)
        
        data.append({
            'timestamp': t,
            'sensor_id': sensor_id,
            'temperature': temperature,
            'humidity': humidity,
            'status': 'normal' if temperature < 30 else 'warning'
        })
    
    return pd.DataFrame(data)

## Create and Save Initial Data

Generate data for multiple sensors over a 24-hour period.

In [3]:
# Generate data for multiple sensors
start_time = datetime(2024, 1, 1, 0, 0)
num_points = 24 * 12  # 5-minute intervals for 24 hours

dfs = []
for sensor_id in ['sensor_00', 'sensor_01', 'sensor_02']:
    df = generate_sensor_data(start_time, num_points, sensor_id)
    dfs.append(df)

# Combine all sensor data
df = pd.concat(dfs)
df.set_index('timestamp', inplace=True)

# Remove duplicate index data
df = df[~df.index.duplicated(keep='first')]

print("Generated data shape:", df.shape)
print("\nSample of data:")
print(df.head())

Generated data shape: (288, 4)

Sample of data:
                     sensor_id  temperature   humidity  status
timestamp                                                     
2024-01-01 00:00:00  sensor_00    25.211679  61.506890  normal
2024-01-01 00:05:00  sensor_00    25.321086  59.022296  normal
2024-01-01 00:10:00  sensor_00    24.576836  61.098418  normal
2024-01-01 00:15:00  sensor_00    24.706252  60.416932  normal
2024-01-01 00:20:00  sensor_00    25.836592  60.709151  normal


In [4]:
# Check if the DataFrame index is unique
if not df.index.is_unique:
    print("Duplicate indices found:")
    print(df[df.index.duplicated(keep=False)])
else:
    print("DataFrame index is unique.")


DataFrame index is unique.


## Save Data to JSONL

Save the generated data to a JSONL file.

In [5]:
print("Saving data to JSONL...")
save_jsonldf('sensor_data.jsonl', df)

Saving data to JSONL...


## Select Recent Data

Select the last 6 hours of data for sensor_00.

In [6]:
# Select last 6 hours of data for sensor_00
end_time = start_time + timedelta(hours=24)
start_time = end_time - timedelta(hours=6)

print(f"Selecting data between {start_time} and {end_time} for sensor_00")
recent_data = select_jsonldf('sensor_data.jsonl', (start_time.isoformat(), end_time.isoformat()))
recent_data = recent_data[recent_data['sensor_id'] == 'sensor_00']

print("\nRecent data:")
print(recent_data)

Selecting data between 2024-01-01 18:00:00 and 2024-01-02 00:00:00 for sensor_00

Recent data:
                     sensor_id  temperature   humidity  status
2024-01-01 00:00:00  sensor_00    25.211679  61.506890  normal
2024-01-01 00:05:00  sensor_00    25.321086  59.022296  normal
2024-01-01 00:10:00  sensor_00    24.576836  61.098418  normal
2024-01-01 00:15:00  sensor_00    24.706252  60.416932  normal
2024-01-01 00:20:00  sensor_00    25.836592  60.709151  normal
...                        ...          ...        ...     ...
2024-01-01 23:35:00  sensor_00    23.605049  56.265599  normal
2024-01-01 23:40:00  sensor_00    23.378021  56.151184  normal
2024-01-01 23:45:00  sensor_00    23.190718  56.643788  normal
2024-01-01 23:50:00  sensor_00    24.535305  59.226843  normal
2024-01-01 23:55:00  sensor_00    24.421930  57.750804  normal

[288 rows x 4 columns]


## Calculate Statistics

Calculate basic statistics for the selected data.

In [7]:
print("\nStatistics for recent data:")
print("Temperature:")
print(f"  Average: {recent_data['temperature'].mean():.2f}°C")
print(f"  Min: {recent_data['temperature'].min():.2f}°C")
print(f"  Max: {recent_data['temperature'].max():.2f}°C")

print("\nHumidity:")
print(f"  Average: {recent_data['humidity'].mean():.2f}%")
print(f"  Min: {recent_data['humidity'].min():.2f}%")
print(f"  Max: {recent_data['humidity'].max():.2f}%")


Statistics for recent data:
Temperature:
  Average: 25.03°C
  Min: 19.01°C
  Max: 30.97°C

Humidity:
  Average: 60.16%
  Min: 48.19%
  Max: 71.62%


## Apply Calibration

Apply a temperature calibration of +0.5°C to sensor_00 readings.

In [8]:
print("\nApplying temperature calibration...")

# Create updates DataFrame with calibrated temperatures
updates = recent_data.copy()
updates['temperature'] += 0.5

update_jsonldf('sensor_data.jsonl', updates)

# Verify calibration
print("\nVerifying calibration:")
calibrated_data = select_jsonldf('sensor_data.jsonl', (start_time.isoformat(), end_time.isoformat()))
calibrated_data = calibrated_data[calibrated_data['sensor_id'] == 'sensor_00']
print(calibrated_data)


Applying temperature calibration...

Verifying calibration:
                     sensor_id  temperature   humidity  status
2024-01-01 00:00:00  sensor_00    25.711679  61.506890  normal
2024-01-01 00:05:00  sensor_00    25.821086  59.022296  normal
2024-01-01 00:10:00  sensor_00    25.076836  61.098418  normal
2024-01-01 00:15:00  sensor_00    25.206252  60.416932  normal
2024-01-01 00:20:00  sensor_00    26.336592  60.709151  normal
...                        ...          ...        ...     ...
2024-01-01 23:35:00  sensor_00    24.105049  56.265599  normal
2024-01-01 23:40:00  sensor_00    23.878021  56.151184  normal
2024-01-01 23:45:00  sensor_00    23.690718  56.643788  normal
2024-01-01 23:50:00  sensor_00    25.035305  59.226843  normal
2024-01-01 23:55:00  sensor_00    24.921930  57.750804  normal

[288 rows x 4 columns]


## Apply Data Retention Policy

Delete data older than 20 hours based on retention policy.

In [9]:
# Delete old data
cutoff_time = end_time - timedelta(hours=20)
print(f"\nDeleting data older than {cutoff_time}")

# Get all timestamps to delete
all_data = load_jsonldf('sensor_data.jsonl')
timestamps_to_delete = all_data[all_data.index < cutoff_time].index

delete_jsonldf('sensor_data.jsonl', [t.isoformat() for t in timestamps_to_delete])


Deleting data older than 2024-01-01 04:00:00


## Verify Final State

Check the final state of the data after all operations.

In [10]:
print("\nFinal data summary:")
final_data = load_jsonldf('sensor_data.jsonl')
print(f"Total readings: {len(final_data)}")
print(f"Oldest reading: {final_data.index.min()}")
print(f"Newest reading: {final_data.index.max()}")
print("\nReadings per sensor:")
print(final_data['sensor_id'].value_counts())


Final data summary:
Total readings: 240
Oldest reading: 2024-01-01 04:00:00
Newest reading: 2024-01-01 23:55:00

Readings per sensor:
sensor_00    240
Name: sensor_id, dtype: int64


## Cleanup

Remove the JSONL file and its index.

In [11]:
print("\nCleaning up...")
os.remove('sensor_data.jsonl')
os.remove('sensor_data.jsonl.idx')
print("Done!")


Cleaning up...
Done!
