# Time Series Data with FolderDB

This notebook demonstrates how to work with time series data using the FolderDB class. We'll show:
- Generating time series data with datetime keys
- Storing and retrieving time-based records
- Performing range queries with timestamps
- Calculating statistics on time series data

## Setup and Imports

First, let's import the required libraries and set up our environment.

In [2]:
import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import sys

# Add the parent directory to the Python path
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd()))))


from jsonldb.folderdb import FolderDB

## Initialize Database

Let's create a folder for our database and initialize the FolderDB instance.

In [3]:
# Create a folder for our database
db_folder = "timeseries_db"
os.makedirs(db_folder, exist_ok=True)

# Initialize the database
db = FolderDB(db_folder)

## Generate Sample Data

Let's create a function to generate sample sensor data with temperature, humidity, and pressure readings.

In [4]:
def generate_sensor_data(start_time: datetime, duration_minutes: int, interval_minutes: int = 1) -> pd.DataFrame:
    """Generate sample sensor data.
    
    Args:
        start_time: Starting datetime
        duration_minutes: Duration in minutes
        interval_minutes: Time interval between readings in minutes
        
    Returns:
        DataFrame with sensor readings
    """
    # Generate timestamps
    timestamps = [start_time + timedelta(minutes=i) for i in range(0, duration_minutes, interval_minutes)]
    
    # Generate random sensor data
    data = {
        'temperature': np.random.normal(25, 2, len(timestamps)),
        'humidity': np.random.normal(60, 5, len(timestamps)),
        'pressure': np.random.normal(1013, 5, len(timestamps))
    }
    
    # Create DataFrame
    df = pd.DataFrame(data, index=timestamps)
    
    # Round values
    df['temperature'] = df['temperature'].round(1)
    df['humidity'] = df['humidity'].round(1)
    df['pressure'] = df['pressure'].round(1)
    
    return df

# Generate data for two sensors
start_time = datetime.now() - timedelta(hours=1)
sensor1_data = generate_sensor_data(start_time, 60)
sensor2_data = generate_sensor_data(start_time, 60)

print("Sensor 1 Data (first 5 records):")
display(sensor1_data.head())
print("\nSensor 2 Data (first 5 records):")
display(sensor2_data.head())

Sensor 1 Data (first 5 records):


Unnamed: 0,temperature,humidity,pressure
2025-03-27 13:26:21.504995,24.2,56.7,1015.3
2025-03-27 13:27:21.504995,24.6,59.5,1016.6
2025-03-27 13:28:21.504995,22.5,54.4,1011.1
2025-03-27 13:29:21.504995,28.5,55.5,1015.7
2025-03-27 13:30:21.504995,25.0,66.5,1022.8



Sensor 2 Data (first 5 records):


Unnamed: 0,temperature,humidity,pressure
2025-03-27 13:26:21.504995,24.2,56.1,1023.5
2025-03-27 13:27:21.504995,29.3,67.4,1007.9
2025-03-27 13:28:21.504995,23.4,54.8,1009.8
2025-03-27 13:29:21.504995,25.3,53.6,1009.1
2025-03-27 13:30:21.504995,21.1,63.4,1012.4


## Save Data to Database

Now let's save our sensor data to the database using the `upsert_df` method.

In [5]:
# Save DataFrames to database
db.upsert_df("sensor1", sensor1_data)
db.upsert_df("sensor2", sensor2_data)

print("Database state after saving:")
print(str(db))

Database state after saving:
FolderDB at timeseries_db
--------------------------------------------------
Found 2 JSONL files

sensor1.jsonl:
  Size: 4740 bytes
  Count: 60
  Key range: 2025-03-27T13:26:21 to 2025-03-27T14:25:21
  Linted: False

sensor2.jsonl:
  Size: 4740 bytes
  Count: 60
  Key range: 2025-03-27T13:26:21 to 2025-03-27T14:25:21
  Linted: False




## Query Recent Data

Let's query the last 30 minutes of data from both sensors.

In [6]:
# Get current time and calculate time range
end_time = datetime.now()
start_time = end_time - timedelta(minutes=30)

# Query recent data
recent_data = db.get_df(["sensor1", "sensor2"], lower_key=start_time, upper_key=end_time)

print("Recent Sensor 1 Data:")
display(recent_data["sensor1"].head())
print("\nRecent Sensor 2 Data:")
display(recent_data["sensor2"].head())

Recent Sensor 1 Data:


Unnamed: 0,temperature,humidity,pressure
2025-03-27 13:57:21,26.4,72.6,1013.2
2025-03-27 13:58:21,23.6,54.7,1009.8
2025-03-27 13:59:21,25.0,64.1,1017.2
2025-03-27 14:00:21,25.5,53.1,1018.9
2025-03-27 14:01:21,22.3,58.7,1011.4



Recent Sensor 2 Data:


Unnamed: 0,temperature,humidity,pressure
2025-03-27 13:57:21,23.4,67.5,1020.2
2025-03-27 13:58:21,25.7,64.8,1011.4
2025-03-27 13:59:21,27.3,60.2,1010.2
2025-03-27 14:00:21,28.2,51.7,1014.6
2025-03-27 14:01:21,23.6,63.1,1007.1


## Calculate Statistics

Let's calculate some basic statistics on the sensor data.

In [7]:
# Get all data
all_data = db.get_df(["sensor1", "sensor2"])

print("Sensor 1 Statistics:")
display(all_data["sensor1"].describe())
print("\nSensor 2 Statistics:")
display(all_data["sensor2"].describe())

Sensor 1 Statistics:


Unnamed: 0,temperature,humidity,pressure
count,60.0,60.0,60.0
mean,25.575,61.428333,1013.59
std,1.937903,5.546906,4.211433
min,21.6,44.7,1006.1
25%,24.175,58.4,1010.325
50%,25.25,61.6,1013.55
75%,27.325,64.175,1016.3
max,29.0,74.0,1022.9



Sensor 2 Statistics:


Unnamed: 0,temperature,humidity,pressure
count,60.0,60.0,60.0
mean,24.61,61.261667,1012.841667
std,2.050523,4.990537,4.792908
min,19.4,50.5,1004.4
25%,23.3,57.775,1009.65
50%,24.55,61.6,1011.75
75%,25.825,65.05,1016.4
max,29.3,71.5,1026.7


## Apply Calibration

Let's apply a calibration factor to one of the sensors.

In [8]:
# Apply calibration to sensor1
calibration_factor = 1.1
sensor1_calibrated = all_data["sensor1"].copy()
sensor1_calibrated['temperature'] *= calibration_factor

# Save calibrated data
db.upsert_df("sensor1", sensor1_calibrated)

print("Calibrated Sensor 1 Data (first 5 records):")
display(sensor1_calibrated.head())

Calibrated Sensor 1 Data (first 5 records):


Unnamed: 0,temperature,humidity,pressure
2025-03-27 13:26:21,26.62,56.7,1015.3
2025-03-27 13:27:21,27.06,59.5,1016.6
2025-03-27 13:28:21,24.75,54.4,1011.1
2025-03-27 13:29:21,31.35,55.5,1015.7
2025-03-27 13:30:21,27.5,66.5,1022.8


## Delete Old Data

Let's delete data older than 30 minutes.

In [9]:
# Delete old data from both sensors
cutoff_time = datetime.now() - timedelta(minutes=30)

db.delete_file_range("sensor1", None, cutoff_time)
db.delete_file_range("sensor2", None, cutoff_time)

print("Database state after deletion:")
print(str(db))

Database state after deletion:
FolderDB at timeseries_db
--------------------------------------------------
Found 2 JSONL files

sensor1.jsonl:
  Size: 9954 bytes
  Count: 60
  Key range: 2025-03-27T13:26:21 to 2025-03-27T14:25:21
  Linted: False

sensor2.jsonl:
  Size: 4740 bytes
  Count: 60
  Key range: 2025-03-27T13:26:21 to 2025-03-27T14:25:21
  Linted: False




## Lint DB

In [10]:
db.lint_db()

Found 2 JSONL files to lint.
Linting file: sensor1.jsonl
Successfully linted and updated metadata for sensor1.jsonl.
Linting file: sensor2.jsonl
Successfully linted and updated metadata for sensor2.jsonl.


## Cleanup

Finally, let's clean up by removing the database folder and its contents.

In [11]:
# Cleanup
for file in os.listdir(db_folder):
    os.remove(os.path.join(db_folder, file))
os.rmdir(db_folder)

print("Database folder has been cleaned up.")

Database folder has been cleaned up.
