# Temperature & Irradiance Data Upload Notebook

This notebook processes and uploads temperature and irradiance sensor data from text files to the TimescaleDB database.

## Purpose

- Scan directories for temperature and irradiance data files
- Register sensor metadata in the database
- Upload measurement data to the TimescaleDB database
- Avoid duplicate data entries

## Prerequisites

- Running TimescaleDB instance (configured in docker-compose.yml)
- Access to directory containing temperature and irradiance data files
- Environment variables configured in .env file (for database connection)

## 1. Setup and Imports

Import required libraries and install any missing dependencies.

In [1]:
# Install required packages if not already installed
!pip install psycopg2-binary sqlalchemy pandas tqdm pathlib python-dotenv uuid
import psycopg2



In [2]:
# Core data processing libraries
import os
import re
import pandas as pd
import numpy as np
from datetime import datetime, timezone
from pathlib import Path
import uuid

# Database libraries
from sqlalchemy import create_engine, text

# Progress tracking
from tqdm.notebook import tqdm

# Environment variables
from dotenv import load_dotenv

# Logging
import logging
logging.basicConfig(level=logging.INFO,
                   format='%(asctime)s - %(levelname)s - %(message)s')

## 2. Configuration

Load configuration from environment variables or use defaults.

## TimescaleDB and Measurement Tables Note

Important note about TimescaleDB tables:

1. The measurement tables (`irradiance_measurement`, `temperature_measurement`, etc.) are hypertables in TimescaleDB.
2. These tables do **not** have a traditional primary key column named 'id'.
3. Instead, they use the `timestamp` column as their primary dimension for partitioning and indexing.
4. When checking for existing data, we must use the `timestamp` column rather than looking for an 'id' column.

This is a key difference from regular PostgreSQL tables and affects how we check for duplicates.

In [3]:
# Load environment variables from .env file
# Look for the .env file two directories up from the notebook location
dotenv_path = Path("../../.env")
load_dotenv(dotenv_path)

# Database configuration from environment variables with fallbacks
DB_CONFIG = {
    'host': os.getenv('DB_HOST', 'localhost'),  # Changed from 'timescaledb' to 'localhost'
    'port': int(os.getenv('DB_PORT', 5432)),
    'database': os.getenv('DB_NAME', 'perocube'),
    'user': os.getenv('DB_USER', 'postgres'),
    'password': os.getenv('DB_PASSWORD', 'postgres')
}

# Print database connection info (excluding password)
print(f"Database connection: {DB_CONFIG['host']}:{DB_CONFIG['port']}/{DB_CONFIG['database']} as {DB_CONFIG['user']}")

# Data directory configuration
ROOT_DIRECTORY = os.getenv('DEFAULT_DATA_DIR', "../../sample_data/datasets/PeroCube-sample-data")

# File matching patterns
IRRADIANCE_FILE_PATTERN = r"PT-(\d+)_channel_(\d+)"  # Matches PT-104_channel_01.txt
TEMPERATURE_FILE_PATTERN = r"m7004_ID_(\w+)"

# Batch size for database operations
BATCH_SIZE = 5000

# Data validation configuration
VALIDATION_CONFIG = {
    'enabled': False,  # Master switch for validation
    'remove_nan': True,  # Always remove NaN values
    'validate_ranges': False,  # Optional physical value validation
    'ranges': {
        'irradiance': {'min': 0, 'max': 1500},  # W/m²
        'temperature': {'min': -50, 'max': 100}  # °C
    }
}

def print_validation_config():
    """Print current validation configuration for user awareness"""
    print("\nData Validation Configuration:")
    print(f"- Validation enabled: {VALIDATION_CONFIG['enabled']}")
    print(f"- Remove NaN values: {VALIDATION_CONFIG['remove_nan']}")
    if VALIDATION_CONFIG['enabled'] and VALIDATION_CONFIG['validate_ranges']:
        print("\nPhysical value ranges:")
        for measure, limits in VALIDATION_CONFIG['ranges'].items():
            print(f"- {measure}: {limits['min']} to {limits['max']}")
    else:
        print("\nPhysical value validation is disabled")

# Table names
IRRADIANCE_SENSOR_TABLE = 'irradiance_sensor'
IRRADIANCE_MEASUREMENT_TABLE = 'irradiance_measurement'
TEMPERATURE_SENSOR_TABLE = 'temperature_sensor'
TEMPERATURE_MEASUREMENT_TABLE = 'temperature_measurement'

Database connection: localhost:5432/perocube as postgres


## 3. Utility Functions

Helper functions for database connection and data validation.

In [4]:
def create_db_connection(config=DB_CONFIG):
    """
    Create a SQLAlchemy database engine from configuration.
    
    Args:
        config: Dictionary containing database connection parameters
        
    Returns:
        SQLAlchemy engine instance
    """
    try:
        connection_string = f"postgresql://{config['user']}:{config['password']}@{config['host']}:{config['port']}/{config['database']}"
        # Store connection string as attribute of engine for external access
        engine = create_engine(connection_string)
        engine.connection_string = connection_string  # This makes it accessible via engine.connection_string
        
        # Test the connection
        with engine.connect() as conn:
            result = conn.execute(text("SELECT 1"))
            logging.info(f"Database connection successful: {config['host']}:{config['port']}/{config['database']}")
        return engine
    except Exception as e:
        logging.error(f"Database connection failed: {str(e)}")
        raise

def validate_irradiance_data(df, config=VALIDATION_CONFIG):
    """
    Validate irradiance data according to the specified configuration.
    
    Args:
        df: DataFrame containing irradiance measurements
        config: Dictionary containing validation configuration
        
    Returns:
        Cleaned and validated DataFrame, along with validation statistics
    """
    if df.empty:
        return df, {'initial_count': 0, 'final_count': 0, 'removed': {}}
    
    stats = {
        'initial_count': len(df),
        'final_count': None,
        'removed': {
            'nan_values': 0,
            'irradiance_range': 0
        }
    }
    
    # Always ensure timestamp is in UTC
    if 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True)
    
    # Remove NaN values if configured
    if config['remove_nan']:
        nan_count = df.isna().sum().sum()
        df = df.dropna()
        stats['removed']['nan_values'] = nan_count
    
    # Apply physical value validation if enabled
    if config['enabled'] and config['validate_ranges']:
        if 'irradiance' in df.columns and 'irradiance' in config['ranges']:
            limits = config['ranges']['irradiance']
            invalid_count = len(df[~(df['irradiance'].between(limits['min'], limits['max']))])
            df = df[df['irradiance'].between(limits['min'], limits['max'])]
            stats['removed']['irradiance_range'] = invalid_count
    
    stats['final_count'] = len(df)
    
    # Log validation results
    logging.info("Validation statistics:")
    logging.info(f"Initial records: {stats['initial_count']}")
    if config['remove_nan']:
        logging.info(f"Removed NaN values: {stats['removed']['nan_values']}")
    if config['enabled'] and config['validate_ranges']:
        if stats['removed']['irradiance_range'] > 0:
            logging.info(f"Removed irradiance out of range: {stats['removed']['irradiance_range']}")
    logging.info(f"Final records: {stats['final_count']}")
    
    return df, stats

def validate_temperature_data(df, config=VALIDATION_CONFIG):
    """
    Validate temperature data according to the specified configuration.
    
    Args:
        df: DataFrame containing temperature measurements
        config: Dictionary containing validation configuration
        
    Returns:
        Cleaned and validated DataFrame, along with validation statistics
    """
    if df.empty:
        return df, {'initial_count': 0, 'final_count': 0, 'removed': {}}
    
    stats = {
        'initial_count': len(df),
        'final_count': None,
        'removed': {
            'nan_values': 0,
            'temperature_range': 0
        }
    }
    
    # Always ensure timestamp is in UTC
    if 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True)
    
    # Remove NaN values if configured
    if config['remove_nan']:
        nan_count = df.isna().sum().sum()
        df = df.dropna()
        stats['removed']['nan_values'] = nan_count
    
    # Apply physical value validation if enabled
    if config['enabled'] and config['validate_ranges']:
        if 'temperature' in df.columns and 'temperature' in config['ranges']:
            limits = config['ranges']['temperature']
            invalid_count = len(df[~(df['temperature'].between(limits['min'], limits['max']))])
            df = df[df['temperature'].between(limits['min'], limits['max'])]
            stats['removed']['temperature_range'] = invalid_count
    
    stats['final_count'] = len(df)
    
    # Log validation results
    logging.info("Validation statistics:")
    logging.info(f"Initial records: {stats['initial_count']}")
    if config['remove_nan']:
        logging.info(f"Removed NaN values: {stats['removed']['nan_values']}")
    if config['enabled'] and config['validate_ranges']:
        if stats['removed']['temperature_range'] > 0:
            logging.info(f"Removed temperature out of range: {stats['removed']['temperature_range']}")
    logging.info(f"Final records: {stats['final_count']}")
    
    return df, stats

def check_existing_data(engine, table, sensor_id_col, sensor_id, timestamp, id_field='timestamp'):
    """
    Check if data already exists in the database for given parameters.
    
    Args:
        engine: SQLAlchemy engine
        table: Table name
        sensor_id_col: Column name for sensor ID
        sensor_id: Sensor ID value
        timestamp: Timestamp to check
        id_field: Primary key field name (defaults to timestamp as hypertables use timestamp as primary dimension)
        
    Returns:
        Boolean indicating if data exists
    """
    if not timestamp:
        return False
        
    # Build a query to check for existing data
    query = text(f"""
        SELECT {id_field}
        FROM {table}
        WHERE timestamp = :timestamp
          AND {sensor_id_col} = :sensor_id
        LIMIT 1
    """)
    
    # Execute the query
    with engine.connect() as conn:
        result = conn.execute(query, {
            "timestamp": timestamp,
            "sensor_id": sensor_id
        })
        row = result.fetchone()
        
    # If row is not None, data exists
    return row is not None

def get_sensor_id(engine, sensor_table, name_col, name_val, channel_col=None, channel_val=None):
    """
    Get sensor ID from the database or create if it doesn't exist.
    
    Args:
        engine: SQLAlchemy engine
        sensor_table: Sensor table name
        name_col: Column name for sensor name
        name_val: Sensor name value
        channel_col: Column name for channel (optional)
        channel_val: Channel value (optional)
        
    Returns:
        Sensor ID
    """
    # Build the query based on whether channel is provided
    if channel_col and channel_val is not None:
        query = text(f"""
            SELECT {sensor_table}_id FROM {sensor_table} 
            WHERE {name_col} = :name_val AND {channel_col} = :channel_val
        """)
        params = {"name_val": name_val, "channel_val": channel_val}
    else:
        query = text(f"""
            SELECT {sensor_table}_id FROM {sensor_table} 
            WHERE {name_col} = :name_val
        """)
        params = {"name_val": name_val}
    
    # Execute the query
    with engine.connect() as conn:
        result = conn.execute(query, params)
        row = result.fetchone()
        
    # Return the ID if found
    if row:
        return row[0]
    else:
        logging.warning(f"Sensor not found: {name_val} in {sensor_table}")
        return None

## 4. Sensor Registration Functions

In [5]:
def register_irradiance_sensors(root_dir, engine, pattern=IRRADIANCE_FILE_PATTERN):
    """
    Scan directories for irradiance sensor files and register unique sensors in the database.
    """
    stats = {
        'sensors_found': 0,
        'sensors_registered': 0,
        'sensors_existing': 0
    }
    
    # Convert to Path object for better path handling
    root_path = Path(root_dir)
    if not root_path.exists():
        logging.error(f"Root directory does not exist: {root_dir}")
        return stats
    
    # Compile the regex pattern for efficiency
    pattern_compiled = re.compile(pattern)
    
    # Create a DataFrame to store sensor information
    irradiance_sensors = pd.DataFrame(columns=[
        'irradiance_sensor_id', 'date_installed', 'location', 
        'installation_angle', 'sensor_identifier', 'channel'
    ])
    existing_sensors = set()  # Track unique sensor/channel combinations

    # First, get all existing sensors from the database
    query = text(f"""
        SELECT sensor_identifier, channel, irradiance_sensor_id
        FROM {IRRADIANCE_SENSOR_TABLE}
    """)
    sensor_mapping = {}
    with engine.connect() as conn:
        result = conn.execute(query)
        for row in result:
            # FIXED: Store the key in the same consistent format
            sensor_key = f"{row.sensor_identifier}_{row.channel}"
            sensor_mapping[sensor_key] = row.irradiance_sensor_id

    # Scan directories for sensor files
    for dirpath, dirnames, filenames in os.walk(root_path):
        path_parts = Path(dirpath).parts
        if any(part.startswith("data") for part in path_parts):
            for filename in filenames:
                match = pattern_compiled.search(filename)
                if match:
                    sensor_number = match.group(1)  # PT-{number}
                    channel = int(match.group(2))      # channel_{number}
                    sensor_identifier = f"PT-{sensor_number}"
                    
                    # FIXED: Use a consistent key format
                    sensor_key = f"{sensor_identifier}_{channel}"  
                    
                    if sensor_key not in existing_sensors:
                        # Check if sensor already exists in database
                        if sensor_key in sensor_mapping:
                            sensor_id = sensor_mapping[sensor_key]
                            stats['sensors_existing'] += 1
                        else:
                            # Only generate new UUID for new sensors
                            sensor_id = str(uuid.uuid4())
                            # Create a new sensor entry
                            sensor_data = {
                                'irradiance_sensor_id': sensor_id,
                                'date_installed': None,  # Will be set when actual installation date is known
                                'location': None,       # Will be set to represent physical location in PV system
                                'installation_angle': 0, # Will be set to actual installation angle
                                'sensor_identifier': sensor_identifier,
                                'channel': channel
                            }
                            
                            # Add to DataFrame
                            irradiance_sensors = pd.concat([irradiance_sensors, pd.DataFrame([sensor_data])], ignore_index=True)
                            stats['sensors_found'] += 1
                        
                        existing_sensors.add(sensor_key)

    # Insert only new sensors into the database
    for _, sensor in irradiance_sensors.iterrows():
        try:
            # Convert NaN to None
            sensor_dict = {k: (None if pd.isna(v) else v) for k, v in sensor.items()}
            
            insert_query = text(f"""
                INSERT INTO {IRRADIANCE_SENSOR_TABLE} 
                (irradiance_sensor_id, date_installed, location, installation_angle, sensor_identifier, channel)
                VALUES (:irradiance_sensor_id, :date_installed, :location, :installation_angle, :sensor_identifier, :channel)
                ON CONFLICT (sensor_identifier, channel) DO NOTHING
            """)
            
            with engine.connect() as conn:
                result = conn.execute(insert_query, sensor_dict)
                conn.commit()
                if result.rowcount > 0:
                    stats['sensors_registered'] += 1
                    logging.info(f"Registered irradiance sensor: {sensor['sensor_identifier']} channel {sensor['channel']}")
                
        except Exception as e:
            logging.error(f"Failed to register sensor {sensor['sensor_identifier']} channel {sensor['channel']}: {str(e)}")
    
    logging.info(f"Found {stats['sensors_found']} unique irradiance sensors")
    logging.info(f"Registered {stats['sensors_registered']} new irradiance sensors")
    logging.info(f"Found {stats['sensors_existing']} existing irradiance sensors")
    
    return stats

def register_temperature_sensors(root_dir, engine, pattern=TEMPERATURE_FILE_PATTERN):
    """
    Scan directories for temperature sensor files and register unique sensors in the database.
    """
    stats = {
        'sensors_found': 0,
        'sensors_registered': 0,
        'sensors_existing': 0
    }
    
    # Convert to Path object for better path handling
    root_path = Path(root_dir)
    if not root_path.exists():
        logging.error(f"Root directory does not exist: {root_dir}")
        return stats
    
    # Compile the regex pattern for efficiency
    pattern_compiled = re.compile(pattern)
    
    # Create a DataFrame to store sensor information
    temperature_sensors = pd.DataFrame(columns=[
        'temperature_sensor_id', 'date_installed', 'location', 
        'sensor_identifier'
    ])
    existing_sensors = set()  # Track unique sensors

    # First, get all existing sensors from the database
    query = text(f"""
        SELECT sensor_identifier, temperature_sensor_id
        FROM {TEMPERATURE_SENSOR_TABLE}
    """)
    sensor_mapping = {}
    with engine.connect() as conn:
        result = conn.execute(query)
        for row in result:
            sensor_mapping[row.sensor_identifier] = row.temperature_sensor_id

    # Scan directories for sensor files
    for dirpath, dirnames, filenames in os.walk(root_path):
        path_parts = Path(dirpath).parts
        if any(part.startswith("data") for part in path_parts):
            for filename in filenames:
                match = pattern_compiled.search(filename)
                if match:
                    sensor_number = match.group(1)  # TE-{number}
                    sensor_identifier = f"TE-{sensor_number}"
                    
                    if sensor_identifier not in existing_sensors:
                        # Check if sensor already exists in database
                        if sensor_identifier in sensor_mapping:
                            sensor_id = sensor_mapping[sensor_identifier]
                            stats['sensors_existing'] += 1
                        else:
                            # Only generate new UUID for new sensors
                            sensor_id = str(uuid.uuid4())
                            # Create a new sensor entry
                            sensor_data = {
                                'temperature_sensor_id': sensor_id,
                                'date_installed': None,  # Will be set when actual installation date is known
                                'location': None,       # Will be set to represent physical location in PV system
                                'sensor_identifier': sensor_identifier
                            }
                            
                            # Add to DataFrame
                            temperature_sensors = pd.concat([temperature_sensors, pd.DataFrame([sensor_data])], ignore_index=True)
                            stats['sensors_found'] += 1
                        
                        existing_sensors.add(sensor_identifier)

    # Insert only new sensors into the database
    for _, sensor in temperature_sensors.iterrows():
        try:
            # Convert NaN to None
            sensor_dict = {k: (None if pd.isna(v) else v) for k, v in sensor.items()}
            
            insert_query = text(f"""
                INSERT INTO {TEMPERATURE_SENSOR_TABLE} 
                (temperature_sensor_id, date_installed, location, sensor_identifier)
                VALUES (:temperature_sensor_id, :date_installed, :location, :sensor_identifier)
                ON CONFLICT (sensor_identifier) DO NOTHING
            """)
            
            with engine.connect() as conn:
                result = conn.execute(insert_query, sensor_dict)
                conn.commit()
                if result.rowcount > 0:
                    stats['sensors_registered'] += 1
                    logging.info(f"Registered temperature sensor: {sensor['sensor_identifier']}")
                
        except Exception as e:
            logging.error(f"Failed to register sensor {sensor['sensor_identifier']}: {str(e)}")
    
    logging.info(f"Found {stats['sensors_found']} unique temperature sensors")
    logging.info(f"Registered {stats['sensors_registered']} new temperature sensors")
    logging.info(f"Found {stats['sensors_existing']} existing temperature sensors")
    
    return stats

## 5. Data Processing Functions

In [6]:
def process_irradiance_files(root_dir, engine, pattern=IRRADIANCE_FILE_PATTERN, batch_size=BATCH_SIZE, validate=VALIDATION_CONFIG['enabled']):
    """
    Process irradiance data files and upload to database.
    """
    # Statistics to return
    stats = {
        'files_processed': 0,
        'files_skipped': 0,
        'files_error': 0,
        'rows_inserted': 0,
        'start_time': datetime.now(timezone.utc),
        'total_files': 0
    }

    # Convert to Path object for better path handling
    root_path = Path(root_dir)
    if not root_path.exists():
        logging.error(f"Root directory does not exist: {root_dir}")
        return stats

    # Compile the regex pattern for efficiency
    pattern_compiled = re.compile(pattern)

    # First, collect all matching filepaths
    matching_files = []
    for dirpath, dirnames, filenames in os.walk(root_path):
        path_parts = Path(dirpath).parts
        if any(part.startswith("data") for part in path_parts):
            for filename in filenames:
                filepath = Path(dirpath) / filename
                if pattern_compiled.search(filename):
                    matching_files.append(filepath)

    stats['total_files'] = len(matching_files)
    logging.info(f"Found {len(matching_files)} irradiance data files to process")

    # Create a mapping of sensor identifiers and channels to database IDs
    query = text(f"""SELECT irradiance_sensor_id, sensor_identifier, channel FROM {IRRADIANCE_SENSOR_TABLE}""")
    sensor_mapping = {}
    with engine.connect() as conn:
        result = conn.execute(query)
        for row in result:
            sensor_mapping[f"{row.sensor_identifier}_{row.channel}"] = row.irradiance_sensor_id
    
    if not sensor_mapping:
        logging.error("No irradiance sensors registered in the database")
        return stats

    # Process files with a progress bar
    with tqdm(total=len(matching_files), desc="Processing Irradiance Files") as pbar:
        for filepath in matching_files:
            try:
                # Extract sensor info from filename
                match = pattern_compiled.search(filepath.name)
                if not match:
                    logging.warning(f"Could not parse sensor info from filename: {filepath}")
                    stats['files_skipped'] += 1
                    pbar.update(1)
                    continue

                sensor_number = match.group(1)  # This gets "104" from "PT-104_channel_01"
                channel = int(match.group(2))      # This gets "01" from "PT-104_channel_01"
                sensor_identifier = f"PT-{sensor_number}"
                sensor_key = f"{sensor_identifier}_{channel}"  # Construct key to match actual filename pattern
                
                # Look up sensor ID from mapping
                sensor_id = sensor_mapping.get(sensor_key)
                if not sensor_id:
                    logging.warning(f"No registered sensor found for identifier: {sensor_key}. Skipping file.")
                    stats['files_skipped'] += 1
                    pbar.update(1)
                    continue

                # Read the file into a pandas DataFrame
                df = pd.read_csv(
                    filepath,
                    sep='\t',
                    names=['timestamp', 'raw_reading', 'irradiance']
                )

                if df.empty:
                    logging.warning(f"Empty file: {filepath}")
                    stats['files_skipped'] += 1
                    pbar.update(1)
                    continue

                # Ensure timestamp is in UTC format
                df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True)

                # Add sensor ID to the DataFrame
                df['irradiance_sensor_id'] = sensor_id

                # Validate data if enabled
                if validate:
                    df, validation_stats = validate_irradiance_data(df)
                    if df.empty:
                        logging.warning(f"All data filtered during validation: {filepath}")
                        stats['files_skipped'] += 1
                        pbar.update(1)
                        continue

                # Check if last data point exists to avoid duplicates
                last_timestamp = df['timestamp'].iloc[-1]
                data_exists = check_existing_data(
                    engine,
                    IRRADIANCE_MEASUREMENT_TABLE,
                    'irradiance_sensor_id',
                    sensor_id,
                    last_timestamp,
                    id_field='timestamp'  # Explicitly specify timestamp as the id_field
                )

                if data_exists:
                    logging.info(f"Data already exists for {filepath}. Skipping file.")
                    stats['files_skipped'] += 1
                else:
                    # Upload data in batches for large files
                    total_rows = len(df)
                    for i in range(0, total_rows, batch_size):
                        batch_df = df.iloc[i:i+batch_size]
                        batch_df.to_sql(
                            IRRADIANCE_MEASUREMENT_TABLE,
                            engine,
                            if_exists='append',
                            index=False
                        )

                    stats['rows_inserted'] += total_rows
                    stats['files_processed'] += 1
                    logging.info(f"Successfully uploaded {total_rows} rows from {filepath}")

                # Clean up
                del df
                pbar.update(1)

            except Exception as e:
                logging.error(f"Error processing {filepath}: {str(e)}")
                stats['files_error'] += 1
                pbar.update(1)

    # Calculate duration
    stats['end_time'] = datetime.now(timezone.utc)
    stats['duration_seconds'] = (stats['end_time'] - stats['start_time']).total_seconds()

    logging.info(f"Processing complete. Processed {stats['files_processed']} files, "
                 f"skipped {stats['files_skipped']} files, "
                 f"errors in {stats['files_error']} files. "
                 f"Inserted {stats['rows_inserted']} data points in {stats['duration_seconds']:.2f} seconds.")

    return stats

def process_temperature_files(root_dir, engine, pattern=TEMPERATURE_FILE_PATTERN, batch_size=BATCH_SIZE, validate=VALIDATION_CONFIG['enabled']):
    """
    Process temperature data files and upload to database.
    """
    # Statistics to return
    stats = {
        'files_processed': 0,
        'files_skipped': 0,
        'files_error': 0,
        'rows_inserted': 0,
        'start_time': datetime.now(timezone.utc),
        'total_files': 0
    }

    # Convert to Path object for better path handling
    root_path = Path(root_dir)
    if not root_path.exists():
        logging.error(f"Root directory does not exist: {root_dir}")
        return stats

    # Compile the regex pattern for efficiency
    pattern_compiled = re.compile(pattern)

    # First, collect all matching filepaths
    matching_files = []
    for dirpath, dirnames, filenames in os.walk(root_path):
        path_parts = Path(dirpath).parts
        if any(part.startswith("data") for part in path_parts):
            for filename in filenames:
                filepath = Path(dirpath) / filename
                if pattern_compiled.search(filename):
                    matching_files.append(filepath)

    stats['total_files'] = len(matching_files)
    logging.info(f"Found {len(matching_files)} temperature data files to process")

    # Create a mapping of sensor identifiers to database IDs
    query = text(f"""SELECT temperature_sensor_id, sensor_identifier FROM {TEMPERATURE_SENSOR_TABLE}""")
    sensor_mapping = {}
    with engine.connect() as conn:
        result = conn.execute(query)
        for row in result:
            sensor_mapping[row.sensor_identifier] = row.temperature_sensor_id
    
    if not sensor_mapping:
        logging.error("No temperature sensors registered in the database")
        return stats

    # Process files with a progress bar
    with tqdm(total=len(matching_files), desc="Processing Temperature Files") as pbar:
        for filepath in matching_files:
            try:
                # Extract sensor info from filename
                match = pattern_compiled.search(filepath.name)
                if not match:
                    logging.warning(f"Could not parse sensor info from filename: {filepath}")
                    stats['files_skipped'] += 1
                    pbar.update(1)
                    continue

                sensor_identifier = match.group(1)  # Extract the sensor ID from filename
                
                # Look up sensor ID from mapping
                sensor_id = sensor_mapping.get(sensor_identifier)
                if not sensor_id:
                    logging.warning(f"No registered sensor found for identifier: {sensor_identifier}. Skipping file.")
                    stats['files_skipped'] += 1
                    pbar.update(1)
                    continue

                # Read the file into a pandas DataFrame
                df = pd.read_csv(
                    filepath,
                    sep='\t',
                    names=['timestamp', 'temperature']
                )

                if df.empty:
                    logging.warning(f"Empty file: {filepath}")
                    stats['files_skipped'] += 1
                    pbar.update(1)
                    continue

                # Ensure timestamp is in UTC format
                df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True)

                # Add sensor ID to the DataFrame
                df['temperature_sensor_id'] = sensor_id

                # Validate data if enabled
                if validate:
                    df, validation_stats = validate_temperature_data(df)
                    if df.empty:
                        logging.warning(f"All data filtered during validation: {filepath}")
                        stats['files_skipped'] += 1
                        pbar.update(1)
                        continue

                # Check if last data point exists to avoid duplicates
                last_timestamp = df['timestamp'].iloc[-1]
                data_exists = check_existing_data(
                    engine,
                    TEMPERATURE_MEASUREMENT_TABLE,
                    'temperature_sensor_id',
                    sensor_id,
                    last_timestamp,
                    id_field='timestamp'  # Explicitly specify timestamp as the id_field
                )

                if data_exists:
                    logging.info(f"Data already exists for {filepath}. Skipping file.")
                    stats['files_skipped'] += 1
                else:
                    # Upload data in batches for large files
                    total_rows = len(df)
                    for i in range(0, total_rows, batch_size):
                        batch_df = df.iloc[i:i+batch_size]
                        batch_df.to_sql(
                            TEMPERATURE_MEASUREMENT_TABLE,
                            engine,
                            if_exists='append',
                            index=False
                        )

                    stats['rows_inserted'] += total_rows
                    stats['files_processed'] += 1
                    logging.info(f"Successfully uploaded {total_rows} rows from {filepath}")

                # Clean up
                del df
                pbar.update(1)

            except Exception as e:
                logging.error(f"Error processing {filepath}: {str(e)}")
                stats['files_error'] += 1
                pbar.update(1)

    # Calculate duration
    stats['end_time'] = datetime.now(timezone.utc)
    stats['duration_seconds'] = (stats['end_time'] - stats['start_time']).total_seconds()

    logging.info(f"Processing complete. Processed {stats['files_processed']} files, "
                 f"skipped {stats['files_skipped']} files, "
                 f"errors in {stats['files_error']} files. "
                 f"Inserted {stats['rows_inserted']} data points in {stats['duration_seconds']:.2f} seconds.")

    return stats

## 6. Execute the Data Upload Process

In [7]:
# Create database connection
try:
    engine = create_db_connection()
    logging.info("Database connection established successfully")
except Exception as e:
    logging.error(f"Failed to connect to database: {str(e)}")
    raise

2025-05-15 16:02:03,654 - INFO - Database connection successful: localhost:5432/perocube
2025-05-15 16:02:03,655 - INFO - Database connection established successfully
2025-05-15 16:02:03,655 - INFO - Database connection established successfully


In [8]:
# Execute schema changes to add sensor identifier columns
schema_changes = text("""
-- Add columns to irradiance_sensor table
ALTER TABLE irradiance_sensor 
ADD COLUMN IF NOT EXISTS sensor_identifier VARCHAR(50),
ADD COLUMN IF NOT EXISTS channel INTEGER;

-- Add columns to temperature_sensor table
ALTER TABLE temperature_sensor 
ADD COLUMN IF NOT EXISTS sensor_identifier VARCHAR(50);

-- Add unique constraints
CREATE UNIQUE INDEX IF NOT EXISTS irr_sensor_identifier_channel_idx 
ON irradiance_sensor (sensor_identifier, channel);

CREATE UNIQUE INDEX IF NOT EXISTS temp_sensor_identifier_idx 
ON temperature_sensor (sensor_identifier);
""")

try:
    with engine.connect() as conn:
        conn.execute(schema_changes)
        conn.commit()
        logging.info("Schema changes applied successfully")
except Exception as e:
    logging.error(f"Failed to apply schema changes: {str(e)}")
    raise

2025-05-15 16:02:03,684 - INFO - Schema changes applied successfully


In [9]:
# Step 1: Register sensors in the database
print(f"Starting sensor registration from directory: {ROOT_DIRECTORY}")
print("\n1. Registering irradiance sensors...")
irr_sensor_stats = register_irradiance_sensors(ROOT_DIRECTORY, engine)

print("\n2. Registering temperature sensors...")
temp_sensor_stats = register_temperature_sensors(ROOT_DIRECTORY, engine)

2025-05-15 16:02:03,755 - INFO - Found 0 unique irradiance sensors
2025-05-15 16:02:03,756 - INFO - Registered 0 new irradiance sensors
2025-05-15 16:02:03,756 - INFO - Found 2 existing irradiance sensors
2025-05-15 16:02:03,756 - INFO - Registered 0 new irradiance sensors
2025-05-15 16:02:03,756 - INFO - Found 2 existing irradiance sensors
2025-05-15 16:02:03,762 - INFO - Found 0 unique temperature sensors
2025-05-15 16:02:03,763 - INFO - Registered 0 new temperature sensors
2025-05-15 16:02:03,763 - INFO - Found 5 existing temperature sensors
2025-05-15 16:02:03,762 - INFO - Found 0 unique temperature sensors
2025-05-15 16:02:03,763 - INFO - Registered 0 new temperature sensors
2025-05-15 16:02:03,763 - INFO - Found 5 existing temperature sensors


Starting sensor registration from directory: ../../sample_data/datasets/PeroCube-sample-data

1. Registering irradiance sensors...

2. Registering temperature sensors...


In [10]:
# Debug cell to check registered sensors and their keys
print("\nRegistered irradiance sensors in database:")
query = text(f"""SELECT sensor_identifier, channel, irradiance_sensor_id FROM {IRRADIANCE_SENSOR_TABLE}""")
sensor_mapping = {}
with engine.connect() as conn:
    result = conn.execute(query)
    for idx, row in enumerate(result):
        sensor_key = f"{row.sensor_identifier}_{row.channel}"
        sensor_mapping[sensor_key] = row.irradiance_sensor_id
        # Print first few rows to verify format
        if idx < 5:
            print(f"  - Sensor: {sensor_key} -> {row.irradiance_sensor_id}")
            
print(f"\nTotal registered irradiance sensors: {len(sensor_mapping)}")
print("\nExample file names that will be processed:")

# Show some example filenames to compare with registered keys
import glob
example_files = glob.glob(f"{ROOT_DIRECTORY}/**/PT-*.txt", recursive=True)[:3]
for file in example_files:
    print(f"  - {os.path.basename(file)}")
    # Extract sensor info using the same pattern
    match = re.search(IRRADIANCE_FILE_PATTERN, os.path.basename(file))
    if match:
        sensor_number = match.group(1)  
        channel = int(match.group(2))
        sensor_identifier = f"PT-{sensor_number}"
        sensor_key = f"{sensor_identifier}_{channel}"  # Using the same key format as in processing
        print(f"    Would use key: {sensor_key}")
        print(f"    Sensor exists in DB: {sensor_key in sensor_mapping}")


Registered irradiance sensors in database:
  - Sensor: PT-104_1 -> 7c9b9624-a1f5-4f53-b007-7a0e80d32b93
  - Sensor: PT-104_3 -> 46688ea2-1e86-4120-9693-7836d532e155

Total registered irradiance sensors: 2

Example file names that will be processed:
  - PT-104_channel_01.txt
    Would use key: PT-104_1
    Sensor exists in DB: True
  - PT-104_channel_03.txt
    Would use key: PT-104_3
    Sensor exists in DB: True
  - PT-104_channel_01.txt
    Would use key: PT-104_1
    Sensor exists in DB: True


In [11]:
# Review current validation configuration
print_validation_config()

# Uncomment and modify these lines to change validation settings
# VALIDATION_CONFIG['enabled'] = True
# VALIDATION_CONFIG['validate_ranges'] = True
# VALIDATION_CONFIG['ranges']['irradiance']['max'] = 2000  # Adjust range if needed
# VALIDATION_CONFIG['ranges']['temperature']['min'] = -60  # Adjust range if needed

# Review current validation configuration before processing data
print_validation_config()

# Uncomment and modify these lines to change validation settings
# VALIDATION_CONFIG['enabled'] = True
# VALIDATION_CONFIG['validate_ranges'] = True
# VALIDATION_CONFIG['ranges']['irradiance']['max'] = 2000  # Adjust range if needed
# VALIDATION_CONFIG['ranges']['temperature']['min'] = -60  # Adjust range if needed

# Step 2: Process and upload irradiance data
print("\n3. Processing irradiance data files...")
irr_stats = process_irradiance_files(ROOT_DIRECTORY, engine)

2025-05-15 16:02:03,899 - INFO - Found 35 irradiance data files to process



Data Validation Configuration:
- Validation enabled: False
- Remove NaN values: True

Physical value validation is disabled

Data Validation Configuration:
- Validation enabled: False
- Remove NaN values: True

Physical value validation is disabled

3. Processing irradiance data files...


Processing Irradiance Files:   0%|          | 0/35 [00:00<?, ?it/s]

2025-05-15 16:02:04,358 - INFO - Successfully uploaded 10714 rows from ../../sample_data/datasets/PeroCube-sample-data/data_20240319/data/PT-104_channel_01.txt
2025-05-15 16:02:04,758 - INFO - Successfully uploaded 10714 rows from ../../sample_data/datasets/PeroCube-sample-data/data_20240319/data/PT-104_channel_03.txt
2025-05-15 16:02:04,758 - INFO - Successfully uploaded 10714 rows from ../../sample_data/datasets/PeroCube-sample-data/data_20240319/data/PT-104_channel_03.txt
2025-05-15 16:02:35,197 - INFO - Successfully uploaded 7552 rows from ../../sample_data/datasets/PeroCube-sample-data/data_20240222/data/PT-104_channel_01.txt
2025-05-15 16:02:35,197 - INFO - Successfully uploaded 7552 rows from ../../sample_data/datasets/PeroCube-sample-data/data_20240222/data/PT-104_channel_01.txt
2025-05-15 16:02:35,501 - INFO - Successfully uploaded 7552 rows from ../../sample_data/datasets/PeroCube-sample-data/data_20240222/data/PT-104_channel_03.txt
2025-05-15 16:02:35,501 - INFO - Successful

In [12]:
# Step 3: Process and upload temperature data
print("\n4. Processing temperature data files...")
temp_stats = process_temperature_files(ROOT_DIRECTORY, engine)

2025-05-15 16:04:29,427 - INFO - Found 43 temperature data files to process



4. Processing temperature data files...


Processing Temperature Files:   0%|          | 0/43 [00:00<?, ?it/s]



2025-05-15 16:04:29,452 - INFO - Processing complete. Processed 0 files, skipped 43 files, errors in 0 files. Inserted 0 data points in 0.03 seconds.


## 7. Results Summary

In [13]:
# Display processing statistics
print("\n===== UPLOAD SUMMARY =====\n")

# Sensor registration summary
print("SENSOR REGISTRATION:")
print(f"- Irradiance Sensors: {irr_sensor_stats.get('sensors_found', 0)} found, "
      f"{irr_sensor_stats.get('sensors_registered', 0)} newly registered, "
      f"{irr_sensor_stats.get('sensors_existing', 0)} already existing")

print(f"- Temperature Sensors: {temp_sensor_stats.get('sensors_found', 0)} found, "
      f"{temp_sensor_stats.get('sensors_registered', 0)} newly registered, "
      f"{temp_sensor_stats.get('sensors_existing', 0)} already existing")

# Data processing summary
print("\nDATA PROCESSING:")
print("Irradiance Data:")
print(f"- Files processed: {irr_stats.get('files_processed', 0)}")
print(f"- Files skipped: {irr_stats.get('files_skipped', 0)}")
print(f"- Files with errors: {irr_stats.get('files_error', 0)}")
print(f"- Rows inserted: {irr_stats.get('rows_inserted', 0)}")
if 'duration_seconds' in irr_stats:
    print(f"- Processing time: {irr_stats['duration_seconds']:.2f} seconds")

print("\nTemperature Data:")
print(f"- Files processed: {temp_stats.get('files_processed', 0)}")
print(f"- Files skipped: {temp_stats.get('files_skipped', 0)}")
print(f"- Files with errors: {temp_stats.get('files_error', 0)}")
print(f"- Rows inserted: {temp_stats.get('rows_inserted', 0)}")
if 'duration_seconds' in temp_stats:
    print(f"- Processing time: {temp_stats['duration_seconds']:.2f} seconds")

# Verify database counts
try:
    print("\nDATABASE VERIFICATION:")
    with engine.connect() as conn:
        # Get irradiance data counts
        result = conn.execute(text(f"SELECT COUNT(*) FROM {IRRADIANCE_MEASUREMENT_TABLE}"))
        irradiance_count = result.scalar()
        print(f"- Total irradiance measurements: {irradiance_count}")
        
        # Get temperature data counts
        result = conn.execute(text(f"SELECT COUNT(*) FROM {TEMPERATURE_MEASUREMENT_TABLE}"))
        temperature_count = result.scalar()
        print(f"- Total temperature measurements: {temperature_count}")
        
        # Get sensor counts
        result = conn.execute(text(f"SELECT COUNT(*) FROM {IRRADIANCE_SENSOR_TABLE}"))
        irradiance_sensor_count = result.scalar()
        print(f"- Total irradiance sensors: {irradiance_sensor_count}")
        
        result = conn.execute(text(f"SELECT COUNT(*) FROM {TEMPERATURE_SENSOR_TABLE}"))
        temperature_sensor_count = result.scalar()
        print(f"- Total temperature sensors: {temperature_sensor_count}")
        
except Exception as e:
    print(f"Could not query database: {str(e)}")


===== UPLOAD SUMMARY =====

SENSOR REGISTRATION:
- Irradiance Sensors: 0 found, 0 newly registered, 2 already existing
- Temperature Sensors: 0 found, 0 newly registered, 5 already existing

DATA PROCESSING:
Irradiance Data:
- Files processed: 32
- Files skipped: 0
- Files with errors: 3
- Rows inserted: 2107390
- Processing time: 145.51 seconds

Temperature Data:
- Files processed: 0
- Files skipped: 43
- Files with errors: 0
- Rows inserted: 0
- Processing time: 0.03 seconds

DATABASE VERIFICATION:
- Total irradiance measurements: 2152390
- Total temperature measurements: 0
- Total irradiance sensors: 2
- Total temperature sensors: 5
- Total irradiance measurements: 2152390
- Total temperature measurements: 0
- Total irradiance sensors: 2
- Total temperature sensors: 5
