# Perocube Logbook Data Upload Notebook

This notebook processes and uploads logbook data from the Perocube Excel file to the TimescaleDB database.

## Purpose
- Read logbook data from 'PeroCube_logbook_example.xlsx'
- Parse and validate the data
- Upload the data to the TimescaleDB database
- Avoid duplicate data entries

## Prerequisites
- Running TimescaleDB instance (configured in docker-compose.yml)
- Access to the Perocube logbook Excel file
- Environment variables configured in .env file (for database connection)

## 1. Setup and Imports

Import required libraries and install any missing dependencies.

In [71]:
# Core data processing libraries
import os
import pandas as pd
from datetime import datetime, timezone
from pathlib import Path

# Database libraries
from sqlalchemy import create_engine, text

# Progress tracking
from tqdm.notebook import tqdm

# Environment variables
from dotenv import load_dotenv

# Logging
import logging
logging.basicConfig(level=logging.INFO,
                   format='%(asctime)s - %(levelname)s - %(message)s')

In [72]:
# Install required packages if not already installed
!pip install psycopg2-binary sqlalchemy pandas tqdm pathlib python-dotenv openpyxl
import psycopg2



## 2. Configuration

Load configuration from environment variables and set up constants.

In [73]:
# Look for the .env file two directories up from the notebook location
dotenv_path = Path("../../.env")
load_dotenv(dotenv_path)

# Database configuration from environment variables with fallbacks
DB_CONFIG = {
    'host': os.getenv('DB_HOST', 'localhost'),
    'port': int(os.getenv('DB_PORT', 5432)),
    'database': os.getenv('DB_NAME', 'perocube'),
    'user': os.getenv('DB_USER', 'postgres'),
    'password': os.getenv('DB_PASSWORD', 'postgres')
}

# Print database connection info (excluding password)
print(f"Database connection: {DB_CONFIG['host']}:{DB_CONFIG['port']}/{DB_CONFIG['database']} as {DB_CONFIG['user']}")

# Data directory and file configuration
ROOT_DIRECTORY = os.getenv('DEFAULT_DATA_DIR', "../../sample_data/datasets/PeroCube-sample-data")
LOGBOOK_FILE = "PeroCube_logbook_example.xlsx"
LOGBOOK_SHEET = "Perocube history"

# Batch size for database operations
BATCH_SIZE = 1000

Database connection: localhost:5432/perocube as postgres


## 3. Read and Process Logbook Data

In [74]:
# Create the full path to the logbook file
logbook_path = Path(ROOT_DIRECTORY) / LOGBOOK_FILE

# Read the Excel sheet, skip first row and use second row as header
try:
    df = pd.read_excel(logbook_path, sheet_name=LOGBOOK_SHEET, header=1)
    print(f"Successfully read {len(df)} rows from {LOGBOOK_FILE}")
    
    # Display the first few rows and data info
    print("\nColumn names:")
    print(df.columns.tolist())
    
    print("\nFirst few rows of the data:")
    display(df.head())
    
    print("\nDataset information:")
    display(df.info())
    
except Exception as e:
    print(f"Error reading Excel file: {str(e)}")

Successfully read 138 rows from PeroCube_logbook_example.xlsx

Column names:
['Date removed', 'Unnamed: 1', 'Board', 'Channel', 'Status', 'Cell type', 'Date installed', 'Cell name', 'Pixel', 'Area', 'Init.PCE', 'Encap.', 'Structure', 'Producer', 'Owner', 'Project', 'Temp sensor', 'Comment 1', 'Comment 2']

First few rows of the data:


Unnamed: 0,Date removed,Unnamed: 1,Board,Channel,Status,Cell type,Date installed,Cell name,Pixel,Area,Init.PCE,Encap.,Structure,Producer,Owner,Project,Temp sensor,Comment 1,Comment 2
0,15.07.2022,,3.0,14.0,,Pero,09.02.2022,T2-2,C,,,,,,Giuxian,WITH polymere,,ok - 0.5V,
1,15.07.2022,,3.0,15.0,,Pero,09.02.2022,T2-2,E,,,,,,Giuxian,WITH polymere,,ok - 0.5V,
2,15.07.2022,,3.0,16.0,,Pero,09.02.2022,GK-6-1,B,,,,,,Giuxian,WITH polymere,,ok - 0.5V,
3,15.07.2022,,3.0,17.0,,Pero,09.02.2022,GK-6-1,D,,,,,,Giuxian,WITH polymere,,ok - 0.5V,
4,15.07.2022,,3.0,18.0,,Pero,09.02.2022,GK-6-1,F,,,,,,Giuxian,WITH polymere,,ok - 0.6V,



Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138 entries, 0 to 137
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date removed    132 non-null    object 
 1   Unnamed: 1      0 non-null      float64
 2   Board           118 non-null    float64
 3   Channel         118 non-null    float64
 4   Status          0 non-null      float64
 5   Cell type       118 non-null    object 
 6   Date installed  118 non-null    object 
 7   Cell name       117 non-null    object 
 8   Pixel           84 non-null     object 
 9   Area            84 non-null     float64
 10  Init.PCE        46 non-null     float64
 11  Encap.          61 non-null     object 
 12  Structure       67 non-null     object 
 13  Producer        65 non-null     object 
 14  Owner           99 non-null     object 
 15  Project         49 non-null     object 
 16  Temp sensor     4 non-null      object 
 17  Comment 1    

None

In [75]:
# Check for invalid date entries ('??') and missing cell names

# 1. Check for invalid date entries
invalid_dates = df['Date removed'].astype(str).str.contains(r'\?')
if invalid_dates.any():
    print(f"Found {invalid_dates.sum()} rows with invalid removal dates ('??')")
    print("\nSample of rows with invalid dates:")
    display(df[invalid_dates][['Cell name', 'Pixel', 'Date removed', 'Date installed']])

# 2. Check for missing cell names
missing_cell_names = df['Cell name'].isna() | (df['Cell name'].astype(str).str.strip() == '')
if missing_cell_names.any():
    print(f"\nFound {missing_cell_names.sum()} rows with missing cell names")
    print("\nSample of rows with missing cell names:")
    display(df[missing_cell_names][['Cell name', 'Pixel', 'Date removed', 'Date installed']])

# Remove rows with either invalid dates or missing cell names
df = df[~(invalid_dates | missing_cell_names)]
print(f"\nRemaining rows after removal: {len(df)}")


Found 4 rows with invalid removal dates ('??')

Sample of rows with invalid dates:


Unnamed: 0,Cell name,Pixel,Date removed,Date installed
19,QE-020622-03,3,??.??.2022,13.6.2022
20,QE-020622-04,2,??.??.2022,13.6.2022
21,QE-020622-01,3,??.??.2022,13.6.2022
22,QE-020622-02,2,??.??.2022,13.6.2022



Found 21 rows with missing cell names

Sample of rows with missing cell names:


Unnamed: 0,Cell name,Pixel,Date removed,Date installed
11,,,x,
18,,,x,
23,,,x,
24,,,09.12.2022,15.7.2026
25,,,x,
44,,,x,
49,,,x,
54,,,x,
60,,,x,
62,,,x,



Remaining rows after removal: 113


In [76]:
# Check for invalid date entries ('??') before date conversion
invalid_dates = df['Date removed'].astype(str).str.contains(r'\?', regex=True)
if invalid_dates.any():
    print(f"Found {invalid_dates.sum()} rows with invalid removal dates ('??')")
    print("\nSample of rows to be removed:")
    display(df[invalid_dates][['Cell name', 'Pixel', 'Date removed', 'Date installed']])
    
    # Remove these rows
    df = df[~invalid_dates]
    print(f"\nRemaining rows after removal: {len(df)}")

In [77]:
# Analyze current dataframe state
print("Checking for unnamed columns:")
unnamed_cols = [col for col in df.columns if 'Unnamed' in str(col)]
print(f"Unnamed columns found: {unnamed_cols}")

print("\nCurrent data types:")
print(df.dtypes)

print("\nMissing values per column:")
print(df.isnull().sum())

print("\nTotal rows with all missing values:")
print(df.isna().all(axis=1).sum())

Checking for unnamed columns:
Unnamed columns found: ['Unnamed: 1']

Current data types:
Date removed       object
Unnamed: 1        float64
Board             float64
Channel           float64
Status            float64
Cell type          object
Date installed     object
Cell name          object
Pixel              object
Area              float64
Init.PCE          float64
Encap.             object
Structure          object
Producer           object
Owner              object
Project            object
Temp sensor        object
Comment 1          object
Comment 2         float64
dtype: object

Missing values per column:
Date removed        0
Unnamed: 1        113
Board               0
Channel             0
Status            113
Cell type           0
Date installed      0
Cell name           0
Pixel              33
Area               29
Init.PCE           67
Encap.             52
Structure          51
Producer           48
Owner              19
Project            64
Temp sensor       109
C

In [78]:
# Clean the dataframe

# 1. Remove unnamed columns
df = df.loc[:, ~df.columns.str.contains('Unnamed')]

# 2. Drop rows where all values are missing
df = df.dropna(how='all')

# Print cleaning results
print("Dataframe shape after cleaning:")
print(f"Initial shape: {df.shape}")

# Display updated column list
print("\nUpdated column names:")
print(df.columns.tolist())

# Display updated missing values count
print("\nMissing values per column after cleaning:")
print(df.isnull().sum())

# Display first few rows of cleaned data
print("\nFirst few rows of cleaned data:")
display(df.head())

Dataframe shape after cleaning:
Initial shape: (113, 18)

Updated column names:
['Date removed', 'Board', 'Channel', 'Status', 'Cell type', 'Date installed', 'Cell name', 'Pixel', 'Area', 'Init.PCE', 'Encap.', 'Structure', 'Producer', 'Owner', 'Project', 'Temp sensor', 'Comment 1', 'Comment 2']

Missing values per column after cleaning:
Date removed        0
Board               0
Channel             0
Status            113
Cell type           0
Date installed      0
Cell name           0
Pixel              33
Area               29
Init.PCE           67
Encap.             52
Structure          51
Producer           48
Owner              19
Project            64
Temp sensor       109
Comment 1          55
Comment 2         113
dtype: int64

First few rows of cleaned data:


Unnamed: 0,Date removed,Board,Channel,Status,Cell type,Date installed,Cell name,Pixel,Area,Init.PCE,Encap.,Structure,Producer,Owner,Project,Temp sensor,Comment 1,Comment 2
0,15.07.2022,3.0,14.0,,Pero,09.02.2022,T2-2,C,,,,,,Giuxian,WITH polymere,,ok - 0.5V,
1,15.07.2022,3.0,15.0,,Pero,09.02.2022,T2-2,E,,,,,,Giuxian,WITH polymere,,ok - 0.5V,
2,15.07.2022,3.0,16.0,,Pero,09.02.2022,GK-6-1,B,,,,,,Giuxian,WITH polymere,,ok - 0.5V,
3,15.07.2022,3.0,17.0,,Pero,09.02.2022,GK-6-1,D,,,,,,Giuxian,WITH polymere,,ok - 0.5V,
4,15.07.2022,3.0,18.0,,Pero,09.02.2022,GK-6-1,F,,,,,,Giuxian,WITH polymere,,ok - 0.6V,


In [79]:
# Remove completely empty Comment 2 column
df = df.drop('Comment 2', axis=1)

# Print updated dataframe info
print("Dataframe shape after removing Comment 2:")
print(f"Shape: {df.shape}")

# Display updated column list
print("\nUpdated column names:")
print(df.columns.tolist())

# Display first few rows of cleaned data:
print("\nFirst few rows of cleaned data:")
display(df.head())

Dataframe shape after removing Comment 2:
Shape: (113, 17)

Updated column names:
['Date removed', 'Board', 'Channel', 'Status', 'Cell type', 'Date installed', 'Cell name', 'Pixel', 'Area', 'Init.PCE', 'Encap.', 'Structure', 'Producer', 'Owner', 'Project', 'Temp sensor', 'Comment 1']

First few rows of cleaned data:


Unnamed: 0,Date removed,Board,Channel,Status,Cell type,Date installed,Cell name,Pixel,Area,Init.PCE,Encap.,Structure,Producer,Owner,Project,Temp sensor,Comment 1
0,15.07.2022,3.0,14.0,,Pero,09.02.2022,T2-2,C,,,,,,Giuxian,WITH polymere,,ok - 0.5V
1,15.07.2022,3.0,15.0,,Pero,09.02.2022,T2-2,E,,,,,,Giuxian,WITH polymere,,ok - 0.5V
2,15.07.2022,3.0,16.0,,Pero,09.02.2022,GK-6-1,B,,,,,,Giuxian,WITH polymere,,ok - 0.5V
3,15.07.2022,3.0,17.0,,Pero,09.02.2022,GK-6-1,D,,,,,,Giuxian,WITH polymere,,ok - 0.5V
4,15.07.2022,3.0,18.0,,Pero,09.02.2022,GK-6-1,F,,,,,,Giuxian,WITH polymere,,ok - 0.6V


In [80]:
# Fix data types

# 1. Convert date columns to datetime
df['Date removed'] = pd.to_datetime(df['Date removed'], format='%d.%m.%Y', errors='coerce')
df['Date installed'] = pd.to_datetime(df['Date installed'], format='%d.%m.%Y', errors='coerce')

# 2. Convert numeric columns
numeric_columns = ['Board', 'Channel', 'Status', 'Area', 'Init.PCE']
for col in numeric_columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# 3. Ensure string columns are properly formatted (strip whitespace)
string_columns = ['Cell type', 'Cell name', 'Pixel', 'Encap.', 'Structure', 
                  'Producer', 'Owner', 'Project', 'Temp sensor', 'Comment 1']
for col in string_columns:
    df[col] = df[col].astype(str).str.strip()
    # Replace 'nan' strings with actual NaN
    df[col] = df[col].replace('nan', pd.NA)

# Display the updated data types
print("Updated data types:")
print(df.dtypes)

# Display sample of the data to verify conversions
print("\nSample of converted data:")
display(df.head(25))

# Check for any conversion issues (invalid dates or numbers)
print("\nCount of NaN values after type conversion:")
print(df.isna().sum())

Updated data types:
Date removed      datetime64[ns]
Board                    float64
Channel                  float64
Status                   float64
Cell type                 object
Date installed    datetime64[ns]
Cell name                 object
Pixel                     object
Area                     float64
Init.PCE                 float64
Encap.                    object
Structure                 object
Producer                  object
Owner                     object
Project                   object
Temp sensor               object
Comment 1                 object
dtype: object

Sample of converted data:


Unnamed: 0,Date removed,Board,Channel,Status,Cell type,Date installed,Cell name,Pixel,Area,Init.PCE,Encap.,Structure,Producer,Owner,Project,Temp sensor,Comment 1
0,2022-07-15,3.0,14.0,,Pero,2022-02-09,T2-2,C,,,,,,Giuxian,WITH polymere,,ok - 0.5V
1,2022-07-15,3.0,15.0,,Pero,2022-02-09,T2-2,E,,,,,,Giuxian,WITH polymere,,ok - 0.5V
2,2022-07-15,3.0,16.0,,Pero,2022-02-09,GK-6-1,B,,,,,,Giuxian,WITH polymere,,ok - 0.5V
3,2022-07-15,3.0,17.0,,Pero,2022-02-09,GK-6-1,D,,,,,,Giuxian,WITH polymere,,ok - 0.5V
4,2022-07-15,3.0,18.0,,Pero,2022-02-09,GK-6-1,F,,,,,,Giuxian,WITH polymere,,ok - 0.6V
5,2022-07-15,3.0,19.0,,Pero,2022-02-09,QE-060122-1,A,,,,,,Giuxian,w/o polymere (reference),,ok
6,2022-07-15,3.0,20.0,,Pero,2022-02-09,QE-060122-1,C,,,,,,Giuxian,w/o polymere (reference),,/
7,2022-07-15,3.0,21.0,,Pero,2022-02-09,QE-060122-1,E,,,,,,Giuxian,w/o polymere (reference),,/
8,2022-07-15,3.0,22.0,,Pero,2022-02-09,QE-060122-3,B,,,,,,Giuxian,w/o polymere (reference),,ok
9,2022-07-15,3.0,23.0,,Pero,2022-02-09,QE-060122-3,D,,,,,,Giuxian,w/o polymere (reference),,ok



Count of NaN values after type conversion:
Date removed        0
Board               0
Channel             0
Status            113
Cell type           0
Date installed      2
Cell name           0
Pixel              33
Area               29
Init.PCE           67
Encap.             52
Structure          51
Producer           48
Owner              19
Project            64
Temp sensor       109
Comment 1          55
dtype: int64


In [81]:
# Remove rows where both dates are null
original_rows = len(df)
df = df.dropna(subset=['Date removed', 'Date installed'], how='all')

# Print results
print(f"Removed {original_rows - len(df)} rows where both dates were missing")
print(f"Remaining rows: {len(df)}")

# Display missing values count after removal
print("\nMissing values per column after removing rows with missing dates:")
print(df.isna().sum())

# Display sample of remaining data
print("\nSample of cleaned data:")
display(df.head(25))

Removed 0 rows where both dates were missing
Remaining rows: 113

Missing values per column after removing rows with missing dates:
Date removed        0
Board               0
Channel             0
Status            113
Cell type           0
Date installed      2
Cell name           0
Pixel              33
Area               29
Init.PCE           67
Encap.             52
Structure          51
Producer           48
Owner              19
Project            64
Temp sensor       109
Comment 1          55
dtype: int64

Sample of cleaned data:


Unnamed: 0,Date removed,Board,Channel,Status,Cell type,Date installed,Cell name,Pixel,Area,Init.PCE,Encap.,Structure,Producer,Owner,Project,Temp sensor,Comment 1
0,2022-07-15,3.0,14.0,,Pero,2022-02-09,T2-2,C,,,,,,Giuxian,WITH polymere,,ok - 0.5V
1,2022-07-15,3.0,15.0,,Pero,2022-02-09,T2-2,E,,,,,,Giuxian,WITH polymere,,ok - 0.5V
2,2022-07-15,3.0,16.0,,Pero,2022-02-09,GK-6-1,B,,,,,,Giuxian,WITH polymere,,ok - 0.5V
3,2022-07-15,3.0,17.0,,Pero,2022-02-09,GK-6-1,D,,,,,,Giuxian,WITH polymere,,ok - 0.5V
4,2022-07-15,3.0,18.0,,Pero,2022-02-09,GK-6-1,F,,,,,,Giuxian,WITH polymere,,ok - 0.6V
5,2022-07-15,3.0,19.0,,Pero,2022-02-09,QE-060122-1,A,,,,,,Giuxian,w/o polymere (reference),,ok
6,2022-07-15,3.0,20.0,,Pero,2022-02-09,QE-060122-1,C,,,,,,Giuxian,w/o polymere (reference),,/
7,2022-07-15,3.0,21.0,,Pero,2022-02-09,QE-060122-1,E,,,,,,Giuxian,w/o polymere (reference),,/
8,2022-07-15,3.0,22.0,,Pero,2022-02-09,QE-060122-3,B,,,,,,Giuxian,w/o polymere (reference),,ok
9,2022-07-15,3.0,23.0,,Pero,2022-02-09,QE-060122-3,D,,,,,,Giuxian,w/o polymere (reference),,ok


In [82]:
# Show rows where installation date is missing
missing_install_date = df[df['Date installed'].isna()]

print(f"Found {len(missing_install_date)} rows with missing installation date:\n")
display(missing_install_date)

Found 2 rows with missing installation date:



Unnamed: 0,Date removed,Board,Channel,Status,Cell type,Date installed,Cell name,Pixel,Area,Init.PCE,Encap.,Structure,Producer,Owner,Project,Temp sensor,Comment 1
118,2024-02-29,4.0,1.0,,Pero,NaT,BO - 1C,C,0.16,,,,,"Jinzhao, FAPI",,,not working
119,2024-02-29,4.0,3.0,,Pero,NaT,BO - 1E,E,0.16,,,,,"Jinzhao, FAPI",,,not working


In [83]:
# Fill missing installation dates with January 1st, 2020
default_install_date = pd.to_datetime('2020-01-01')
df['Date installed'] = df['Date installed'].fillna(default_install_date)

# Verify the changes
print("Checking for any remaining missing installation dates:")
print(f"Missing installation dates: {df['Date installed'].isna().sum()}")

# Display the rows that were updated
print("\nVerifying the rows that were updated:")
display(df[df['Date installed'] == default_install_date])

Checking for any remaining missing installation dates:
Missing installation dates: 0

Verifying the rows that were updated:


Unnamed: 0,Date removed,Board,Channel,Status,Cell type,Date installed,Cell name,Pixel,Area,Init.PCE,Encap.,Structure,Producer,Owner,Project,Temp sensor,Comment 1
118,2024-02-29,4.0,1.0,,Pero,2020-01-01,BO - 1C,C,0.16,,,,,"Jinzhao, FAPI",,,not working
119,2024-02-29,4.0,3.0,,Pero,2020-01-01,BO - 1E,E,0.16,,,,,"Jinzhao, FAPI",,,not working


In [84]:
# Convert Board and Channel to integers (as per database schema)
df['Board'] = df['Board'].astype('Int64')  # Using Int64 to handle NaN values
df['Channel'] = df['Channel'].astype('Int64')  # Using Int64 to handle NaN values

# Verify the conversion
print("Updated data types for Board and Channel:")
print(df[['Board', 'Channel']].dtypes)

# Display a sample to verify the conversion
print("\nSample of Board and Channel data:")
display(df[['Board', 'Channel']].head(10))

Updated data types for Board and Channel:
Board      Int64
Channel    Int64
dtype: object

Sample of Board and Channel data:


Unnamed: 0,Board,Channel
0,3,14
1,3,15
2,3,16
3,3,17
4,3,18
5,3,19
6,3,20
7,3,21
8,3,22
9,3,23


In [85]:
# Align numeric types with database schema
df['Area'] = df['Area'].astype('float64')
df['Init.PCE'] = df['Init.PCE'].astype('float64')

# Validate and truncate string columns to match VARCHAR(255)
string_columns = ['Cell type', 'Cell name', 'Pixel', 'Encap.', 'Structure', 
                 'Producer', 'Owner', 'Project', 'Temp sensor', 'Comment 1']

for col in string_columns:
    # Check if any string is longer than 255 characters
    mask = df[col].str.len() > 255
    if mask.any():
        print(f"Warning: Found {mask.sum()} values in {col} longer than 255 characters. Truncating...")
        df.loc[mask, col] = df.loc[mask, col].str.slice(0, 255)

# Display updated data types
print("\nUpdated data types after database alignment:")
print(df.dtypes)

# Check for any values that might be too long
print("\nMaximum string lengths:")
for col in string_columns:
    max_len = df[col].str.len().max()
    print(f"{col}: {max_len}")

# Display sample of numeric columns
print("\nSample of numeric columns:")
display(df[['Area', 'Init.PCE']].head())


Updated data types after database alignment:
Date removed      datetime64[ns]
Board                      Int64
Channel                    Int64
Status                   float64
Cell type                 object
Date installed    datetime64[ns]
Cell name                 object
Pixel                     object
Area                     float64
Init.PCE                 float64
Encap.                    object
Structure                 object
Producer                  object
Owner                     object
Project                   object
Temp sensor               object
Comment 1                 object
dtype: object

Maximum string lengths:
Cell type: 16
Cell name: 26
Pixel: 5
Encap.: 10
Structure: 72
Producer: 9
Owner: 19
Project: 39
Temp sensor: 21
Comment 1: 29

Sample of numeric columns:


Unnamed: 0,Area,Init.PCE
0,,
1,,
2,,
3,,
4,,


## 4. Database Mapping Reference

Based on the database schema in `tables.sql`, our logbook data maps to the following tables:

### 1. scientist table
```sql
CREATE TABLE scientist (
    scientist_id UUID PRIMARY KEY,
    name VARCHAR(255) NOT NULL
);
```
- Maps from: 'Owner', 'Producer' columns
- Fields:
  - scientist_id (UUID, generated)
  - name (from 'Owner' and 'Producer')

### 2. solar_cell_device table
```sql
CREATE TABLE solar_cell_device (
    name VARCHAR(255) PRIMARY KEY,
    nomad_id UUID UNIQUE,
    technology VARCHAR(255),
    area DOUBLE PRECISION,
    initial_pce DOUBLE PRECISION,
    date_produced TIMESTAMP WITH TIME ZONE,
    form_factor VARCHAR(255),
    encapsulation VARCHAR(255),
    experiment_id UUID,
    date_encapsulated TIMESTAMP WITH TIME ZONE,
    owner_id UUID REFERENCES scientist(scientist_id),
    producer_id UUID REFERENCES scientist(scientist_id)
);
```
- Maps from multiple columns
- Fields:
  - name (from 'Cell name')
  - nomad_id (UUID, generated)
  - technology (from 'Cell type')
  - area (from 'Area')
  - initial_pce (from 'Init.PCE')
  - date_produced (can use 'Date installed')
  - form_factor (not in logbook)
  - encapsulation (from 'Encap.')
  - experiment_id (not in logbook)
  - date_encapsulated (not in logbook)
  - owner_id (link to scientist table)
  - producer_id (link to scientist table)

### 3. solar_cell_pixel table
```sql
CREATE TABLE solar_cell_pixel (
    solar_cell_id VARCHAR(255) REFERENCES solar_cell_device(name),
    pixel VARCHAR(255),
    active_area DOUBLE PRECISION,
    PRIMARY KEY (solar_cell_id, pixel)
);
```
- Maps from pixel-specific data
- Fields:
  - solar_cell_id (foreign key to solar_cell_device.name)
  - pixel (from 'Pixel' column)
  - active_area (from 'Area')

### 4. mpp_tracking_channel table
```sql
CREATE TABLE mpp_tracking_channel (
    board INTEGER,
    channel INTEGER,
    PRIMARY KEY (board, channel)
);
```
- Maps from channel data
- Fields:
  - board (from 'Board')
  - channel (from 'Channel')

### 5. measurement_connection_event table
```sql
CREATE TABLE measurement_connection_event (
    solar_cell_id VARCHAR(255),
    pixel VARCHAR(255),
    tracking_channel_board INTEGER,
    tracking_channel_channel INTEGER,
    temperature_sensor_id VARCHAR(255),
    connection_datetime TIMESTAMP WITH TIME ZONE NOT NULL,
    FOREIGN KEY (solar_cell_id, pixel) REFERENCES solar_cell_pixel(solar_cell_id, pixel),
    FOREIGN KEY (tracking_channel_board, tracking_channel_channel) REFERENCES mpp_tracking_channel(board, channel)
);
```
- Maps connection events
- Fields:
  - solar_cell_id (links to solar_cell_device.name)
  - pixel (from 'Pixel')
  - tracking_channel_board (from 'Board')
  - tracking_channel_channel (from 'Channel')
  - temperature_sensor_id (from 'Temp sensor')
  - connection_datetime (from 'Date installed')

This mapping shows we'll need to:
1. Generate UUIDs for new scientists
2. Handle the relationships between tables
3. Convert dates to proper timestamp format
4. Validate data against database constraints
5. Ensure referential integrity across all tables

## 5. Prepare Data for Database Upload

We'll prepare the data for each table in the database schema, starting with the scientist table.

In [86]:
import uuid

# Get unique scientists from both Owner and Producer columns
scientists = pd.concat([df['Owner'].dropna(), df['Producer'].dropna()]).unique()

# Create scientist dataframe
scientist_df = pd.DataFrame({
    'scientist_id': [uuid.uuid4() for _ in range(len(scientists))],
    'name': scientists
})

# Create a mapping dictionary for later use
scientist_id_map = dict(zip(scientist_df['name'], scientist_df['scientist_id']))

# Display the results
print(f"Found {len(scientist_df)} unique scientists")
print("\nScientist table preview:")
display(scientist_df)

print("\nValidating unique constraints:")
print(f"Duplicate names: {scientist_df['name'].duplicated().sum()}")
print(f"Duplicate UUIDs: {scientist_df['scientist_id'].duplicated().sum()}")

# Store the mapping for later use
print("\nScientist ID mapping (first few entries):")
for name, id in list(scientist_id_map.items())[:5]:
    print(f"{name}: {id}")

Found 15 unique scientists

Scientist table preview:


Unnamed: 0,scientist_id,name
0,a8976d76-1320-4390-835e-e398ff63f118,Giuxian
1,d6d26008-bf49-4105-b028-fd26b151b775,Yicheng/prof Brabec
2,c30aeef2-3538-46af-803c-20da9c6fbe61,Fengjiu
3,9aaf9566-ac11-4a83-869c-7f92caaab9e1,Daniel
4,3664bdc4-3643-4f56-bafb-5002a8da01eb,Kenedy
5,fe4fe15e-ea0c-4f58-9761-6252b363abe4,Ulas
6,8b984220-736f-4ffd-90a1-7af1a201968d,J. Zhang
7,7f816acd-9d15-4c30-b6b8-c128466d336e,ZSW
8,f1b7843e-8e29-4b0b-9d90-1d16b07ac9e0,Marlene
9,e73fe6ca-759c-4eef-80a7-1f9405e83455,"Jinzhao, FAPI"



Validating unique constraints:
Duplicate names: 0
Duplicate UUIDs: 0

Scientist ID mapping (first few entries):
Giuxian: a8976d76-1320-4390-835e-e398ff63f118
Yicheng/prof Brabec: d6d26008-bf49-4105-b028-fd26b151b775
Fengjiu: c30aeef2-3538-46af-803c-20da9c6fbe61
Daniel: 9aaf9566-ac11-4a83-869c-7f92caaab9e1
Kenedy: 3664bdc4-3643-4f56-bafb-5002a8da01eb


In [87]:
# Clean up scientist data

# Remove NA values
scientist_df = scientist_df.dropna()

# Clean up names
# 1. Strip whitespace
# 2. Replace multiple spaces with single space
# 3. Remove any trailing commas or periods
scientist_df['name'] = scientist_df['name'].str.strip()
scientist_df['name'] = scientist_df['name'].str.replace(r'\s+', ' ', regex=True)
scientist_df['name'] = scientist_df['name'].str.replace(r'[,.]$', '', regex=True)

# Update the mapping dictionary
scientist_id_map = dict(zip(scientist_df['name'], scientist_df['scientist_id']))

# Display cleaned results
print(f"After cleaning, found {len(scientist_df)} scientists")
print("\nCleaned scientist table:")
display(scientist_df)

# Verify no duplicates or NA values remain
print("\nValidating cleaned data:")
print(f"Null values: {scientist_df['name'].isna().sum()}")
print(f"Duplicate names: {scientist_df['name'].duplicated().sum()}")
print(f"Duplicate UUIDs: {scientist_df['scientist_id'].duplicated().sum()}")

After cleaning, found 15 scientists

Cleaned scientist table:


Unnamed: 0,scientist_id,name
0,a8976d76-1320-4390-835e-e398ff63f118,Giuxian
1,d6d26008-bf49-4105-b028-fd26b151b775,Yicheng/prof Brabec
2,c30aeef2-3538-46af-803c-20da9c6fbe61,Fengjiu
3,9aaf9566-ac11-4a83-869c-7f92caaab9e1,Daniel
4,3664bdc4-3643-4f56-bafb-5002a8da01eb,Kenedy
5,fe4fe15e-ea0c-4f58-9761-6252b363abe4,Ulas
6,8b984220-736f-4ffd-90a1-7af1a201968d,J. Zhang
7,7f816acd-9d15-4c30-b6b8-c128466d336e,ZSW
8,f1b7843e-8e29-4b0b-9d90-1d16b07ac9e0,Marlene
9,e73fe6ca-759c-4eef-80a7-1f9405e83455,"Jinzhao, FAPI"



Validating cleaned data:
Null values: 0
Duplicate names: 0
Duplicate UUIDs: 0


In [88]:
# Additional cleaning: Remove <NA> strings and ensure no invalid values remain
print("Checking for '<NA>' values...")

# Check for '<NA>' strings
na_mask = scientist_df['name'].isin(['<NA>', 'NA', '<na>', 'na'])
if na_mask.any():
    print(f"Found {na_mask.sum()} '<NA>' values. Removing them...")
    scientist_df = scientist_df[~na_mask]

# Update the mapping dictionary again
scientist_id_map = dict(zip(scientist_df['name'], scientist_df['scientist_id']))

# Validate final results
print("\nFinal validation:")
print(f"Total scientists after removing <NA>: {len(scientist_df)}")
print(f"Any remaining NA values: {scientist_df['name'].isna().any()}")
print(f"Any remaining <NA> strings: {scientist_df['name'].isin(['<NA>', 'NA', '<na>', 'na']).any()}")

# Display final cleaned data
print("\nFinal cleaned scientist table:")
display(scientist_df)

Checking for '<NA>' values...

Final validation:
Total scientists after removing <NA>: 15
Any remaining NA values: False
Any remaining <NA> strings: False

Final cleaned scientist table:


Unnamed: 0,scientist_id,name
0,a8976d76-1320-4390-835e-e398ff63f118,Giuxian
1,d6d26008-bf49-4105-b028-fd26b151b775,Yicheng/prof Brabec
2,c30aeef2-3538-46af-803c-20da9c6fbe61,Fengjiu
3,9aaf9566-ac11-4a83-869c-7f92caaab9e1,Daniel
4,3664bdc4-3643-4f56-bafb-5002a8da01eb,Kenedy
5,fe4fe15e-ea0c-4f58-9761-6252b363abe4,Ulas
6,8b984220-736f-4ffd-90a1-7af1a201968d,J. Zhang
7,7f816acd-9d15-4c30-b6b8-c128466d336e,ZSW
8,f1b7843e-8e29-4b0b-9d90-1d16b07ac9e0,Marlene
9,e73fe6ca-759c-4eef-80a7-1f9405e83455,"Jinzhao, FAPI"


### Prepare Solar Cell Device Table

The solar_cell_device table requires the following fields:
- name (from 'Cell name', serves as primary key)
- nomad_id (empty, will be filled later, unique identifier)
- technology (from 'Cell type')
- area (from 'Area')
- initial_pce (from 'Init.PCE')
- date_produced (empty, will be filled later)
- encapsulation (from 'Encap.')
- owner_id (link to scientist table)
- producer_id (link to scientist table)

We'll prepare this data by first creating a dataframe with the required columns, then clean and validate the data.

In [89]:
# First, get unique devices based on cell name to avoid duplicates
unique_devices = df.groupby('Cell name').agg({
    'Cell type': 'first',  # Take first occurrence of cell type
    'Area': 'first',       # Take first area value
    'Init.PCE': 'first',   # Take first PCE value
    'Encap.': 'first',     # Take first encapsulation value
    'Owner': 'first',      # Take first owner
    'Producer': 'first'    # Take first producer
}).reset_index()

# Create initial solar cell device dataframe
solar_cell_device_df = pd.DataFrame({
    'nomad_id': pd.NA,  # Will be filled later
    'name': unique_devices['Cell name'],  # Add device name from Cell name
    'technology': unique_devices['Cell type'],
    'area': unique_devices['Area'],
    'initial_pce': unique_devices['Init.PCE'],
    'date_produced': pd.NA,  # Will be filled later
    'encapsulation': unique_devices['Encap.'],
    'owner_id': unique_devices['Owner'].map(scientist_id_map),
    'producer_id': unique_devices['Producer'].map(scientist_id_map)
})

# Create a mapping dictionary for cell names to use in subsequent tables
cell_name_map = dict(zip(unique_devices['Cell name'], range(len(unique_devices))))

# Display initial state
print(f"Found {len(solar_cell_device_df)} unique cell devices")
print("\nInitial solar cell device table:")
display(solar_cell_device_df.head(25))

# Check for missing values
print("\nMissing values per column:")
print(solar_cell_device_df.isnull().sum())

# Display data types
print("\nColumn data types:")
print(solar_cell_device_df.dtypes)

Found 67 unique cell devices

Initial solar cell device table:


Unnamed: 0,nomad_id,name,technology,area,initial_pce,date_produced,encapsulation,owner_id,producer_id
0,,A11_100C,Pero,0.16,16.1,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb
1,,A23_110C,Pero,0.16,16.9,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb
2,,B11_120C,Pero,0.16,13.3,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb
3,,B23_130C,Pero,0.16,13.3,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb
4,,BO - 1C,Pero,0.16,,,,e73fe6ca-759c-4eef-80a7-1f9405e83455,
5,,BO - 1E,Pero,0.16,,,,e73fe6ca-759c-4eef-80a7-1f9405e83455,
6,,C11_140C,Pero,0.16,10.5,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb
7,,C23_150C,Pero,0.16,10.2,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb
8,,CEL070923SN104,Pero,1.5,8.92,,,af3efe64-2f5f-48ce-a198-8ed8ac1d1542,25adb7bc-6d7b-4a31-a72c-faf43e5dd2cb
9,,CEL070923SN111,Pero,1.5,8.71,,,af3efe64-2f5f-48ce-a198-8ed8ac1d1542,25adb7bc-6d7b-4a31-a72c-faf43e5dd2cb



Missing values per column:
nomad_id         67
name              0
technology        0
area              9
initial_pce      38
date_produced    67
encapsulation    31
owner_id         11
producer_id      27
dtype: int64

Column data types:
nomad_id          object
name              object
technology        object
area             float64
initial_pce      float64
date_produced     object
encapsulation     object
owner_id          object
producer_id       object
dtype: object


In [90]:
# Validate solar cell device dataframe against database requirements

# 1. Add missing columns required by schema
solar_cell_device_df['form_factor'] = pd.NA  # Will need to be filled
solar_cell_device_df['experiment_id'] = pd.NA  # Will need to be filled
solar_cell_device_df['date_encapsulated'] = pd.NA  # Will need to be filled

# 2. Check for duplicate names (would violate PRIMARY KEY constraint)
duplicates = solar_cell_device_df['name'].duplicated()
if duplicates.any():
    print("WARNING: Found duplicate device names (would violate PRIMARY KEY constraint):")
    print(solar_cell_device_df[duplicates]['name'])

# 3. Check for invalid characters and length in name field
print("\nValidating name field:")
print(f"Max name length: {solar_cell_device_df['name'].str.len().max()} (limit 255)")

# 4. Verify non-null owner_id and producer_id values exist in scientist table
valid_scientist_ids = set(scientist_df['scientist_id'])

# Check only non-null references
invalid_owners = solar_cell_device_df[solar_cell_device_df['owner_id'].notna()]['owner_id'].isin(valid_scientist_ids) == False
invalid_producers = solar_cell_device_df[solar_cell_device_df['producer_id'].notna()]['producer_id'].isin(valid_scientist_ids) == False

if invalid_owners.any():
    print("\nWARNING: Found invalid owner_id references (excluding NULL values):")
    print(solar_cell_device_df[invalid_owners][['name', 'owner_id']])

if invalid_producers.any():
    print("\nWARNING: Found invalid producer_id references (excluding NULL values):")
    print(solar_cell_device_df[invalid_producers][['name', 'producer_id']])

# 5. Check for required non-null fields (only name is required)
print("\nChecking required field (name):")
null_count = solar_cell_device_df['name'].isna().sum()
if null_count > 0:
    print(f"WARNING: Found {null_count} missing values in required field 'name'")

# Display updated dataframe structure
print("\nUpdated dataframe structure:")
print(solar_cell_device_df.dtypes)


Validating name field:
Max name length: 26 (limit 255)

Checking required field (name):

Updated dataframe structure:
nomad_id              object
name                  object
technology            object
area                 float64
initial_pce          float64
date_produced         object
encapsulation         object
owner_id              object
producer_id           object
form_factor           object
experiment_id         object
date_encapsulated     object
dtype: object


In [91]:
# Fix any issues found in validation

# 1. Handle duplicate device names if any
if duplicates.any():
    print("\nHandling duplicate device names...")
    # Add suffix to duplicates
    dup_mask = solar_cell_device_df['name'].duplicated(keep='first')
    dup_names = solar_cell_device_df.loc[dup_mask, 'name']
    for name in dup_names:
        matches = solar_cell_device_df['name'] == name
        # Add numeric suffix to duplicates (e.g., device_1, device_2)
        for i, idx in enumerate(solar_cell_device_df[matches].index[1:], 1):
            solar_cell_device_df.loc[idx, 'name'] = f"{name}_{i}"

# 2. Handle missing values in required fields
print("\nHandling missing required values...")
# Only name is required (PRIMARY KEY)
missing_names = solar_cell_device_df['name'].isna().sum()
if missing_names > 0:
    raise ValueError(f"Found {missing_names} missing values in 'name' column. This is required and cannot be null.")

# 3. Handle invalid scientist references (only for non-null values)
if invalid_owners.any() or invalid_producers.any():
    print("\nHandling invalid scientist references...")
    # Create a default scientist for invalid references
    default_scientist_id = uuid.uuid4()
    scientist_df = pd.concat([scientist_df, pd.DataFrame({
        'scientist_id': [default_scientist_id],
        'name': ['Unknown Scientist']
    })], ignore_index=True)
    
    # Only update invalid non-null references
    if invalid_owners.any():
        solar_cell_device_df.loc[invalid_owners.index, 'owner_id'] = default_scientist_id
    if invalid_producers.any():
        solar_cell_device_df.loc[invalid_producers.index, 'producer_id'] = default_scientist_id

# Verify final state
print("\nFinal validation:")
print(f"Duplicate names: {solar_cell_device_df['name'].duplicated().any()}")
print(f"Missing required values (name): {solar_cell_device_df['name'].isna().sum()}")
print(f"Invalid non-null owner references: {(solar_cell_device_df[solar_cell_device_df['owner_id'].notna()]['owner_id'].isin(valid_scientist_ids) == False).sum()}")
print(f"Invalid non-null producer references: {(solar_cell_device_df[solar_cell_device_df['producer_id'].notna()]['producer_id'].isin(valid_scientist_ids) == False).sum()}")

# Display the cleaned data
print("\nCleaned solar cell device table:")
display(solar_cell_device_df.head())


Handling missing required values...

Final validation:
Duplicate names: False
Missing required values (name): 0
Invalid non-null owner references: 0
Invalid non-null producer references: 0

Cleaned solar cell device table:


Unnamed: 0,nomad_id,name,technology,area,initial_pce,date_produced,encapsulation,owner_id,producer_id,form_factor,experiment_id,date_encapsulated
0,,A11_100C,Pero,0.16,16.1,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb,,,
1,,A23_110C,Pero,0.16,16.9,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb,,,
2,,B11_120C,Pero,0.16,13.3,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb,,,
3,,B23_130C,Pero,0.16,13.3,,Glue,3664bdc4-3643-4f56-bafb-5002a8da01eb,3664bdc4-3643-4f56-bafb-5002a8da01eb,,,
4,,BO - 1C,Pero,0.16,,,,e73fe6ca-759c-4eef-80a7-1f9405e83455,,,,


### Prepare Solar Cell Pixel Table

The solar_cell_pixel table requires:
- solar_cell_id (foreign key to solar_cell_device.name)
- pixel (pixel number/identifier)
- active_area (area measurement for specific pixel)

Important considerations:
- The combination of solar_cell_id and pixel forms the PRIMARY KEY
- Different solar cells can have pixels with the same names (e.g., 'a', 'b', 'c')
- We need to preserve all pixel entries while ensuring uniqueness of the composite key

Let's extract and validate this data from our logbook entries.

In [92]:
# Create solar cell pixel dataframe from original logbook data
solar_cell_pixel_df = df[['Cell name', 'Pixel', 'Area']].copy()

# Clean pixel data
solar_cell_pixel_df['Pixel'] = solar_cell_pixel_df['Pixel'].astype(str)
solar_cell_pixel_df['Pixel'] = solar_cell_pixel_df['Pixel'].str.strip()

# Remove rows where pixel is missing or invalid
solar_cell_pixel_df = solar_cell_pixel_df.dropna(subset=['Pixel'])
solar_cell_pixel_df = solar_cell_pixel_df[solar_cell_pixel_df['Pixel'].str.lower() != 'nan']

# Ensure area is a float
solar_cell_pixel_df['active_area'] = pd.to_numeric(solar_cell_pixel_df['Area'], errors='coerce')

# Get unique combinations of cell and pixel
# Note: We don't deduplicate pixels alone as they can repeat across different cells
solar_cell_pixel_df = solar_cell_pixel_df.drop_duplicates(subset=['Cell name', 'Pixel'])

# Display initial state
print(f"Found {len(solar_cell_pixel_df)} unique cell-pixel combinations")
print(f"Number of unique cells: {solar_cell_pixel_df['Cell name'].nunique()}")
print(f"Number of unique pixel names: {solar_cell_pixel_df['Pixel'].nunique()}")

# Show distribution of pixel names across cells
pixel_counts = solar_cell_pixel_df.groupby('Pixel').size()
print("\nPixel name frequency (showing most common):")
display(pixel_counts.sort_values(ascending=False).head())

# Display initial data
print("\nInitial solar cell pixel table:")
display(solar_cell_pixel_df.head(25))

# Check for missing values
print("\nMissing values per column:")
print(solar_cell_pixel_df.isnull().sum())

# Display data types
print("\nColumn data types:")
print(solar_cell_pixel_df.dtypes)

Found 113 unique cell-pixel combinations
Number of unique cells: 67
Number of unique pixel names: 13

Pixel name frequency (showing most common):


Pixel
<NA>    33
1       10
C       10
E       10
2        9
dtype: int64


Initial solar cell pixel table:


Unnamed: 0,Cell name,Pixel,Area,active_area
0,T2-2,C,,
1,T2-2,E,,
2,GK-6-1,B,,
3,GK-6-1,D,,
4,GK-6-1,F,,
5,QE-060122-1,A,,
6,QE-060122-1,C,,
7,QE-060122-1,E,,
8,QE-060122-3,B,,
9,QE-060122-3,D,,



Missing values per column:
Cell name       0
Pixel           0
Area           29
active_area    29
dtype: int64

Column data types:
Cell name       object
Pixel           object
Area           float64
active_area    float64
dtype: object


In [93]:
# Validate and clean pixel data

# 1. Check for invalid pixel values
print("Unique pixel values:")
print(solar_cell_pixel_df['Pixel'].unique())

# 2. Verify all cell names exist in solar_cell_device_df
invalid_cells = ~solar_cell_pixel_df['Cell name'].isin(solar_cell_device_df['name'])
if invalid_cells.any():
    print(f"\nWARNING: Found {invalid_cells.sum()} pixels with invalid cell references")
    print("Invalid cell names:")
    print(solar_cell_pixel_df[invalid_cells]['Cell name'].unique())
    
    # Remove invalid entries
    solar_cell_pixel_df = solar_cell_pixel_df[~invalid_cells]
else:
    print("✓ All cell names in solar_cell_pixel_df are valid")

# 3. Check for negative or zero areas
invalid_areas = solar_cell_pixel_df['active_area'] <= 0
if invalid_areas.any():
    print(f"\nWARNING: Found {invalid_areas.sum()} pixels with invalid areas")
    print("Entries with invalid areas:")
    display(solar_cell_pixel_df[invalid_areas])
    
    # Set invalid areas to NULL
    solar_cell_pixel_df.loc[invalid_areas, 'active_area'] = pd.NA
else:
    print("✓ All active_area values are valid")

# Display cleaned data
print("\nCleaned solar cell pixel table:")
display(solar_cell_pixel_df.head(25))

# Final validation counts
print(f"\nFinal validation:")
print(f"Total unique pixels: {len(solar_cell_pixel_df)}")
print(f"Unique cells: {solar_cell_pixel_df['Cell name'].nunique()}")
print(f"Pixels without area: {solar_cell_pixel_df['active_area'].isna().sum()}")
print(f"Invalid cell references: {(~solar_cell_pixel_df['Cell name'].isin(solar_cell_device_df['name'])).sum()}")

Unique pixel values:
['C' 'E' 'B' 'D' 'F' 'A' '<NA>' '1' '2' '3' 'A / U' 'C / M' 'E / L']
✓ All cell names in solar_cell_pixel_df are valid
✓ All active_area values are valid

Cleaned solar cell pixel table:


Unnamed: 0,Cell name,Pixel,Area,active_area
0,T2-2,C,,
1,T2-2,E,,
2,GK-6-1,B,,
3,GK-6-1,D,,
4,GK-6-1,F,,
5,QE-060122-1,A,,
6,QE-060122-1,C,,
7,QE-060122-1,E,,
8,QE-060122-3,B,,
9,QE-060122-3,D,,



Final validation:
Total unique pixels: 113
Unique cells: 67
Pixels without area: 29
Invalid cell references: 0


In [94]:
# Prepare final table structure for database upload
solar_cell_pixel_upload_df = pd.DataFrame({
    'solar_cell_id': solar_cell_pixel_df['Cell name'].astype('string'),  # VARCHAR(255)
    'pixel': solar_cell_pixel_df['Pixel'].astype('string'),              # VARCHAR(255)
    'active_area': solar_cell_pixel_df['active_area'].astype('float64')  # DOUBLE PRECISION
})

# Validate data types match database schema
print("Data type validation:")
print(solar_cell_pixel_upload_df.dtypes)

# Validate string lengths (VARCHAR(255) limit)
max_solar_cell_id_len = solar_cell_pixel_upload_df['solar_cell_id'].str.len().max()
max_pixel_len = solar_cell_pixel_upload_df['pixel'].str.len().max()
print(f"\nVARCHAR length validation:")
print(f"solar_cell_id max length: {max_solar_cell_id_len}/255")
print(f"pixel max length: {max_pixel_len}/255")

# Ensure the combined solar_cell_id and pixel form a unique identifier
duplicates = solar_cell_pixel_upload_df.duplicated(subset=['solar_cell_id', 'pixel'], keep=False)
if duplicates.any():
    print("\nWARNING: Found duplicate pixel entries:")
    display(solar_cell_pixel_upload_df[duplicates].sort_values(['solar_cell_id', 'pixel']))
    
    # Keep the first occurrence of each pixel per cell
    solar_cell_pixel_upload_df = solar_cell_pixel_upload_df.drop_duplicates(
        subset=['solar_cell_id', 'pixel'],
        keep='first'
    )
else:
    print("✓ No duplicate pixel entries found")

# Validate numeric ranges
print("\nNumeric range validation:")
print("active_area range:")
print(f"min: {solar_cell_pixel_upload_df['active_area'].min()}")
print(f"max: {solar_cell_pixel_upload_df['active_area'].max()}")

# Final validation
print("\nSchema validation:")
print(f"Primary key uniqueness: {not solar_cell_pixel_upload_df.duplicated(subset=['solar_cell_id', 'pixel']).any()}")
print(f"Foreign key validation: {solar_cell_pixel_upload_df['solar_cell_id'].isin(solar_cell_device_df['name']).all()}")
print(f"Total rows: {len(solar_cell_pixel_upload_df)}")

# Null checks
print("\nNull value checks:")
print(solar_cell_pixel_upload_df.isnull().sum())


Data type validation:
solar_cell_id    string[python]
pixel            string[python]
active_area             float64
dtype: object

VARCHAR length validation:
solar_cell_id max length: 26/255
pixel max length: 5/255
✓ No duplicate pixel entries found

Numeric range validation:
active_area range:
min: 0.122
max: 1.5

Schema validation:
Primary key uniqueness: True
Foreign key validation: True
Total rows: 113

Null value checks:
solar_cell_id     0
pixel             0
active_area      29
dtype: int64


### Prepare MPP Tracking Channel Table

The mpp_tracking_channel table requires:
- board (INTEGER)
- channel (INTEGER)

Important considerations:
- The combination of board and channel forms the PRIMARY KEY
- Both fields are required (no NULL values allowed)
- Values must be valid integers

In [95]:
# Create MPP tracking channel dataframe from original logbook data
mpp_tracking_channel_df = df[['Board', 'Channel']].copy()

# Drop any rows where either Board or Channel is missing
mpp_tracking_channel_df = mpp_tracking_channel_df.dropna()

# Convert to integers
mpp_tracking_channel_df['board'] = mpp_tracking_channel_df['Board'].astype('int64')
mpp_tracking_channel_df['channel'] = mpp_tracking_channel_df['Channel'].astype('int64')

# Drop the original columns
mpp_tracking_channel_df = mpp_tracking_channel_df[['board', 'channel']]

# Add the additional required fields with default values
# These will need to be updated with actual values later
mpp_tracking_channel_df['address'] = pd.NA  # Will need actual address mapping
mpp_tracking_channel_df['com_port'] = pd.NA  # Will need actual COM port mapping
mpp_tracking_channel_df['current_limit'] = pd.NA  # Default current limit in mA

# Remove duplicates
mpp_tracking_channel_df = mpp_tracking_channel_df.drop_duplicates(subset=['board', 'channel'])

# Display initial state
print(f"Found {len(mpp_tracking_channel_df)} unique board-channel combinations")
print(f"Number of unique boards: {mpp_tracking_channel_df['board'].nunique()}")
print(f"Number of unique channels: {mpp_tracking_channel_df['channel'].nunique()}")

# Show distribution of channels across boards
channel_counts = mpp_tracking_channel_df.groupby('board')['channel'].nunique()
print("\nChannels per board:")
display(channel_counts)

# Validate data
print("\nValidation:")
print(f"Primary key uniqueness: {not mpp_tracking_channel_df.duplicated(subset=['board', 'channel']).any()}")
print(f"Negative board values: {(mpp_tracking_channel_df['board'] < 0).sum()}")
print(f"Negative channel values: {(mpp_tracking_channel_df['channel'] < 0).sum()}")

# Display the prepared data
print("\nPrepared MPP tracking channel table:")
display(mpp_tracking_channel_df.sort_values(['board', 'channel']))

Found 58 unique board-channel combinations
Number of unique boards: 4
Number of unique channels: 24

Channels per board:


board
2     5
3    17
4    20
5    16
Name: channel, dtype: int64


Validation:
Primary key uniqueness: True
Negative board values: 0
Negative channel values: 0

Prepared MPP tracking channel table:


Unnamed: 0,board,channel,address,com_port,current_limit
55,2,17,,,
56,2,18,,,
57,2,19,,,
58,2,20,,,
59,2,21,,,
63,3,1,,,
64,3,2,,,
65,3,3,,,
66,3,4,,,
67,3,5,,,


### Prepare Measurement Connection Event Table

We'll process the cleaned logbook data (`df`) that was prepared in the earlier cells. This dataframe contains:
- Cleaned and validated dates (Date installed, Date removed)
- Integer-converted Board and Channel numbers
- Cleaned string fields for Cell name, Pixel, etc.
- Proper handling of missing values

We'll split the preparation into several steps:
1. Extract connection and disconnection events
2. Process and validate dates
3. Add required fields and convert data types
4. Validate referential integrity
5. Validate event sequence logic
6. Final data quality checks

In [96]:
# Let's first verify the state of our input dataframe
print("Input dataframe (cleaned logbook data) summary:")
print("-" * 50)
print("Shape:", df.shape)
print("\nColumns:")
print(df.columns.tolist())
print("\nData types:")
print(df.dtypes)
print("\nSample of input data:")
display(df[['Cell name', 'Pixel', 'Board', 'Channel', 'Date installed', 'Date removed']].head(3))

Input dataframe (cleaned logbook data) summary:
--------------------------------------------------
Shape: (113, 17)

Columns:
['Date removed', 'Board', 'Channel', 'Status', 'Cell type', 'Date installed', 'Cell name', 'Pixel', 'Area', 'Init.PCE', 'Encap.', 'Structure', 'Producer', 'Owner', 'Project', 'Temp sensor', 'Comment 1']

Data types:
Date removed      datetime64[ns]
Board                      Int64
Channel                    Int64
Status                   float64
Cell type                 object
Date installed    datetime64[ns]
Cell name                 object
Pixel                     object
Area                     float64
Init.PCE                 float64
Encap.                    object
Structure                 object
Producer                  object
Owner                     object
Project                   object
Temp sensor               object
Comment 1                 object
dtype: object

Sample of input data:


Unnamed: 0,Cell name,Pixel,Board,Channel,Date installed,Date removed
0,T2-2,C,3,14,2022-02-09,2022-07-15
1,T2-2,E,3,15,2022-02-09,2022-07-15
2,GK-6-1,B,3,16,2022-02-09,2022-07-15


In [97]:
# Step 1: Extract Connection Events from cleaned logbook data
logbook_connection_events = df[[
    'Cell name',
    'Pixel',
    'Board',
    'Channel',
    'Temp sensor',
    'Date installed'
]].copy()

# Add event_type column for connections
logbook_connection_events['event_type'] = 'CONNECTED'

print("Connection events extracted from logbook:")
print(f"Total events: {len(logbook_connection_events)}")
print("\nSample of connection events:")
display(logbook_connection_events.head(3))

Connection events extracted from logbook:
Total events: 113

Sample of connection events:


Unnamed: 0,Cell name,Pixel,Board,Channel,Temp sensor,Date installed,event_type
0,T2-2,C,3,14,,2022-02-09,CONNECTED
1,T2-2,E,3,15,,2022-02-09,CONNECTED
2,GK-6-1,B,3,16,,2022-02-09,CONNECTED


In [98]:
# Step 2: Extract Disconnection Events from cleaned logbook data
logbook_disconnection_events = df[[
    'Cell name',
    'Pixel',
    'Board',
    'Channel',
    'Temp sensor',
    'Date removed'
]].copy()

# Add event_type column for disconnections
logbook_disconnection_events['event_type'] = 'DISCONNECTED'

# Rename Date removed to match connection_datetime field
logbook_disconnection_events = logbook_disconnection_events.rename(columns={'Date removed': 'Date installed'})

# Remove rows where Date removed is null (currently connected cells)
logbook_disconnection_events = logbook_disconnection_events.dropna(subset=['Date installed'])

print("Disconnection events extracted from logbook:")
print(f"Total events: {len(logbook_disconnection_events)}")
print(f"Currently connected cells (no disconnection date): {len(df) - len(logbook_disconnection_events)}")
print("\nSample of disconnection events:")
display(logbook_disconnection_events.head(3))

Disconnection events extracted from logbook:
Total events: 113
Currently connected cells (no disconnection date): 0

Sample of disconnection events:


Unnamed: 0,Cell name,Pixel,Board,Channel,Temp sensor,Date installed,event_type
0,T2-2,C,3,14,,2022-07-15,DISCONNECTED
1,T2-2,E,3,15,,2022-07-15,DISCONNECTED
2,GK-6-1,B,3,16,,2022-07-15,DISCONNECTED


In [99]:
# Step 3: Combine and Process Events
# Combine connection and disconnection events into a new measurement events dataframe
measurement_connection_df = pd.concat([logbook_connection_events, logbook_disconnection_events], ignore_index=True)

# Rename columns to match database schema
measurement_connection_df.columns = [
    'solar_cell_id',
    'pixel',
    'tracking_channel_board',
    'tracking_channel_channel',
    'temperature_sensor_id',
    'connection_datetime',
    'event_type'
]

# Add default fields
measurement_connection_df['irradiance_sensor_id'] = pd.NA
measurement_connection_df['mppt_mode'] = pd.NA
measurement_connection_df['mppt_polarity'] = pd.NA

print("Combined measurement connection events:")
print(f"Total events: {len(measurement_connection_df)}")
print(f"From which:")
print(f"- Connection events: {len(logbook_connection_events)}")
print(f"- Disconnection events: {len(logbook_disconnection_events)}")
print("\nColumns in final dataframe:")
print(measurement_connection_df.columns.tolist())

Combined measurement connection events:
Total events: 226
From which:
- Connection events: 113
- Disconnection events: 113

Columns in final dataframe:
['solar_cell_id', 'pixel', 'tracking_channel_board', 'tracking_channel_channel', 'temperature_sensor_id', 'connection_datetime', 'event_type', 'irradiance_sensor_id', 'mppt_mode', 'mppt_polarity']


In [100]:
# Step 4: Data Type Conversions
# First create database connection
engine = create_engine(
    f"postgresql://{DB_CONFIG['user']}:{DB_CONFIG['password']}@{DB_CONFIG['host']}:{DB_CONFIG['port']}/{DB_CONFIG['database']}"
)

# Convert board and channel to integers
measurement_connection_df['tracking_channel_board'] = measurement_connection_df['tracking_channel_board'].astype('int64')
measurement_connection_df['tracking_channel_channel'] = measurement_connection_df['tracking_channel_channel'].astype('int64')

# Print current state of datetime column
print("\nBefore timezone localization:")
print("connection_datetime type:", measurement_connection_df['connection_datetime'].dtype)
print("Sample value:", measurement_connection_df['connection_datetime'].iloc[0])

# Add timezone info (dates are already datetime objects)
measurement_connection_df['connection_datetime'] = measurement_connection_df['connection_datetime'].dt.tz_localize('UTC')

print("\nAfter timezone localization:")
print("connection_datetime type:", measurement_connection_df['connection_datetime'].dtype)
# Now add timezone if needed
if pd.api.types.is_datetime64_dtype(measurement_connection_df['connection_datetime']):
    print("\nAdding timezone info...")
    measurement_connection_df['connection_datetime'] = measurement_connection_df['connection_datetime'].dt.tz_localize('UTC')

print("\nAfter conversion:")
print("connection_datetime type:", measurement_connection_df['connection_datetime'].dtype)
print("Sample value:", measurement_connection_df['connection_datetime'].iloc[0])

# Get temperature sensor IDs from the database
def get_sensor_id_map(engine):
    """Get mapping of sensor identifiers to their UUIDs from the database."""
    with engine.connect() as conn:
        # Query all temperature sensors
        result = conn.execute(text("""
            SELECT temperature_sensor_id, sensor_identifier 
            FROM temperature_sensor
            WHERE sensor_identifier LIKE 'm7004_ID_%'
        """))
        
        # Create mapping from sensor ID part to UUID
        # e.g., '37F6F9511A64FF28' -> UUID from database
        return {
            identifier.replace('m7004_ID_', ''): uuid 
            for uuid, identifier in result
        }

# Get mapping of sensor identifiers to their database UUIDs
sensor_uuid_map = get_sensor_id_map(engine)

print(f"\nFound {len(sensor_uuid_map)} temperature sensors in database")

# Check which sensors from the logbook exist in our mapping
valid_sensor_mask = measurement_connection_df['temperature_sensor_id'].isin(sensor_uuid_map.keys())
if valid_sensor_mask.any():
    print(f"Found {valid_sensor_mask.sum()} valid temperature sensor entries")
else:
    print("No valid temperature sensor entries found")

# List any sensors that appear in the logbook but aren't in the database
missing_sensors = set(measurement_connection_df['temperature_sensor_id'].dropna()) - set(sensor_uuid_map.keys())
if missing_sensors:
    print(f"Note: {len(missing_sensors)} sensors from logbook are not in database (will be set to NULL)")
    if len(missing_sensors) <= 10:
        print("Missing sensors:", sorted(missing_sensors))

# Map sensors to their UUIDs, setting non-matching ones to NULL
measurement_connection_df['temperature_sensor_id'] = measurement_connection_df['temperature_sensor_id'].map(sensor_uuid_map)

# Print mapping statistics
print(f"\nTemperature sensor mapping summary:")
print(f"Total entries: {len(measurement_connection_df)}")
print(f"Found in database: {valid_sensor_mask.sum()}")
print(f"Successfully mapped: {measurement_connection_df['temperature_sensor_id'].notna().sum()}")
print(f"Set to NULL: {measurement_connection_df['temperature_sensor_id'].isna().sum()}")

print("\nData types after conversion:")
print(measurement_connection_df.dtypes)


Before timezone localization:
connection_datetime type: datetime64[ns]
Sample value: 2022-02-09 00:00:00

After timezone localization:
connection_datetime type: datetime64[ns, UTC]

After conversion:
connection_datetime type: datetime64[ns, UTC]
Sample value: 2022-02-09 00:00:00+00:00

Found 3 temperature sensors in database
Found 2 valid temperature sensor entries
Note: 3 sensors from logbook are not in database (will be set to NULL)
Missing sensors: ['Glued to back glass', 'Temp sensor connected', 'glued-backside']

Temperature sensor mapping summary:
Total entries: 226
Found in database: 2
Successfully mapped: 2
Set to NULL: 224

Data types after conversion:
solar_cell_id                            object
pixel                                    object
tracking_channel_board                    int64
tracking_channel_channel                  int64
temperature_sensor_id                    object
connection_datetime         datetime64[ns, UTC]
event_type                               

In [101]:
# Step 5: Validate Referential Integrity
print("Running referential integrity validation...")

# Check solar_cell_pixel references
valid_pixels = solar_cell_pixel_upload_df.apply(
    lambda row: f"{row['solar_cell_id']}_{row['pixel']}",
    axis=1
).tolist()

test_pixels = measurement_connection_df.apply(
    lambda row: f"{row['solar_cell_id']}_{row['pixel']}",
    axis=1
)

invalid_pixels = ~test_pixels.isin(valid_pixels)
if invalid_pixels.any():
    print("WARNING: Invalid solar cell pixel references found")
    print("Number of invalid references:", invalid_pixels.sum())
    display(measurement_connection_df[invalid_pixels].head())
else:
    print("✓ All solar cell pixel references are valid")

# Check mpp_tracking_channel references
valid_channels = mpp_tracking_channel_df.apply(
    lambda row: f"{row['board']}_{row['channel']}",
    axis=1
).tolist()

test_channels = measurement_connection_df.apply(
    lambda row: f"{row['tracking_channel_board']}_{row['tracking_channel_channel']}",
    axis=1
)

invalid_channels = ~test_channels.isin(valid_channels)
if invalid_channels.any():
    print("\nWARNING: Invalid tracking channel references found")
    print("Number of invalid references:", invalid_channels.sum())
    display(measurement_connection_df[invalid_channels].head())
else:
    print("✓ All tracking channel references are valid")

print("\nReferential integrity validation complete.")

Running referential integrity validation...
✓ All solar cell pixel references are valid
✓ All tracking channel references are valid

Referential integrity validation complete.


In [102]:
# Step 6: Validate Event Sequence Logic
print("Running event sequence validation...")

def check_event_sequence(group):
    """Check if events follow proper sequence"""
    events = group.sort_values('connection_datetime')
    issues = []
    
    # Check for standalone disconnection
    if len(events) == 1 and events.iloc[0]['event_type'] == 'DISCONNECTED':
        issues.append("Disconnection without prior connection")
    
    # Check for consecutive same events
    last_event_type = None
    for _, event in events.iterrows():
        if event['event_type'] == last_event_type:
            issues.append(f"Consecutive {event['event_type']} events")
        last_event_type = event['event_type']
    
    return issues

# Group by cell and pixel and check sequence
sequence_issues = []
for (cell_id, pixel), group in measurement_connection_df.groupby(['solar_cell_id', 'pixel']):
    issues = check_event_sequence(group)
    if issues:
        sequence_issues.append({
            'cell_id': cell_id,
            'pixel': pixel,
            'issues': issues
        })

if sequence_issues:
    print("Event sequence issues found:")
    for issue in sequence_issues:
        print(f"\nCell: {issue['cell_id']}, Pixel: {issue['pixel']}")
        for problem in issue['issues']:
            print(f"  - {problem}")
else:
    print("✓ All event sequences are valid - no issues found")
    print(f"  - Validated {len(measurement_connection_df)} events")
    print(f"  - Checked {measurement_connection_df.groupby(['solar_cell_id', 'pixel']).ngroups} unique cell-pixel combinations")

print("\nEvent sequence validation complete.")

Running event sequence validation...
✓ All event sequences are valid - no issues found
  - Validated 226 events
  - Checked 80 unique cell-pixel combinations

Event sequence validation complete.


In [103]:
# Step 7: Validate Chronological Order
print("Running chronological order validation...")

def check_chronological_order(group):
    """Check if events are in proper chronological order"""
    events = group.sort_values('connection_datetime')
    issues = []
    
    dates = events['connection_datetime'].tolist()
    types = events['event_type'].tolist()
    
    for i in range(len(dates)-1):
        if types[i] == 'DISCONNECTED' and types[i+1] == 'CONNECTED':
            if dates[i] >= dates[i+1]:
                issues.append(f"Disconnection ({dates[i]}) not before next connection ({dates[i+1]})")
    
    return issues

# Check chronological order for each cell/pixel
chronology_issues = []
total_events_checked = 0
unique_pairs_checked = 0

for (cell_id, pixel), group in measurement_connection_df.groupby(['solar_cell_id', 'pixel']):
    issues = check_chronological_order(group)
    if issues:
        chronology_issues.append({
            'cell_id': cell_id,
            'pixel': pixel,
            'issues': issues
        })
    total_events_checked += len(group)
    unique_pairs_checked += 1

if chronology_issues:
    print("Chronological order issues found:")
    for issue in chronology_issues:
        print(f"\nCell: {issue['cell_id']}, Pixel: {issue['pixel']}")
        for problem in issue['issues']:
            print(f"  - {problem}")
else:
    print("✓ All events are in proper chronological order")
    print(f"  - Validated {total_events_checked} events")
    print(f"  - Checked {unique_pairs_checked} unique cell-pixel combinations")

print("\nChronological order validation complete.")

Running chronological order validation...
✓ All events are in proper chronological order
  - Validated 160 events
  - Checked 80 unique cell-pixel combinations

Chronological order validation complete.


In [104]:
# Step 8: Final Data Summary
print("Final measurement connection event data summary:")
print("-" * 50)
print(f"Total events: {len(measurement_connection_df)}")
print(f"Connection events: {len(measurement_connection_df[measurement_connection_df['event_type'] == 'CONNECTED'])}")
print(f"Disconnection events: {len(measurement_connection_df[measurement_connection_df['event_type'] == 'DISCONNECTED'])}")
print(f"Unique cells: {measurement_connection_df['solar_cell_id'].nunique()}")
print(f"Unique pixels: {measurement_connection_df.groupby('solar_cell_id')['pixel'].nunique().sum()}")
print(f"Date range: {measurement_connection_df['connection_datetime'].min()} to {measurement_connection_df['connection_datetime'].max()}")

print("\nColumns with null values:")
null_counts = measurement_connection_df.isnull().sum()
print(null_counts[null_counts > 0])

# Display sample of final data
print("\nSample of prepared data:")
display(measurement_connection_df.sample(n=min(5, len(measurement_connection_df))))

Final measurement connection event data summary:
--------------------------------------------------
Total events: 226
Connection events: 113
Disconnection events: 113
Unique cells: 67
Unique pixels: 80
Date range: 2020-01-01 00:00:00+00:00 to 2024-03-26 00:00:00+00:00

Columns with null values:
pixel                     66
temperature_sensor_id    224
irradiance_sensor_id     226
mppt_mode                226
mppt_polarity            226
dtype: int64

Sample of prepared data:


Unnamed: 0,solar_cell_id,pixel,tracking_channel_board,tracking_channel_channel,temperature_sensor_id,connection_datetime,event_type,irradiance_sensor_id,mppt_mode,mppt_polarity
102,Device 27,E / L,5,11,,2023-10-12 00:00:00+00:00,CONNECTED,,,
110,CEL070923SN111,,4,20,,2023-12-07 00:00:00+00:00,CONNECTED,,,
118,QE-060122-1,A,3,19,,2022-07-15 00:00:00+00:00,DISCONNECTED,,,
106,Device 13,C / M,5,16,,2023-10-12 00:00:00+00:00,CONNECTED,,,
122,QE-060122-3,D,3,23,,2022-07-15 00:00:00+00:00,DISCONNECTED,,,


## 6. Execute the Data Upload Process

Now that we have validated our data, we'll upload it to the database in the correct order to maintain referential integrity. The order is important:

1. Scientists must be uploaded first as other tables reference them
2. Solar cell devices next (needs scientists to exist)
3. Solar cell pixels next (needs devices to exist)
4. Measurement connection events last (needs pixels to exist)

In [105]:
# Create database connection
def create_db_connection(config=DB_CONFIG):
    """
    Create a SQLAlchemy database engine from configuration.
    
    Args:
        config: Dictionary containing database connection parameters
        
    Returns:
        SQLAlchemy engine instance
    """
    try:
        connection_string = f"postgresql://{config['user']}:{config['password']}@{config['host']}:{config['port']}/{config['database']}"
        # Store connection string as attribute of engine for external access
        engine = create_engine(connection_string)
        engine.connection_string = connection_string  # This makes it accessible via engine.connection_string
        
        # Test the connection
        with engine.connect() as conn:
            result = conn.execute(text("SELECT 1"))
            logging.info(f"Database connection successful: {config['host']}:{config['port']}/{config['database']}")
        return engine
    except Exception as e:
        logging.error(f"Database connection failed: {str(e)}")
        raise
    
try:
    engine = create_db_connection()
    logging.info("Database connection established successfully")
except Exception as e:
    logging.error(f"Failed to connect to database: {str(e)}")
    raise

# Track upload statistics
stats = {
    'files_processed': 1,  # We process one Excel file
    'files_skipped': 0,
    'files_error': 0,
    'rows_inserted': 0,
    'start_time': datetime.now(timezone.utc),
    'records': {
        'scientists': 0,
        'devices': 0,
        'pixels': 0,
        'connections': 0
    }
}

try:
    # 1. Upload scientists
    logging.info("Uploading scientists...")
    scientist_df.to_sql('scientist', engine, if_exists='append', index=False)
    stats['records']['scientists'] = len(scientist_df)
    stats['rows_inserted'] += len(scientist_df)
    logging.info(f"Successfully uploaded {len(scientist_df):,} scientists")

    # 2. Upload solar cell devices
    logging.info("\nUploading solar cell devices...")
    solar_cell_device_df.to_sql('solar_cell_device', engine, if_exists='append', index=False)
    stats['records']['devices'] = len(solar_cell_device_df)
    stats['rows_inserted'] += len(solar_cell_device_df)
    logging.info(f"Successfully uploaded {len(solar_cell_device_df):,} solar cell devices")

    # 3. Upload solar cell pixels
    logging.info("\nUploading solar cell pixels...")
    solar_cell_pixel_upload_df.to_sql('solar_cell_pixel', engine, if_exists='append', index=False)
    stats['records']['pixels'] = len(solar_cell_pixel_upload_df)
    stats['rows_inserted'] += len(solar_cell_pixel_upload_df)
    logging.info(f"Successfully uploaded {len(solar_cell_pixel_upload_df):,} solar cell pixels")

    # 4. Upload measurement connection events
    logging.info("\nUploading measurement connection events...")
    measurement_connection_df.to_sql('measurement_connection_event', engine, if_exists='append', index=False)
    stats['records']['connections'] = len(measurement_connection_df)
    stats['rows_inserted'] += len(measurement_connection_df)
    logging.info(f"Successfully uploaded {len(measurement_connection_df):,} measurement connection events")

    # Calculate duration
    stats['end_time'] = datetime.now(timezone.utc)
    stats['duration_seconds'] = (stats['end_time'] - stats['start_time']).total_seconds()
    
    logging.info(f"Processing complete. "
                 f"Uploaded {stats['rows_inserted']:,} total records "
                 f"in {stats['duration_seconds']:.2f} seconds.")

except Exception as e:
    stats['files_error'] += 1
    logging.error(f"Error during upload: {str(e)}")
    raise

2025-05-20 15:45:59,611 - INFO - Database connection successful: localhost:5432/perocube
2025-05-20 15:45:59,613 - INFO - Database connection established successfully
2025-05-20 15:45:59,614 - INFO - Uploading scientists...


2025-05-20 15:46:01,357 - INFO - Successfully uploaded 15 scientists
2025-05-20 15:46:01,358 - INFO - 
Uploading solar cell devices...
2025-05-20 15:46:01,376 - INFO - Successfully uploaded 67 solar cell devices
2025-05-20 15:46:01,377 - INFO - 
Uploading solar cell pixels...
2025-05-20 15:46:01,386 - INFO - Successfully uploaded 113 solar cell pixels
2025-05-20 15:46:01,386 - INFO - 
Uploading measurement connection events...
2025-05-20 15:46:01,405 - ERROR - Error during upload: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "measurement_connection_event_pkey"
DETAIL:  Key (solar_cell_id, connection_datetime)=(T2-2, 2022-02-09 00:00:00+00) already exists.

[SQL: INSERT INTO measurement_connection_event (solar_cell_id, pixel, tracking_channel_board, tracking_channel_channel, temperature_sensor_id, connection_datetime, event_type, irradiance_sensor_id, mppt_mode, mppt_polarity) VALUES (%(solar_cell_id__0)s, %( ... 59341 characters truncated ... 5)s, %(

IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "measurement_connection_event_pkey"
DETAIL:  Key (solar_cell_id, connection_datetime)=(T2-2, 2022-02-09 00:00:00+00) already exists.

[SQL: INSERT INTO measurement_connection_event (solar_cell_id, pixel, tracking_channel_board, tracking_channel_channel, temperature_sensor_id, connection_datetime, event_type, irradiance_sensor_id, mppt_mode, mppt_polarity) VALUES (%(solar_cell_id__0)s, %( ... 59341 characters truncated ... 5)s, %(event_type__225)s, %(irradiance_sensor_id__225)s, %(mppt_mode__225)s, %(mppt_polarity__225)s)]
[parameters: {'event_type__0': 'CONNECTED', 'tracking_channel_channel__0': 14, 'mppt_polarity__0': None, 'irradiance_sensor_id__0': None, 'mppt_mode__0': None, 'tracking_channel_board__0': 3, 'pixel__0': 'C', 'connection_datetime__0': datetime.datetime(2022, 2, 9, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__0': 'T2-2', 'temperature_sensor_id__0': None, 'event_type__1': 'CONNECTED', 'tracking_channel_channel__1': 15, 'mppt_polarity__1': None, 'irradiance_sensor_id__1': None, 'mppt_mode__1': None, 'tracking_channel_board__1': 3, 'pixel__1': 'E', 'connection_datetime__1': datetime.datetime(2022, 2, 9, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__1': 'T2-2', 'temperature_sensor_id__1': None, 'event_type__2': 'CONNECTED', 'tracking_channel_channel__2': 16, 'mppt_polarity__2': None, 'irradiance_sensor_id__2': None, 'mppt_mode__2': None, 'tracking_channel_board__2': 3, 'pixel__2': 'B', 'connection_datetime__2': datetime.datetime(2022, 2, 9, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__2': 'GK-6-1', 'temperature_sensor_id__2': None, 'event_type__3': 'CONNECTED', 'tracking_channel_channel__3': 17, 'mppt_polarity__3': None, 'irradiance_sensor_id__3': None, 'mppt_mode__3': None, 'tracking_channel_board__3': 3, 'pixel__3': 'D', 'connection_datetime__3': datetime.datetime(2022, 2, 9, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__3': 'GK-6-1', 'temperature_sensor_id__3': None, 'event_type__4': 'CONNECTED', 'tracking_channel_channel__4': 18, 'mppt_polarity__4': None, 'irradiance_sensor_id__4': None, 'mppt_mode__4': None, 'tracking_channel_board__4': 3, 'pixel__4': 'F', 'connection_datetime__4': datetime.datetime(2022, 2, 9, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__4': 'GK-6-1', 'temperature_sensor_id__4': None ... 2160 parameters truncated ... 'event_type__221': 'DISCONNECTED', 'tracking_channel_channel__221': 18, 'mppt_polarity__221': None, 'irradiance_sensor_id__221': None, 'mppt_mode__221': None, 'tracking_channel_board__221': 5, 'pixel__221': 'E / L', 'connection_datetime__221': datetime.datetime(2024, 3, 18, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__221': 'Device 13', 'temperature_sensor_id__221': None, 'event_type__222': 'DISCONNECTED', 'tracking_channel_channel__222': 19, 'mppt_polarity__222': None, 'irradiance_sensor_id__222': None, 'mppt_mode__222': None, 'tracking_channel_board__222': 4, 'pixel__222': None, 'connection_datetime__222': datetime.datetime(2024, 3, 26, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__222': 'CEL070923SN104', 'temperature_sensor_id__222': None, 'event_type__223': 'DISCONNECTED', 'tracking_channel_channel__223': 20, 'mppt_polarity__223': None, 'irradiance_sensor_id__223': None, 'mppt_mode__223': None, 'tracking_channel_board__223': 4, 'pixel__223': None, 'connection_datetime__223': datetime.datetime(2024, 3, 26, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__223': 'CEL070923SN111', 'temperature_sensor_id__223': None, 'event_type__224': 'DISCONNECTED', 'tracking_channel_channel__224': 21, 'mppt_polarity__224': None, 'irradiance_sensor_id__224': None, 'mppt_mode__224': None, 'tracking_channel_board__224': 4, 'pixel__224': None, 'connection_datetime__224': datetime.datetime(2024, 3, 26, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__224': 'CEL070923SN213', 'temperature_sensor_id__224': None, 'event_type__225': 'DISCONNECTED', 'tracking_channel_channel__225': 22, 'mppt_polarity__225': None, 'irradiance_sensor_id__225': None, 'mppt_mode__225': None, 'tracking_channel_board__225': 4, 'pixel__225': None, 'connection_datetime__225': datetime.datetime(2024, 3, 26, 0, 0, tzinfo=datetime.timezone.utc), 'solar_cell_id__225': 'CEL070923SN21X', 'temperature_sensor_id__225': None}]
(Background on this error at: https://sqlalche.me/e/20/gkpj)

## 7. Results Summary

After uploading the logbook data, here's a summary of what was accomplished:

In [None]:
def format_duration(seconds):
    """Format duration in a human-readable format"""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = seconds % 60
    if hours > 0:
        return f"{hours}h {minutes}m {secs:.1f}s"
    elif minutes > 0:
        return f"{minutes}m {secs:.1f}s"
    else:
        return f"{secs:.1f}s"

def format_number(n):
    """Format number with thousand separators"""
    return f"{n:,}"

# Display processing statistics
if 'stats' in locals():
    print("📊 File Processing")
    print("━━━━━━━━━━━━━━━")
    print(f"📁 Total files processed:        {format_number(stats['files_processed']):>10}")
    print(f"❌ Files with errors:           {format_number(stats['files_error']):>10}")
    
    print("\n📈 Data Statistics")
    print("━━━━━━━━━━━━━━━")
    print(f"👥 Scientists uploaded:         {format_number(stats['records']['scientists']):>10}")
    print(f"📱 Solar cell devices:         {format_number(stats['records']['devices']):>10}")
    print(f"🔲 Solar cell pixels:          {format_number(stats['records']['pixels']):>10}")
    print(f"🔗 Connection events:          {format_number(stats['records']['connections']):>10}")
    print(f"📝 Total records uploaded:     {format_number(stats['rows_inserted']):>10}")
    
    if 'duration_seconds' in stats:
        duration = format_duration(stats['duration_seconds'])
        print("\n⚡ Performance Metrics")
        print("━━━━━━━━━━━━━━━━━━")
        print(f"⏱️  Total processing time:      {duration:>10}")
        
        if stats['rows_inserted'] > 0 and stats['duration_seconds'] > 0:
            throughput = stats['rows_inserted'] / stats['duration_seconds']
            print(f"🚀 Processing speed:           {format_number(int(throughput)):>10} rows/sec")
    
    # Database verification
    try:
        with engine.connect() as conn:
            print("\n🗄️  Database Status")
            print("━━━━━━━━━━━━━━━")
            result = conn.execute(text("SELECT COUNT(*) FROM scientist"))
            print(f"👥 Total scientists:           {format_number(result.scalar()):>10}")
            
            result = conn.execute(text("SELECT COUNT(*) FROM solar_cell_device"))
            print(f"📱 Total devices:             {format_number(result.scalar()):>10}")
            
            result = conn.execute(text("SELECT COUNT(*) FROM solar_cell_pixel"))
            print(f"🔲 Total pixels:              {format_number(result.scalar()):>10}")
            
            result = conn.execute(text("SELECT COUNT(*) FROM measurement_connection_event"))
            print(f"🔗 Total connections:         {format_number(result.scalar()):>10}")
    except Exception as e:
        print("\n⚠️  Could not verify database status:")
        print(f"   {str(e)}")
else:
    print("❌ No statistics available - processing may have failed")