Database Schema

Database schema for a PostgreSQL database.

Database Model Schema
Rationale for Schema
Length and Size Constraints
Schema Creation: SQL or SqlAlchemy?

Database Model Schema

In this application, the database schema are generated from SqlAlchemy model classes in the app.models source code module (app/models.py). Schema here for reference.

1. Users

CREATE TABLE users (
    user_id       INTEGER      PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    username      VARCHAR(50),                            -- not necessarily unique
    email         VARCHAR(100) UNIQUE,
    created_at    TIMESTAMPTZ  DEFAULT CURRENT_TIMESTAMP,
    updated_at    TIMESTAMPTZ  DEFAULT CURRENT_TIMESTAMP, -- and auto-update
    -- these two fields not implemented yet. I think last_login is very desirable.
    last_login    TIMESTAMPTZ,
    is_active     BOOLEAN      DEFAULT TRUE
);

user_id may be UUID or INTEGER. UUID guarantees unique ids even in a distributed application, but require 16 bytes versus 4 bytes for INTEGER. I decided that size economy is more important, so chose INTEGER. Its not much of a restriction to require single point of registration for new users.

2. Passwords

Not all users will have passwords or be authenticated locally. Store passwords in a separate table:

CREATE TABLE user_passwords (
    user_id       INTEGER      PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
    password_hash TEXT         NOT NULL,
    updated_at    TIMESTAMPTZ  DEFAULT CURRENT_TIMESTAMP -- most recent update
);

password_hash can be stored as TEXT, STRING(n), or BYTEA (binary up to 1GB). I chose TEXT because:

The output of both Argon2id (preferred hash algorithm) and bcrypt are readable strings. bcrypt output is byte encoded but easily converted to ASCII using string.decode().
The Argon2id hash() function outputs a 97-char string, such as:
$argon2id$v=19$m=65536,t=3,p=4$p29WTJJnZcDXPGtorG/LwQ$J+yA7NL8hEp/qiKQQooZt1tCUpOXK2DXjwGoKx9Arvg
bcrypt hashpw produces a 60-byte string (still readable), such as (after decoding): $2b$12$CwTycUXWue0Thq9StjUM0uJ8oyfppUIF2M5n6m8/NBJYf8c6vUZ1u
Both Argon2 and bcrypt store the "salt" as part of the output, so we do not need a salt field in the table.
Both VARCHAR(n) and TEXT result in same table size (actual string size + 1-4 bytes to record length). I chose TEXT for flexibility (no length limit).
See Password Security for details of password hashing.

In PostgreSQL ORM, to specify user_id as both primary key and foreign key in UserPassword write:

class UserPassword(Base):
    ...
    user_id: Mapped[int] = mapped_column(
                           Integer,
                           ForeignKey("users.id", ondelete="CASCADE"),
                           primary_key=True,
                           nullable=False
                           )

3. User Identities for Multiple Authentication Providers

The app will eventually support multiple authentication methods, including OAuth and LINE's authentication.

Deepseek suggested a table like this:

I had to remind Deepseek of authentication requirements to get it to create this table)

-- Separate table for OAuth identities
CREATE TABLE user_oauth_identities (
    identity_id       UUID          PRIMARY KEY,
    user_id           INTEGER       REFERENCES users(user_id) ON DELETE CASCADE,
    provider          VARCHAR(20)   NOT NULL CHECK (provider IN ('line', 'google')),
    provider_user_id  VARCHAR(100)  NOT NULL, -- LINE or Google's unique ID
    email             VARCHAR(100), -- From provider
    name              VARCHAR(100), -- From provider
    profile_image     TEXT,         -- URL to avatar
    access_token      TEXT,         -- Encrypted, optional
    refresh_token     TEXT,         -- Encrypted, optional
    token_expiry      TIMESTAMPTZ,
    UNIQUE(provider, provider_user_id) -- One identity per provider
);

ChatGPT suggested

CREATE TABLE auth_identities (
  id                  SERIAL       PRIMARY KEY,
  user_id             INTEGER      NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  provider            VARCHAR(50)  NOT NULL,        -- e.g. 'line', 'google'
  provider_user_id    VARCHAR(255) NOT NULL,        -- external user ID
  created_at          TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
  UNIQUE (provider, provider_user_id)
);

Storing access and refresh tokens allows the application to:

Access user data on their behalf: Some APIs (like LINE, Google) require these tokens to make authorized requests for user data or actions.
Maintain user sessions: Refresh tokens let you obtain new access tokens without requiring the user to log in again, providing a seamless experience.
Support background tasks: Your app can perform actions (like syncing data) even when the user is not actively using the app.

Security:

Only store tokens if the app needs to interact with the provider’s API after login.
Always encrypt sensitive tokens at rest and follow security best practices.

Me: A better uniqueness constraint would be that a user can have only 1 OAuth identity per provider:

  UNIQUE (user_id, provider)

4. Locations

CREATE TABLE locations (
    id                UUID          PRIMARY KEY,
    name              VARCHAR(100)  NOT NULL,
    address           TEXT,
    timezone          VARCHAR(50)   DEFAULT 'UTC',
    created_at        TIMESTAMPTZ   DEFAULT CURRENT_TIMESTAMP,
    created_by        UUID          REFERENCES users(user_id)
);

Consider adding: geo-coordinates

Copilot: To preserve referential integrity, add ON DELETE SET NULL or ON DELETE CASCADE to created_by.

5. Data Sources

CREATE TABLE data_sources (
    id             INTEGER       PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    name           VARCHAR(100)  NOT NULL,
    owner_id       INTEGER       NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    description    TEXT,                           -- additional info such as meter number
    unit           VARCHAR(20)   NOT NULL,         -- 'kWh', 'gallons', 'm³', etc.
    is_public      BOOLEAN       DEFAULT FALSE,
    created_at     TIMESTAMPTZ   DEFAULT CURRENT_TIMESTAMP,

    -- optional
    created_by     INTEGER       REFERENCES users(user_id),
    source_type    VARCHAR(50)   NOT NULL,  -- 'electric', 'water', 'gas', etc.
    location_id    INTEGER       REFERENCES locations(id) ON DELETE SET NULL,
);

TO DO

Each reading of a data source may have multiple values. Two changes are needed:

Add an array of value names or labels (strings).
Replace unit with an array of unit names.

To preserve referential integrity, add ON DELETE SET NULL or ON DELETE CASCADE to created_by and/or owner_id.
In our app, we don't have a requirement (yet) for one user to create a data source for a different user so created_by isn't needed.

6. Readings or Data Values

A "reading" from a Data Source may be a single number with timestamp, or multiple numbers such as a person's systolic & diastolic blood pressure. Instead of storing each reading in a separate row with a column to identify the value, use JSON or JSONB (Postgres supports, more efficient). For Sqlite, a JSON value is stored in a TEXT field.

CREATE TABLE readings (
    id             UUID         PRIMARY KEY,
    source_id      UUID         REFERENCES data_sources(id) ON DELETE CASCADE,
    timestamp      TIMESTAMPTZ  DEFAULT CURRENT_TIMESTAMP, -- actual time that the reading values refer to
    values         JSONB,
    recorded_at    TIMESTAMPTZ  DEFAULT CURRENT_TIMESTAMP, -- optional, if record date can differ from reading date
    recorded_by    INTEGER      REFERENCES users(id),      -- if someone other than owner can store a reading
);

Some ORM model designs for this are:

from uuid import UUID, uuid4
from sqlalchemy import Column, Integer, Float, DateTime, JSON
from sqlalchemy.ext.mutable import MutableDict
class Reading(Base):
    __tablename__ = "readings"
    __table_args__ = (
        Index("ix_reading_timestamp", "timestamp"),
        Index("ix_reading_data_source", "data_source_id"),
    )  
    id: Mapped[UUID] = mapped_column(primary_key=True, default=uuid4)
    timestamp: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False)
    data_source_id: Mapped[int] = mapped_column(
                                    Integer,
                                    ForeignKey("data_sources.id",  ondelete="CASCADE"),
                                    nullable=False
                                    )
    values: Mapped[Dict[str, Any]] = mapped_column(
                                    MutableDict.as_mutable(JSONB),
                                    nullable=False
                                    )
    
    data_source: Mapped["DataSource"] = relationship(back_populates="readings")
    
    @validates("values")
    def validate_values(self, key: str, values: Dict[str, Any]) -> Dict[str, Any]:
        if not isinstance(values, dict):
            raise ValueError("Values must be a dictionary")
        if not values:
            raise ValueError("Values dictionary cannot be empty")
        return values
    
    def get_value(self, key: str) -> Optional[float | Any]:
        """Safely get a value with type conversion"""
        val = self.values.get(key)
        try:
            return float(val) if val is not None else None
        except (TypeError, ValueError):
            return val

The use of MutableDict is so that SqlAlchemy can detect in-place modification of values, such as:

   reading.values["temp"] = 28.5   # in place modification

with a plain Column(JSONB) the ORM would not detect the change and flag the object as "dirty" (need to persist changes). Use of MutableDict is recommended if the app does in-place updates (instead of always replacing the whole dict, but can also be achieved for a plain values = Column(JSONB) like this:

from sqlalchemy.orm.attributes import flag_modified

reading.values['temp'] = 28.5
flag_modified(reading, 'values')

UUID as Primary Key

There may be many readings and we might want to enable off-line recording of readings, so UUID is preferred as key. For better indexing and localized data retrieval consider:

include source_id in primary key?
use chronologically ordered UUID such as UUIDv7 or ULID, described in UUID Variants.

7. Permissions

A Many-to-Many Relationship between Users and Data Sources

Deepseek:

CREATE TABLE user_source_access (
    access_id     UUID        PRIMARY KEY,
    user_id       UUID        REFERENCES users(user_id)          ON DELETE CASCADE,
    source_id     UUID        REFERENCES data_sources(source_id) ON DELETE CASCADE,
    access_level  VARCHAR(20) NOT NULL CHECK (access_level IN ('read', 'write', 'admin')),
    granted_at    TIMESTAMP   WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    granted_by    UUID        REFERENCES users(user_id),
    UNIQUE(user_id, source_id)  -- Ensures one access record per user-source combination
);

ChatGPT:

-- Grant view/edit rights on a data_source to other users
CREATE TABLE source_permissions (
  user_id      INTEGER     NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  source_id    INTEGER     NOT NULL REFERENCES data_sources(id) ON DELETE CASCADE,
  can_edit     BOOLEAN     NOT NULL DEFAULT FALSE,
  granted_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  PRIMARY KEY (user_id, source_id)
);

8. Optional Data Validators

Only ChatGPT suggested this.

Actual data validation may be more complex, such as requiring non-decreasing values.

-- e.g. enforce min/max or pattern checks before inserting entries
CREATE TABLE data_validators (
  id             SERIAL      PRIMARY KEY,
  source_id      INTEGER     NOT NULL REFERENCES data_sources(id) ON DELETE CASCADE,
  min_value      NUMERIC,
  max_value      NUMERIC,
  created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

8. Audit Logging

Only ChatGPT suggested this table.

-- Log all access, creations, and edits in an append-only table
CREATE TABLE audit_logs (
  id             SERIAL       PRIMARY KEY,
  user_id        INTEGER      REFERENCES users(id),
  source_id      INTEGER      REFERENCES data_sources(id),
  entry_id       INTEGER      REFERENCES data_entries(id),
  event_type     VARCHAR(50)  NOT NULL,        -- e.g. 'view_source', 'create_entry', 'update_entry'
  event_time     TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
  details        JSONB        NOT NULL        -- e.g. { "old_value": 123, "new_value": 125 }
);
CREATE INDEX ON audit_logs(event_time);

Rationale for Schema

See Rationale Q-and-A below for Deepseek & ChatGPT's answers to specific questions.

ChatGPT's Rationale

Timezones: All timestamps use TIMESTAMPTZ to preserve exact moments.
Foreign-key cascades: Deleting a user or source automatically removes related records.
Indexes: Add additional indexes as needed, e.g. on audit_logs(user_id) or data_entries(created_by).
Triggers: You can implement triggers on data_entries to enforce validator rules using data_validators and to write audit records automatically.

Security Considerations

Token Storage
- Encrypt access/refresh tokens in database
- Use HTTP-only, Secure-Flag cookies for web JWTs
- Android: Store in EncryptedSharedPreferences
- Rationale: Prevents token theft via XSS attacks or database breaches |
Token Validation
- Verify LINE/Google ID tokens on the backend
- Set JWT expiration, e.g., 1 hour for access, 7 days for refresh
- Rationale: Mitigates replay attacks and limits exposure |

Data Access

Row-Level Security (RLS) in PostgreSQL:

CREATE POLICY user_source_access
ON readings USING (source_id IN (
     SELECT source_id FROM user_source_access WHERE user_id = current_user_id()
                  )   ); 
- **Rationale**: Enforces "users can only access sources they have permissions for" at DB level

API Security
- Rate-limit, e.g., 100 requests/minute
- CORS restricted to your domains
- HSTS headers enforced
- Rationale: Prevents brute-force attacks and unauthorized cross-origin access |
OAuth Hardening
- Store client secrets in environment variables
- Validate redirect_uri strictly
- Use PKCE for mobile apps
- Rationale: Blocks OAuth injection and redirect hijacking | |

Key Additions for LINE/Google Auth

LINE-Specific
- Verify the nonce parameter to prevent CSRF
- Check client_id matches your LINE Channel ID
Google-Specific
- Validate hd (hosted domain) parameter if restricting to organizational emails
- Use prompt=select_account to prevent accidental session reuse

Unified JWT Claims
Include both authentication methods in your JWT:

{
  "sub": "user_123",
  "auth_provider": "line|google",
  "auth_time": 1625097600,
  "scope": "read:data write:data"
}

Schema Rationale Q-and-A

Q: Why use UUID for ID fields instead of SERIAL?

Deepseek:

UUID Pros:

Global Uniqueness: safe for distributed systems, e.g. mobile apps with offline data sync
Avoid lock contention when multiple users create records simultaneously
Obfuscation: harder to scan/guess, e.g. /users/a1b2c3d4-xxx vs /users/102
Merge friendly: no id collisions if merge databases

UUID Cons:

More storage: 16 bytes for UUID versus 4 bytes for SERIAL
Index fragmentation. Use UUIDv7 (time ordered) to improve index locality. Example for PostgreSQL 14+ (turns out to be wrong):
```
reading_id UUID  DEFAULT uuid_generate_v7() PRIMARY KEY,
```
Readability - hard for humans to read. For this app, ids would not normally be displayed, but may make log files harder to read.

SERIAL or modern equivalent INTEGER GENERATED AS IDENTITY Pros:

Simple: easy to understand and debug
Human friendly. More legible in logs.
Slightly faster to index using B-trees. Indexes are more compact.
Integer indices yield better query performance, esp. for large tables.
Monolithic Growth: Newer rows always have higher keys, which helps keep inserts localized to the right end of the index and reduces page splits.
- Migitation: you can use time-ordered UUIDs, e.g. UUIDv7 or ULID, but that adds complexity.

When to prefer SERIAL:

Small, single-server app
Tables with less than 100K records
Where integer joins are critical for performance

If using UUID, then to optimize for analytics use:

   -- Add a SERIAL column for internal reporting
   ALTER TABLE readings ADD COLUMN reading_number SERIAL;

Deepseek Recommendation for HomeLog: Stick with UUIDs because

Your user_source_access table benefits from collision-safe distributed writes
Mobile clients may generate IDs offline (readings)
Future-proof for scaling (sharding, multi-region)

ChatGPT Recommendation for HomeLog: Use Integer for primary keys

Given HomeLog’s scope — only a handful of users, centralized deployment on Supabase/Railway, and no need for offline key generation, you’ll benefit most from using integer-based primary keys:

Performance: Smaller indexes and faster joins on time-series data (your readings table will grow quickly).
Simplicity: Easier schema introspection, debugging, and meaningful URLs.
Operational overhead: No additional complexity for key generation or migration.

If in the future HomeLog evolves into a massively distributed system with cross-region writes or must merge data across silos, you can migrate to UUIDs (or ULIDs) at that time.

For now, stick with SERIAL (or better yet, GENERATED AS IDENTITY) for clean, performant, and maintainable design.

Deepseek Recommendation if Off-line Writes are Not Required

If you don't require offline writes and your system operates purely in a centralized, online-only mode, the recommendation shifts toward traditional sequential IDs. Here's the updated guidance:

1. For Most Tables use GENERATED AS IDENTITY

Use GENERATED ALWAYS AS IDENTITY (PostgreSQL 10+):

CREATE TABLE readings (
    reading_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,  -- Auto-incrementing integer
    source_id INTEGER NOT NULL REFERENCES data_sources(source_id),
    value DOUBLE PRECISION NOT NULL,
    timestamp TIMESTAMPTZ NOT NULL
);

Why?

Better performance (4-byte integers vs 16-byte UUIDs)
Natural sorting (no fragmentation)
Simpler joins (integer indexes are smaller/faster)
Built-in sequence tracking (no risk of duplicates)

2. For Tables Needing External IDs

Add a separate public_id UUID column if you need obscurity:

CREATE TABLE users (
    user_id     INTEGER      PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    public_id   UUID         DEFAULT gen_random_uuid() UNIQUE,  -- For URLs/APIs
    email       TEXT         UNIQUE NOT NULL
);

Usage:

Internal joins: Use user_id (integer)
Public APIs: Expose only public_id (UUID)

Comparison of UUID and Integer Primary Keys

Primary Key	UUID	`INTEGER GENERATED AS IDENTITY`
Distributed System	Offline Writes, multi-server	Single database server
Offline writes	Safe	No. Online only
Sorting	Require UUIDv7 or ULID	Natural integer order
Storage	16 bytes per ID	4 bytes per ID
Joins	Slightly slower	Optimized

When to Use UUIDs

If you anticipate future sharding/multi-region deployment
For any tables that might sync with mobile devices later

Performance Impact

20-30% faster joins on integer keys (based on pgBench tests)
~15% storage reduction for large tables

Length and Size Constraints

Explicit size limits should be placed on fields, particularly strings. This is to reduce database table size (incurs cost) and improve security (defend against attackers sending huge values as input data).

For consistency we use these size constraints in both SqlAlchemy model classes and Pydantic schema classes. Named constants for these values are defined in app/core/config.py, and used in models and schema code.

Field	Maximum Length (chars)
name	50
description	80
address	160
email	160, as explained below

These lengths are characters, not bytes. For non-latin UNIcode chars requiring multi-byte storage, the limits still work in my tests. For example, a 50-char Thai name is correctly stored.

Pydantic has its own length limits for EmailStr! For "username@domain", the max length of username is 64 chars and max length of domainname is 67 chars. A total of 64+1+67 = 132 chars. Far less than the 254 char limit in RFC 5321. So, for a smaller database table I chose 160 instead of 254 chars.

Pydantic's EmailStr uses this built-in email validator.

Complete list of Pydantic String Types on https://docs.pydantic.dev. NameEmail accepts strings like Santa Claus <santa@xmas.org>. To add size limits to Pydantic schemas, use this syntax:

    email: EmailStr = Field(..., max_length=160)   # "..." means a required field
    username: Optional[str] = Field(None, max_length=50)

Schema Creation: SQL or SqlAlchemy?

The database table schema can be defined using SQL statements or using SqlAlchemy. Both SqlAlchemy Core and ORM components can create database tables from definitions given in Python code.

Each approach has its strengths and weaknesses

Aspect	SqlAlchemy	Direct SQL
Integration	Models and logic defined in Python	Tables defined in SQL, logic in Python
DRY?	Yes - models define table structure	No - models duplicate table structure
Migrations	Automatic using Alembic	Manual, but can automate with Alembic
Programmatic Reuse	Yes. Can regenerate tables from code.	Somewhat tied to database.
Database agnostic	Yes	No
Type Safety	IDE support for type checking & hinting	Yes if you define models for tables
Model evolution	Schema automatically updated as models change	Manually revise schema.
Data validation	In code	Done by database but can add Pydantic schema
Precision schema	Less than raw SQL, limited access to Postgres features	Full access to database capabilities
Docker Integration	Harder - need Python code to create database in db container	Easy & reproducible using init script in db container

ChatGPT's Recommendation for this Application

Layer	Approach	Rationale
Initialization	Use sql in Docker	Quick and reproducable using `init.sql` in container volume
Application	Use SqlAlchemy	Clean Pythonic access to models at expense of duplication
Migrations	Use Alembic	Optional - use if schema evolves over time

How to Add Alembic for Schema Migration

install and add to requirements.txt if used in container

pip install alembic

Initialize

alembic init alembic

Configure alembic.ini

sqlalchemy.url = postgresql+asyncpg://postgres:fat_chance@localhost:5432/homelog

In alembic/env.py import the models:

# import models
from app.models import user
target_metadata = user.Base.metadata

Generate a migration

alembic revision --autogenerate -m "add more fields to user"
alembic upgrade head

Database Schema

Database Model Schema

1. Users

2. Passwords

3. User Identities for Multiple Authentication Providers

4. Locations

5. Data Sources

TO DO

6. Readings or Data Values

UUID as Primary Key

7. Permissions

8. Optional Data Validators

8. Audit Logging

Rationale for Schema

Security Considerations

Key Additions for LINE/Google Auth

Schema Rationale Q-and-A

1. For Most Tables use GENERATED AS IDENTITY

2. For Tables Needing External IDs

Comparison of UUID and Integer Primary Keys

When to Use UUIDs

Performance Impact

Length and Size Constraints

Schema Creation: SQL or SqlAlchemy?

How to Add Alembic for Schema Migration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally