-
Notifications
You must be signed in to change notification settings - Fork 0
Database Schema
Database schema for a PostgreSQL database.
- Database Model Schema
- Rationale for Schema
- Length and Size Constraints
- Schema Creation: SQL or SqlAlchemy?
In this application, the database schema are generated from SqlAlchemy model classes in the source file app/models.py.
The sizes of VARCHAR fields are defined in app/core/config.py. As of this writing:
MAX_NAME = 60
# Pydantic Email validators enforce a limit of 64+1+67 chars for username '@' domainname.
MAX_EMAIL = 160For most tables, the recommended specification on an integer id field is:
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,this is standard SQL and considered very portable. Some schema generated by SqlAlchemy declared the id as SERIAL, which is non-standard and relies on a sequence.
To get an "identity" column use:
from sqlalchemy import Integer, Identity
id: Mapped[int] = mapped_column(
Integer,
Identity(always=False), # or: Identity(generated_by_default=True)
primary_key=True
)The Identity option always=False enables you to manually specify an id value; useful for importing data.
| SqlAlchemy | SQL DDL |
|---|---|
| Identity(always=False) | GENERATED BY DEFAULT AS IDENTITY |
| Identity(always=True) | GENERATED ALWAYS AS IDENTITY |
id may be UUID or INTEGER. UUID guarantees unique ids even in a distributed application, but require 16 bytes versus 4 bytes for INTEGER. I decided that economy of size is more important, so chose INTEGER. Its not much of a restriction to require single point of registration for new users or data sources.
For readings, I intend to switch to UUID after some testing, to enable distributed or off-line data collection.
In Python with SqlAlchemy:
class User(Base):
"""Model for a User than can own DataSources."""
__tablename__ = "users"
id: Mapped[int] = mapped_column(Integer,
Identity(generated_by_default=True), # or: Identity(always=False)
primary_key=True
)
email: Mapped[str] = mapped_column(String(MAX_EMAIL), unique=True, nullable=False)
username: Mapped[str] = mapped_column(String(MAX_NAME))
created_at: Mapped[datetime] = mapped_column(
TIMESTAMP(timezone=True),
nullable=False,
# server_default=func.now()
default=utcnow
)
# updated_at is automatically updated by database?
updated_at: Mapped[datetime] = mapped_column(
TIMESTAMP(timezone=True),
default=utcnow,
# server_default=func.now(),
onupdate=utcnow
)
# a uni-directional relationship (use of UserPassword doesn't need reference to User)
user_password: Mapped["UserPassword"] = relationship(
"UserPassword",
uselist=False,
# lazy="joined", # don't use eager instantiation
cascade="all, delete-orphan"
)
def __str__(self) -> str:
"""Return a string representation of user data."""
return f'id={self.id} "{self.username[:40]}" <{self.email}>'Using server-side generation of default value server_default=func.now() causes errors in async sessions,
so this code applies defaults on the client side. utcnow is a function reference:
def utcnow() -> datetime:
"""Return the current datetime as a timezone aware value."""
return datetime.now(timezone.utc)This produces the schema:
CREATE TABLE users (
user_id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
username VARCHAR(MAX_NAME), -- not necessarily unique
email VARCHAR(MAX_EMAIL) UNIQUE,
created_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ,
-- these two fields not implemented yet. I think last_login is very desirable.
last_login TIMESTAMPTZ,
is_active BOOLEAN DEFAULT TRUE
);to have PostgreSQL assign default timestamps, use:
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,Not all users will have passwords or be authenticated locally. Store passwords in a separate table:
CREATE TABLE user_passwords (
user_id INTEGER PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
password_hash TEXT NOT NULL,
updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP -- most recent update
);password_hash can be stored as TEXT, STRING(n), or BYTEA (binary up to 1GB). I chose TEXT because:
- The output of both Argon2id (preferred hash algorithm) and bcrypt are readable strings. bcrypt output is byte encoded but easily converted to ASCII using
string.decode(). - The Argon2id
hash()function outputs a 97-char string, such as:
$argon2id$v=19$m=65536,t=3,p=4$p29WTJJnZcDXPGtorG/LwQ$J+yA7NL8hEp/qiKQQooZt1tCUpOXK2DXjwGoKx9Arvg -
bcrypt
hashpwproduces a 60-byte string (still readable), such as (after decoding):$2b$12$CwTycUXWue0Thq9StjUM0uJ8oyfppUIF2M5n6m8/NBJYf8c6vUZ1u - Both Argon2 and bcrypt store the "salt" as part of the output, so we do not need a
saltfield in the table. - Both
VARCHAR(n)andTEXTresult in same table size (actual string size + 1-4 bytes to record length). I chose TEXT for flexibility (no length limit). - See Password Security for details of password hashing.
In PostgreSQL ORM, to specify user_id as both primary key and foreign key in UserPassword write:
class UserPassword(Base):
...
user_id: Mapped[int] = mapped_column(
Integer,
ForeignKey("users.id", ondelete="CASCADE"),
primary_key=True,
nullable=False
)The app will eventually support multiple authentication methods, including OAuth and LINE's authentication.
Deepseek suggests a table like this:
I had to remind Deepseek of authentication requirements to get it to create this table
-- Separate table for OAuth identities
CREATE TABLE user_oauth_identities (
identity_id UUID PRIMARY KEY,
user_id INTEGER REFERENCES users(user_id) ON DELETE CASCADE,
provider VARCHAR(20) NOT NULL CHECK (provider IN ('line', 'google')),
provider_user_id VARCHAR(100) NOT NULL, -- LINE or Google's unique ID
email VARCHAR(100), -- From provider
name VARCHAR(100), -- From provider
profile_image TEXT, -- URL to avatar
access_token TEXT, -- Encrypted, optional
refresh_token TEXT, -- Encrypted, optional
token_expiry TIMESTAMPTZ,
UNIQUE(provider, provider_user_id) -- One identity per provider
);ChatGPT suggests
CREATE TABLE auth_identities (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
user_id INTEGER NOT NULL REFERENCES users(id) ON DELETE CASCADE,
provider VARCHAR(50) NOT NULL, -- e.g. 'line', 'google'
provider_user_id VARCHAR(255) NOT NULL, -- external user ID
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (provider, provider_user_id)
);Storing access and refresh tokens allows the application to:
- Access user data on their behalf: Some APIs (like LINE, Google) require these tokens to make authorized requests for user data or actions.
- Maintain user sessions: Refresh tokens let code obtain new access tokens without requiring the user to log in again, providing a more seamless experience.
- Support background tasks: The app can perform actions (like syncing data) even when the user is not actively using the app.
Security:
- Only store tokens if the app needs to interact with the provider’s API after login.
- Always encrypt sensitive tokens at rest and follow security best practices.
Me: A better uniqueness constraint would be that a user can have only 1 OAuth identity per provider:
UNIQUE (user_id, provider)Location info for a data source.
CREATE TABLE locations (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
name VARCHAR(100) NOT NULL,
address TEXT,
timezone VARCHAR(50) DEFAULT 'UTC',
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
created_by INTEGER REFERENCES users(id)
);Consider adding geo-location (coordinates).
To preserve referential integrity, add ON DELETE SET NULL or ON DELETE CASCADE to created_by.
CREATE TABLE data_sources (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
name VARCHAR(MAX_NAME) NOT NULL,
owner_id INTEGER NOT NULL REFERENCES users(id) ON DELETE CASCADE,
description TEXT, -- additional info such as meter number
metrics JSON,
is_public BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
-- optional (not implemented)
created_by INTEGER REFERENCES users(id),
source_type VARCHAR(50) NOT NULL, -- 'electric', 'water', 'gas', etc.
location_id INTEGER REFERENCES locations(id) ON DELETE SET NULL,
);The metrics is a JSON (or Postgres JSONB) field containing a map of value names to units. For example
- Electric meter:
{ "reading": "kWh" } - Blood Pressure & Heart:
{ "systolic": "mmHg", "diastolic": "mmHg", "pulse": "bpm" }
To preserve referential integrity, add ON DELETE SET NULL or ON DELETE CASCADE to created_by and/or owner_id.
In this app, we don't have a requirement (yet) for one user to create a data source for a different user so created_by isn't needed.
A "reading" from a Data Source may be a single number or multiple values such as a person's systolic & diastolic blood pressure. To avoid creating related tables or multiple rows per reading, we store all reading data in a JSON field, or JSONB in Postgres (more efficient). The keys in the JSON data must match the keys in the metrics field of the DataSource.
CREATE TABLE readings (
id UUID PRIMARY KEY,
source_id INTEGER REFERENCES data_sources(id) ON DELETE CASCADE,
timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, -- actual time that the reading values refer to
values JSON, -- for Postgres use JSONB
created_by_id INTEGER REFERENCES users(id),
);Some ORM model designs for this are:
from uuid import UUID, uuid4
from sqlalchemy import Column, Integer, Float, DateTime, JSON
from sqlalchemy.ext.mutable import MutableDict
class Reading(Base):
__tablename__ = "readings"
__table_args__ = (
Index("ix_reading_timestamp", "timestamp"),
Index("ix_reading_data_source", "data_source_id"),
)
id: Mapped[UUID] = mapped_column(primary_key=True, default=uuid4)
timestamp: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False)
data_source_id: Mapped[int] = mapped_column(
Integer,
ForeignKey("data_sources.id", ondelete="CASCADE"),
nullable=False
)
values: Mapped[Dict[str, Any]] = mapped_column(
MutableDict.as_mutable(JSONB),
nullable=False
)
data_source: Mapped["DataSource"] = relationship(back_populates="readings")
@validates("values")
def validate_values(self, key: str, values: Dict[str, Any]) -> Dict[str, Any]:
if not isinstance(values, dict):
raise ValueError("Values must be a dictionary")
if not values:
raise ValueError("Values dictionary cannot be empty")
return values
def get_value(self, key: str) -> Optional[float | Any]:
"""Safely get a value with type conversion"""
val = self.values.get(key)
try:
return float(val) if val is not None else None
except (TypeError, ValueError):
return valThe use of MutableDict is so that SqlAlchemy can detect in-place modification of values, such as:
reading.values["temp"] = 28.5 # in place modificationwith a plain Column(JSONB) the ORM would not detect the change and flag the object as "dirty" (need to persist changes).
Use of MutableDict is recommended if the app does in-place updates (instead of always replacing the whole dict,
but can also be achieved for a plain values = Column(JSONB) like this:
from sqlalchemy.orm.attributes import flag_modified
reading.values['temp'] = 28.5
flag_modified(reading, 'values') There may be many readings and we might want to enable off-line recording of readings, so UUID is preferred as key. For better indexing and localized data retrieval consider:
- include
source_idin primary key? - use chronologically ordered UUID such as UUIDv7 or ULID, described in UUID Variants.
Many people may have either read or write access to a table.
This requires a Many-to-Many Relationship between Users and Data Sources.
Deepseek:
CREATE TABLE user_source_access (
access_id UUID PRIMARY KEY,
user_id INTEGER REFERENCES users(id) ON DELETE CASCADE,
source_id INTEGER REFERENCES data_sources(id) ON DELETE CASCADE,
access_level VARCHAR(20) NOT NULL CHECK (access_level IN ('read', 'write', 'admin')),
granted_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
granted_by INTEGER REFERENCES users(id),
UNIQUE(user_id, source_id) -- Ensures one access record per user-source combination
);ChatGPT:
-- Grant view/edit rights on a data_source to other users
CREATE TABLE source_permissions (
user_id INTEGER NOT NULL REFERENCES users(id) ON DELETE CASCADE,
source_id INTEGER NOT NULL REFERENCES data_sources(id) ON DELETE CASCADE,
can_edit BOOLEAN NOT NULL DEFAULT FALSE,
granted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (user_id, source_id)
);Only ChatGPT suggested this.
Actual data validation may be more complex, such as requiring non-decreasing values.
-- e.g. enforce min/max or pattern checks before inserting entries
CREATE TABLE data_validators (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
source_id INTEGER NOT NULL REFERENCES data_sources(id) ON DELETE CASCADE,
min_value NUMERIC,
max_value NUMERIC,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);Only ChatGPT suggested this table.
-- Log all access, creations, and edits in an append-only table
CREATE TABLE audit_logs (
id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
user_id INTEGER REFERENCES users(id),
source_id INTEGER REFERENCES data_sources(id),
entry_id INTEGER REFERENCES data_entries(id),
event_type VARCHAR(50) NOT NULL, -- e.g. 'view_source', 'create_entry', 'update_entry'
event_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
details JSONB NOT NULL -- e.g. { "old_value": 123, "new_value": 125 }
);
CREATE INDEX ON audit_logs(event_time);See Rationale Q-and-A below for Deepseek & ChatGPT's answers to specific questions.
ChatGPT's Rationale
-
Timezones: All timestamps use
TIMESTAMPTZto preserve exact moments. - Foreign-key cascades: Deleting a user or source automatically removes related records.
-
Indexes: Add additional indexes as needed, e.g.
on audit_logs(user_id)ordata_entries(created_by). -
Triggers: You can implement triggers on data_entries to enforce validator rules using
data_validatorsand to write audit records automatically.
-
Token Storage
- Encrypt access/refresh tokens in database
- Use HTTP-only, Secure-Flag cookies for web JWTs
- Android: Store in
EncryptedSharedPreferences - Rationale: Prevents token theft via XSS attacks or database breaches |
-
Token Validation
- Verify LINE/Google ID tokens on the backend
- Set JWT expiration, e.g., 1 hour for access, 7 days for refresh
- Rationale: Mitigates replay attacks and limits exposure |
-
Data Access
- Row-Level Security (RLS) in PostgreSQL:
CREATE POLICY user_source_access ON readings USING (source_id IN ( SELECT source_id FROM user_source_access WHERE user_id = current_user_id() ) ); - **Rationale**: Enforces "users can only access sources they have permissions for" at DB level
-
API Security
- Rate-limit, e.g., 100 requests/minute
- CORS restricted to your domains
- HSTS headers enforced
- Rationale: Prevents brute-force attacks and unauthorized cross-origin access |
-
OAuth Hardening
- Store client secrets in environment variables
- Validate
redirect_uristrictly - Use PKCE for mobile apps
- Rationale: Blocks OAuth injection and redirect hijacking | |
-
LINE-Specific
- Verify the
nonceparameter to prevent CSRF - Check
client_idmatches your LINE Channel ID
- Verify the
-
Google-Specific
- Validate
hd(hosted domain) parameter if restricting to organizational emails - Use
prompt=select_accountto prevent accidental session reuse
- Validate
-
Unified JWT Claims
Include both authentication methods in your JWT:{ "sub": "user_123", "auth_provider": "line|google", "auth_time": 1625097600, "scope": "read:data write:data" }
Q: Why use UUID for ID fields instead of INTEGER?
Deepseek:
UUID Pros:
- Global Uniqueness: safe for distributed systems, e.g. mobile apps with offline data sync
- Avoid lock contention when multiple users create records simultaneously
- Obfuscation: harder to scan/guess, e.g.
/users/a1b2c3d4-xxxvs/users/102 - Merge friendly: no id collisions if merge databases
UUID Cons:
- More storage: 16 bytes for UUID versus 4 bytes for INTEGER
- Index fragmentation. Use
UUIDv7(time ordered) to improve index locality. Example for PostgreSQL 14+ (turns out to be wrong):reading_id UUID DEFAULT uuid_generate_v7() PRIMARY KEY, - Readability - hard for humans to read. For this app, ids would not normally be displayed, but may make log files harder to read.
INTEGER GENERATED AS IDENTITY Pros:
- Simple: easy to understand and debug
- Human friendly. More legible in logs.
- Slightly faster to index using B-trees. Indexes are more compact.
- Integer indices yield better query performance, esp. for large tables.
-
Monolithic Growth: Newer rows always have higher keys, which helps keep inserts localized to the right end of the index and reduces page splits.
- Migitation: you can use time-ordered UUIDs, e.g. UUIDv7 or ULID, but that adds complexity.
When to prefer INTEGER:
- Small, single-server app
- Tables with less than 100K records
- Where integer joins are critical for performance
Deepseek Recommendation for HomeLog: Stick with UUIDs because
- Your
user_source_accesstable benefits from collision-safe distributed writes - Mobile clients may generate IDs offline (readings)
- Future-proof for scaling (sharding, multi-region)
ChatGPT Recommendation for HomeLog: Use Integer for primary keys
Given HomeLog’s scope — only a handful of users, centralized deployment on Supabase/Railway, and no need for offline key generation, you’ll benefit most from using integer-based primary keys:
- Performance: Smaller indexes and faster joins on time-series data (your readings table will grow quickly).
- Simplicity: Easier schema introspection, debugging, and meaningful URLs.
- Operational overhead: No additional complexity for key generation or migration.
If in the future HomeLog evolves into a massively distributed system with cross-region writes or must merge data across silos, you can migrate to UUIDs (or ULIDs) at that time.
For now, stick with INTEGER GENERATED AS IDENTITY for clean, performant, and maintainable design.
Deepseek Recommendation if Off-line Writes are Not Required
If you don't require offline writes and your system operates purely in a centralized, online-only mode, the recommendation shifts toward traditional sequential IDs. Here's the updated guidance:
Use GENERATED ALWAYS AS IDENTITY (PostgreSQL 10+):
CREATE TABLE readings (
reading_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY, -- Auto-incrementing integer
...
);Why?
- Better performance (4-byte integers vs 16-byte UUIDs)
- Natural sorting (no fragmentation)
- Simpler joins (integer indexes are smaller/faster)
- Built-in sequence tracking (no risk of duplicates)
Add a separate public_id UUID column if you need obscurity:
CREATE TABLE users (
user_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
public_id UUID DEFAULT gen_random_uuid() UNIQUE, -- For URLs/APIs
email TEXT UNIQUE NOT NULL
);Usage:
-
Internal joins: Use
user_id(integer) -
Public APIs: Expose only
public_id(UUID)
| Primary Key | UUID | INTEGER GENERATED AS IDENTITY |
|---|---|---|
| Distributed System | Offline Writes, multi-server | Single database server |
| Offline writes | Safe | No. Online only |
| Sorting | Require UUIDv7 or ULID | Natural integer order |
| Storage | 16 bytes per ID | 4 bytes per ID |
| Joins | Slightly slower | Optimized |
- 20-30% faster joins on integer keys (based on pgBench tests)
- ~15% storage reduction for large tables
Explicit size limits are placed on string fields. This is to limit database table size (incurs cost) and for security (defend against attacker sending huge values as input data).
For consistency we use named constants for field sizes in both SqlAlchemy classes and Pydantic schema classes.
The named constants are defined in app/core/config.py. The lengths are characters, not bytes (even for non-Latin characters requiring multiple bytes per char).
Pydantic has its own length limit for EmailStr! For "username@domain", the max length of username is 64 chars and max length of domainname is 67 chars. A total of 64+1+67 = 132 chars.
This is less than the 254 char limit in RFC 5321.
Pydantic's EmailStr uses this built-in email validator.
Complete list of Pydantic String Types on https://docs.pydantic.dev. NameEmail accepts strings like Santa Claus <santa@xmas.org>.
To add size limits to Pydantic schemas, use this syntax:
email: EmailStr = Field(..., max_length=MAX_EMAIL) # "..." means a required field
username: Optional[str] = Field(None, max_length=MAX_NAME)The database table schema can be defined using SQL statements or using SqlAlchemy. Both SqlAlchemy Core and ORM components can create database tables from definitions given in Python code.
Each approach has its strengths and weaknesses
| Aspect | SqlAlchemy | Direct SQL |
|---|---|---|
| Integration | Models and logic defined in Python | Tables defined in SQL, logic in Python |
| DRY? | Yes - models define table structure | No - models duplicate table structure |
| Migrations | Automatic using Alembic | Manual, but can automate with Alembic |
| Programmatic Reuse | Yes. Can regenerate tables from code. | Somewhat tied to database. |
| Database agnostic | Yes | No |
| Type Safety | IDE support for type checking & hinting | Yes if you define models for tables |
| Model evolution | Schema automatically updated as models change | Manually revise schema. |
| Data validation | In code | Done by database but can add Pydantic schema |
| Precision schema | Less than raw SQL, limited access to Postgres features | Full access to database capabilities |
| Docker Integration | Harder - need Python code to create database in db container | Easy & reproducible using init script in db container |
ChatGPT's Recommendation for this Application
| Layer | Approach | Rationale |
|---|---|---|
| Initialization | Use sql in Docker | Quick and reproducable using init.sql in container volume |
| Application | Use SqlAlchemy | Clean Pythonic access to models at expense of duplication |
| Migrations | Use Alembic | Optional - use if schema evolves over time |