Skip to content

Postgres Canary Test Env#19

Merged
kmacrow merged 7 commits intomasterfrom
data-transfer-dates
Jul 29, 2025
Merged

Postgres Canary Test Env#19
kmacrow merged 7 commits intomasterfrom
data-transfer-dates

Conversation

@alexdriedger
Copy link
Copy Markdown
Collaborator

  • Moved docker-compose.test-pg.yml to it's own test-env directory

  • Added 3 tables that automatically get created for the source database in the test docker compose

  • Added a mock data generator to the test docker compose that writes 100 records to each of the 3 tables every 5 minutes

  • Fixed a variety of sql errors with reading/writing to postgres source/destinations in the canary

  • Fixed an issue with booleans in the sqllite cache

  • Fixed an issue with date column types not being compatible on the second write

  • Updated schema checking to allow for schemas to have different column order

  • Added a lot of unit tests for the sql connectors

Copy link
Copy Markdown
Collaborator

@kmacrow kmacrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments but approving - test env looks great. The AI unit tests are kind of slop imho, but it's fine for now. The only thing that might be worth a follow up look is the datetime.time mapping issue, but it's not worse than it already was.

bytes: pa.binary(),
datetime.date: pa.date32(),
date: pa.date32(),
datetime.time: pa.time64('us'),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an issue here that's similar to the previous bug I had with the date type:

>>> import datetime
>>> datetime.time
<class 'datetime.time'> # we want this
>>> from datetime import datetime
>>> datetime.time
<method 'time' of 'datetime.datetime' objects> # we're getting this

The workaround I had was to do from datetime import datetime, timedelta, timezone, date, time as dt_time so it doesn't conflict with time but you get the type, not the method

UUID: str,
Decimal: float,
dict: str, # JSONB and other dict-like types get converted to string
date: date, # Date type maps to itself (already in PY_TO_PYARROW_MAP)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 1:1 mapping doesn't hurt anything but it's kind of redundant, maybe confusing

if py_type is datetime:
return val.astimezone(timezone.utc)
elif py_type is date:
return val # date objects should be passed through as-is
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this special case is needed for date? - the val would be returned as-is for date anyway if that 1:1 mapping wasn't in the conversion table. The datetime has that special case because it needs the timezone adjustment and I didn't have a generalized way to mutate the val on the way through


# Check if there are any records to process
stream_size = ds.size(stream)
if stream_size == 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really add this to all of the destinations

Boolean: pa.bool_(),
Date: pa.timestamp('us', tz='UTC'),
Date: pa.date32(),
Time: pa.timestamp('us', tz='UTC'),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Time needs to be mapped to pa.time64('us') here (and inverse in the PYARROW_TO_ALCHEMY table?

from sqlalchemy.dialects.postgresql import UUID, JSONB, TIMESTAMP


class TestPostgresSource:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some good test ideas in here, but I'd probably want to refactor it at some point:

  • I think a bunch of these test cases should be in the Stream tests
  • the SqliteCache cases should probably be in a class of their own, independent of any specific connector
  • there's some kind of duplication of the code/logic being tested vs. actually calling the real code
  • some of the cases are Postgres specific (JSONB) but most could probably be generic SQLSource tests
  • there's quite a bit of SQL destination testing that should be in a different class
  • modules being imported in methods - sometimes valid use for that but I'm not sure it's needed here, and kind of obscures what's being tested looking at the filename and header

@kmacrow kmacrow merged commit 6c2ff2b into master Jul 29, 2025
@kmacrow kmacrow deleted the data-transfer-dates branch July 29, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants