v0.31.1

Latest

Latest

github-actions released this 12 Jun 11:25

3c9582c

Added

pw.io.elasticsearch.read reads an Elasticsearch index into Pathway. Since Elasticsearch has no change-data-capture API, the connector ingests by polling and reconciling the overlap between consecutive queries, so no row is missed or delivered twice. It is configured with timestamp_column (a numeric column it watermarks and orders by), id_column (a unique, sortable identifier used to deduplicate the overlap and as the Pathway row key), and max_transaction_duration (how late a row may still become visible). mode="streaming" (default) keeps polling at poll_interval; mode="static" reads the index once. The index is read in bounded pages of read_batch_size documents (default 10 000), each becoming one minibatch, and an idle index is detected and skipped without re-reading the overlap window. With persistence enabled, the connector resumes from the saved watermark, delivering only rows added since the last checkpoint. At startup it warns if timestamp_column or id_column is mapped in a way it cannot poll on (e.g. an id_column mapped as text, which Elasticsearch cannot sort by).
pw.io.clickhouse.write writes a Pathway table to a ClickHouse table over the native protocol. Two output formats are available via output_table_type: the default "stream_of_changes" appends every change with time/diff columns, while "snapshot" maintains the current state in a ReplacingMergeTree keyed by the required primary_key (query it with SELECT ... FINAL). The init_mode parameter ("default", "create_if_not_exists", "replace") controls whether the connector creates the destination table, and the destination is validated at start-up so a missing table, column, or incompatible type is reported immediately. Most scalar, Optional, list, tuple, and 1-D np.ndarray column types are supported (see the connector documentation for the full type mapping).
pw.io.iceberg.read now decodes every Iceberg primitive type. The new arms are: date materializes as DateTimeNaive at midnight on the calendar day (Pathway has no date-only type); time materializes as Duration representing microseconds since midnight (same convention as the Postgres TIME mapping); uuid materializes as the canonical 8-4-4-4-12 hex string when the column is declared as str in the Pathway schema (or as 16 raw bytes when declared as bytes); fixed(N) materializes as bytes; decimal(p, s) materializes as either float (lossy, with a one-shot startup warning naming each affected column) or str (lossless decimal text — opt in by declaring the column as str).
pw.io.iceberg.write now reconciles Pathway types against the destination table's existing schema and writes the narrower / alternatively-encoded representation when the target column already declares one: Pathway int into an existing int (32-bit) column with overflow detection, Pathway float into an existing float (32-bit) column (cast to 32-bit float; precision beyond ~7 significant decimal digits is lost), Pathway str into an existing decimal(p, s) column (parsed as decimal text) or uuid column (parsed as canonical UUID hex), Pathway bytes into an existing fixed(N) column (length-checked), and Pathway Duration into an existing time column (microseconds since midnight). When Pathway creates the destination table from a Pathway schema, the connector continues to emit the wide representations (long from Pathway int / Duration, double from float, string from str, binary from bytes); choosing a narrow / specialized type at create-time isn't exposed yet. Iceberg date is not supported on write at all — neither at create-time (Pathway has no date-only type to derive from) nor as an existing-column override. Iceberg map<K, V> remains unsupported on both sides.
pw.io.iceberg.read and pw.io.iceberg.write now support Iceberg struct<…> columns through Pathway's positional tuple[…] type. Tuples are written with synthesized field names [0], [1], … (same convention pw.io.deltalake.write already uses). Reads ignore struct field names and bind tuple positions to struct field positions in the destination order; the mapping composes transitively, so list[tuple[…]] works as well. When writing into an existing table whose target column declares a struct with arbitrary field names, the writer adopts the destination's field names automatically, so the user's tuple[…] declaration only needs to align with the destination struct's field order — Pathway has no named-record type that would let a tuple bind to struct fields by name, so reordering the destination struct's fields out-of-band would silently misalign a Pathway pipeline declaring the column as tuple[…].
pw.io.mongodb.read now accepts four additional BSON types that were previously dropped at parse time. ObjectId and Decimal128 map to str (the canonical 24-character hex form and the canonical decimal string respectively); RegularExpression maps to str formatted as "/<pattern>/<options>"; Timestamp maps to int carrying the seconds-since-epoch time component (the companion increment field is dropped). When such a value is written back to MongoDB it is stored as an ordinary string (or integer for Timestamp) field rather than under its original BSON type, so a write-then-read round-trip preserves the value but not the original BSON type of the column.
pw.io.postgres.write now accepts a schema_name parameter for writing to tables in non-default PostgreSQL schemas.
pw.io.postgres.write now supports pre-existing INET, CIDR, MACADDR, and MACADDR8 columns from a str Pathway column, matching the reader round-trip.
pw.io.postgres.read and pw.io.postgres.write now run extensive preflight validation that surfaces misconfigurations (PostgreSQL types that are not yet supported, array element type mismatches, nullability mismatches, REPLICA IDENTITY NOTHING on non-append-only streaming tables, etc.) as clear pipeline-start errors instead of silent row drops or opaque worker panics.
pw.io.mysql.read reads a MySQL table into Pathway. In mode="streaming" (the default) it performs Change Data Capture by reading the MySQL binary log: it takes an initial snapshot and then continuously delivers inserts, updates, and deletes (requires log_bin on, binlog_format=ROW, binlog_row_image=FULL, and the REPLICATION SLAVE / REPLICATION CLIENT privileges). In mode="static" it reads the table once and terminates. The schema must declare at least one primary-key column. Every type produced by pw.io.mysql.write round-trips back, and common native MySQL types (DECIMAL, DATE, integer and text families, JSON, …) are parsed as well. Unlike PostgreSQL logical replication, the connector leaves no server-side state behind — there is no replication slot to retain logs and fill the disk; binary-log retention is governed solely by the server's own settings. With persistence enabled, the streaming connector saves the binary-log coordinates and resumes from them on restart, raising a clear error if the needed binary logs have already been purged by the server's normal expiry.

Changed

pw.io.iceberg.read and pw.io.iceberg.write now retry transient catalog errors automatically (e.g. concurrent-commit conflicts on write, transient REST/Glue catalog failures on read).
pw.io.postgres.write now retries transient PostgreSQL errors automatically — SQLSTATE class 08 (connection exceptions), class 57 (admin / crash shutdown, cannot_connect_now), serialization_failure (40001), deadlock_detected (40P01), and any closed connection are retried up to three times with exponential backoff before the writer surfaces the error. Permanent failures (syntax errors, missing tables, constraint violations) still propagate on the first attempt.
pw.io.postgres.read (streaming mode) no longer requires user, password, or host in postgres_settings. Missing components are omitted from the connection string and resolved by PostgreSQL's standard client defaults (OS user, ~/.pgpass, UNIX socket), matching how static mode has always behaved. This unblocks deployments authenticated via trust, peer, cert, or other passwordless pg_hba.conf modes.
pw.io.postgres connections now tag themselves in PostgreSQL as application_name=pathway[:<name>] (where <name> comes from the connector's name parameter), so operators can identify Pathway sessions in pg_stat_activity, pg_stat_replication, and server logs. The value is sanitized to printable ASCII and truncated to 63 bytes to match PostgreSQL's NAMEDATALEN. A user-supplied application_name in postgres_settings is left untouched.
pw.io.postgres connections now default to TCP keepalives tuned for roughly five-minute dead-peer detection (keepalives_idle=300, keepalives_interval=30, keepalives_count=3, plus tcp_user_timeout=300000), so a SIGKILL'd Pathway process releases its temporary replication slot in minutes rather than the OS-inherited ~2 hour timeline. Each value is only applied when the user has not already set it in postgres_settings.
pw.io.mssql.read and pw.io.mssql.write now validate configuration and schemas at call/init time, producing clear errors for cases that previously surfaced as opaque SQL Server failures partway through the run: invalid primary_key (passed in stream_of_changes mode, with duplicates, referring to a different table, or with Optional dtype), schema columns colliding with the auto-appended time/diff columns or differing only in letter case, non-existent source tables or columns, missing or incompatible destination columns (non-existent, IDENTITY, computed, or required NOT NULL columns absent from the Pathway schema), Optional[T] fields mapped to NOT NULL destination columns, and empty or NUL-containing table_name / schema_name.
pw.io.mssql.write snapshot mode now supports bytes-typed primary keys, and verifies that the destination has a unique index covering exactly the configured primary_key columns — without it, the upsert could silently match the wrong rows.
pw.io.mssql.read streaming/CDC mode now handles previously silent edge cases: it errors when more than one CDC capture instance is registered on the source table, recovers the persistence offset from CDC when no events have been observed yet (so subsequent runs resume from CDC instead of re-snapshotting), and raises a clear error if CDC cleanup advances retention past the connector's last read position.
pw.io.mssql.read accepts SQL Server NUMERIC(N, 0) columns into a Pathway int schema and integer-family columns (TINYINT / SMALLINT / INT / BIGINT) into a Pathway float schema, matching the int → float tolerance of pw.io.sqlite.read.
pw.io.mssql.read and pw.io.mssql.write now correctly handle identifiers containing ].
pw.io.kafka.read now emits a DeprecationWarning (instead of a SyntaxWarning) when topic is passed as a list, and warns when an explicitly configured auto.offset.reset is overridden because start_from_timestamp_ms is set. It also logs a warning (previously an easy-to-miss info message) when start_from_timestamp_ms lands at or past the end of a partition, since no already-written data will be read from it.
pw.io.mysql.write now retries only transient MySQL errors — connection drops, deadlocks, lock-wait timeouts, "too many connections", and "server is shutting down" — with exponential backoff, and lets permanent failures (missing tables, syntax errors, constraint violations) propagate on the first attempt. Previously every error was retried up to three times, delaying permanent failures by several seconds before they surfaced.

Fixed

pw.io.mssql.read in mode="streaming" no longer mistakes an unrelated table for a CDC-enabled one. SQL Server leaves a capture instance behind when a CDC-tracked table is dropped without sp_cdc_disable_table, and its dangling source_object_id can later be reused by a brand-new, non-CDC table; the CDC probe matched on that object id alone, so the reader would report the fresh table as CDC-enabled and then tail a stale change table forever instead of failing fast with a "CDC is not enabled on table" error. The probe now also requires the source table to currently exist and have is_tracked_by_cdc = 1.
pw.io.iceberg.read in mode="static" no longer hangs on an Iceberg table that has no current snapshot (e.g. a table Pathway just created but never wrote data to). The reader treated the absence of a snapshot as "wait for one to appear" — which never returned in static mode — and now correctly reports zero rows and exits.
pw.io.iceberg.write's min_commit_frequency now actually rate-limits all commits over the lifetime of the run, not just the first one. Previously the last-commit timestamp was set at writer construction and never updated, so once the initial interval elapsed every subsequent minibatch was committed individually — producing one Iceberg snapshot per minibatch rather than at most one per min_commit_frequency window.
pw.io.iceberg.write into an Iceberg table that was created outside Pathway (for example, a table pre-created via pyiceberg with a hand-rolled schema) now correctly writes each row's column data into the matching destination column. Previously the writer relied on column-position metadata that only happened to line up when Pathway had both created and written the table — so writes into externally-created tables either failed to load on read-back or silently bound data to the wrong destination columns.
pw.io.postgres.read streaming mode now correctly parses negative time components of INTERVAL values in PostgreSQL's default text format.
pw.io.kafka.read, pw.io.kafka.write and pw.io.kafka.simple_read now validate their arguments when the connector is created and raise a clear ValueError/TypeError, instead of deferring to an opaque error from the engine or librdkafka at run time. This covers, among others: an empty or missing topic name; a missing or empty bootstrap.servers; a missing or empty group.id on the reader; non-positive max_backlog_size, parallel_readers or autocommit_duration_ms; a negative start_from_timestamp_ms; a schema that declares the reserved _metadata column; json_field_paths entries that reference an unknown field or use an invalid JSON Pointer; and contradictory combinations such as read_only_new=True with mode="static".
pw.io.kafka.write now rejects configurations that would otherwise silently drop data or emit malformed messages: duplicate header names, header names colliding with the reserved pathway_time/pathway_diff headers, table columns named time/diff under format="json", an empty subject, and a subject / schema_registry_settings pair where only one side is provided or the format is not json.
pw.io.kafka.write now honors its documented sort_by parameter; previously the column reference was lost when the connector rebuilt its output table, raising an error at run time.
pw.io.kafka.read with mode="static" and a start_from_timestamp_ms past the end of the topic now returns immediately, instead of stalling until the static-read polling budget is exhausted.
pw.io.kafka.SchemaRegistrySettings and pw.io.kafka.SchemaRegistryHeader now validate their fields on construction (the urls list shape and non-emptiness, string-typed credentials, mutually exclusive authentication methods, and the timeout type and positivity), instead of failing later with an opaque error.
pw.io.mongodb.read persistence: on restart, the replayed change-stream events are now delivered atomically, preventing an edge case where a crash partway through the replay could skip events that had been read from MongoDB but not yet processed downstream.
pw.io.mongodb.read in mode="streaming" no longer loses change events that are committed around connector startup. The reader now anchors its change stream to a cluster operation time captured before the initial collection snapshot and keeps a single cursor for both catch-up and live tailing, instead of capturing oplog tokens with short-lived streams and then re-opening a separate live stream. The previous handoff left a gap in which an event arriving during startup — most easily triggered when several MongoDB readers start concurrently against the same server — could be skipped and never delivered.
Passing a non-positive max_batch_size to any output connector that accepts it now raises a clear error (max_batch_size must be a positive integer). Previously the value was handled inconsistently: 0 was silently accepted and disabled size-based batching entirely, while a negative value surfaced an opaque OverflowError.
pw.io.mysql.write now rejects, at construction, a schema column named time or diff (case-insensitive) in stream_of_changes mode, where it would collide with the time/diff metadata columns the connector appends. Previously the conflict surfaced mid-run as an opaque MySQL "Duplicate column name" error. Rename the column or switch to output_table_type="snapshot", which does not append these columns.
pw.io.dynamodb.write now correctly handles which column types may serve as a partition_key / sort_key. A bytes column (serialized as the DynamoDB Binary type) and a pw.Json column (serialized as String) are valid scalar key types but were previously rejected with a "can't be used in the index" error; they are now accepted. Conversely, a bool column was previously allowed as a key and the auto-created table declared it as Binary, so every write then failed with an opaque Invalid attribute value type error from DynamoDB — bool is now rejected up front with a clear message, since DynamoDB has no Boolean key type.
pw.io.dynamodb.write no longer fails when a row is updated within a single minibatch. The update emits a retraction of the old row and an insertion of the new one under the same primary key; previously both were sent in the same BatchWriteItem request, which DynamoDB rejects with Provided list of item keys contains duplicates, so the write failed after exhausting its retries. The connector now reconciles the two operations per key within the batch (the insertion wins), so updates are applied as expected under the documented snapshot semantics.
pw.io.questdb.write now validates its designated-timestamp arguments when the connector is created and raises a clear ValueError, instead of deferring to an opaque error from the engine at run time. This covers designated_timestamp_policy="use_column" passed without a designated_timestamp column, and a designated_timestamp column whose type is neither DateTimeNaive nor DateTimeUtc.

Assets 6