feat: Allow an inline stream map to set output stream name dynamically #2502

menzenski · 2024-06-24T12:09:37Z

Feature scope

Other

Description

See this thread (and this message in particular) in the Meltano slack for more context.

Here's the scenario I'd like to be able to implement:

Given a tap that produces messages to a generic database_records stream, where the records in that stream have a namespace object with a database and collection property (”namespace”: {“database”: “customer_service”, “collection”: “Customer”} for example), I'd like to dynamically split the database_records stream into many streams, one for each namespace.database and namespace.collection value. For the example record with ”namespace”: {“database”: “customer_service”, “collection”: “Customer”} that should be mapped to a new stream with stream_id customer_service-Customer here (as the hyphenating will let us take advantage of handling in the target to write this record to a specific table).

The text was updated successfully, but these errors were encountered:

edgarrmondragon · 2024-06-25T10:01:02Z

This is even tougher than I imagined and I think you were hinting at this in Slack. The way stream maps currently work is by generating:

one SCHEMA message for each stream map
one RECORD message for each record and stream map

So if a stream with 10 records has 2 stream maps applied to it, then the resulting mapped stream will emit:

2 SCHEMA messages
20 RECORD messages

Now, in the current implementation the generated SCHEMA messages don't depend on the individual records, only on the stream map expression. With this proposal however, SCHEMA messages would also need to be dynamic since the stream alias would depend on the contents of each individual record, so this change would require some non-trivial refactoring of the stream maps implementation.

That is, for any implementation of a "stream splitter" like this, based on the SDK stream maps or not, the following transformation would have to occur:

Original Singer output:

{"type": "SCHEMA", "stream": "tenant_resources", "schema": {"properties": {"tenant_id": {"type": "string"}, "resource": {"type": "string"}}, "type": "object"}, "key_properties": []}
{"type": "RECORD", "stream": "tenant_resources", "record": {"tenant_id": "tenant_001", "resource": "resource_A"}}
{"type": "RECORD", "stream": "tenant_resources", "record": {"tenant_id": "tenant_002", "resource": "resource_A"}}
{"type": "RECORD", "stream": "tenant_resources", "record": {"tenant_id": "tenant_001", "resource": "resource_B"}}
{"type": "RECORD", "stream": "tenant_resources", "record": {"tenant_id": "tenant_002", "resource": "resource_B"}}
{"type": "RECORD", "stream": "tenant_resources", "record": {"tenant_id": "tenant_001", "resource": "resource_C"}}

Transformed output based on the tenant_id property:

{"type": "SCHEMA", "stream": "tenant_001-resources", "schema": {"properties": {"resource": {"type": "string"}}, "type": "object"}, "key_properties": []}
{"type": "RECORD", "stream": "tenant_001-resources", "record": {"resource": "resource_A"}}
{"type": "SCHEMA", "stream": "tenant_002-resources", "schema": {"properties": {"resource": {"type": "string"}}, "type": "object"}, "key_properties": []}
{"type": "RECORD", "stream": "tenant_002-resources", "record": {"resource": "resource_A"}}
{"type": "SCHEMA", "stream": "tenant_001-resources", "schema": {"properties": {"resource": {"type": "string"}}, "type": "object"}, "key_properties": []}
{"type": "RECORD", "stream": "tenant_001-resources", "record": {"resource": "resource_B"}}
{"type": "SCHEMA", "stream": "tenant_002-resources", "schema": {"properties": {"resource": {"type": "string"}}, "type": "object"}, "key_properties": []}
{"type": "RECORD", "stream": "tenant_002-resources", "record": {"resource": "resource_B"}}
{"type": "SCHEMA", "stream": "tenant_001-resources", "schema": {"properties": {"resource": {"type": "string"}}, "type": "object"}, "key_properties": []}
{"type": "RECORD", "stream": "tenant_001-resources", "record": {"resource": "resource_C"}}

Notice because of the arbitrary order of records, a SCHEMA message may be generated multiple times. With some smart caching, we might get to simplify this to:

{"type": "SCHEMA", "stream": "tenant_001-resources", "schema": {"properties": {"resource": {"type": "string"}}, "type": "object"}, "key_properties": []}
{"type": "RECORD", "stream": "tenant_001-resources", "record": {"resource": "resource_A"}}
{"type": "SCHEMA", "stream": "tenant_002-resources", "schema": {"properties": {"resource": {"type": "string"}}, "type": "object"}, "key_properties": []}
{"type": "RECORD", "stream": "tenant_002-resources", "record": {"resource": "resource_A"}}
{"type": "RECORD", "stream": "tenant_001-resources", "record": {"resource": "resource_B"}}
{"type": "RECORD", "stream": "tenant_002-resources", "record": {"resource": "resource_B"}}
{"type": "RECORD", "stream": "tenant_001-resources", "record": {"resource": "resource_C"}}

menzenski added kind/Feature New feature or request valuestream/SDK labels Jun 24, 2024

edgarrmondragon added the discussion label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow an inline stream map to set output stream name dynamically #2502

feat: Allow an inline stream map to set output stream name dynamically #2502

menzenski commented Jun 24, 2024

edgarrmondragon commented Jun 25, 2024

feat: Allow an inline stream map to set output stream name dynamically #2502

feat: Allow an inline stream map to set output stream name dynamically #2502

Comments

menzenski commented Jun 24, 2024

Feature scope

Description

edgarrmondragon commented Jun 25, 2024