-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: filter discovery to selected streams #1234
Comments
@kgpayne interesting idea! How were you thinking we'd pass the selection criteria to the tap during discovery? Since usually catalogs are the product of discovery so it wouldnt exist prior to that. This could be specific to my implementation of environments but if we did use the existing selection syntax I think I'd still need to set the selections multiple times (i.e. once per environment) so I can inject my alternative databases and schemas, right? Ideally for most use cases I'd be able to set the selection criteria once in the top level config then override the database/schema somehow in each environment config. That would help me avoid selections getting out of sync across my environments. Am I understanding correctly? Do you have an idea of how I could get around that? |
@kgpayne, @pnadolny13 - I like the idea of doing this generically. What if we built something like a native I wrote this up as another option here: As I tried to imagine doing this during discovery via traditional selection logic, I kept running into the fact that you have to first run discovery in order to deselect anything from it. As well as I understand it, the discovery solution (generally) would need to run twice before seeing benefit: Pseudocode
And unless we still check for undiscovered tables, a discovery-based solution isn't safe for enabling by default. The
Does this meet the requirement? |
@aaronsteers I really like this proposal 🙌 Will add comments there 👍
@pnadolny13 I would expect this feature would solve the 'selection criteria per env' case 🙂 e.g. environments:
- name: prod
env:
SOURCE_DB_ID: db1_live
- name: uat
env:
SOURCE_DB_ID: db1_uat
plugins:
extractors:
- name: tap-mysql-db1
inherit_from: tap-mysql
select:
- ${SOURCE_DB_ID}-table1.*
- ${SOURCE_DB_ID}-table2.*
- ${SOURCE_DB_ID}-table3.*
metadata:
"*":
replication-method: LOG_BASED |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Still relevant |
Very much relevant still, we are facing this MeltanoLabs/tap-postgres#215 |
A good way to handle this that would allow us to adhere to the singer spec is to have meltano pass the select, metadata (etc) fields as configuration options to the tap via config.json . The tap would use these to limit discovery properly it'd take a bit of work on the sdk side to mimic meltano but this is probably easiest on meltano users while still allowing the tap to run without meltano |
@kgpayne I read through #1350 and unless I'm missing something I think my idea here may be easier than adding a whole new configuration option. From the tap side it'd look like a configuration option as you mention, maybe the
The difference is that the config that meltano would pass to Meltano would look different before changes config.json passed to tap {
"groups": "meltano",
"start_date": "2020-01-01T00:00:00Z"
} after changes config.json passed to tap {
"groups": "meltano",
"start_date": "2020-01-01T00:00:00Z",
"_select": [
"sourcedbid12345-table1.*",
"sourcedbid12345-table2.*",
"sourcedbid12345-table3.*"]
} The beauty of this is that:
Downsides
|
@visch re: downside 1. - I'd be happy for the SDK implementation of selection to take precedence over the Meltano. I.e. we advertise "selection" as a capability for SDK-based Taps and 'turn off' the Meltano selection engine in favour of pass-through. For legacy/non-SDK Taps (that don't advertise a "selection" capability) Meltano would step in as it currently does 🙂 |
I think you'd still leave selection as it is, following the singer spec. You'd just also allow for the select settings to get passed to the tap during discovery from meltano so the tap could use those select settings to do something more efficient during discovery. This means the singer spec stays in tact, people outside meltano can still use the feature if they'd like, and we're good to go. Shorter answer is I don't think that'd follow the singer spec or at least would be wonky in Meltano as the catalog wouldn't' be controlled by meltano anymore, seems easier to go the other way 🤷 |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
just chiming in here - I would also find this feature very useful as I have been trying to come up with a custom workaround for this on a specific tap! |
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
It is commonly advantageous to filter stream discovery to only selected streams. This can be for both performance and cost (e.g. Snowflake metadata credits) reasons. Full context is captured here.
In several popular taps, this is solved by providing
database
,schema
andtable
settings, accepting comma-separated names to be used for filtering. Whilst this is adequate, it isn't ideal as it is both verbose (when selecting many itemised tables) and error-prone (when managing long lists of itemised tables, and also translating from table name to stream name).An improved solution is proposed in that same issue:
The text was updated successfully, but these errors were encountered: