Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: filter discovery to selected streams #1234

Open
kgpayne opened this issue Dec 2, 2022 · 13 comments
Open

feat: filter discovery to selected streams #1234

kgpayne opened this issue Dec 2, 2022 · 13 comments

Comments

@kgpayne
Copy link
Contributor

kgpayne commented Dec 2, 2022

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

It is commonly advantageous to filter stream discovery to only selected streams. This can be for both performance and cost (e.g. Snowflake metadata credits) reasons. Full context is captured here.

In several popular taps, this is solved by providing database, schema and table settings, accepting comma-separated names to be used for filtering. Whilst this is adequate, it isn't ideal as it is both verbose (when selecting many itemised tables) and error-prone (when managing long lists of itemised tables, and also translating from table name to stream name).

An improved solution is proposed in that same issue:

My preference would be to use the existing selection syntax and mechanisms to source the table filter list, with a flag to enable/disable the behaviour (i.e. discover all). I think 'limit discovery to selected streams' is a helpful default, but it should also be overridable. This would effectively rely on the catalog to supply the filter, rather than config.json, but means we aren't duplicating/polluting config.json with stream-level config. We should also interpret no provided catalog as a case to discover all.

@pnadolny13
Copy link
Contributor

@kgpayne interesting idea! How were you thinking we'd pass the selection criteria to the tap during discovery? Since usually catalogs are the product of discovery so it wouldnt exist prior to that.

This could be specific to my implementation of environments but if we did use the existing selection syntax I think I'd still need to set the selections multiple times (i.e. once per environment) so I can inject my alternative databases and schemas, right? Ideally for most use cases I'd be able to set the selection criteria once in the top level config then override the database/schema somehow in each environment config. That would help me avoid selections getting out of sync across my environments.

Am I understanding correctly? Do you have an idea of how I could get around that?

@aaronsteers
Copy link
Contributor

aaronsteers commented Dec 3, 2022

@kgpayne, @pnadolny13 - I like the idea of doing this generically. What if we built something like a native exclusion_filter into the SDK, that would be somewhat compatible with and analogous to --exclude in Meltano?

I wrote this up as another option here:

As I tried to imagine doing this during discovery via traditional selection logic, I kept running into the fact that you have to first run discovery in order to deselect anything from it.

As well as I understand it, the discovery solution (generally) would need to run twice before seeing benefit:

Pseudocode
# Discover streams (note that normally `--catalog` input is not allowed or expected here):
tap-snowflake --discover > catalog.json

# [Something like a manual process, meltano, or another tool iterates on the catalog to deselect or delete streams from catalog.json]
./my-deselector-tool catalog.json > filtered-catalog.json

# Subsequent discovery only scans/updates catalog entries if stream is not explicitly de-selected
tap-snowflake --discover --catalog=filtered-catalog.json > updated-schema.json

And unless we still check for undiscovered tables, a discovery-based solution isn't safe for enabling by default. The ignore option has has some other benefits:

  1. No initial discovery is needed in order then 'deselect' from the catalog.
  2. It also solves for cases like excluding information_schema.
  3. If taps themselves (or the hub) provide a default ignore setting value like "ignore": ["information_schema-*"], users can override that if they want to change that behavior.
  4. Other DB schemas (like information_schema) are easy for the end user to exclude - and the exclusion saves time during discovery while also automatically reduces the size of the catalog file.

Does this meet the requirement?

@kgpayne
Copy link
Contributor Author

kgpayne commented Dec 7, 2022

@aaronsteers I really like this proposal 🙌 Will add comments there 👍

I wrote this up as another option here:

As I tried to imagine doing this during discovery via traditional selection logic, I kept running into the fact that you have to first run discovery in order to deselect anything from it.

@pnadolny13 I would expect this feature would solve the 'selection criteria per env' case 🙂 e.g.

environments:
  - name: prod
    env:
     SOURCE_DB_ID: db1_live
  - name: uat
    env:
     SOURCE_DB_ID: db1_uat
plugins:
  extractors:
    - name: tap-mysql-db1
      inherit_from: tap-mysql
      select:
        - ${SOURCE_DB_ID}-table1.*
        - ${SOURCE_DB_ID}-table2.*
        - ${SOURCE_DB_ID}-table3.*
      metadata:
        "*":
          replication-method: LOG_BASED

@stale
Copy link

stale bot commented Jul 18, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 18, 2023
@pnadolny13
Copy link
Contributor

Still relevant

@ShahBinoy
Copy link

Very much relevant still, we are facing this MeltanoLabs/tap-postgres#215

@visch
Copy link
Contributor

visch commented Aug 29, 2023

A good way to handle this that would allow us to adhere to the singer spec is to have meltano pass the select, metadata (etc) fields as configuration options to the tap via config.json . The tap would use these to limit discovery properly it'd take a bit of work on the sdk side to mimic meltano but this is probably easiest on meltano users while still allowing the tap to run without meltano

@kgpayne
Copy link
Contributor Author

kgpayne commented Sep 4, 2023

@visch that is my strong preference 😅 I would love for the SDK to implement Meltano-style selection, including the supporting of wildcards etc.

There is some crossover with #1350 too - selection isn't the only stream-wise config that folks want to apply.

@visch
Copy link
Contributor

visch commented Sep 7, 2023

Contributor

@visch that is my strong preference 😅 I would love for the SDK to implement Meltano-style selection, including the supporting of wildcards etc.

There is some crossover with #1350 too - selection isn't the only stream-wise config that folks want to apply.

@kgpayne I read through #1350 and unless I'm missing something I think my idea here may be easier than adding a whole new configuration option. From the tap side it'd look like a configuration option as you mention, maybe the streams key makes sense maybe not (you'll see why it may matter in a sec). From the Meltano side you'd just define your normal select criteria like below (stole @pnadolny13 's example)

plugins:
  extractors:
  - name: tap-gitlab
    variant: meltanolabs
    pip_url: git+https://github.com/meltanolabs/tap-gitlab.git
    select:
      projects.*:
      merge_requests.*:
      issues.*:
      '!issues.description':
      '!issues.title':
      '!merge_requests.description':
      '!merge_requests.title':
environments:
- name: userdev
  config:
    plugins:
      extractors:
      - name: tap-gitlab
        config:
          groups: meltano
          start_date: '2020-01-01T00:00:00Z'

The difference is that the config that meltano would pass to Meltano would look different

before changes config.json passed to tap

{
"groups": "meltano",
"start_date":  "2020-01-01T00:00:00Z"
}

after changes config.json passed to tap

{
"groups": "meltano",
"start_date":  "2020-01-01T00:00:00Z",
"_select":  [
"sourcedbid12345-table1.*",
"sourcedbid12345-table2.*",
"sourcedbid12345-table3.*"]
}

The beauty of this is that:

  1. There's no duplication for Meltano users, stuff just "works".
  2. Taps still follow the singer spec

Downsides

  1. There's a spec that meltano owns that's now duplicated in the sdk versioning this isn't trivial as you'd have to maintain two versions, but this seems better than the other alternative

@kgpayne
Copy link
Contributor Author

kgpayne commented Oct 12, 2023

@visch re: downside 1. - I'd be happy for the SDK implementation of selection to take precedence over the Meltano. I.e. we advertise "selection" as a capability for SDK-based Taps and 'turn off' the Meltano selection engine in favour of pass-through. For legacy/non-SDK Taps (that don't advertise a "selection" capability) Meltano would step in as it currently does 🙂

@visch
Copy link
Contributor

visch commented Oct 12, 2023

@visch re: downside 1. - I'd be happy for the SDK implementation of selection to take precedence over the Meltano. I.e. we advertise "selection" as a capability for SDK-based Taps and 'turn off' the Meltano selection engine in favour of pass-through. For legacy/non-SDK Taps (that don't advertise a "selection" capability) Meltano would step in as it currently does 🙂

I think you'd still leave selection as it is, following the singer spec. You'd just also allow for the select settings to get passed to the tap during discovery from meltano so the tap could use those select settings to do something more efficient during discovery. This means the singer spec stays in tact, people outside meltano can still use the feature if they'd like, and we're good to go.

Shorter answer is I don't think that'd follow the singer spec or at least would be wonky in Meltano as the catalog wouldn't' be controlled by meltano anymore, seems easier to go the other way 🤷

Copy link

stale bot commented Oct 11, 2024

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@melgazar9
Copy link

just chiming in here - I would also find this feature very useful as I have been trying to come up with a custom workaround for this on a specific tap!

@github-project-automation github-project-automation bot moved this to To Discuss in Office Hours Dec 7, 2024
@edgarrmondragon edgarrmondragon moved this from To Discuss to Up Next in Office Hours Dec 7, 2024
@edgarrmondragon edgarrmondragon moved this from Up Next to Discussed in Office Hours Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Discussed
Development

No branches or pull requests

7 participants