feat: filter discovery to selected streams #1234

kgpayne · 2022-12-02T10:46:23Z

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

It is commonly advantageous to filter stream discovery to only selected streams. This can be for both performance and cost (e.g. Snowflake metadata credits) reasons. Full context is captured here.

In several popular taps, this is solved by providing database, schema and table settings, accepting comma-separated names to be used for filtering. Whilst this is adequate, it isn't ideal as it is both verbose (when selecting many itemised tables) and error-prone (when managing long lists of itemised tables, and also translating from table name to stream name).

An improved solution is proposed in that same issue:

My preference would be to use the existing selection syntax and mechanisms to source the table filter list, with a flag to enable/disable the behaviour (i.e. discover all). I think 'limit discovery to selected streams' is a helpful default, but it should also be overridable. This would effectively rely on the catalog to supply the filter, rather than config.json, but means we aren't duplicating/polluting config.json with stream-level config. We should also interpret no provided catalog as a case to discover all.

The text was updated successfully, but these errors were encountered:

pnadolny13 · 2022-12-02T16:19:51Z

@kgpayne interesting idea! How were you thinking we'd pass the selection criteria to the tap during discovery? Since usually catalogs are the product of discovery so it wouldnt exist prior to that.

This could be specific to my implementation of environments but if we did use the existing selection syntax I think I'd still need to set the selections multiple times (i.e. once per environment) so I can inject my alternative databases and schemas, right? Ideally for most use cases I'd be able to set the selection criteria once in the top level config then override the database/schema somehow in each environment config. That would help me avoid selections getting out of sync across my environments.

Am I understanding correctly? Do you have an idea of how I could get around that?

aaronsteers · 2022-12-03T01:53:11Z

@kgpayne, @pnadolny13 - I like the idea of doing this generically. What if we built something like a native exclusion_filter into the SDK, that would be somewhat compatible with and analogous to --exclude in Meltano?

I wrote this up as another option here:

Feature: Accept ignore glob patterns as standard tap config #1240

As I tried to imagine doing this during discovery via traditional selection logic, I kept running into the fact that you have to first run discovery in order to deselect anything from it.

As well as I understand it, the discovery solution (generally) would need to run twice before seeing benefit:

Pseudocode

# Discover streams (note that normally `--catalog` input is not allowed or expected here):
tap-snowflake --discover > catalog.json

# [Something like a manual process, meltano, or another tool iterates on the catalog to deselect or delete streams from catalog.json]
./my-deselector-tool catalog.json > filtered-catalog.json

# Subsequent discovery only scans/updates catalog entries if stream is not explicitly de-selected
tap-snowflake --discover --catalog=filtered-catalog.json > updated-schema.json

And unless we still check for undiscovered tables, a discovery-based solution isn't safe for enabling by default. The ignore option has has some other benefits:

No initial discovery is needed in order then 'deselect' from the catalog.
It also solves for cases like excluding information_schema.
If taps themselves (or the hub) provide a default ignore setting value like "ignore": ["information_schema-*"], users can override that if they want to change that behavior.
Other DB schemas (like information_schema) are easy for the end user to exclude - and the exclusion saves time during discovery while also automatically reduces the size of the catalog file.

Does this meet the requirement?

kgpayne · 2022-12-07T17:02:53Z

@aaronsteers I really like this proposal 🙌 Will add comments there 👍

I wrote this up as another option here:

Feature: Accept ignore glob patterns as standard tap config #1240

As I tried to imagine doing this during discovery via traditional selection logic, I kept running into the fact that you have to first run discovery in order to deselect anything from it.

@pnadolny13 I would expect this feature would solve the 'selection criteria per env' case 🙂 e.g.

environments:
  - name: prod
    env:
     SOURCE_DB_ID: db1_live
  - name: uat
    env:
     SOURCE_DB_ID: db1_uat
plugins:
  extractors:
    - name: tap-mysql-db1
      inherit_from: tap-mysql
      select:
        - ${SOURCE_DB_ID}-table1.*
        - ${SOURCE_DB_ID}-table2.*
        - ${SOURCE_DB_ID}-table3.*
      metadata:
        "*":
          replication-method: LOG_BASED

stale · 2023-07-18T04:09:50Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

pnadolny13 · 2023-07-19T12:46:05Z

Still relevant

ShahBinoy · 2023-08-23T19:31:49Z

Very much relevant still, we are facing this MeltanoLabs/tap-postgres#215

visch · 2023-08-29T16:25:09Z

A good way to handle this that would allow us to adhere to the singer spec is to have meltano pass the select, metadata (etc) fields as configuration options to the tap via config.json . The tap would use these to limit discovery properly it'd take a bit of work on the sdk side to mimic meltano but this is probably easiest on meltano users while still allowing the tap to run without meltano

kgpayne · 2023-09-04T14:51:15Z

@visch that is my strong preference 😅 I would love for the SDK to implement Meltano-style selection, including the supporting of wildcards etc.

There is some crossover with #1350 too - selection isn't the only stream-wise config that folks want to apply.

visch · 2023-09-07T10:32:20Z

Contributor

@visch that is my strong preference 😅 I would love for the SDK to implement Meltano-style selection, including the supporting of wildcards etc.

There is some crossover with #1350 too - selection isn't the only stream-wise config that folks want to apply.

@kgpayne I read through #1350 and unless I'm missing something I think my idea here may be easier than adding a whole new configuration option. From the tap side it'd look like a configuration option as you mention, maybe the streams key makes sense maybe not (you'll see why it may matter in a sec). From the Meltano side you'd just define your normal select criteria like below (stole @pnadolny13 's example)

plugins:
  extractors:
  - name: tap-gitlab
    variant: meltanolabs
    pip_url: git+https://github.com/meltanolabs/tap-gitlab.git
    select:
      projects.*:
      merge_requests.*:
      issues.*:
      '!issues.description':
      '!issues.title':
      '!merge_requests.description':
      '!merge_requests.title':
environments:
- name: userdev
  config:
    plugins:
      extractors:
      - name: tap-gitlab
        config:
          groups: meltano
          start_date: '2020-01-01T00:00:00Z'

The difference is that the config that meltano would pass to Meltano would look different

before changes config.json passed to tap

{
"groups": "meltano",
"start_date":  "2020-01-01T00:00:00Z"
}

after changes config.json passed to tap

{
"groups": "meltano",
"start_date":  "2020-01-01T00:00:00Z",
"_select":  [
"sourcedbid12345-table1.*",
"sourcedbid12345-table2.*",
"sourcedbid12345-table3.*"]
}

The beauty of this is that:

There's no duplication for Meltano users, stuff just "works".
Taps still follow the singer spec

Downsides

There's a spec that meltano owns that's now duplicated in the sdk versioning this isn't trivial as you'd have to maintain two versions, but this seems better than the other alternative

kgpayne · 2023-10-12T15:44:29Z

@visch re: downside 1. - I'd be happy for the SDK implementation of selection to take precedence over the Meltano. I.e. we advertise "selection" as a capability for SDK-based Taps and 'turn off' the Meltano selection engine in favour of pass-through. For legacy/non-SDK Taps (that don't advertise a "selection" capability) Meltano would step in as it currently does 🙂

visch · 2023-10-12T15:55:06Z

@visch re: downside 1. - I'd be happy for the SDK implementation of selection to take precedence over the Meltano. I.e. we advertise "selection" as a capability for SDK-based Taps and 'turn off' the Meltano selection engine in favour of pass-through. For legacy/non-SDK Taps (that don't advertise a "selection" capability) Meltano would step in as it currently does 🙂

I think you'd still leave selection as it is, following the singer spec. You'd just also allow for the select settings to get passed to the tap during discovery from meltano so the tap could use those select settings to do something more efficient during discovery. This means the singer spec stays in tact, people outside meltano can still use the feature if they'd like, and we're good to go.

Shorter answer is I don't think that'd follow the singer spec or at least would be wonky in Meltano as the catalog wouldn't' be controlled by meltano anymore, seems easier to go the other way 🤷

stale · 2024-10-11T16:57:56Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

melgazar9 · 2024-12-06T17:23:16Z

just chiming in here - I would also find this feature very useful as I have been trying to come up with a custom workaround for this on a specific tap!

kgpayne added kind/Feature New feature or request valuestream/SDK labels Dec 2, 2022

kgpayne mentioned this issue Dec 2, 2022

Limit Discoverable Tables MeltanoLabs/tap-snowflake#10

Closed

aaronsteers mentioned this issue Dec 3, 2022

Feature: Accept ignore glob patterns as standard tap config #1240

Closed

stale bot added the stale label Jul 18, 2023

stale bot removed the stale label Jul 19, 2023

This was referenced Jul 20, 2023

[Feature]: speed up discovery, just discover selected tables #1048

Closed

Follow-up from SQLStreams: 'only_schemas' and/or 'ignore_schemas' config option #290

Closed

visch mentioned this issue Aug 23, 2023

Support filter_schema setting MeltanoLabs/tap-postgres#215

Closed

edgarrmondragon mentioned this issue Aug 29, 2023

feat: SQL Taps - Filter Schemas #1930

Open

edgarrmondragon mentioned this issue Jun 17, 2024

feat(taps): Allow SQL tap developers to auto-skip certain schemas from discovery #2482

Merged

visch mentioned this issue Jul 17, 2024

Discovery taking a very long time for databases with 1000s of tables BuzzCutNorman/tap-mssql#76

Open

stale bot added the stale label Oct 11, 2024

edgarrmondragon added evergreen Accepting Pull Requests and removed stale labels Oct 11, 2024

edgarrmondragon added this to Office Hours Dec 7, 2024

github-project-automation bot moved this to To Discuss in Office Hours Dec 7, 2024

edgarrmondragon moved this from To Discuss to Up Next in Office Hours Dec 7, 2024

edgarrmondragon moved this from Up Next to Discussed in Office Hours Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: filter discovery to selected streams #1234

feat: filter discovery to selected streams #1234

kgpayne commented Dec 2, 2022

pnadolny13 commented Dec 2, 2022

aaronsteers commented Dec 3, 2022 •

edited

Loading

kgpayne commented Dec 7, 2022

stale bot commented Jul 18, 2023

pnadolny13 commented Jul 19, 2023

ShahBinoy commented Aug 23, 2023

visch commented Aug 29, 2023 •

edited

Loading

kgpayne commented Sep 4, 2023

visch commented Sep 7, 2023 •

edited

Loading

kgpayne commented Oct 12, 2023

visch commented Oct 12, 2023 •

edited

Loading

stale bot commented Oct 11, 2024

melgazar9 commented Dec 6, 2024

feat: filter discovery to selected streams #1234

feat: filter discovery to selected streams #1234

Comments

kgpayne commented Dec 2, 2022

Feature scope

Description

pnadolny13 commented Dec 2, 2022

aaronsteers commented Dec 3, 2022 • edited Loading

kgpayne commented Dec 7, 2022

stale bot commented Jul 18, 2023

pnadolny13 commented Jul 19, 2023

ShahBinoy commented Aug 23, 2023

visch commented Aug 29, 2023 • edited Loading

kgpayne commented Sep 4, 2023

visch commented Sep 7, 2023 • edited Loading

kgpayne commented Oct 12, 2023

visch commented Oct 12, 2023 • edited Loading

stale bot commented Oct 11, 2024

melgazar9 commented Dec 6, 2024

aaronsteers commented Dec 3, 2022 •

edited

Loading

visch commented Aug 29, 2023 •

edited

Loading

visch commented Sep 7, 2023 •

edited

Loading

visch commented Oct 12, 2023 •

edited

Loading