Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 'catalog_caching' extra for extractors #6915

Closed
aaronsteers opened this issue Oct 19, 2022 · 4 comments
Closed

Support 'catalog_caching' extra for extractors #6915

aaronsteers opened this issue Oct 19, 2022 · 4 comments
Labels

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Oct 19, 2022

Related to:

There seems to be a path forward with catalog_caching being able to be declared as an extra and being able to be disabled with meltano config tap-something set _catalog_caching disabled. One nice thing about not starting with a simple true/false, is that we could expand this in future to be meltano config tap-something set _catalog_caching '60 min' to allow short-lived catalogs in the future.

As an initial boolean toggle though, I think we probably would want to use the true/false or enabled/disabled value to drive the following behaviors:

  • If caching enabled, use the same catalog file path we have today. (No change.)
  • If caching is disabled, use a catalog path specific to the execution. (Will not be found on a subsequent run, is multi-process safe, and will still be available afterward for debugging.)

What do you think?

Originally posted by @aaronsteers in #2848 (comment)

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Oct 19, 2022

@tayloramurphy, @visch - Are we happy with the initial spec of:

  1. Add catalog_caching as an 'extra' setting for extractors, with initial support for 'enabled' (default) or 'disabled'.
  2. Add CLI support for: meltano config tap-something set _catalog_caching disabled and/or modifying the yaml file to add catalog_caching: disabled to their plugin definition.
  3. If caching is 'disabled', then the catalog file path is generated with a name seed reflecting the current execution ID. Because of the name change, it will not be available to subsequent executions.
  4. We should probably still cleanup the file after execution completes, since otherwise there could be significant storage bloat after hundreds/thousands of executions. (I'm also open to other thoughts on this.)

A completely different approach, instead of modifying the yaml definition, would be to add something like a CLI flag in meltano run like --no-catalog-cache, or --refresh-catalogs. That would work similarly to --full-refresh except instead of ignoring input state we'd be ignoring the previously cached catalog files. This approach doesn't require any change to how catalogs are stored, but it also doesn't give a lot of control, and it would feel odd to have it on meltano run but not meltano invoke or meltano elt.

A third option is to go back to @edgarrmondragon's suggestion and extend meltano select with a '--refresh' option, or similar. That option is looking increasingly nice. Originally, I suggested a better place would be under a meltano catalog command, but the scope for defining such a command (and the broader catalog feature vision) is beyond what we can achieve in a near-term iteration. Presumably, if we add a meltano catalog command in future, we could refactor this operation under that header in the future. (More discussion on future catalog UX here: #6766.) New issue logged:

@visch
Copy link
Collaborator

visch commented Oct 20, 2022

@tayloramurphy, @visch - Are we happy with the initial spec of:

  1. Add catalog_caching as an 'extra' setting for extractors, with initial support for 'enabled' (default) or 'disabled'.
  2. Add CLI support for: meltano config tap-something set _catalog_caching disabled and/or modifying the yaml file to add catalog_caching: disabled to their plugin definition.
  3. If caching is 'disabled', then the catalog file path is generated with a name seed reflecting the current execution ID. Because of the name change, it will not be available to subsequent executions.
  4. We should probably still cleanup the file after execution completes, since otherwise there could be significant storage bloat after hundreds/thousands of executions. (I'm also open to other thoughts on this.)

A completely different approach, instead of modifying the yaml definition, would be to add something like a CLI flag in meltano run like --no-catalog-cache, or --refresh-catalogs. That would work similarly to --full-refresh except instead of ignoring input state we'd be ignoring the previously cached catalog files. This approach doesn't require any change to how catalogs are stored, but it also doesn't give a lot of control, and it would feel odd to have it on meltano run but not meltano invoke or meltano elt.

A third option is to go back to @edgarrmondragon's suggestion and extend meltano select with a '--refresh' option, or similar. That option is looking increasingly nice. Originally, I suggested a better place would be under a meltano catalog command, but the scope for defining such a command (and the broader catalog feature vision) is beyond what we can achieve in a near-term iteration. Presumably, if we add a meltano catalog command in future, we could refactor this operation under that header in the future. (More discussion on future catalog UX here: #6766.) New issue logged:

1 👍

I just want a way to do this, right now I template this out on all my projects. Work on a new project and forget to do it again and I'm trying hard to not have a seperate repo for workarounds on my end!

@stale
Copy link

stale bot commented Apr 26, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Apr 26, 2023
@visch
Copy link
Collaborator

visch commented Apr 26, 2023

Something like this would be very nice, right now catalog caching causes headaches as you have to know about what its behavior is, and it's not easily controllable without knowing the internals.

@stale stale bot closed this as completed May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants