Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial import of the extension sdk #1

Merged
merged 30 commits into from
Aug 31, 2022
Merged

Conversation

pandemicsyn
Copy link

@pandemicsyn pandemicsyn commented Aug 4, 2022

Output and flags

Based on some early feedback from @pnadolny13 I configured the default logging profile to drop timestamps and info fields. That's mainly to keep output clean when its nested within a meltano run invocation.

The fields can be re-enabled via cli options or env vars. In addition, there's also a MELTANO_LOG_FORMAT=true/--meltano-log-format option. That switches the extension to start writing output in json format. I wanted to have that present out of the gate so that down the road, meltano can start setting that flag and more intelligently parse an extension's output (and selectively present portions of it).

Basic cookiecutter and ext sdk overview

main.py and pass_through.py are the two main things generated by the cookiecutter skaffolding. Those cover the CLI invocation, logging setup, disabling the rich & typer colorized styling, etc. The "only" thing a user will have to do is fill out wrapper.py and satisfy the interface defined by the ExtensionBase class (which only has a small handful of required methods).

Expect ExtensionBase to evolve a lot when tackle Superset/dbt - especially as we need more explicit config handling and path setting's when we go and implement dbt.

Right now theres two methods attached to supplied Invoker class (a minimal invoker class users can use to interact with a subprocess they're wrapping) on the extension:

  • run - Useful when you want to run a command, but don't care about its output and only care about its return code. Basically background things you want to do who's output you don't care about nor want to show users.
  • run_and_log - Best used when you want to run a command and stream the output to a user. This logs output using an opinionated logger that we (meltano) can control via env vars to do things like flip on json logs.

Basic usage example

Given this yaml:

  utilities:
  - name: airflow
    namespace: airflow
    pip_url: git+https://github.com/meltano/ext-airflow.git@feat-init-extension apache-airflow==2.3.3
      --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.3.3/constraints-no-providers-3.8.txt
    executable: airflow_invoker
    commands:
      describe:
        executable: airflow_extension 
        args: describe
      initialize:
        executable: airflow_extension 
        args: initialize
    settings:
    - name: core.dags_folder
      label: DAGs Folder
      value: $MELTANO_PROJECT_ROOT/orchestrate/dags
      env: AIRFLOW__CORE__DAGS_FOLDER
    - name: core.plugins_folder
      label: Plugins Folder
      value: $MELTANO_PROJECT_ROOT/orchestrate/plugins
      env: AIRFLOW__CORE__PLUGINS_FOLDER
    - name: core.load_examples
      label: Load Examples
      value: false
      env: AIRFLOW__CORE__LOAD_EXAMPLES
    - name: core.dags_are_paused_at_creation
      label: Pause DAGs at Creation
      env: AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION
      value: false
    - name: database.sql_alchemy_conn
      label: SQL Alchemy Connection
      value: sqlite:///$MELTANO_PROJECT_ROOT/airflow/airflow.db
      env: AIRFLOW__CORE__SQL_ALCHEMY_CONN
    - name: webserver.web_server_port
      label: Webserver Port
      value: 8080
      env: AIRFLOW__WEBSERVER__WEB_SERVER_PORT
    config:
      home: $MELTANO_PROJECT_ROOT/airflow
      config: $MELTANO_PROJECT_ROOT/airflow/airflow.cfg
(melty-3.8) ➜  extdev meltano install utilities airflow
Installing 1 plugins...
Installing utility 'airflow'...
Installed utility 'airflow'
(melty-3.8) ➜  extdev tree
.
├── logging.yaml-orig
├── meltano.yml
└── plugins
    ├── extractors
    ├── loaders
    └── utilities

4 directories, 2 files

Calling any command would create airflow home and all the supporting files. But if we want the optional meltano dag generator, you can call the "initialize" command of the plugin.

(melty-3.8) ➜  extdev meltano invoke airflow:initialize
meltano dag generator not found, will be auto-generated dag_generator_path=PosixPath('/Users/syn/projects/meltano-projects/extdev/orchestrate/dags/meltano_dag_generator.py')
(melty-3.8) ➜  extdev tree
.
├── airflow
│   ├── airflow.cfg
│   ├── airflow.db
│   ├── logs
│   │   └── scheduler
│   │       ├── 2022-08-05
│   │       └── latest -> /Users/syn/projects/meltano-projects/extdev/airflow/logs/scheduler/2022-08-05
│   └── webserver_config.py
├── meltano.yml
├── orchestrate
│   └── dags
│       ├── README.md
│       └── meltano_dag_generator.py
└── plugins
    ├── extractors
    ├── loaders
    └── utilities

Create a user and start the webserver:

(melty-3.8) ➜  extdev meltano invoke airflow users create -u syn@ronin.sh -p password --role Admin -e syn@ronin.sh -f admin -l admin
<-- tons of output from the airflow cli -->
User "syn@ronin.sh" created with role "Admin"

(melty-3.8) ➜  extdev meltano invoke airflow webserver
[2022-08-05 13:05:14 -0500] [17215] [INFO] Starting gunicorn 20.1.0
[2022-08-05 13:05:14 -0500] [17215] [INFO] Listening at: http://0.0.0.0:8088 (17215)
[2022-08-05 13:05:14 -0500] [17215] [INFO] Using worker: sync
[2022-08-05 13:05:14 -0500] [17217] [INFO] Booting worker with pid: 17217
[2022-08-05 13:05:15 -0500] [17218] [INFO] Booting worker with pid: 17218
[2022-08-05 13:05:15 -0500] [17219] [INFO] Booting worker with pid: 17219
[2022-08-05 13:05:15 -0500] [17220] [INFO] Booting worker with pid: 17220

Doesn't yet close meltano/meltano#6398 as well still need to chat about how to transition folks to the new extension and how we want to represent it on the hub.

@pandemicsyn
Copy link
Author

pandemicsyn commented Aug 5, 2022

@aaronsteers @pnadolny13 it might be easier to look at this on the branch (https://github.com/meltano/ext-airflow/tree/feat-init-extension). Full disclosure,I haven't had a chance to retest the cookiecutter since making some changes this morning. But should be pretty close too.

Curious what yall's first thoughts are.

@pnadolny13
Copy link
Contributor

pnadolny13 commented Aug 5, 2022

@pandemicsyn this is awesome! I'll review the code also and share any thoughts/feedback but I had an initial question to start.

Could you re-explain the situation with invoke and why we need the second invoke command following the plugin name like your example meltano invoke airflow invoke xyz? Is there a plan to remove that but just not in the first iteration? Is there a way for us to inject that automatically ourselves behind the scenes when we know its needed?

@pandemicsyn
Copy link
Author

pandemicsyn commented Aug 5, 2022

@pandemicsyn this is awesome! I'll review the code also and share any thoughts/feedback but I had an initial question to start.

Could you re-explain the situation with invoke and why we need the second invoke command following the plugin name like your example meltano invoke airflow invoke xyz? Is there a plan to remove that but just not in the first iteration? Is there a way for us to inject that automatically ourselves behind the scenes when we know its needed?

Is there a way for us to inject that automatically ourselves behind the scenes when we know its needed?

Yes, and we might not even have too. I think @aaronsteers had an idea for reusing the existing command/yaml structure.

But we could definitely hide it on the meltano side and inject it automatically, but on the extension side it's needed because without it you get into command and option shadowing situations. IMHO generally speaking having to use the extension this way should be rare. Like, a proper well-integrated airflow extension would make user creation/management easier, and allow it to be meltano config driven.

--

As for why, our extension spec has that extensions must support three commands right now:

  • airflow_extension describe
  • airflow_extension initialize
  • airflow_extension invoke

Down the road we might have others (e.g cleanup, service, state, whatever) though.

Those are commands for the extension, NOT for what it wraps behind the scenes. You need the invoke in this scenario to tell the extension that what you're about to ask it to do should be passed on to the thing you're calling (e.g. airflow). Doing it this way prevents an extension command from shadowing a command of a wrapper (or vice versa).

For example, if airflow had a describe command or an initialize command and we allowed bare invocations that are proxied straight through to airflow, then when you called airflow_extension describe you'd have a collision. Similar thing would happen with flags.

If your extension has a global flag like --do-something and you called airflow_extension --do-something invoke somecmd its clear that --do-something is an option for the extension. Calling airflow_extension --do--something somecmd you can't tell anymore if the user expects that option to apply to the extension, the wrapped thing you're passing stuff too, or both.

There are many ways we can skin this cat, e.g. we could mitigate this some with creative flag use to indicate that we're about to ask an extension to run a meltano command. This could even be a per extension/author choice. Don't want a clean entry point where you pass on everything and want to explicit define the commands you wrap? You can! Just drop to invoke, and define some specific subcommand's instead.

ninja edit to say: I'm totally down for any alternatives. Not strongly tied to this pattern.

@pandemicsyn
Copy link
Author

pandemicsyn commented Aug 5, 2022

If the invoke pattern is a no-go, we could also invert the requirement, and just require extensions to have a top level meltano command that houses describe, initialize, and future meltano specific commands .e.g:

  • airflow_extension meltano * # handled internally to extension via the sdk
  • airflow_extension meltano describe --format=yaml # handled internally to extension via the sdk
  • airflow_extension * # gets passed on to airflow
  • airflow_extension --something somecommand # gets passed on to airflow
  • airflow_extension --help # gets passed on to airflow

That does men that flags like airflow_extension --meltano-log-format=true or a future --meltano-context are probably a no-go and would have to be env vars (just to make it easier to extension authors).

That might actually be pretty nice ? /cc @aaronsteers @pnadolny13

@aaronsteers
Copy link

aaronsteers commented Aug 10, 2022

If the invoke pattern is a no-go, we could also invert the requirement, and just require extensions to have a top level meltano command that houses describe, initialize, and future meltano specific commands .e.g:

  • airflow_extension meltano * # handled internally to extension via the sdk
  • airflow_extension meltano describe --format=yaml # handled internally to extension via the sdk
  • airflow_extension * # gets passed on to airflow
  • airflow_extension --something somecommand # gets passed on to airflow
  • airflow_extension --help # gets passed on to airflow

That does men that flags like airflow_extension --meltano-log-format=true or a future --meltano-context are probably a no-go and would have to be env vars (just to make it easier to extension authors).

That might actually be pretty nice ? /cc @aaronsteers @pnadolny13

I like this exploration of options. The "meltano" command (or similar) with subcommands is a nice approach. I just thought of another approach, which is to have 2 CLI entrypoints, one that is the top-level with full capabilities and one is the wrapper that biases toward simplicity of execution and having swap-in capability. For example:

# These are 'special' and handled by the SDK:
airflow-ext initialize ...
airflow-ext describe ...

# This is a passthrough, unless the plugin defines special action:
airflow-ext invoke ...

# This is also a passthrough (basically an alias to `airflow invoke ...)
airflow-invoke --help                     # This is the help from airflow (passthrough)
airflow-invoke start                      # This is a custom command that the plugin defines
airflow-invoke webserver -p 8080 -D True  # This is a passthrough

# We could also consider alternate namings for the same:
airflow-invoke ...
airflow-wrapper ...
airflow-run ...

This can be accomplished by simply defining an extra line in the pyproject.toml that points to the invoke command instead of the main or generic cli command/class. (Might require some special wrapping but logically at least, we'd just be giving a shortcut to the invoke command.)

What's nice about this is that it still plays nicely with invocation patterns that want to basically swap one executable with another.

@pandemicsyn - What do you think about this?

@pandemicsyn
Copy link
Author

pandemicsyn commented Aug 10, 2022

@pandemicsyn - What do you think about this?

@aaronsteers I'm not sure I grok what the benefit is to splitting this up into two distinct commands.

I think from an extension development standpoint this does make it a bit harder to reason about - even with cookiecutter doing the bulk of the heavy lifting because things aren't quite as clear-cut and click/typer usage might get a bit complicated for them (what you can use where, where you can use a command group, etc). The one big hang up for me is mixing pass through and non-pass through on the wrapper commands:

airflow-invoke --help # This is the help from airflow (passthrough)
airflow-invoke start # This is a custom command that the plugin defines

If you allow that then I'm not sure what this gains us over just automatically injecting a required meltano top-level command group like in my alternate proposal? It's the same pattern for the pass through, but with the additional overhead of now having two cli's to reason about and a slightly more complicated scaffolding.

@pandemicsyn
Copy link
Author

pandemicsyn commented Aug 11, 2022

@aaronsteers I'm kinda blocked on this since I'm at the point where I'd like to clean up the Airflow and Superset extensions, and get reviews of those while I start on dbt.

So if you wanted to roll with your solution I'd propose a minor tweak of the invoker command NOT allowing custom commands or shadowing:

# These are 'special' and handled by the SDK:
airflow-ext --help         #help for the airflow-ext cli
airflow-ext initialize ...
airflow-ext describe ...   # would register all custom subcommands AND the invoker cli AND the cli being wrapped (e.g. "airflow")
airflow-ext invoke ...     # pass through mostly handled by the SDK 
airflow-ext custom ...     # custom command that can take full advantage of typer  for arg/option handling
airflow-ext start ....     # custom command that can take full advantage of typer for arg/option handling

# All commands are straight pass throughs (but still wrapped by pre-invoke/post-invoke patterns)
airflow-invoke --help                     # This is the help from airflow (passthrough)
airflow-invoke webserver -p 8080 -D True  # This is a passthrough

I like how clean this is on the extension side, but feels like it might be more messy on the meltano side (who invokes what variation and when)?

@aaronsteers
Copy link

@pandemicsyn - Yeah, I like this latest proposal iteration a lot. 👍

@aaronsteers
Copy link

aaronsteers commented Aug 11, 2022

To your question here:

who invokes what variation and when

I can see these usage patterns emerging:

  1. To wrap the plugin commands, use executable: <PLUGIN>-invoke in place of executable: <PLUGIN>. (E.g. sqlfluff-invoke in place of the 3rd party sqlfluff executable name on its own.)
    • If the wrapper is a noop (no pre- and post-hooks and no behavioral tweaks), you can optionally skip this step.
    • Wrapped or not, the tool's own documentation can be used to debug the invocation syntax.
    • If a user believes the wrapper itself is causing some error and/or not working correctly, they can simply swap back temporarily to the original executable (without the -invoke suffix) to debug/confirm the native behavior separately from the wrapper's behavior.
  2. To invoke "special", "custom", or "non-native" commands (like service start or describe), use executable: <PLUGIN>-ext.
    • The user or the reader of the commands enumaration will correctly infer that they are plugin operators and/or SDK operators - and not native to the tool being wrapped.

@pandemicsyn
Copy link
Author

fyi - rebuilt a basic version of the superset extension on top of the output from the cookiecutter template: meltano/superset-ext#1 🥳

@pandemicsyn
Copy link
Author

Not really relevant for this first PR, but in case anyone's curious about how we'll handle release of the extension initially, there's some details in the parent issue: meltano/meltano#6398 (comment)

Copy link
Member

@WillDaSilva WillDaSilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make meltano_extension_sdk into a namespace package named meltano.edk. It would still be versioned/installed separately from Meltano - making it a namespace package only affects the import name (i.e. meltano.edk) and the distribution name meltano-edk.

It's a nice way of organizing things imo. To make it a namespace package, you should only need to update pyproject.toml in the EDK repository so that the name Poetry has for the package is meltano.edk, and then update the directory structure to have be meltano/edk instead of meltano_edk/.

@pandemicsyn @aaronsteers Thoughts?

meltano_extension_sdk/process.py Outdated Show resolved Hide resolved
meltano_extension_sdk/extension.py Outdated Show resolved Hide resolved
@WillDaSilva
Copy link
Member

WillDaSilva commented Aug 29, 2022

class ExtensionCommand(Command):
    description = "The extension cli"
    pass_through_cli: bool = False
    commands: List[str] = [
        "describe",
        "invoke",
        "pre_invoke",
        "post_invoke",
        "initialize",
    ]

@pandemicsyn Since ExtensionCommand is defined as having invoke, pre_invoke, and post_invoke, is there any way to make an extension that cannot be invoked? For the cron utility, the only commands I want to define at the moment are schedule and deschedule. Having describe and a no-op initialize is fine, but having no-op invoke, pre_invoke, and post_invoke commands seems odd.

Also, because this list is hardcoded, the schedule and deschedule commands I've added are not listed by meltano invoke cron:describe.

meltano_extension_sdk/process.py Outdated Show resolved Hide resolved
@WillDaSilva
Copy link
Member

@pandemicsyn Using the EDK, should I have the extension I'm developing call Meltano directly with subprocess.run, or is there some special way provided to call Meltano? Are there plans to add a special way to call Meltano that saves the EDK user from manually handling subprocesses?

@pandemicsyn
Copy link
Author

I think we should make meltano_extension_sdk into a namespace package named meltano.edk. It would still be versioned/installed separately from Meltano - making it a namespace package only affects the import name (i.e. meltano.edk) and the distribution name meltano-edk.

👍 @WillDaSilva I'm down for this. @aaronsteers any objections ?

@pandemicsyn Since ExtensionCommand is defined as having invoke, pre_invoke, and post_invoke, is there any way to make an extension that cannot be invoked? For the cron utility, the only commands I want to define at the moment are schedule and deschedule. Having describe and a no-op initialize is fine, but having no-op invoke, pre_invoke, and post_invoke commands seems odd.

@WillDaSilva Yea, invoke and friends probably doesn't need to be required, although in this first version, since we're only targeting wrappers 🤷. Eventually, I think it might come down to if we want to try to keep ExtensionCommand generic and layer on WrapperCommand and UtilityCommand or something. And then have the cookie cutter template use the appropriate one when a dev starts a new project?

@aaronsteers we've not really chatted about plain utility commands all that much but I know you've been thinking about this for awhile, any prefs or thoughts ?

@pandemicsyn Using the EDK, should I have the extension I'm developing call Meltano directly with subprocess.run, or is there some special way provided to call Meltano? Are there plans to add a special way to call Meltano that saves the EDK user from manually handling subprocesses?

@WillDaSilva yea have the extension call Meltano directly with subprocess.run for now. fyi - there should be a symlink back to meltano in MELTANO_PROJECT_ROOT/run/bin. That's that definitely something we should try to make easy for users - because it probably will be pretty common.

ext_airflow/wrapper.py Outdated Show resolved Hide resolved
Florian Hines added 2 commits August 30, 2022 18:35
1. ditch universal new lines but still default enabling text mode
2. allow passing in kwargs to run
3. set default env in process.py rather than depending on wrapper.py
@pandemicsyn
Copy link
Author

@WillDaSilva Yea, invoke and friends probably doesn't need to be required, although in this first version, since we're only targeting wrappers 🤷. Eventually, I think it might come down to if we want to try to keep ExtensionCommand generic and layer on WrapperCommand and UtilityCommand or something. And then have the cookie cutter template use the appropriate one when a dev starts a new project?

@aaronsteers we've not really chatted about plain utility commands all that much but I know you've been thinking about this for awhile, any prefs or thoughts ?

Punted on this since I figure we can iterate on it when we go to break the EDK out into its own repo tomorrow.

Copy link
Member

@WillDaSilva WillDaSilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nitpick: some of the files lack newlines at the end of them. It would probably be worth enabling precommit.ci for this repository, and adding a similar pre-commit config as what we've got in our other repositories.

.github/semantic.yml Outdated Show resolved Hide resolved
@pandemicsyn
Copy link
Author

A nitpick: some of the files lack newlines at the end of them. It would probably be worth enabling precommit.ci for this repository, and adding a similar pre-commit config as what we've got in our other repositories.

I've got #2 logged to setup all the ci/linter things. Knowing that the cookiecutter templates and the edk would need special handling and that we'd immediately just turn around and modify the configs to remove those anyway, didn't seem worth the work for the initial PR.

I'll steal the pre-commit config from the SDK repo real quick to cover the basics.

@pandemicsyn
Copy link
Author

Added a pre-commit config, and fixed up all the warnings - I might have missed back porting some to the cookiecutter templates themself's since I didn't completely regenerate the extension from scratch again. It was mostly missing doc strings - I'll keep an eye out for those when I regenerate the superset extension.

Copy link

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉🚀🤘

@pandemicsyn pandemicsyn merged commit 2286a08 into main Aug 31, 2022
@pandemicsyn pandemicsyn deleted the feat-init-extension branch August 31, 2022 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Refactor Airflow integration as an external Python plugin
5 participants