Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2629] Users can't provide custom run_id or save_version to KedroSession #1273

Closed
3 tasks done
antonymilne opened this issue Feb 21, 2022 · 4 comments
Closed
3 tasks done
Assignees
Labels
Component: Framework Issue/PR that addresses core framework functionality
Milestone

Comments

@antonymilne
Copy link
Contributor

antonymilne commented Feb 21, 2022

(transfer from Jira, created by @lorenabalan)

This may be acceptable behaviour, but we should document it better in that case.

Working on this PR (see discussion in the thread too), I discovered that in the new session model, users can't set their own custom run_id, or use a different way to generate a save version (for example use a different format). They can modify KedroContext._get_save_version() or KedroContext._get_run_id() but that's not what will be stored in the session store - instead it will contain only the values equivalent to session_id.

When a session is created, a session_id is generated (timestamp) and written to the store. During a session run, that session_id is loaded from the store and is used for both save_version and run_id, and it'll be the same timestamp every time, i.e.

1 session ↔️ 1 run ↔️ 1 run_id=session_id=save_version

If that's the case, as Antony pointed out, what's the optimal behaviour in this case:

 with KedroSession.create(...) as session:
    session.run(pipeline1)
    session.run(pipeline2)

Note this isn't as strange a thing to do as it might initially seem. In Jupyter, a user could well do multiple session.run.

This ticket includes both the design and implementing the solution. Setup a team discussion to go through design suggestion.

Following the discussion the team had about this issue these tasks need to be done:

After 0.18.0 has been released:

@antonymilne
Copy link
Contributor Author

This comment in Kedro-Viz makes me think that @limdauto was anticipating that run_id should not necessarily be the same as the timestamp?

Also from this Discord conversation with @shaunc on possible integration with DVC, my gut feeling is that saying run_id = session_id = save_version is maybe too restrictive and we should allow for controlling some (all?) of these independently.

@shaunc
Copy link

shaunc commented Feb 22, 2022

I'm integrating kedro with DVC experiment tracking kedro-dvc (integration plans) -- I'm hoping that the kedro interface will support different experiment tracking and session management plugins.

Its great that you support SESSION_STORE_CLASS -- though I'm hoping for more detail on its interface! :)

VS ids -- I'd propose that the session store "rows" (metadata) include various fields, whose names and meanings can also be configurable. For instance:

  • SESSION_ID_FIELD - field guaranteed to be unique over session store
  • SESSION_TIMESTAMP_FIELD and/or SESSION_ORDER_BY_FIELD -- the first for a timestamp, the last for display order in kedro-vis, and for finding the most recent, and maybe for deleting old during garbage collection.

DVC uses git commit hashes for names. A timestamp isn't necessarily unique in distributed runs. However, you could have default config for these things all pointing to the timestamp field you are already using for convenience.


The other thing I'd like to see made abstract is RunsRepository. Is this also going to migrate to kedro core? May I suggest that both this and the default session metadata store be moved to -- say -- kedro-session plugin, which is included by default by the core, but, being a plugin, could be superseded by someone who, perhaps, wanted to use kedro-dvc instead? :)

@merelcht
Copy link
Member

merelcht commented Mar 9, 2022

Based on the discussion the Kedro team had on this topic on Monday it was decided that a session can only every have 1 run, and so the run_id is no longer needed. For the time being, the session_id and save_version will remain the same, but there is a possibility to allow users to add a custom save_version that is different from the session_id. We'll require user research to determine the best solution for allowing this customisation.

See more details in #1335

@merelcht merelcht added the Component: Framework Issue/PR that addresses core framework functionality label Mar 15, 2022
lvijnck pushed a commit to lvijnck/kedro that referenced this issue Apr 7, 2022
Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
@merelcht
Copy link
Member

All tasks have been completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Framework Issue/PR that addresses core framework functionality
Projects
None yet
Development

No branches or pull requests

3 participants