Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we allow users provide custom session_id #1551

Closed
noklam opened this issue May 17, 2022 · 9 comments
Closed

Should we allow users provide custom session_id #1551

noklam opened this issue May 17, 2022 · 9 comments
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation

Comments

@noklam
Copy link
Contributor

noklam commented May 17, 2022

Description

This was originally a thread from Discord

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

When using an orchestrator like Airflow, kedro nodes are run with multiple sessions and it is hard to organise the output dataset since all of them have different timestamps.

Context

(Edited on Jun16)

Use Case

  1. A id to organize for multiple runs with Orchestrator, a "logical run" id for all KedroSession. Logging of all I/O related to a particular session ID
  2. Add metadata for the session_id for various logging purpose

Possible Implementation

  1. Do we need session_id in KedroSession.create()?
  2. Do we allow user to customize session_id other than the default timestamp? (Joel has a suggestion that allow user to add suffix so it won't change the order but give more flexibility for user to add metadata.)
  3. Is save_version = session_id still valid? Parametrized session_id #1571
  4. (Bonus) - KedroSession.reset() to create a new session easily? - this can potentially make the Jupyter workflow nicer. Instead of asking user to create their session with lots of details, they can just take the global session and do session.reset() Parametrized session_id #1571

(Optional) Suggest an idea for implementing the addition or change.
workarounds or changes in kedro framework side.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

@noklam noklam added the Issue: Feature Request New feature or improvement to existing feature label May 17, 2022
@datajoely
Copy link
Contributor

So it looks simple enough to make this a parameter of the create method...

Is the question that we need to ensure that custom session_ids are sorted correctly as otherwise it will ruin versioning?

I think this has a lot of user value, both Airflow and Prefect users have raised this.

@noklam
Copy link
Contributor Author

noklam commented May 18, 2022

@datajoely Implementation-wise should be simple, but I remember the team discussed related topics when we have the 1 session = 1 run changes 2 months ago. There are probably more discussions before that, so I don't have the full context.

Personally, I think this is a very practical use case.

Is the question that we need to ensure that custom session_ids are sorted correctly as otherwise it will ruin versioning?

Yes for this too, we need to ensure they are monotonic to make sure versioning works.

@datajoely
Copy link
Contributor

I wonder if there is any merit in a prefix and suffix structure where the prefix is Kedro generated, but the suffix can be provided by the user to manage groups etc.

@datajoely datajoely mentioned this issue May 24, 2022
8 tasks
@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Jun 13, 2022
@noklam
Copy link
Contributor Author

noklam commented Jun 14, 2022

Description

Hello! I know, that you have In a backlog discussion related to parametrized session_id and as a user I would like to add mine extra requirements.

  1. As a user I want to specify (entire or prefix or suffix) of session_id as parameter in Kedro CLI
  2. As a user I want to log all data I/O operations within scope of particular session_id

(entire or prefix or suffix) - This part depends on your decision about how you will manage it

Why as user I would like to have this feature

  1. This feature would help to link node & pipeline logs with logs produced from other components in my system when I execute pipelines using kedro run command
  2. Awareness about session_id inside of functions of DatasetSpecs class would help to distinguish to which kedro session I/O operation does belong inside of logs. But I don't about degree of coupling kedro striving between KedroSession and DataCatalog classes, so I think safest is to make session_id optional to keep lowest

Importance

  1. Not critical, nice-to-have: I consider feature as nice-to-have because, I can store extra details of KedroSession as environment variable, and refer to them when I log something.
  2. Not critical, nice to have: I consider feature as nice-to-have because, I the same as above I can refer to env variable

Thanks for your support

@atmosone I have copied your comments here so we can track it in 1 issue, hope this is fine.

@noklam
Copy link
Contributor Author

noklam commented Jun 16, 2022

Updated the issue to facilitate future tech design session.

@noklam noklam changed the title Custom session id for organising kedro run with orchestrator Should we allow users provide custom session_id Jun 16, 2022
@antonymilne
Copy link
Contributor

Just so we have all this in one place - this is very relevant: #1273

@merelcht
Copy link
Member

merelcht commented Jul 27, 2022

Discussed in Technical Design on 27/7

Discussion

As part of the discussion we talked about the following points:

  • The purpose of the session_id in Kedro at the moment is mainly to link session data (run metadata) to versioned datasets for the experiment tracking view in Viz.
  • The session_id==save_version and we want to keep it that way.
  • It needs to be made very clear to users that when they choose a custom session_id it needs to be chronologically sortable, because this id will also be used for versioning of the datasets (see point above).
  • Does this influence how experiment tracking works on distributed versions Kedro? Probably not, and the suspicion is that this currently doesn't work properly anyway.

Conclusion

The outcome of the discussion was that we will allow users to provide a custom session_id.

Implementation

@merelcht
Copy link
Member

Closing this issue in favour of the action points mentioned above.

@datajoely
Copy link
Contributor

I would add that we should update the Prefect deployment guide too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation
Projects
Status: Done
Development

No branches or pull requests

4 participants