Spike: proposal for better default node names #3575

astrojuanlu · 2024-01-30T09:12:26Z

Description

When a node has no name, the default is to use a string representation for it. For example:

'split_data([model_input_table; params: model_options]) -> [X_train;X _test;y_train;y_test]'

This poses a series of problems:

kedro airflow create produces very long task ids when using unnamed nodes kedro-plugins#397 "and node.name defaults to the signature"
kedro run CLI incorrectly splits the names of nodes at commas #1828 was already discussed in tech design, but no significant changes were made to node name defaults
Users defaulting to func.__name__ for a more convenient node name https://linen-slack.kedro.org/t/16107421/a-tip-i-wanted-to-share-you-can-name-a-node-using-the-dunder#0e36ce67-23d8-4638-9e79-113d66b2229d
Surprising behavior when trying to filter nodes without names create line magic to debug a node in notebook workflow #3510 (comment)

Context

I contend that 'split_data([model_input_table; params: model_options]) -> [X_train;X _test;y_train;y_test]' is a bad name because:

A human would never name their nodes like that
It's more of a description or a __repr__ rather than a "name"

Plus all the problems discussed above.

Task

Come up with a proposal for a better default node name, taking into account:

backwards compatibility
filtering by node should be user with the new default
requirements outlined in discussion

Possible Implementation

Use func.__name__ as default node names.

The text was updated successfully, but these errors were encountered:

deepyaman · 2024-01-31T15:39:19Z

Use func.__name__ as default node names.

Don't node names need to be unique? Even if not ideal, the current approach guarantees uniqueness, while the latter very frequently won't.

astrojuanlu · 2024-01-31T16:34:36Z

Correct: the current approach guarantees uniqueness under the current assumptions of "no two nodes can output the same dataset".

More strawman proposals:

f"{func.__name__}_{uuid4()}
f"{func.__name__}_0 (and continues with _1, _2 if node functions are reused)
f"{func.__name__}_{simplify(self.outputs())} (for instance split_data__X_train_X_test_y_train_y_test)

More ideas?

noklam · 2024-02-07T13:31:38Z

It may be obvious, but can I ask why node name need to be unique? I feel like they exists for different purposes, I would like to map them out and see if there are opportunities to combine them. There would be at least 2 different purpose, one is for internal working, the other for human-readable.

Currently there are different names.

node.name (with namespace)
node.short_name
node._name
node._unique_key (hashable)
node._func_name

Noted some of this already exist in Node API, though it may not be what you expected.

def _get_readable_func_name(func: Callable) -> str:
    """Get a user-friendly readable name of the function provided.

    Returns:
        str: readable name of the provided callable func.
    """

    if hasattr(func, "__name__"):
        return func.__name__

Re: @deepyaman point, I think is fair for any medium size pipeline using namespace, but I think there are also many pipelines without repeated node function.

p.s. I am asking this just because I am lazy, I will do the digging at some point, but if someone already know that could save some time.

astrojuanlu · 2024-02-07T13:38:23Z

Agreed with the sentiment. I imagine a "node ID" should be unique, but not sure why we want names to be unique.

noklam · 2024-02-09T22:35:09Z

This is my understanding:

node._unique_key = node.name + inputs + outputs

This definition is a bit leaky, because I will think combination of (func, inputs + outputs) is already an unique key. I test this and Kedro thinks they are valid:

node(func, "x", None) and node(func, "x", None, name="abc") , Kedro treat them as "different node"
If you extend this, does adding "namespace" to the name of the function but keep the same input and output (assume it's None because output cannot be duplicate anyway) makes it a different Node?

I tend to say it's NO, they are the same.

Back to the original idea, _unique_key should be used as the "Node ID". You may be surprised that _validate_duplicate_nodes is using node.name to compare instead of node._unique_key. I do not understand why this is the case, but seems that a lot of the __eq__ , __lt__ was created for toposort, Kedro doesn't use this to compare node.

If we offload this responsibility to _unique_key, maybe we can use node.name to be something more human-readable, or even allow it to be non-unique. namespace.func is not a bad choice for default node name. The node name doesn't matter to Kedro, but only to the user.

What if user provide a name but there are multiple nodes return with the same name? This affect things such as %load_node <node_name>, kedro run --to-nodes <node_name>

2 options:

allow user to provide the non human readable name, which no one can remember what is it unless they copy from the log when the pipeline fails and Kedro throw a resume suggestion.
In case of ambiguity, Kedro don't do anything. Ask user to give the node a name to make it unique.

Con:

Technically a breaking change, but does anyone rely on the node.name?

Pro:

Keep the node uniqueness guarantee
sensible default name that works 95%, in case of ambiguity, prompt user to add name.

Note that having duplicate names doesn't mean that you need to change the name immediately, you only have to do so if you have to specify it as an argument (smaller chance)

astrojuanlu · 2024-04-08T12:28:14Z

You may be surprised that _validate_duplicate_nodes is using node.name to compare instead of node._unique_key

Maybe this is a good place to start?

seems that a lot of the __eq__ , __lt__ was created for toposort, Kedro doesn't use this to compare node.

Now that we switched to graphlib #3728, would it be interesting to try to do some "tree-shaking" and see if we can remove (after deprecation) some of these methods?

noklam · 2024-04-08T15:51:30Z

The comment was mostly based on the investigation that I have done 2 months ago, they are documented a bit more in details here: https://noklam.github.io/blog/posts/default_node_name/2024-02-08-default-node-name.html

Now that we switched to graphlib #3728, would it be interesting to try to do some "tree-shaking" and see if we can remove (after deprecation) some of these methods?

We can definitely try this, but this wouldn't address the problem described in this issue, it may helps with removing the API surface.

The main proposal of #3575 (comment) is that can we loosen up to make node.name non-unique, and offload that uniqueness checking to node._unique_key instead.

astrojuanlu · 2024-07-30T14:34:15Z

Related: one user told me today that he found node names "weird" and "useless", given that they were inferred from function names already.

astrojuanlu · 2024-08-22T13:19:48Z

When discussing kedro-org/kedro-viz#2036 today, we realised that a few more surprising things about names:

The current automatically generated node names aren't visible in the Viz UI (too ugly to display?)
Kedro is prepending {namespace}. to user-specified node names
It was confusing that kedro --to-nodes=function_name wasn't working, and the reason was that the node had no name

This is on top of what we already knew:

It's very annoying having to name a node every time you want to use it for something (see my previous comment)
The current approach to automatically naming nodes is very problematic (see examples)
- And now we will add one more problem, because when the slicing functionality is live, these automatically generated node names will start popping up on the generated kedro run commands.

I think we need to plan what to do about this.

lordsoffallen · 2024-09-19T14:18:48Z

Also adding a similar use where it doesnt work if name is not specified:

From slack chat:

I noticed something that when i define the following node:

node(
    func=sample_func,
    # name="sample_func"    -> does not work when this is commented
    inputs="epubdf",
    outputs="result"
)

and then run this test:

kedro_session.run(node_names=['sample_func'])

I get an error saying name doesnt exists but when i specify name parameter in node itself it works.

astrojuanlu mentioned this issue Jan 30, 2024

kedro airflow create produces very long task ids when using unnamed nodes kedro-org/kedro-plugins#397

Closed

AhdraMeraliQB mentioned this issue Jan 31, 2024

[Parent] - Notebook/IPython Debug line magic %load_node feature discussion thread #3535

Closed

5 tasks

github-actions bot mentioned this issue Feb 1, 2024

Monthly issue metrics report #3582

Open

noklam mentioned this issue Mar 22, 2024

Optimise pipeline addition and creation #3730

Merged

7 tasks

noklam mentioned this issue Sep 23, 2024

%load_node integration with flowchart - One Click debugging experience kedro-org/vscode-kedro#60

Open

merelcht changed the title ~~Default node names are problematic~~ Spike: proposal for better default node names Sep 23, 2024

merelcht added this to the Improve Developer Experience milestone Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: proposal for better default node names #3575

Spike: proposal for better default node names #3575

astrojuanlu commented Jan 30, 2024 •

edited by merelcht

Loading

deepyaman commented Jan 31, 2024

astrojuanlu commented Jan 31, 2024

noklam commented Feb 7, 2024

astrojuanlu commented Feb 7, 2024

noklam commented Feb 9, 2024 •

edited

Loading

astrojuanlu commented Apr 8, 2024

noklam commented Apr 8, 2024

astrojuanlu commented Jul 30, 2024

astrojuanlu commented Aug 22, 2024

lordsoffallen commented Sep 19, 2024

Spike: proposal for better default node names #3575

Spike: proposal for better default node names #3575

Comments

astrojuanlu commented Jan 30, 2024 • edited by merelcht Loading

Description

Context

Task

Possible Implementation

deepyaman commented Jan 31, 2024

astrojuanlu commented Jan 31, 2024

noklam commented Feb 7, 2024

astrojuanlu commented Feb 7, 2024

noklam commented Feb 9, 2024 • edited Loading

astrojuanlu commented Apr 8, 2024

noklam commented Apr 8, 2024

astrojuanlu commented Jul 30, 2024

astrojuanlu commented Aug 22, 2024

lordsoffallen commented Sep 19, 2024

astrojuanlu commented Jan 30, 2024 •

edited by merelcht

Loading

noklam commented Feb 9, 2024 •

edited

Loading