Workflow of debugging Kedro pipeline in notebook #1832

noklam · 2022-09-06T10:28:08Z

Background

Kedro's philosophy is pretty much you should use a notebook wisely and keep your code as a Python module. But there are situation that you have to debug with a notebook because data infrastructure is tied to the platform.

What are the pain points with debugging Kedro Pipeline?

The situation that you have to use a notebook - use cases may be notebook base platforms like Databricks, or debugger just slow down massively when you loaded large datasets in memory and have to use a notebook as debug session instead. In this case, debugging a python module is annoying as the source code doesn’t live with it, some copy & paste or monkey-patching seems to be unavoidable.
Schedule/Distribute cluster job - use cases may be Deployment, or trying to leverage cluster computing for large-scale ML experiments. In this case, you can’t attach a debugger and I don’t know any workaround for this. If it’s just one remote server then you can attach a remote debugger that VS Code & PyCharm would support.
suggested
Kedro-specific API isn’t friendly enough - MemoryDataSet / CacheDataSet and KedroSession do not have the most user-friendly interface for an interactive environment like Notebook.
I think 3. is something Kedro should solve and I would love more feedback about this. 1 is not a kedro-specific problem, but it’s more common to kedro users due to data science/ML workflow and we may try to make it easier. I don’t have any workaround for 2.

My opinion is:

Not a Kedro-specific problem, but it's a more common problem to Kedro users because of the nature of ML/data science pipeline, and we may figure out what may be a smoother workflow.
Not a Kedro problem, this is true for any Python program, and I don't see anything Kedro could do about this (yet)
This is a Kedro problem that we should improve.

I talked to Tom earlier and try to understand the debugging process that he had.

Steps to debug Kedro pipeline in a notebook

Read from stack trace - find out the line of code that produce the error
Find which node this function belongs to
Trying to rerun the pipeline just before this node
If it's not a persisted dataset, you need to change it in catalog.yml, and re-run the pipeline, error is thrown again
session has already been used once, so if you call session again it will throw error. (so he had a wrapper function that recreate session and do something similar to session.run
Create a new session or %reload_kedro?
Now catalog.load that persisted dataset, i.e. func(catalog.load("some_data"))
Copy the source code of func to notebook, it would work if the function itself is the node function, but if it is some function buried deep down, that's a lot more copy-pasting and change of import maybe.
Change the source code and make it work in the notebook
Rerun the pipeline to ensure everything works

Note that if this is a local development environment, all you would do is set a breakpoint. But you will have to touch on a few files with a notebook i.e.

Notebook cell
Source code of the function that cause the problem
catalog.yml

Problems

KedroSession cannot be re-run, user will called session.run multiple time for debugging purpose.
Sometimes session.run doesn't give the correct output and this issue try to address this problem Should we change the output of session.run? #1802
Errors happened in the 50th node in a 100 nodes pipeline - how can we remove some steps so less copy & paste is needed?
Not all nodes write data to disk - this means they can't be recovered easily. It makes sense to have most things in memory, but can we make it easier for debug sessions, where users can change this behavior instead of going to change every entry in a catalog?

Why it would be less of a problem with stuff like Airflow?

All datasets are persisted - so each node is a self-contained node, you only need to rerun the interested node
UI will show you clearly which node fails

Proposal

We are definitely not trying to re-create the debugger experience. Ideally, it would be great if Kedro can just pop out the correct context at the exact line of code (similar to putting a breakpoint right before an error happen).

Easier way to re-use session? or %reload_kedro enough? If we want to keep things in memory then reload_kedro does not fit well.
%load_node proposal mentioned in Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721 - which should address Step1-Step7
Some debug mode with session which you can do session.run(dataset=["a","b","c"]) and keep specific dataset you are interested, or even

Some of this can reuse the backtracking logic we have in #1795 so we don't have to rerun the entire pipeline.

The text was updated successfully, but these errors were encountered:

antonymilne · 2022-09-09T21:35:44Z

This is a great issue, thank you for opening it! I agree with a lot of what you say here, including which bits might be feasible to improve within kedro. I myself have followed pretty much exactly the same debugging workflow that you describe many, many times. I think it's quite a common way of working, even when IDE breakpoints are available. Especially for users coming from a notebook background, debugging things using an interactive Python session in Jupyter is way easier than trying to use pdb or an IDE breakpoint. It also allows you to do things you can't do easily in the IDE (like plot graphs). So I definitely think there's a lot of value in making this process smoother.

The mooted %load_node line magic would help a lot with this. I wonder whether it could go further than the original idea (basically doing the relevant catalog.load operations) and also perform step 8 in the above, i.e. copy and paste the content of the node function into the notebook:

You can then break the function up into multiple cells (or maybe %load_node would create multiple cells?) and step through them as you please. I think this is a very common way of debugging stuff - I've personally done it a lot anyway.

Disclaimers:

I have no idea if it's technically feasible to write a line magic function that pastes stuff into a cell
how do you also get all the functions called by the node function imported? As well as pandas/whatever else required? Not sure
we currently have a feature that goes the other way round, automatically converting a notebook cell into a node. We're removing it because no one uses it (they just copy and paste instead). My feeling is that this way round would be more useful exactly to ease the debugging process as you describe, but it would need some user research to see if anyone would use it

Another idea: when a pipeline fails, we say "try running %load_node debug=True" in your Kedro notebook". Then you go to your notebook and %load_node debug=True already knows the node that failed so you don't need to tell it that. In fact we know exactly what line it failed (captured in the Python traceback somewhere). Hence we can put split the Jupyter cell into code that runs correctly and the code after the exception is raised. e.g. let's say the dropna operation above is the thing that raised the exception. Then it would look like:

I'm guessing in 90% of cases this is basically how you would split the code yourself and just run the top cell. Note sure how well it would work for for loops and the like though 🤔

I think there's two distinct but related use cases here:

developing a node, especially from scratch. I think it's still common to do this from Jupyter notebooks (it's how I would do it still, even though I predominantly use an IDE). I would write an empty node function in the Python file, then do %load_node to get the input variables loaded up, write all the code in a Jupyter notebook cell, and then copy and paste to the Python file
debugging a node that you've finished writing and is in your pipeline properly, like in the debug=True case above

I don't know if the same solution might work for both these or if we need separate solutions.

Something we've wondered before is whether you should be able to open up the node code in a Jupyter cell in kedro-viz. In fact @limdauto had a rough prototype of this. Basically the bit that shows node code in the metadata panel becomes an mini interactive Jupyter cell that executes %load_node. This way you could also step through the code through kedro-viz. If we could show in kedro viz the progress of a pipeline and highlight the node that's failing that would fit in really nicely here.

noklam · 2022-10-05T14:00:46Z

Note for myself: create a demo for existing debugging workflow.

noklam · 2022-10-18T14:54:27Z

One question for Antony - this would work if the error is within the node function - but would it works if it's deeper in the node?

For example

# node.py
def some_func():
    b = a()
    d = c(b) <- Assume the error is in c -> you would then also need to copy paste the code of c to make it works

noklam · 2022-11-09T13:33:34Z

Potentially useful IPython magic

get_ipython().set_next_input(s)
%debug
%load
from inspect import getsource

merelcht · 2022-11-09T15:21:00Z

Discussed in Technical Design

The general agreement is that we need to improve the debugging workflow of Kedro in notebooks. Concrete actions to achieve this:

Create line/cell magic to debug a node in notebook workflow #2009 Implement an ipython line magic that can load a node, the necessary datasets and copy the node code into a notebook cell. This is the %load_node idea described above. We might want to call this magic %debug_node instead to make it clear it does more than just loading the node and is meant for debugging.
As a follow up on the above, we need to document the debugging workflow for notebooks. Currently we only talk about debugging in through IDEs in our docs. Document how users can debug Kedro inside notebooks using the %load_node magic #2011
Investigate if the ipython %debug line magic can be used for debugging Kedro in notebooks #2010 If so, document this for our users.

noklam mentioned this issue Sep 6, 2022

Should we change the output of session.run? #1802

Closed

1 task

noklam added this to the Improve the Interactive Jupyter notebook workflow milestone Sep 6, 2022

noklam changed the title ~~Workflow of debugging Kedro pipeline with notebooks~~ Workflow of debugging Kedro pipeline Sep 6, 2022

noklam added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Sep 6, 2022

noklam changed the title ~~Workflow of debugging Kedro pipeline~~ Workflow of debugging Kedro pipeline in notebook Sep 6, 2022

merelcht added Type: Parent Issue and removed Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation labels Nov 9, 2022

noklam modified the milestones: Improve the Interactive Jupyter notebook workflow, Improving the debugging experience with Notebook Mar 22, 2023

AhdraMeraliQB mentioned this issue Jan 31, 2024

[Parent] - Notebook/IPython Debug line magic %load_node feature discussion thread #3535

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow of debugging Kedro pipeline in notebook #1832

Workflow of debugging Kedro pipeline in notebook #1832

noklam commented Sep 6, 2022 •

edited

antonymilne commented Sep 9, 2022 •

edited

noklam commented Oct 5, 2022

noklam commented Oct 18, 2022

noklam commented Nov 9, 2022

merelcht commented Nov 9, 2022 •

edited

Workflow of debugging Kedro pipeline in notebook #1832

Workflow of debugging Kedro pipeline in notebook #1832

Comments

noklam commented Sep 6, 2022 • edited

Background

What are the pain points with debugging Kedro Pipeline?

Steps to debug Kedro pipeline in a notebook

Problems

Why it would be less of a problem with stuff like Airflow?

Proposal

antonymilne commented Sep 9, 2022 • edited

noklam commented Oct 5, 2022

noklam commented Oct 18, 2022

noklam commented Nov 9, 2022

merelcht commented Nov 9, 2022 • edited

noklam commented Sep 6, 2022 •

edited

antonymilne commented Sep 9, 2022 •

edited

merelcht commented Nov 9, 2022 •

edited