Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow of debugging Kedro pipeline in notebook #1832

Open
3 tasks
Tracked by #1802
noklam opened this issue Sep 6, 2022 · 5 comments
Open
3 tasks
Tracked by #1802

Workflow of debugging Kedro pipeline in notebook #1832

noklam opened this issue Sep 6, 2022 · 5 comments

Comments

@noklam
Copy link
Contributor

noklam commented Sep 6, 2022

Background

Kedro's philosophy is pretty much you should use a notebook wisely and keep your code as a Python module. But there are situation that you have to debug with a notebook because data infrastructure is tied to the platform.

What are the pain points with debugging Kedro Pipeline?

  1. The situation that you have to use a notebook - use cases may be notebook base platforms like Databricks, or debugger just slow down massively when you loaded large datasets in memory and have to use a notebook as debug session instead. In this case, debugging a python module is annoying as the source code doesn’t live with it, some copy & paste or monkey-patching seems to be unavoidable.
  2. Schedule/Distribute cluster job - use cases may be Deployment, or trying to leverage cluster computing for large-scale ML experiments. In this case, you can’t attach a debugger and I don’t know any workaround for this. If it’s just one remote server then you can attach a remote debugger that VS Code & PyCharm would support.
    suggested
  3. Kedro-specific API isn’t friendly enough - MemoryDataSet / CacheDataSet and KedroSession do not have the most user-friendly interface for an interactive environment like Notebook.
    I think 3. is something Kedro should solve and I would love more feedback about this. 1 is not a kedro-specific problem, but it’s more common to kedro users due to data science/ML workflow and we may try to make it easier. I don’t have any workaround for 2.

My opinion is:

  1. Not a Kedro-specific problem, but it's a more common problem to Kedro users because of the nature of ML/data science pipeline, and we may figure out what may be a smoother workflow.
  2. Not a Kedro problem, this is true for any Python program, and I don't see anything Kedro could do about this (yet)
  3. This is a Kedro problem that we should improve.

I talked to Tom earlier and try to understand the debugging process that he had.

Steps to debug Kedro pipeline in a notebook

  1. Read from stack trace - find out the line of code that produce the error
  2. Find which node this function belongs to
  3. Trying to rerun the pipeline just before this node
  4. If it's not a persisted dataset, you need to change it in catalog.yml, and re-run the pipeline, error is thrown again
  5. session has already been used once, so if you call session again it will throw error. (so he had a wrapper function that recreate session and do something similar to session.run
  6. Create a new session or %reload_kedro?
  7. Now catalog.load that persisted dataset, i.e. func(catalog.load("some_data"))
  8. Copy the source code of func to notebook, it would work if the function itself is the node function, but if it is some function buried deep down, that's a lot more copy-pasting and change of import maybe.
  9. Change the source code and make it work in the notebook
  10. Rerun the pipeline to ensure everything works

Note that if this is a local development environment, all you would do is set a breakpoint. But you will have to touch on a few files with a notebook i.e.

  • Notebook cell
  • Source code of the function that cause the problem
  • catalog.yml

Problems

  • KedroSession cannot be re-run, user will called session.run multiple time for debugging purpose.
  • Sometimes session.run doesn't give the correct output and this issue try to address this problem Should we change the output of session.run? #1802
  • Errors happened in the 50th node in a 100 nodes pipeline - how can we remove some steps so less copy & paste is needed?
  • Not all nodes write data to disk - this means they can't be recovered easily. It makes sense to have most things in memory, but can we make it easier for debug sessions, where users can change this behavior instead of going to change every entry in a catalog?

Why it would be less of a problem with stuff like Airflow?

  • All datasets are persisted - so each node is a self-contained node, you only need to rerun the interested node
  • UI will show you clearly which node fails

Proposal

We are definitely not trying to re-create the debugger experience. Ideally, it would be great if Kedro can just pop out the correct context at the exact line of code (similar to putting a breakpoint right before an error happen).

Some of this can reuse the backtracking logic we have in #1795 so we don't have to rerun the entire pipeline.

@noklam noklam changed the title Workflow of debugging Kedro pipeline with notebooks Workflow of debugging Kedro pipeline Sep 6, 2022
@noklam noklam added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Sep 6, 2022
@noklam noklam changed the title Workflow of debugging Kedro pipeline Workflow of debugging Kedro pipeline in notebook Sep 6, 2022
@antonymilne
Copy link
Contributor

antonymilne commented Sep 9, 2022

This is a great issue, thank you for opening it! I agree with a lot of what you say here, including which bits might be feasible to improve within kedro. I myself have followed pretty much exactly the same debugging workflow that you describe many, many times. I think it's quite a common way of working, even when IDE breakpoints are available. Especially for users coming from a notebook background, debugging things using an interactive Python session in Jupyter is way easier than trying to use pdb or an IDE breakpoint. It also allows you to do things you can't do easily in the IDE (like plot graphs). So I definitely think there's a lot of value in making this process smoother.

The mooted %load_node line magic would help a lot with this. I wonder whether it could go further than the original idea (basically doing the relevant catalog.load operations) and also perform step 8 in the above, i.e. copy and paste the content of the node function into the notebook:
image

You can then break the function up into multiple cells (or maybe %load_node would create multiple cells?) and step through them as you please. I think this is a very common way of debugging stuff - I've personally done it a lot anyway.

Disclaimers:

  • I have no idea if it's technically feasible to write a line magic function that pastes stuff into a cell
  • how do you also get all the functions called by the node function imported? As well as pandas/whatever else required? Not sure
  • we currently have a feature that goes the other way round, automatically converting a notebook cell into a node. We're removing it because no one uses it (they just copy and paste instead). My feeling is that this way round would be more useful exactly to ease the debugging process as you describe, but it would need some user research to see if anyone would use it

Another idea: when a pipeline fails, we say "try running %load_node debug=True" in your Kedro notebook". Then you go to your notebook and %load_node debug=True already knows the node that failed so you don't need to tell it that. In fact we know exactly what line it failed (captured in the Python traceback somewhere). Hence we can put split the Jupyter cell into code that runs correctly and the code after the exception is raised. e.g. let's say the dropna operation above is the thing that raised the exception. Then it would look like:
image
I'm guessing in 90% of cases this is basically how you would split the code yourself and just run the top cell. Note sure how well it would work for for loops and the like though 🤔

I think there's two distinct but related use cases here:

  • developing a node, especially from scratch. I think it's still common to do this from Jupyter notebooks (it's how I would do it still, even though I predominantly use an IDE). I would write an empty node function in the Python file, then do %load_node to get the input variables loaded up, write all the code in a Jupyter notebook cell, and then copy and paste to the Python file
  • debugging a node that you've finished writing and is in your pipeline properly, like in the debug=True case above

I don't know if the same solution might work for both these or if we need separate solutions.

Something we've wondered before is whether you should be able to open up the node code in a Jupyter cell in kedro-viz. In fact @limdauto had a rough prototype of this. Basically the bit that shows node code in the metadata panel becomes an mini interactive Jupyter cell that executes %load_node. This way you could also step through the code through kedro-viz. If we could show in kedro viz the progress of a pipeline and highlight the node that's failing that would fit in really nicely here.

@noklam
Copy link
Contributor Author

noklam commented Oct 5, 2022

Note for myself: create a demo for existing debugging workflow.

@noklam
Copy link
Contributor Author

noklam commented Oct 18, 2022

One question for Antony - this would work if the error is within the node function - but would it works if it's deeper in the node?

For example

# node.py
def some_func():
    b = a()
    d = c(b) <- Assume the error is in c -> you would then also need to copy paste the code of c to make it works

@noklam
Copy link
Contributor Author

noklam commented Nov 9, 2022

Potentially useful IPython magic

  • get_ipython().set_next_input(s)
  • %debug
  • %load
  • from inspect import getsource

@merelcht
Copy link
Member

merelcht commented Nov 9, 2022

Discussed in Technical Design

The general agreement is that we need to improve the debugging workflow of Kedro in notebooks. Concrete actions to achieve this:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants