Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Add information about TrainingClient logging #1973

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jan 3, 2024

Fixes: #1946.

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.
If users want to see all messages from TrainingClient APIs, they can override the default logger config in their program:

log = logging.getLogger("kubeflow.training.api.training_client")
log.setLevel(logging.DEBUG)

Also, I fixed get_pytorchjob_template func, since we just need to check if num_procs_per_worker and num_worker_replicas is set before using it.

UPDATE: After discussion on this PR, we decided to use print() for the messages that users are required to see while using SDK.

/assign @droctothorpe @johnugeorge @tenzen-y @deepanker13

/hold for the review

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Fixes: #1946.

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.
If users want to see all messages from TrainingClient APIs, they can override the default logger config in their program:

log = logging.getLogger("kubeflow.training.api.training_client")
log.setLevel(logging.DEBUG)

Also, I fixed get_pytorchjob_template func, since we just need to check if num_procs_per_worker and num_worker_replicas is set before using it.

/assign @droctothorpe @johnugeorge @tenzen-y @deepanker13

/hold for the review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link

coveralls commented Jan 3, 2024

Pull Request Test Coverage Report for Build 7423527274

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.07%) to 42.828%

Totals Coverage Status
Change from base Build 7285460954: -0.07%
Covered Lines: 3750
Relevant Lines: 8756

💛 - Coveralls

@johnugeorge
Copy link
Member

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.

Won't this be a bad experience for a user who is trying get_job_logs when follow is True? Is there a need to remove it?

@tenzen-y
Copy link
Member

tenzen-y commented Jan 4, 2024

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.

Won't this be a bad experience for a user who is trying get_job_logs when follow is True? Is there a need to remove it?

I would like to ask what @droctothorpe feels about this. It might be output logs at a debug level.

@andreyvelich
Copy link
Member Author

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.

Won't this be a bad experience for a user who is trying get_job_logs when follow is True? Is there a need to remove it?

@johnugeorge Why it is bad experience ? Users can use this parameter to get logs in StdOut and use it in their program to track until Job is complete.
If they just want to get the current logs they can always use get_job_logs without follow parameter.

@johnugeorge
Copy link
Member

With this PR, user won't see any output when they call get_job_logs with follow parameter. How will they get notified to use logger.info() in such a case ?

@andreyvelich
Copy link
Member Author

With this PR, user won't see any output when they call get_job_logs with follow parameter. How will they get notified to use logger.info() in such a case ?

Why users won't see the output ? I just test it and it worked for me.
As you can see in my PR, I modified the default logging level for our logger:

logging.basicConfig()
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

I mentioned above: #1973 (comment) that if user wants to also see debug logs, they can modify the kubeflow.training.api.training_client logger.

@droctothorpe
Copy link
Contributor

droctothorpe commented Jan 4, 2024

Here's the general principle directly from the official Python docs:

Configuring logging by adding handlers, formatters and filters is the responsibility of the application developer, not the library developer. If you are maintaining a library, ensure that you don’t add handlers to any of your loggers other than a NullHandler instance.

-- https://docs.python.org/3/howto/logging-cookbook.html#adding-handlers-other-than-nullhandler-to-a-logger-in-a-library

For example, imagine a training operator SDK consumer sets their log level to WARNING. That configuration will effectively be ignored by this logging handler and INFO statements will nevertheless print because log level inheritance is overridden by the handler instantiation and call to setLevel.

That being said, the above criticism applies to using print as well! When follow is set to True, the library kind of temporarily becomes a CLI rather than an SDK, and since the CLI is the application, not just the consumed library, it makes sense for it to set the logging level.

HOWEVER (so many twists and turns), this specific logger is used in multiple other places within this module, e.g. here. That's a problem. The debug logs will never print, even if the consumer sets the logging level to debug, because the call to setLevel in this module overrides that.

IMO, you can add another handle just for follow=True to this module, but sticking with print is probably simpler.

Here's another link from the Python docs with additional guidelines:
https://docs.python.org/3/howto/logging.html#library-config

Sorry for the wall of text! Take the above with a grain of salt. I'm not an expert in this area.

@andreyvelich
Copy link
Member Author

andreyvelich commented Jan 4, 2024

The debug logs will never print, even if the consumer sets the logging level to debug, because the call to setLevel in this module overrides that.

@droctothorpe Is it actually true? I guess, user can override logging level in their application as follows:

from kubeflow.training import TrainingClient
import logging

def test_job():
   ....
log = logging.getLogger("kubeflow.training.api.training_client")
log.setLevel(logging.DEBUG) 
cl.create_job(name="kubeflow-job", train_func=test_job, num_worker_replicas=1)

...
$ DEBUG:kubeflow.training.api.training_client: PyTorchJob default/kubeflow-job has been created

That being said, the above criticism applies to using print as well! When follow is set to True, the library kind of temporarily becomes a CLI rather than an SDK, and since the CLI is the application, not just the consumed library, it makes sense for it to set the logging level.

I guess, in the future our Python SDK can perform some more application-level actions. For example, build training image on a fly before creating PyTorchJob as discussed here: #1878
How we should send user useful information in that case.

@droctothorpe
Copy link
Contributor

@droctothorpe Is it actually true?

I think so, though we can always test to confirm. It's why the Python docs say this:

Note It is strongly advised that you do not add any handlers other than NullHandler to your library’s loggers. This is because the configuration of handlers is the prerogative of the application developer who uses your library. The application developer knows their target audience and what handlers are most appropriate for their application: if you add handlers ‘under the hood’, you might well interfere with their ability to carry out unit tests and deliver logs which suit their requirements.

This is an oddly contentious topic. Python purists will argue that libraries should never include handlers. But SDKs like the training operator SDK are often meant to be used interactively inside of Jupyter notebooks in a way that's not so different from how CLIs operate, i.e. the library IS the application.

How we should send user useful information in that case.

I think one option is to have a dedicated logger that (this part is key) isn't used in any other context with a dedicated handler. Another option is to use print. That's what KFP does in a couple of places, such as here. You can override a handler's level like you demonstrated, which I think might give the dedicated handler pattern the edge.

@andreyvelich
Copy link
Member Author

I think so, though we can always test to confirm. It's why the Python docs say this:

I've tested it, and I can override the SDK logging level in my application as I messaged above.

When you say that we specify handlers for the logger, please can you explain what exactly do you mean ?
e.g. the following code returns empty list for me:

log = logging.getLogger("kubeflow.training.api.training_client")
print(log.handlers)

$ []

@droctothorpe
Copy link
Contributor

droctothorpe commented Jan 4, 2024

import logging

logging.basicConfig(level=logging.DEBUG)

logger = logging.getLogger(__name__)
logger_info = logging.getLogger(__name__ + "info")
logger_info.setLevel(logging.INFO)

logger.debug("This debug log should print")
logger_info.debug("This debug log should not print because we're overriding the application level handler level")
logger_info.info("This info log should print")

>>> DEBUG:__main__:This debug log should print
>>> INFO:__main__info:This info log should print

You want the application developer to be able to control the logging level globally, except in the very specific case of logger_info, which is analogous to print. In the above example, the second to last line (logger_info.debug) doesn't print to stdout, despite the fact that logging.basicConfig(level=logging.DEBUG) is set. If you change the code as follows:

import logging

logging.basicConfig(level=logging.DEBUG)

logger = logging.getLogger(__name__)
logger_info = logging.getLogger(__name__ + "info")
logger_info.setLevel(logging.INFO)

logger.debug("This debug log should print")
logger.debug("This debug log should print because this logger responds to the logging level set in basicConfig.")
logger_info.info("This info log should print")

>>> DEBUG:__main__:This debug log should print
>>> DEBUG:__main__:This debug log should print because this logger responds to the logging level set in basicConfig.
>>> INFO:__main__info:This info log should print

The second to last line prints after all, i.e. application level control is respected. All of which is to say that if you're going to set logging level in the handler, just make sure to use it only where it's strictly necessary / intentional, and use a logger that responds to global logging levels (via inheritance) everywhere else.

@andreyvelich
Copy link
Member Author

That's sound good, thanks for the clarification @droctothorpe.
Let's keep print() to log application-level logs for now. If users want to run SDK in debug mode, they can always configure their logging as I described in the TrainingClient constructor.
We can evaluate with our users to see if we need to change/improve it.

Please take a look at the docs updates.

@andreyvelich andreyvelich changed the title [SDK] Use logging to print Job logs [SDK] Add information about TrainingClient logging Jan 4, 2024
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @droctothorpe

"""Get the logs for every Training Job pod. By default it returns logs from
the `master` pod. Logs are returned in this format: { "pod-name": "Log data" }.

If follow = True, this function prints logs to StdOut and returns None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user wants to follow and interact with the response once the job completes? Any reason not to return the logs_dict even if follow is set to True?

Copy link
Member

@tenzen-y tenzen-y Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user wants to follow and interact with the response once the job completes?

The pods will be garbage collected based on the job's some runPolicy such as cleanup policy.
Additionally, we cannot guarantee to get logs from completed pods since the pods are removed for some reasons such as node disruption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's return logs if follow=True as well.
Despite on the runPolicy if would be useful for user to get the final results after running get_job_logs.
I made that change @tenzen-y @droctothorpe.

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, droctothorpe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Alex <mythicalsunlight@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Jan 5, 2024
@andreyvelich
Copy link
Member Author

/assign @tenzen-y @droctothorpe

@andreyvelich
Copy link
Member Author

/hold cancel

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jan 5, 2024
@google-oss-prow google-oss-prow bot merged commit ed3168a into kubeflow:master Jan 5, 2024
33 checks passed
@andreyvelich andreyvelich deleted the issue-1946-use-logging-sdk branch January 5, 2024 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SDK] Setup Logging Verbose Level in TrainingClient
5 participants