Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logs structure by consolidating libs from controller runtime and controllers #1313

Closed
Jeffwan opened this issue Jul 29, 2021 · 8 comments

Comments

@Jeffwan
Copy link
Member

Jeffwan commented Jul 29, 2021

Umbrella issue: #1318

Log is kind of messy and we should unify log library. I created a story in kubeflow/common#143 as well. it needs changes in both repos

2021-07-29T13:25:16.864-0700	INFO	controller-runtime.manager.controller.tfjob	Starting Controller	{"reconciler group": "kubeflow.org", "reconciler kind": "TFJob"}
2021-07-29T13:25:16.864-0700	INFO	controller-runtime.manager.controller.tfjob	Starting workers	{"reconciler group": "kubeflow.org", "reconciler kind": "TFJob", "worker count": 1}
INFO[0084] Reconciling for job dist-mnist-for-e2e-test
INFO[0084] Need to create new pod: ps-0                  job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
INFO[0084] Controller dist-mnist-for-e2e-test created pod dist-mnist-for-e2e-test-ps-0  job=.dist-mnist-for-e2e-test pod=.dist-mnist-for-e2e-test-ps-0 uid=
INFO[0084] Need to create new pod: ps-1                  job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.488-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreatePod", "message": "Created pod: dist-mnist-for-e2e-test-ps-0"}
INFO[0084] Controller dist-mnist-for-e2e-test created pod dist-mnist-for-e2e-test-ps-1  job=.dist-mnist-for-e2e-test pod=.dist-mnist-for-e2e-test-ps-1 uid=
INFO[0084] need to create new service: ps-0              job=default.dist-mnist-for-e2e-test replica-type=ps uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.510-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreatePod", "message": "Created pod: dist-mnist-for-e2e-test-ps-1"}
INFO[0084] Controller dist-mnist-for-e2e-test created service dist-mnist-for-e2e-test-ps-0
INFO[0084] need to create new service: ps-1              job=default.dist-mnist-for-e2e-test replica-type=ps uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.522-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreateService", "message": "Created service: dist-mnist-for-e2e-test-ps-0"}
INFO[0085] Controller dist-mnist-for-e2e-test created service dist-mnist-for-e2e-test-ps-1
INFO[0085] Need to create new pod: worker-0              job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.544-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreateService", "message": "Created service: dist-mnist-for-e2e-test-ps-1"}
INFO[0085] Controller dist-mnist-for-e2e-test created pod dist-mnist-for-e2e-test-worker-0  job=.dist-mnist-for-e2e-test pod=.dist-mnist-for-e2e-test-worker-0 uid=
INFO[0085] Need to create new pod: worker-1              job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.567-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreatePod", "message": "Created pod: dist-mnist-for-e2e-test-worker-0"}
INFO[0085] Controller dist-mnist-for-e2e-test created pod dist-mnist-for-e2e-test-worker-1  job=.dist-mnist-for-e2e-test pod=.dist-mnist-for-e2e-test-worker-1 uid=
INFO[0085] Need to create new pod: worker-2              job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.613-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreatePod", "message": "Created pod: dist-mnist-for-e2e-test-worker-1"}
INFO[0085] Controller dist-mnist-for-e2e-test created pod dist-mnist-for-e2e-test-worker-2  job=.dist-mnist-for-e2e-test pod=.dist-mnist-for-e2e-test-worker-2 uid=
INFO[0085] Need to create new pod: worker-3              job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.637-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreatePod", "message": "Created pod: dist-mnist-for-e2e-test-worker-2"}
INFO[0085] Controller dist-mnist-for-e2e-test created pod dist-mnist-for-e2e-test-worker-3  job=.dist-mnist-for-e2e-test pod=.dist-mnist-for-e2e-test-worker-3 uid=
INFO[0085] need to create new service: worker-0          job=default.dist-mnist-for-e2e-test replica-type=worker uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.654-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreatePod", "message": "Created pod: dist-mnist-for-e2e-test-worker-3"}
INFO[0085] Controller dist-mnist-for-e2e-test created service dist-mnist-for-e2e-test-worker-0
INFO[0085] need to create new service: worker-1          job=default.dist-mnist-for-e2e-test replica-type=worker uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.670-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreateService", "message": "Created service: dist-mnist-for-e2e-test-worker-0"}
INFO[0085] Controller dist-mnist-for-e2e-test created service dist-mnist-for-e2e-test-worker-1
INFO[0085] need to create new service: worker-2          job=default.dist-mnist-for-e2e-test replica-type=worker uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.680-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreateService", "message": "Created service: dist-mnist-for-e2e-test-worker-1"}
INFO[0085] Controller dist-mnist-for-e2e-test created service dist-mnist-for-e2e-test-worker-2
INFO[0085] need to create new service: worker-3          job=default.dist-mnist-for-e2e-test replica-type=worker uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.748-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreateService", "message": "Created service: dist-mnist-for-e2e-test-worker-2"}
INFO[0085] Controller dist-mnist-for-e2e-test created service dist-mnist-for-e2e-test-worker-3
2021-07-29T13:26:39.765-0700	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"TFJob","namespace":"default","name":"dist-mnist-for-e2e-test","uid":"80e7318c-cbcb-48cd-873e-066b1f13deec","apiVersion":"kubeflow.org/v1","resourceVersion":"2242493"}, "reason": "SuccessfulCreateService", "message": "Created service: dist-mnist-for-e2e-test-worker-3"}
INFO[0085] TFJob=default/dist-mnist-for-e2e-test, ReplicaType=PS expected=2, running=0, failed=0  job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
INFO[0085] TFJob=default/dist-mnist-for-e2e-test, ReplicaType=Worker expected=4, running=0, failed=0  job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
INFO[0085] Finished updating TFJobs Status "dist-mnist-for-e2e-test" (26.593097ms)  job=default.dist-mnist-for-e2e-test uid=80e7318c-cbcb-48cd-873e-066b1f13deec
2021-07-29T13:26:39.792-0700	INFO	reconcile cancelled, job does not need to do reconcile or has been deleted	{"tfjob": "default/dist-mnist-for-e2e-test", "sync": false, "deleted": false}
2021-07-29T13:26:39.796-0700	INFO	reconcile cancelled, job does not need to do reconcile or has been deleted	{"tfjob": "default/dist-mnist-for-e2e-test", "sync": false, "deleted": false}

@gaocegege
Copy link
Member

SGTM

@Jeffwan Jeffwan added this to To do in All-in-one Operator Jul 31, 2021
@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 31, 2021

/help
/good-first-issue

@google-oss-robot
Copy link

@Jeffwan:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/help
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 5, 2021

/cc @PatrickXYS If you have interest, feel free to pick them up.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 6, 2021

/assign @PatrickXYS

Feel free to unassign if you are overhelmed by other stuff and also let me know anything I can help

@gaocegege
Copy link
Member

The logging format is not consistent now:

2021-10-28T15:57:17.289+0800    INFO    PyTorchJobSpec is not valid: Master ReplicaSpec must be present {"pytorchjob": "default/elastic-example-echo", "PyTorchJob failed validation": "default/elastic-example-echo"}
INFO[0983] Reconciling for job elastic-example-echo
INFO[0983] PyTorchJob=elastic-example-echo, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0

We can still improve it.

@gaocegege
Copy link
Member

Should we keep "github.com/go-logr/logr" or "github.com/sirupsen/logrus"?

Now we are using both. Personally, I prefer keeping logr and removing logrus.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
@stale stale bot closed this as completed Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants