Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resurface CassandraTask error messages in the Failed condition #612

Closed
adejanovski opened this issue Jan 22, 2024 · 4 comments · Fixed by #614
Closed

Resurface CassandraTask error messages in the Failed condition #612

adejanovski opened this issue Jan 22, 2024 · 4 comments · Fixed by #614
Assignees
Labels
done Issues in the state 'done' enhancement New feature or request

Comments

@adejanovski
Copy link
Contributor

What is missing?

When a CassandraTask fails, we don't have any information in the CassandraTask status to make it easier to understand the core issue. This then requires to go through the cass-operator logs in search of the error message.
We'd need to update the Failed condition Message field with the error.

Why is this needed?

This would make investigation a lot easier without requiring to go through logs.

@adejanovski adejanovski added enhancement New feature or request product-backlog Issues in the state 'product-backlog' labels Jan 22, 2024
@burmanm
Copy link
Contributor

burmanm commented Jan 22, 2024

That information should be there, if there is a message to be set.

https://github.com/k8ssandra/cass-operator/blob/master/internal/controllers/control/cassandratask_controller.go#L375

@adejanovski
Copy link
Contributor Author

I'm seeing cases where the task fails with an NPE reported in the cass-operator logs but no message in the Failed condition.
Let's investigate a bit more, I'll come up with a way to reproduce this.

@adejanovski
Copy link
Contributor Author

adejanovski commented Jan 23, 2024

Here's a sample status from a failed CassandraTask:

status:
  completionTime: '2024-01-23T07:59:44Z'
  conditions:
    - lastTransitionTime: '2024-01-23T07:59:44Z'
      message: ''
      reason: Running
      status: 'False'
      type: Running
    - lastTransitionTime: '2024-01-23T07:59:44Z'
      message: ''
      reason: Complete
      status: 'True'
      type: Complete
    - lastTransitionTime: '2024-01-23T07:59:44Z'
      message: ''
      reason: Failed
      status: 'True'
      type: Failed
  failed: 1
  startTime: '2024-01-23T07:59:43Z'

And here's the error in the cass-operator logs:

2024-01-23T09:32:42.286Z	ERROR	Job failed to successfully complete the task	... {"error": "task failed: java.lang.IllegalArgumentException: Keyspace ALL does not exist"}
github.com/k8ssandra/cass-operator/internal/controllers/control.(*CassandraTaskReconciler).reconcileEveryPodTask
	/workspace/internal/controllers/control/cassandratask_controller.go:660
github.com/k8ssandra/cass-operator/internal/controllers/control.(*CassandraTaskReconciler).Reconcile
	/workspace/internal/controllers/control/cassandratask_controller.go:339
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

@burmanm
Copy link
Contributor

burmanm commented Feb 5, 2024

Only way to return a message in this case is to stop the task execution also when a non-retry event happens. What happens here is that we might get that error on the first pod, but currently we then move on to the next pod to try the same task again - so writing that error message from the first pod didn't make sense.

@adejanovski adejanovski added ready-for-review Issues in the state 'ready-for-review' and removed product-backlog Issues in the state 'product-backlog' labels Feb 5, 2024
@burmanm burmanm self-assigned this Feb 6, 2024
@adejanovski adejanovski added review Issues in the state 'review' and removed ready-for-review Issues in the state 'ready-for-review' labels Feb 12, 2024
@adejanovski adejanovski added done Issues in the state 'done' and removed review Issues in the state 'review' labels Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done Issues in the state 'done' enhancement New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants