Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve K8sGPT Error Reporting for CrashLoopBackOff Pods #1059

Closed
3 of 4 tasks
naveenthangaraj03 opened this issue Apr 10, 2024 · 3 comments
Closed
3 of 4 tasks

Improve K8sGPT Error Reporting for CrashLoopBackOff Pods #1059

naveenthangaraj03 opened this issue Apr 10, 2024 · 3 comments

Comments

@naveenthangaraj03
Copy link
Contributor

Checklist

  • I've searched for similar issues and couldn't find anything matching
  • I've included steps to reproduce the behavior

Affected Components

  • K8sGPT (CLI)
  • K8sGPT Operator

K8sGPT Version

v0.3.27

Kubernetes Version

v1.26.5

Host OS and its Version

Linux

Steps to reproduce

  1. Run the Go code as a container
    package main import "fmt" import "os" func main() { var arr []int fmt.Fprintln(os.Stdout, arr[0]) }

  2. Run the k8sgpt analyze --filter=Pod
    You can see the error
    default/go-panic-pod(go-panic-pod) - Error: back-off 2m40s restarting failed container=go-panic-container pod=go-panic-pod_default(60286b08-47b6-4e7a-be19-576d3e9e6f5d) - Error: the last termination reason is Error container=go-panic-container pod=go-panic-pod

  3. See the logs of the pod it will show the real error kubectl logs go-panic-pod
    panic: runtime error: index out of range [0] with length 0

Expected behaviour

In Pod analyzer, when the pod is in crashloopbackoff it fetches the message back-off 5m0s restarting failed container=prometheus pod=prometheus-prometheus-kube-prometheus-0_ from the pod's CR. Instead of fetching the message from CR , if we can fetch the error from logs of the pod ,then we can know exactly what the problem is.
If it fetches from the logs, then in mycase I will get : level=error err="opening storage failed: open /prometheus/wal/00000828: no space left on device" , with this i can know better about the cause of the crashloop backoff error.

Actual behaviour

When the pod is in crashloopbackoff the error message is fetched from the pod's CR . So I cannot get the exact reason why the pod in crashloopbackoff.

Additional Information

Below I mention the real case:

kubectl get pod -n tcl-monitoring

prometheus-prometheus-kube-prometheus-0 1/2 CrashLoopBackOff 704 (3m30s ago) 22d

When the pod is in CrashLoopBackOff it fetch the below message as error
state:
waiting:
message: back-off 5m0s restarting failed container=prometheus pod=prometheus-prometheus-kube-prometheus-0_tcl-monitoring(29368fc9-fa1d-4b3d-9333-241acf0fbece)
reason: CrashLoopBackOff

k8sgpt analyze --filter=Pod --namespace=tcl-monitoring --explain

0 tcl-monitoring/prometheus-prometheus-kube-prometheus-0(prometheus-prometheus-kube-prometheus)

  • Error: back-off 5m0s restarting failed container=prometheus pod=prometheus-prometheus-kube-prometheus-0_tcl-monitoring(29368fc9-fa1d-4b3d-9333-241acf0fbece)
  • Error: the last termination reason is Error container=prometheus pod=prometheus-prometheus-kube-prometheus-0

kubectl logs -n tcl-monitoring prometheus-prometheus-kube-prometheus-0

ts=2024-04-10T06:59:00.178Z caller=main.go:1180 level=error err="opening storage failed: open /prometheus/wal/00000828: no space left on device"

"So we can get the error from the logs instead of getting from the CR message when the pod is in crashloopbackoff"

@arbreezy
Copy link
Member

We have the optional log analyzer, which is still experimental and risky since logs may container sensitive data but it sends logs to your AI backend.

We don't want to expand Pod's analyzer to fetch logs from the pod, cause the logs are arbitrary to each workload and that adds unnecessary complexity for this analyzer.

The goal at some point is to have a way to compound the errors from analyzers and contextualize them so there is a cohesion between them rather than stretching individual analyzers

@naveenthangaraj03
Copy link
Contributor Author

@arbreezy Thanks for that. We unable to use free OpenAI account for k8sgpt, can you provide some details about the k8sgpt and AI, because we created one new OpenAI account and add the Api key and by running the command "k8sgpt analyze
--filter=Pod --namespace=default --explain", there is one pod with unhealth, but it shows exhausted quota. do we need to use paid account to access AI for k8sgpt or can we able to access that in free account, if we can be able to access through free account then what is the limit for free account? Can you share the detail information? We don't have clear idea about that, we contacted many but no responses.

@naveenthangaraj03
Copy link
Contributor Author

@arbreezy, yeah i got it, In OpenAI the token which i had was expired but by using cohere as backend AI the log analyzer is working fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants