Improve K8sGPT Error Reporting for CrashLoopBackOff Pods #1059

naveenthangaraj03 · 2024-04-10T07:48:01Z

Checklist

I've searched for similar issues and couldn't find anything matching
I've included steps to reproduce the behavior

Affected Components

K8sGPT (CLI)
K8sGPT Operator

K8sGPT Version

v0.3.27

Kubernetes Version

v1.26.5

Host OS and its Version

Linux

Steps to reproduce

Run the Go code as a container
package main import "fmt" import "os" func main() { var arr []int fmt.Fprintln(os.Stdout, arr[0]) }
Run the k8sgpt analyze --filter=Pod
You can see the error
default/go-panic-pod(go-panic-pod) - Error: back-off 2m40s restarting failed container=go-panic-container pod=go-panic-pod_default(60286b08-47b6-4e7a-be19-576d3e9e6f5d) - Error: the last termination reason is Error container=go-panic-container pod=go-panic-pod
See the logs of the pod it will show the real error kubectl logs go-panic-pod
panic: runtime error: index out of range [0] with length 0

Expected behaviour

In Pod analyzer, when the pod is in crashloopbackoff it fetches the message back-off 5m0s restarting failed container=prometheus pod=prometheus-prometheus-kube-prometheus-0_ from the pod's CR. Instead of fetching the message from CR , if we can fetch the error from logs of the pod ,then we can know exactly what the problem is.
If it fetches from the logs, then in mycase I will get : level=error err="opening storage failed: open /prometheus/wal/00000828: no space left on device" , with this i can know better about the cause of the crashloop backoff error.

Actual behaviour

When the pod is in crashloopbackoff the error message is fetched from the pod's CR . So I cannot get the exact reason why the pod in crashloopbackoff.

Additional Information

Below I mention the real case:

kubectl get pod -n tcl-monitoring

prometheus-prometheus-kube-prometheus-0 1/2 CrashLoopBackOff 704 (3m30s ago) 22d

When the pod is in CrashLoopBackOff it fetch the below message as error
state:
waiting:
message: back-off 5m0s restarting failed container=prometheus pod=prometheus-prometheus-kube-prometheus-0_tcl-monitoring(29368fc9-fa1d-4b3d-9333-241acf0fbece)
reason: CrashLoopBackOff

k8sgpt analyze --filter=Pod --namespace=tcl-monitoring --explain

0 tcl-monitoring/prometheus-prometheus-kube-prometheus-0(prometheus-prometheus-kube-prometheus)

Error: back-off 5m0s restarting failed container=prometheus pod=prometheus-prometheus-kube-prometheus-0_tcl-monitoring(29368fc9-fa1d-4b3d-9333-241acf0fbece)
Error: the last termination reason is Error container=prometheus pod=prometheus-prometheus-kube-prometheus-0

kubectl logs -n tcl-monitoring prometheus-prometheus-kube-prometheus-0

ts=2024-04-10T06:59:00.178Z caller=main.go:1180 level=error err="opening storage failed: open /prometheus/wal/00000828: no space left on device"

"So we can get the error from the logs instead of getting from the CR message when the pod is in crashloopbackoff"

The text was updated successfully, but these errors were encountered:

arbreezy · 2024-04-19T22:05:10Z

We have the optional log analyzer, which is still experimental and risky since logs may container sensitive data but it sends logs to your AI backend.

We don't want to expand Pod's analyzer to fetch logs from the pod, cause the logs are arbitrary to each workload and that adds unnecessary complexity for this analyzer.

The goal at some point is to have a way to compound the errors from analyzers and contextualize them so there is a cohesion between them rather than stretching individual analyzers

naveenthangaraj03 · 2024-05-31T07:15:28Z

@arbreezy Thanks for that. We unable to use free OpenAI account for k8sgpt, can you provide some details about the k8sgpt and AI, because we created one new OpenAI account and add the Api key and by running the command "k8sgpt analyze
--filter=Pod --namespace=default --explain", there is one pod with unhealth, but it shows exhausted quota. do we need to use paid account to access AI for k8sgpt or can we able to access that in free account, if we can be able to access through free account then what is the limit for free account? Can you share the detail information? We don't have clear idea about that, we contacted many but no responses.

naveenthangaraj03 · 2024-07-26T08:58:34Z

@arbreezy, yeah i got it, In OpenAI the token which i had was expired but by using cohere as backend AI the log analyzer is working fine.

naveenthangaraj03 closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve K8sGPT Error Reporting for CrashLoopBackOff Pods #1059

Improve K8sGPT Error Reporting for CrashLoopBackOff Pods #1059

naveenthangaraj03 commented Apr 10, 2024

arbreezy commented Apr 19, 2024

naveenthangaraj03 commented May 31, 2024

naveenthangaraj03 commented Jul 26, 2024

Improve K8sGPT Error Reporting for CrashLoopBackOff Pods #1059

Improve K8sGPT Error Reporting for CrashLoopBackOff Pods #1059

Comments

naveenthangaraj03 commented Apr 10, 2024

Checklist

Affected Components

K8sGPT Version

Kubernetes Version

Host OS and its Version

Steps to reproduce

Expected behaviour

Actual behaviour

Additional Information

arbreezy commented Apr 19, 2024

naveenthangaraj03 commented May 31, 2024

naveenthangaraj03 commented Jul 26, 2024