Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

[Bug] full diagnostics cannot access (No change from "Running Diagnostics...") #99

Closed
2 tasks done
githubeto opened this issue May 22, 2024 · 10 comments
Closed
2 tasks done
Labels
bug Something isn't working needs-triage A label added by default to all issues indicating it needs to be curated and triaged internally.

Comments

@githubeto
Copy link

githubeto commented May 22, 2024

Kubecost Version

2.2.5

Kubernetes Version

1.28

Kubernetes Platform

EKS

Description

As shown in the screenshot, I cannot access the complete diagnostic page. Why is that?
Currently, we are using the Athena configuration, but since the CUR has not yet arrived in S3, we are in a waiting status.

helm chart

global:
  grafana:
    enabled: false
    proxy: false
  prometheus:
    enabled: true
ingress:
  enabled: false
kubecostModel:
  etlAssetReconciliationEnabled: false
  etlCloudUsage: false
  extraEnv:
  - name: LOG_LEVEL
    value: warn
  utcOffset: "+09:00"
kubecostProductConfigs:
  athenaBucketName: s3://skystyle-mng-athena-log
  athenaDatabase: athenacurcfn_skystyle_mng_kubecost
  athenaProjectID: "xxxxxxxxxxx"
  athenaRegion: ap-northeast-1
  athenaTable: skystyle_mng_kubecost
  athenaWorkgroup: spdkube-aws-mgr-athena-workgroup
  awsSpotDataBucket: spot-instance-datafeed-subscription
  awsSpotDataRegion: ap-northeast-1
  projectID: "xxxxxxxxxxx"
kubecostToken: xxxxxxxxxxx
networkPolicy:
  enabled: false
persistentVolume:
  dbSize: 32Gi
  enabled: true
  size: 32Gi
pricingCsv:
  enabled: false
priority:
  enabled: false
prometheus:
  server:
    global:
      evaluation_interval: 1m
      external_labels:
        cluster_id: aws-mgr
      scrape_interval: 1m
      scrape_timeout: 60s
reporting:
  productAnalytics: false
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/aws-mgr-kubecost-role
  create: true
  name: kubecost

kubectl logs -f deployment.apps/kubecost-cost-analyzer -c cost-model | grep ERR

ERR Failed to query prometheus at http://kubecost-prometheus-server.kubecost. Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=up&time=1716355356": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'up' . Troubleshooting help available at: http://docs.kubecost.com/custom-prom#troubleshoot
ERR Failed to lookup reserved instance data: no reservation data available in Athena
ERR Failed to lookup savings plan data: Error fetching Savings Plan Data: QueryAthenaPaginated: query execution error: no query results available for query 5266696b-f7d7-4570-9dc5-013eb76a8690
ERR Alerts config file failed to load: open /var/configs/alerts/alerts.json: no such file or directory
ERR savings: cluster sizing: failed to get monthly cluster rates: error getting valid asset set in MonthlyNodeClusterRates: failed to query from assets for 2024-05-22 00:00:00 +0000 UTC/2024-05-23 00:00:00 +0000 UTC: boundary error: requested [2024-05-22T00:00:00+0000, 2024-05-23T00:00:00+0000); supported [2024-05-22T02:00:00+0000, 2024-05-22T05:22:51+0000): Store[1h]: store does not have coverage to perform query
ERR Asset ETL: ComputeAssets: clusterManagementQuery: Prometheus communication error: sum_over_time((avg(kubecost_cluster_management_cost{}) by (cluster_id))[60m:1m] offset 322m) * 0.016667: retrying
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-21 00:00:00 +0000 UTC End:2024-05-22 00:00:00 +0000 UTC}': building [2024-05-21 00:00:00 +0000 UTC-2024-05-21 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-21T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-21T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-20 00:00:00 +0000 UTC End:2024-05-21 00:00:00 +0000 UTC}': building [2024-05-20 00:00:00 +0000 UTC-2024-05-20 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-20T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-20T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR Asset ETL: ComputeAssets: clusterManagementQuery: Prometheus communication error: sum_over_time((avg(kubecost_cluster_management_cost{}) by (cluster_id))[60m:1m] offset 322m) * 0.016667: retrying
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-19 00:00:00 +0000 UTC End:2024-05-20 00:00:00 +0000 UTC}': building [2024-05-19 00:00:00 +0000 UTC-2024-05-19 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-19T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-19T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR CostModel.ComputeAllocation: failed to build pod map: Prometheus communication error: avg(kube_pod_container_status_running{} != 0) by (pod, namespace, cluster_id)[1h:5m]
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22false%22%2C+sameZone%3D%22false%22%2C+sameRegion%3D%22true%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=avg%28%0A%09%09label_replace%28%0A%09%09%09label_replace%28%0A%09%09%09%09label_replace%28%0A%09%09%09%09%09sum_over_time%28container_memory_working_set_bytes%7Bcontainer%21%3D%22%22%2C+container%21%3D%22POD%22%2C+instance%21%3D%22%22%2C+%7D%5B2m%5D+%29%2C+%22node%22%2C+%22%241%22%2C+%22instance%22%2C+%22%28.%2B%29%22%0A%09%09%09%09%29%2C+%22container_name%22%2C+%22%241%22%2C+%22container%22%2C+%22%28.%2B%29%22%0A%09%09%09%29%2C+%22pod_name%22%2C+%22%241%22%2C+%22pod%22%2C+%22%28.%2B%29%22%0A%09%09%29%0A%09%29+by+%28namespace%2C+container_name%2C+pod_name%2C+node%2C+cluster_id%29&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'avg(
ERR ComputeCostData: Parsing Error: Prometheus communication error: avg(
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22false%22%2C+sameZone%3D%22false%22%2C+sameRegion%3D%22false%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="false", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="false", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=avg%28%0A%09%09label_replace%28%0A%09%09%09label_replace%28%0A%09%09%09%09label_replace%28%0A%09%09%09%09%09rate%28%0A%09%09%09%09%09%09container_cpu_usage_seconds_total%7Bcontainer%21%3D%22%22%2C+container%21%3D%22POD%22%2C+instance%21%3D%22%22%2C+%7D%5B2m%5D+%0A%09%09%09%09%09%29%2C+%22node%22%2C+%22%241%22%2C+%22instance%22%2C+%22%28.%2B%29%22%0A%09%09%09%09%29%2C+%22container_name%22%2C+%22%241%22%2C+%22container%22%2C+%22%28.%2B%29%22%0A%09%09%09%29%2C+%22pod_name%22%2C+%22%241%22%2C+%22pod%22%2C+%22%28.%2B%29%22%0A%09%09%29%0A%09%29+by+%28namespace%2C+container_name%2C+pod_name%2C+node%2C+cluster_id%29&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'avg(
ERR ComputeCostData: Parsing Error: Prometheus communication error: avg(
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22true%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR CostModel.ComputeAllocation: query context error Errors:
ERR CostModel.ComputeAllocation: query context error Errors:
ERR CostModel.ComputeAllocation: query context error Errors:

Steps to reproduce

  1. helm install
  2. Spot datafeed setup
  3. AWS Cloud Billing Integration

Expected behavior

can access diagnostics page

Impact

No response

Screenshots

  • View Full Diagnostics cannnot access
    image

Logs

No response

Slack discussion

No response

Troubleshooting

  • I have read and followed the issue guidelines and this is a bug impacting only the Kubecost application.
  • I have searched other issues in this repository and mine is not recorded.
@githubeto githubeto added bug Something isn't working needs-triage A label added by default to all issues indicating it needs to be curated and triaged internally. labels May 22, 2024
@dwbrown2
Copy link
Collaborator

@jessegoodier @AjayTripathy or others will likely be able to provide more detailed troubleshooting recommendations, but it looks like your prometheus isn't reachable. What's status of that pod?

@githubeto
Copy link
Author

@jessegoodier @AjayTripathy or others will likely be able to provide more detailed troubleshooting recommendations, but it looks like your prometheus isn't reachable. What's status of that pod?

Prometheus pod is running.

@jessegoodier
Copy link

Do you have network policies that prevent communication between pods?
Also, is anything else running in this cluster that has a networking issue?
@githubeto

@githubeto
Copy link
Author

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier
Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied.
There are also no other resources controlling inter-Pod communication.
Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

@jessegoodier
Copy link

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied. There are also no other resources controlling inter-Pod communication. Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

You can try a curl from the frontend:

kubectl exec -i -t -n kubecost deployments/kubecost-cost-analyzer -c cost-analyzer-frontend -- curl http://kubecost-prometheus-server.kubecost

should get:
<a href="/graph">Found</a>

you can also try a curl to other pods, perhaps grafana?

curl http://kubecost-grafana.kubecost

@jessegoodier
Copy link

Because Kubecost does not block traffic, I would not expect any logs, other than the communication failures you are seeing.

Do you have another cluster to test on to rule out other issues?

@githubeto
Copy link
Author

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied. There are also no other resources controlling inter-Pod communication. Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

You can try a curl from the frontend:

kubectl exec -i -t -n kubecost deployments/kubecost-cost-analyzer -c cost-analyzer-frontend -- curl http://kubecost-prometheus-server.kubecost

should get: <a href="/graph">Found</a>

you can also try a curl to other pods, perhaps grafana?

curl http://kubecost-grafana.kubecost

@jessegoodier
The curl to kubecost-prometheus-server returned the correct response "Found".
As you can see from the Helm Chart, Grafana is not running, so it has not been checked.

There are no clusters without Istio, making it difficult to verify.

@jessegoodier
Copy link

We do not have other reports of this.

I don't have any other ideas here. Very strange the test command works but the cost-model container cannot communicate.

@githubeto
Copy link
Author

@jessegoodier @AjayTripathy

While closely monitoring the browser access logs, I found an interesting log.
Does this error log correspond to the reason why the Full Diagnostics screen cannot be displayed?

This error log seems to be a 403 error (Rate limit) when accessing
https://api.github.com/repositories/178079595/releases or
https://api.github.com/repos/kubecost/cost-model/releases.

178079595/releases response:

{
    "message": "API rate limit exceeded for xx.xx.xx.xx. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
    "documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"
}

@chipzoller
Copy link
Collaborator

Hello, in an effort to consolidate our bug and feature request tracking, we are deprecating using GitHub to track tickets. If this issue is still outstanding and you have not done so already, please raise a request at https://support.kubecost.com/.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working needs-triage A label added by default to all issues indicating it needs to be curated and triaged internally.
Projects
None yet
Development

No branches or pull requests

4 participants