[Bug] full diagnostics cannot access (No change from "Running Diagnostics...") #99

githubeto · 2024-05-22T02:55:38Z

Kubecost Version

2.2.5

Kubernetes Version

1.28

Kubernetes Platform

EKS

Description

As shown in the screenshot, I cannot access the complete diagnostic page. Why is that?
Currently, we are using the Athena configuration, but since the CUR has not yet arrived in S3, we are in a waiting status.

helm chart

global:
  grafana:
    enabled: false
    proxy: false
  prometheus:
    enabled: true
ingress:
  enabled: false
kubecostModel:
  etlAssetReconciliationEnabled: false
  etlCloudUsage: false
  extraEnv:
  - name: LOG_LEVEL
    value: warn
  utcOffset: "+09:00"
kubecostProductConfigs:
  athenaBucketName: s3://skystyle-mng-athena-log
  athenaDatabase: athenacurcfn_skystyle_mng_kubecost
  athenaProjectID: "xxxxxxxxxxx"
  athenaRegion: ap-northeast-1
  athenaTable: skystyle_mng_kubecost
  athenaWorkgroup: spdkube-aws-mgr-athena-workgroup
  awsSpotDataBucket: spot-instance-datafeed-subscription
  awsSpotDataRegion: ap-northeast-1
  projectID: "xxxxxxxxxxx"
kubecostToken: xxxxxxxxxxx
networkPolicy:
  enabled: false
persistentVolume:
  dbSize: 32Gi
  enabled: true
  size: 32Gi
pricingCsv:
  enabled: false
priority:
  enabled: false
prometheus:
  server:
    global:
      evaluation_interval: 1m
      external_labels:
        cluster_id: aws-mgr
      scrape_interval: 1m
      scrape_timeout: 60s
reporting:
  productAnalytics: false
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/aws-mgr-kubecost-role
  create: true
  name: kubecost

kubectl logs -f deployment.apps/kubecost-cost-analyzer -c cost-model | grep ERR

ERR Failed to query prometheus at http://kubecost-prometheus-server.kubecost. Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=up&time=1716355356": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'up' . Troubleshooting help available at: http://docs.kubecost.com/custom-prom#troubleshoot
ERR Failed to lookup reserved instance data: no reservation data available in Athena
ERR Failed to lookup savings plan data: Error fetching Savings Plan Data: QueryAthenaPaginated: query execution error: no query results available for query 5266696b-f7d7-4570-9dc5-013eb76a8690
ERR Alerts config file failed to load: open /var/configs/alerts/alerts.json: no such file or directory
ERR savings: cluster sizing: failed to get monthly cluster rates: error getting valid asset set in MonthlyNodeClusterRates: failed to query from assets for 2024-05-22 00:00:00 +0000 UTC/2024-05-23 00:00:00 +0000 UTC: boundary error: requested [2024-05-22T00:00:00+0000, 2024-05-23T00:00:00+0000); supported [2024-05-22T02:00:00+0000, 2024-05-22T05:22:51+0000): Store[1h]: store does not have coverage to perform query
ERR Asset ETL: ComputeAssets: clusterManagementQuery: Prometheus communication error: sum_over_time((avg(kubecost_cluster_management_cost{}) by (cluster_id))[60m:1m] offset 322m) * 0.016667: retrying
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-21 00:00:00 +0000 UTC End:2024-05-22 00:00:00 +0000 UTC}': building [2024-05-21 00:00:00 +0000 UTC-2024-05-21 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-21T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-21T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-20 00:00:00 +0000 UTC End:2024-05-21 00:00:00 +0000 UTC}': building [2024-05-20 00:00:00 +0000 UTC-2024-05-20 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-20T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-20T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR Asset ETL: ComputeAssets: clusterManagementQuery: Prometheus communication error: sum_over_time((avg(kubecost_cluster_management_cost{}) by (cluster_id))[60m:1m] offset 322m) * 0.016667: retrying
ERR FA[*types.ContainerStatsSet]: Error building window '{Start:2024-05-19 00:00:00 +0000 UTC End:2024-05-20 00:00:00 +0000 UTC}': building [2024-05-19 00:00:00 +0000 UTC-2024-05-19 00:30:00 +0000 UTC]: querying cpu: Error: Post "http://kubecost-prometheus-server.kubecost/api/v1/query_range?end=2024-05-19T00%3A30%3A00Z&query=irate%28container_cpu_usage_seconds_total%7B%0A++container%21%3D%22%22%2C%0A++container%21%3D%22POD%22%2C%0A++container_name%21%3D%22POD%22%2C%0A%7D%5B5m%5D%29&start=2024-05-19T00%3A00%3A00Z&step=60.000": dial tcp 172.20.218.103:80: connect: connection refused, Body:  Query: irate(container_cpu_usage_seconds_total{
ERR CostModel.ComputeAllocation: failed to build pod map: Prometheus communication error: avg(kube_pod_container_status_running{} != 0) by (pod, namespace, cluster_id)[1h:5m]
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22false%22%2C+sameZone%3D%22false%22%2C+sameRegion%3D%22true%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=avg%28%0A%09%09label_replace%28%0A%09%09%09label_replace%28%0A%09%09%09%09label_replace%28%0A%09%09%09%09%09sum_over_time%28container_memory_working_set_bytes%7Bcontainer%21%3D%22%22%2C+container%21%3D%22POD%22%2C+instance%21%3D%22%22%2C+%7D%5B2m%5D+%29%2C+%22node%22%2C+%22%241%22%2C+%22instance%22%2C+%22%28.%2B%29%22%0A%09%09%09%09%29%2C+%22container_name%22%2C+%22%241%22%2C+%22container%22%2C+%22%28.%2B%29%22%0A%09%09%09%29%2C+%22pod_name%22%2C+%22%241%22%2C+%22pod%22%2C+%22%28.%2B%29%22%0A%09%09%29%0A%09%29+by+%28namespace%2C+container_name%2C+pod_name%2C+node%2C+cluster_id%29&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'avg(
ERR ComputeCostData: Parsing Error: Prometheus communication error: avg(
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22false%22%2C+sameZone%3D%22false%22%2C+sameRegion%3D%22false%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="false", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="false", sameZone="false", sameRegion="false", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=avg%28%0A%09%09label_replace%28%0A%09%09%09label_replace%28%0A%09%09%09%09label_replace%28%0A%09%09%09%09%09rate%28%0A%09%09%09%09%09%09container_cpu_usage_seconds_total%7Bcontainer%21%3D%22%22%2C+container%21%3D%22POD%22%2C+instance%21%3D%22%22%2C+%7D%5B2m%5D+%0A%09%09%09%09%09%29%2C+%22node%22%2C+%22%241%22%2C+%22instance%22%2C+%22%28.%2B%29%22%0A%09%09%09%09%29%2C+%22container_name%22%2C+%22%241%22%2C+%22container%22%2C+%22%28.%2B%29%22%0A%09%09%09%29%2C+%22pod_name%22%2C+%22%241%22%2C+%22pod%22%2C+%22%28.%2B%29%22%0A%09%09%29%0A%09%29+by+%28namespace%2C+container_name%2C+pod_name%2C+node%2C+cluster_id%29&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'avg(
ERR ComputeCostData: Parsing Error: Prometheus communication error: avg(
ERR ComputeCostData: Request Error: query error: 'Post "http://kubecost-prometheus-server.kubecost/api/v1/query?query=sum%28increase%28kubecost_pod_network_egress_bytes_total%7Binternet%3D%22true%22%2C+%7D%5B2m%5D+%29%29+by+%28namespace%2Cpod_name%2Ccluster_id%29+%2F+1024+%2F+1024+%2F+1024&time=1716355371": dial tcp 172.20.218.103:80: connect: connection refused' fetching query 'sum(increase(kubecost_pod_network_egress_bytes_total{internet="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024'
ERR ComputeCostData: Parsing Error: Prometheus communication error: sum(increase(kubecost_pod_network_egress_bytes_total{internet="true", }[2m] )) by (namespace,pod_name,cluster_id) / 1024 / 1024 / 1024
ERR CostModel.ComputeAllocation: query context error Errors:
ERR CostModel.ComputeAllocation: query context error Errors:
ERR CostModel.ComputeAllocation: query context error Errors:

Steps to reproduce

helm install
Spot datafeed setup
AWS Cloud Billing Integration

Expected behavior

can access diagnostics page

Impact

No response

Screenshots

View Full Diagnostics cannnot access

Logs

No response

Slack discussion

No response

Troubleshooting

I have read and followed the issue guidelines and this is a bug impacting only the Kubecost application.
I have searched other issues in this repository and mine is not recorded.

The text was updated successfully, but these errors were encountered:

dwbrown2 · 2024-05-22T15:16:03Z

@jessegoodier @AjayTripathy or others will likely be able to provide more detailed troubleshooting recommendations, but it looks like your prometheus isn't reachable. What's status of that pod?

githubeto · 2024-05-22T15:21:14Z

@jessegoodier @AjayTripathy or others will likely be able to provide more detailed troubleshooting recommendations, but it looks like your prometheus isn't reachable. What's status of that pod?

Prometheus pod is running.

jessegoodier · 2024-05-22T15:55:59Z

Do you have network policies that prevent communication between pods?
Also, is anything else running in this cluster that has a networking issue?
@githubeto

githubeto · 2024-05-23T00:24:50Z

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier
Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied.
There are also no other resources controlling inter-Pod communication.
Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

jessegoodier · 2024-05-23T11:49:36Z

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied. There are also no other resources controlling inter-Pod communication. Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

You can try a curl from the frontend:

kubectl exec -i -t -n kubecost deployments/kubecost-cost-analyzer -c cost-analyzer-frontend -- curl http://kubecost-prometheus-server.kubecost

should get:
<a href="/graph">Found</a>

you can also try a curl to other pods, perhaps grafana?

curl http://kubecost-grafana.kubecost

jessegoodier · 2024-05-23T11:51:07Z

Because Kubecost does not block traffic, I would not expect any logs, other than the communication failures you are seeing.

Do you have another cluster to test on to rule out other issues?

githubeto · 2024-05-24T23:28:56Z

Do you have network policies that prevent communication between pods? Also, is anything else running in this cluster that has a networking issue? @githubeto

@jessegoodier Cluster has Istio installed, but neither AuthorizationPolicy nor NetworkPolicy is applied. There are also no other resources controlling inter-Pod communication. Are there no detailed logs when the connection fails? Is it an issue with the debug level? I believe it should appear in the logs.

You can try a curl from the frontend:
kubectl exec -i -t -n kubecost deployments/kubecost-cost-analyzer -c cost-analyzer-frontend -- curl http://kubecost-prometheus-server.kubecost
should get: <a href="/graph">Found</a>

you can also try a curl to other pods, perhaps grafana?
curl http://kubecost-grafana.kubecost

@jessegoodier
The curl to kubecost-prometheus-server returned the correct response "Found".
As you can see from the Helm Chart, Grafana is not running, so it has not been checked.

There are no clusters without Istio, making it difficult to verify.

jessegoodier · 2024-05-26T12:34:53Z

We do not have other reports of this.

I don't have any other ideas here. Very strange the test command works but the cost-model container cannot communicate.

githubeto · 2024-05-27T02:42:10Z

@jessegoodier @AjayTripathy

While closely monitoring the browser access logs, I found an interesting log.
Does this error log correspond to the reason why the Full Diagnostics screen cannot be displayed?

This error log seems to be a 403 error (Rate limit) when accessing
https://api.github.com/repositories/178079595/releases or
https://api.github.com/repos/kubecost/cost-model/releases.

178079595/releases response:

{
    "message": "API rate limit exceeded for xx.xx.xx.xx. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
    "documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting"
}

chipzoller · 2024-10-16T14:21:54Z

Hello, in an effort to consolidate our bug and feature request tracking, we are deprecating using GitHub to track tickets. If this issue is still outstanding and you have not done so already, please raise a request at https://support.kubecost.com/.

githubeto added bug Something isn't working needs-triage A label added by default to all issues indicating it needs to be curated and triaged internally. labels May 22, 2024

chipzoller closed this as completed Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] full diagnostics cannot access (No change from "Running Diagnostics...") #99

[Bug] full diagnostics cannot access (No change from "Running Diagnostics...") #99

githubeto commented May 22, 2024 •

edited

Loading

dwbrown2 commented May 22, 2024

githubeto commented May 22, 2024

jessegoodier commented May 22, 2024

githubeto commented May 23, 2024

jessegoodier commented May 23, 2024

jessegoodier commented May 23, 2024

githubeto commented May 24, 2024

jessegoodier commented May 26, 2024

githubeto commented May 27, 2024

chipzoller commented Oct 16, 2024

[Bug] full diagnostics cannot access (No change from "Running Diagnostics...") #99

[Bug] full diagnostics cannot access (No change from "Running Diagnostics...") #99

Comments

githubeto commented May 22, 2024 • edited Loading

Kubecost Version

Kubernetes Version

Kubernetes Platform

Description

Steps to reproduce

Expected behavior

Impact

Screenshots

Logs

Slack discussion

Troubleshooting

dwbrown2 commented May 22, 2024

githubeto commented May 22, 2024

jessegoodier commented May 22, 2024

githubeto commented May 23, 2024

jessegoodier commented May 23, 2024

jessegoodier commented May 23, 2024

githubeto commented May 24, 2024

jessegoodier commented May 26, 2024

githubeto commented May 27, 2024

chipzoller commented Oct 16, 2024

githubeto commented May 22, 2024 •

edited

Loading