Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

[Bug] Aggregator crashing on Kubecost #72

Closed
2 tasks done
aaj-synth opened this issue Apr 3, 2024 · 27 comments
Closed
2 tasks done

[Bug] Aggregator crashing on Kubecost #72

aaj-synth opened this issue Apr 3, 2024 · 27 comments
Labels
bug Something isn't working needs-triage A label added by default to all issues indicating it needs to be curated and triaged internally.

Comments

@aaj-synth
Copy link

Kubecost Helm Chart Version

2.2.0

Kubernetes Version

1.29

Kubernetes Platform

EKS

Description

While trying to update kubecost from v2.1.0 to v2.2.0, the kubecost-analyzer pod's container aggregator started going into CrashLoopBackOff with the error pasted below.

Steps to reproduce

  1. Update the kubecost helm chart from v2.1.0 to v2.2.0

Expected behavior

It was expected to update successfully but it threw this error.

Impact

No response

Screenshots

No response

Logs

ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: migrating up: no migration found for version 20240306133000: read down for version 20240306133000 migrations: file does not exist
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1721f75]

goroutine 49 [running]:
database/sql.(*DB).Close(0x0)
	/usr/local/go/src/database/sql/sql.go:910 +0x35
github.com/kubecost/kubecost-cost-model/pkg/duckdb/write.startIngestor(0xc0001ca600, 0xc001485960)
	/app/kubecost-cost-model/pkg/duckdb/write/writer.go:325 +0x28
github.com/kubecost/kubecost-cost-model/pkg/duckdb/write.NewWriter.func5({0x4840ae0?, 0xc000c30060?}, 0xc001569568?)
	/app/kubecost-cost-model/pkg/duckdb/write/writer.go:183 +0x1b
github.com/looplab/fsm.(*FSM).enterStateCallbacks(0xc000c44000, {0x6017be8, 0xc000c34af0}, 0xc00165b7a0)
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:470 +0x82
github.com/looplab/fsm.(*FSM).Event.(*FSM).Event.func2.func3()
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:363 +0x150
github.com/looplab/fsm.transitionerStruct.transition(...)
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:422
github.com/looplab/fsm.(*FSM).doTransition(...)
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:407
github.com/looplab/fsm.(*FSM).Event(0xc000c44000, {0x60177f0, 0x861da40}, {0x4eac1e4, 0xd}, {0x0, 0x0, 0x0})
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:390 +0x80a
github.com/kubecost/kubecost-cost-model/pkg/duckdb/write.NewWriter(0xc001485960, {0xc0013f54c0, 0x3a}, {0xc0013f55c0, 0x39})
	/app/kubecost-cost-model/pkg/duckdb/write/writer.go:241 +0x725
github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.createWriter(0xc001485900)
	/app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:399 +0x33
github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.NewOrchestrator.func7({0x4840ae0?, 0xc00155a930?}, 0xc000c42000)
	/app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:213 +0x25
github.com/looplab/fsm.(*FSM).enterStateCallbacks(0xc001560500, {0x6017be8, 0xc000c34050}, 0xc000c42000)
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:470 +0x82
github.com/looplab/fsm.(*FSM).Event.(*FSM).Event.func2.func3()
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:363 +0x150
github.com/looplab/fsm.transitionerStruct.transition(...)
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:422
github.com/looplab/fsm.(*FSM).doTransition(...)
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:407
github.com/looplab/fsm.(*FSM).Event(0xc001560500, {0x60177f0, 0x861da40}, {0x4edebc3, 0x1b}, {0x0, 0x0, 0x0})
	/go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:390 +0x80a
github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.NewOrchestrator.func6.1()
	/app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:205 +0x3e
created by github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.NewOrchestrator.func6 in goroutine 1
	/app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:204 +0x4e8

Slack discussion

No response

Troubleshooting

  • I have read and followed the issue guidelines and this is a bug impacting only the Helm chart.
  • I have searched other issues in this repository and mine is not recorded.
@aaj-synth aaj-synth added bug Something isn't working needs-triage A label added by default to all issues indicating it needs to be curated and triaged internally. labels Apr 3, 2024
@cliffcolvin
Copy link
Member

@aaj-synth thank you for reporting this issue, I've got an engineer looking at this today.

@passionInfinite
Copy link

passionInfinite commented Apr 4, 2024

Same thing for me as well but with different error: /var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied

Full Error:

2024-04-04T15:16:31.847772518Z ERR entering state: create_read_interface_init, err: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost-1712243791.d │
│ uckdb.read': could not open database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied                     │
│ 2024-04-04T15:16:31.847814718Z ERR after event, current state: create_read_interface_init, err: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost- │
│ 1712243791.duckdb.read': could not open database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied         │
│ 2024-04-04T15:16:31.847829319Z ERR error submitting event: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read': could not op │
│ en database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied                                              │
│ panic: runtime error: invalid memory address or nil pointer dereference                                                                                                 │
│ [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x2f72575]                                                                                                 │
│                                                                                                                                                                         │
│ goroutine 52 [running]:                                                                                                                                                 │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.(*DuckDBProvider).NewSelect(0xc000bf63c0?)                                                              │
│     /app/kubecost-cost-model/pkg/duckdb/orchestrator/duckdbprovider.go:63 +0x35                                                                                         │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation/db.(*AllocationDBQueryService).buildAbandonedWorkloadsCTE(0xc000f13ab8, {0x2, 0x1f4, 0x0, {0x0, 0x0}, 0x0 │
│ , 0x0, 0xc00111ad80})                                                                                                                                                   │
│     /app/kubecost-cost-model/pkg/duckdb/allocation/db/abandonedworkloads.go:212 +0x386                                                                                  │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation/db.(*AllocationDBQueryService).QueryAbandonedWorkloadsTopLine(0xc000f13ab8, {0x2, 0x1f4, 0x0, {0x0, 0x0}, │
│  0x0, 0x0, 0x0})                                                                                                                                                        │
│     /app/kubecost-cost-model/pkg/duckdb/allocation/db/abandonedworkloads.go:29 +0x185                                                                                   │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation.(*DuckDBAllocationQueryService).GetAllAbandonedWorkloadsTopLine(0xc0000d6d48?, {0x2, 0x1f4, 0x0, {0x0, 0x │
│ 0}, 0x0, 0x0, 0x0})                                                                                                                                                     │
│     /app/kubecost-cost-model/pkg/duckdb/allocation/abandonedworkloads.go:11 +0x76                                                                                       │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).FindAbandonedWorkloadsTopLine(0x0?, {0x2, 0x1f4, 0x0, {0x0, 0x0}, 0x0, 0x0, 0x0 │
│ })                                                                                                                                                                      │
│     /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:191 +0x77                                                                                               │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).summarizeAbandonedWorkloads(0x0?, 0x0?)                                         │
│     /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:237 +0x99                                                                                               │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).refreshSummaryCache.func1()                                                     │
│     /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:66 +0x1b                                                                                                │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).refreshIndividualMetric(0xc00120ab40, {0xc0006a8220, 0x1e}, 0xc0000d6fa0)       │
│     /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:104 +0x86

@AjayTripathy
Copy link

@passionInfinite @aaj-synth curious if a downgrade to 2.1 resolves the issue?

@aaj-synth
Copy link
Author

Downgrading from 2.2 to 2.1 does not help. I had to go back to 1.108 to have things working again.

@passionInfinite
Copy link

Downgrading to v2.1.0 produces this error:

2024/04/04 16:44:37 maxprocs: Updating GOMAXPROCS=10: determined from CPU quota                                                                                         │
│ 2024-04-04T16:44:37.889918371Z ??? Log level set to info                                                                                                                │
│ 2024-04-04T16:44:37.889949771Z INF tracing disabled                                                                                                                     │
│ 2024-04-04T16:44:37.890124073Z ERR AllocationReportFileStore: error creating file store: open /var/configs/reports.json: permission denied                              │
│ 2024-04-04T16:44:37.890294475Z ERR creating file store: open /var/configs/asset-reports.json: permission denied                                                         │
│ 2024-04-04T16:44:37.890402276Z ERR AdvancedReportFileStore: error creating file store: open /var/configs/advanced-reports.json: permission denied                       │
│ 2024-04-04T16:44:37.890489277Z ERR CloudCostFileStore: error creating file store: open /var/configs/cloud-cost-reports.json: permission denied                          │
│ 2024-04-04T16:44:37.890535778Z ERR RecurringBudgetRuleFileStore: error writing file store: open /var/configs/recurring-budget-rules.json: permission denied             │
│ 2024-04-04T16:44:37.890594378Z ERR BudgetFileStore: error writing file store: open /var/configs/budgets.json: permission denied                                         │
│ 2024-04-04T16:44:37.890650979Z ERR Team.FileStore: error creating file store: open /var/configs/teams.json: permission denied                                           │
│ 2024-04-04T16:44:37.890681179Z ERR User.FileStore: error creating file store: open /var/configs/users.json: permission denied                                           │
│ 2024-04-04T16:44:37.89071078Z ERR Auth.ServiceAccountFileStore: error creating file store: open /var/configs/serviceAccounts.json: permission denied                    │
│ 2024-04-04T16:44:37.890794881Z ERR entering state: create_read_interface_init, err: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var/configs/wat │
│ 2024-04-04T16:44:37.890808981Z ERR after event, current state: create_read_interface_init, err: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var │
│ 2024-04-04T16:44:37.890818581Z ERR error submitting event: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var/configs/waterfowl/duckdb/v0_9_2: per │
│ 2024-04-04T16:44:37.890951182Z ERR error initializing file store: failed to wrtie to file: open /var/configs/collections.json: permission denied                        │
│ 2024-04-04T16:44:37.891031283Z ERR Failed to write trial status: open /var/configs/trialuser.kc: permission denied                                                      │
│ Error: initializing: failed to start enterprise trial: FailedToWriteTrialStatus

@passionInfinite
Copy link

Downgrading from 2.2 to 2.1 does not help. I had to go back to 1.108 to have things working again.

@aaj-synth For me it is still failing for 1.108.1 . Is it something that you did and it started working? Looks like the pv files are corrupted 🤔

@AjayTripathy
Copy link

Hi @passionInfinite looks like a seperate issue with permissions on the PV? Can you open a ticket with support?

@aaj-synth
Copy link
Author

I upgraded from v1.108.0 to v2.1.0 and it worked fine. In the meantime i saw the blog post about v2.2.0 being released and soon as i upgraded to that, things stopped working. I tried downgrading to v2.1.0 but that ran in the same error that i mentioned in the issue. I eventually downgraded to v1.108.0 and just removed the upgrade.toV2 flag from helm chart and it worked for me.

@rahul-chr
Copy link

rahul-chr commented Apr 5, 2024

I can confirm i am facing this too on kubecost 2.1, i did a upgrade from 1.103.5 to 2.1.1

ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: migrating up: Dirty database version 20230712171354. Fix and force version. panic: runtime error: invalid memory address or nil pointer dereference

@michaelmdresser
Copy link

@rahul-chr did you upgrade directly from v1.103.5 to v2.1.1? No other upgrades/downgrades along the way before seeing that error?

@michaelmdresser
Copy link

@aaj-synth Downgrades can sometimes be tricky when going between particular version of v2.X. We're working on making this not a problem. While we're working on that, if you'd like to get back to v2.1 or try getting onto v2.2 again, please remove the /var/configs/waterfowl folder from your kubecost-cost-analyzer PVC before upgrading to your desired version. I have reason to believe the DB file got into a bad state and needs manual intervention. This does not cause data loss.

The command you would run is this, assuming Kubecost is installed in the kubecost namespace:
kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=cost-analyzer -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl


@rahul-chr I'm not confident that your problem is the same as @aaj-synth's problem. If you're willing to experiment, trying the same command above might help you, but it also might not.

@michaelmdresser
Copy link

Also, @aaj-synth and @rahul-chr do Kubecost's PV(C)s have enough space on them? Are any of them filling up or full?

@passionInfinite
Copy link

passionInfinite commented Apr 5, 2024

@michaelmdresser For my case, I found out that the PV mounted folder permissions got changed to root for some reason but the newer version uses the fsGroup of 1001 hence the permission denied errors?

@michaelmdresser By anychance etlUtils runs as root? 🤔

@passionInfinite
Copy link

@michaelmdresser I attached the volume to another test pod and checked the permissions of the /var/configs and it was root instead of 1001. I think how we upgrade matters over here. Going directly is not at going to work because some version includes the securityContext implementation which changes the permission from root to 1001. From my experience what I have noticed the upgrade path that will work:

v1.106.5 (Current Version) -> v1.107.1
v1.107.1 -> v1.108.1 ---> This includes the contextSecurity changes to 1001
v1.108.1 -> v2.1.0 --> Initial migration with Kubecost Aggregator using DuckDB
v2.1.0 -> v2.2.0 (Target Version) --> Migration schema changes.

Please correct me if something is wrong in my point of view.

@michaelmdresser
Copy link

@passionInfinite Thank you for the extra information, please open a separate issue to track the file permission problems you have encountered. We are using this issue to track the original issue and related problems: ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: migrating up: Dirty database version 20230712171354. Fix and force version.

@michaelmdresser
Copy link

I attempted an upgrade directly from v1.103.5 to v2.1.1 without incident. I suspect this issue is limited to situations where downgrades have occurred.

@rahul-chr
Copy link

rahul-chr commented Apr 8, 2024

@rahul-chr did you upgrade directly from v1.103.5 to v2.1.1? No other upgrades/downgrades along the way before seeing that error?

@michaelmdresser yes that was directl upgrade.. no downgrades.
and also the o/p

Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1

@michaelmdresser
Copy link

yes that was directl upgrade.. no downgrades.

Fascinating, we're trying to look further into this.


Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1

@rahul-chr Are you using Aggregator in a StatefulSet configuration? If so, the command I gave you is slightly wrong, and needs to be modified like so:

kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl

@rahul-chr
Copy link

rahul-chr commented Apr 9, 2024

yes that was directl upgrade.. no downgrades.

Fascinating, we're trying to look further into this.

Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1

@rahul-chr Are you using Aggregator in a StatefulSet configuration? If so, the command I gave you is slightly wrong, and needs to be modified like so:

kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl

Thank you @michaelmdresser for your response! But looks like this isnt helping either..

kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}')

error: unable to upgrade connection: container not found ("aggregator")

@michaelmdresser
Copy link

michaelmdresser commented Apr 9, 2024

@rahul-chr Ah, shucks. I'm guessing that's because its crash looping. To exec into the Pod to run the recovery despite the crash loop, we're going to have to do this:

  1. Edit the Aggregator StatefulSet via kubectl edit, e.g. kubectl edit statefulset -n kubecost kubecost-aggregator.

    Add the following right underneath name: aggregator in the Pod spec inside the StatefulSet:

    command:
      - /bin/bash
      - -c
      - |
         sleep 36000;
    

    This will start the pod up in sleeping mode and will not start the app, meaning it will not crash.

  2. After saving the edits, the kubecost-aggregator-0 Pod should terminate and restart

  3. Check the logs on the kubecost-aggregator-0 Pod to ensure there are no logs. This is expected, because it is sleeping.

  4. Run the command I sent earlier: kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl/duckdb

  5. Remove the command: block added in step 1. Aggregator should restart with normal log behavior.

I apologize for the trouble here. This is an unusual error situation.

@rahul-chr
Copy link

rahul-chr commented Apr 10, 2024

@michaelmdresser i think still there is obvious problem

rm: cannot remove '/var/configs/waterfowl/duckdb': Device or resource busy command terminated with exit code 1

But i have tweaked it more, i have removed the below mountPath , as it was mounted and used by PVC. I was able to delete the directory then and later have added it back ;)

    - name: aggregator-db-storage
      mountPath: /var/configs/waterfowl/duckdb
   - name: aggregator-staging
     mountPath: /var/configs/waterfowl

It works now

also, do you think this is a potential bug with this upgrade ?

@michaelmdresser
Copy link

But i have tweaked it more,

Ah, thanks for the reminder about that bit of the volume configuration. Thanks for your patience.

It works now

The command works, great! After removing the sleep, has Aggregator started up normally without the crash behavior?

also, do you think this is a potential bug with this upgrade ?

Is this question about the original bit of this GH Issue, which is migrating up: no migration found for version 20240306133000: read down for version 20240306133000 migrations: file does not exist? I think so, given that we've seen a few reports of it so far. It's a bit troubling for me, as I haven't been able to reproduce it yet with anything except a downgrade.

@rahul-chr
Copy link

rahul-chr commented Apr 11, 2024

But i have tweaked it more,

Ah, thanks for the reminder about that bit of the volume configuration. Thanks for your patience.

It works now

The command works, great! After removing the sleep, has Aggregator started up normally without the crash behavior?

also, do you think this is a potential bug with this upgrade ?

Is this question about the original bit of this GH Issue, which is migrating up: no migration found for version 20240306133000: read down for version 20240306133000 migrations: file does not exist? I think so, given that we've seen a few reports of it so far. It's a bit troubling for me, as I haven't been able to reproduce it yet with anything except a downgrade.

Nope, this is specific to my issue, do you want me to open an github issue for that? As i am afraid, if i can do this workaround(removing duckdb) in production?

@michaelmdresser
Copy link

Nope, this is specific to my issue, do you want me to open an github issue for that?

If you're running into a new bug, please do open a new issue.

As i am afraid, if i can do this workaround(removing duckdb) in production?

Don't worry! DuckDB files are not a "source of truth" -- Aggregator builds up its datastore from what we call "ETL" files which are stored either in object storage (e.g. S3, GCS) or in a different folder in the PV, depending on your configuration. Removing the /var/configs/waterfowl/duckdb directory will indeed cause a rebuild, but all of the ETL data it builds from will not be affected so it will get you right back to where you should be once the rebuild completes. No data loss.

@chipzoller chipzoller transferred this issue from kubecost/cost-analyzer-helm-chart May 1, 2024
@chipzoller
Copy link
Collaborator

Does not appear to be an issue with the Helm chart. Transferred to the correct repository.

@TomHellier
Copy link

@AjayTripathy - This is marked as completed - do you know what version a fix was released in? thanks :)

@AjayTripathy
Copy link

2.2.5 -- let me check on what's going on in #103 though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working needs-triage A label added by default to all issues indicating it needs to be curated and triaged internally.
Projects
None yet
Development

No branches or pull requests

8 participants