Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tempo leaves behind data files for blocks with no metadata #2754

Open
zalegrala opened this issue Aug 2, 2023 · 2 comments
Open

Tempo leaves behind data files for blocks with no metadata #2754

zalegrala opened this issue Aug 2, 2023 · 2 comments
Labels
keepalive Label to exempt Issues / PRs from stale workflow operations type/bug Something isn't working

Comments

@zalegrala
Copy link
Contributor

Describe the bug

In some circumstances, a data.parquet file is the only object in a block path, which means this block shows up in the list, but the metadata is available. As of #2678 tempo now deletes the tenant index when the tenant is found to have zero blocks, but because these paths still show up, the tenant is not completely deleted which causes index failures to occur repeatedly for these tenants and results in additional calls to the backend that aren't helpful.

The index deletion I believe only uncovered this issue, which was dormant due to the index being left in place prior to #2678.

To Reproduce

Running r106 we see this in environments where a tenant index was deleted due to no blocks being found. My hunch is that this has something to do with unclean shutdown, but have no data to back this up.

Expected behavior
Block meta is either reconstructed from the data, or the data is removed.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: jsonnet
@joe-elliott joe-elliott added type/bug Something isn't working operations labels Aug 2, 2023
@mdisibio
Copy link
Contributor

mdisibio commented Aug 9, 2023

Adding that this situation will cause repeated errors like the following until all files are cleaned up: msg="failed to write tenant index" tenant=<id> err="storage: object doesn't exist". This is because it sees no more blocks and tries to delete the tenant index, but it's already been deleted. The delete call at https://github.com/grafana/tempo/blob/main/tempodb/backend/raw.go#L94 isn't handling the does not exist error gracefully.

In this situation it should not be considered a failure and propagate up. It's falsely triggering the TempoTenantIndexFailure. I think this should be an easy fix by ignoring backend.ErrDoesNotExist.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Oct 9, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2023
@zalegrala zalegrala reopened this Oct 27, 2023
@zalegrala zalegrala added the keepalive Label to exempt Issues / PRs from stale workflow label Oct 27, 2023
@github-actions github-actions bot removed the stale Used for stale issues / PRs label Oct 28, 2023
zalegrala added a commit to zalegrala/tempo that referenced this issue Apr 10, 2024
The tenant index deletion was originally put in as TCO win, but did not
have the desired effect and surfaced other issues in the system.

Related grafana#2678
Related grafana#2754
Related grafana#2781
Related grafana#2878
Related grafana#3115
Related grafana#3223

Due to the number of issues here, and causing considerable noise on the
pager, perhaps the right thing to do is back out the tenant deletion.

Raising here for discussion.
@zalegrala zalegrala mentioned this issue Apr 10, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Label to exempt Issues / PRs from stale workflow operations type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants