Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug 1098954 - store crashes in S3 so they can be quickly listed by d... #2484

Conversation

rhelmer
Copy link
Contributor

@rhelmer rhelmer commented Nov 15, 2014

...ate and file type later

@rhelmer
Copy link
Contributor Author

rhelmer commented Nov 15, 2014

r? @twobraids - this is different from but inspired by the crash_id_(to/from)_row_id functions in the hb crashstorage.

Two things I don't like about this:

  1. the name build_s3_dirs
  2. maybe we should be stricter when parsing crash_id

There are tests which currently break if I try to e.g. do int(crash_date) and insisting that the crash_id is the expected length. Seems like those are cases where we'd want to raise BadCrashIDException

@twobraids
Copy link
Contributor

one of the problems that Netflix was to encounter in their use of Socorro involved the crash_id. They had their own UUID for each crash and wanted Socorro to respect it. They did not have a date embedded and figured that we could just append our date-specifier onto the end of their UUID. That idea is thwarted by the PG implementation that uses type UUID in some places to save the crash_id.

Our embedding of the date within crash_id has proven to be really useful, but I think we could handle it better. I was hoping that we wouldn't introduce any new systems that would depend on it.

question: what is the reason that we need to organize in this manner? Why is listing by day important?

@rhelmer
Copy link
Contributor Author

rhelmer commented Nov 17, 2014

@twobraids well date is important in case we want to do analysis or reprocessing using S3 and no other data sources (using postgres as an index would get around this I suppose). S3 doesn't support globbing for listing so there won't be any quick way to list crashes by date otherwise.

@twobraids
Copy link
Contributor

@rhelmer since looking up by date is not a primary use case, could we go for an indirect solution? For example, save all the crashes in the current scheme and in a parallel "virtual file system structure" implement name based symbolic links:

{{prefix}}/{{crash_id}}.{{name_of_thing}}
{{prefix}}/{{date}}/{{crash_id}}.{{name_of_thing}}

the second structures is actually storing empty files. If we can iterate over {{prefix}}/{{date}} that gives us a list of {{crash_id}}.{{name_of_thing}} that we can look up from the other tree to get the actual data.

when we initially save crashes into this structure, we've got the date because we can introspect the raw_crash to get submitted_timestamp (when saving raw_crash & dumps) or date_processed (when saving the processed_crash).

I think this scheme can work, even if the date is not embedded in the crash_id.

@rhelmer
Copy link
Contributor Author

rhelmer commented Nov 17, 2014

@twobraids hm doesn't feel worth the added complexity to me - we can just use Postgres if we need a quick index. I am fine wontfixing this for now, we can always restructure later.

@rhelmer rhelmer closed this Nov 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants