fix bug 1098954 - store crashes in S3 so they can be quickly listed by d... #2484

rhelmer · 2014-11-15T01:53:28Z

...ate and file type later

…y date and file type later

rhelmer · 2014-11-15T01:57:58Z

r? @twobraids - this is different from but inspired by the crash_id_(to/from)_row_id functions in the hb crashstorage.

Two things I don't like about this:

the name build_s3_dirs
maybe we should be stricter when parsing crash_id

There are tests which currently break if I try to e.g. do int(crash_date) and insisting that the crash_id is the expected length. Seems like those are cases where we'd want to raise BadCrashIDException

twobraids · 2014-11-17T15:13:01Z

one of the problems that Netflix was to encounter in their use of Socorro involved the crash_id. They had their own UUID for each crash and wanted Socorro to respect it. They did not have a date embedded and figured that we could just append our date-specifier onto the end of their UUID. That idea is thwarted by the PG implementation that uses type UUID in some places to save the crash_id.

Our embedding of the date within crash_id has proven to be really useful, but I think we could handle it better. I was hoping that we wouldn't introduce any new systems that would depend on it.

question: what is the reason that we need to organize in this manner? Why is listing by day important?

rhelmer · 2014-11-17T16:51:55Z

@twobraids well date is important in case we want to do analysis or reprocessing using S3 and no other data sources (using postgres as an index would get around this I suppose). S3 doesn't support globbing for listing so there won't be any quick way to list crashes by date otherwise.

twobraids · 2014-11-17T17:23:42Z

@rhelmer since looking up by date is not a primary use case, could we go for an indirect solution? For example, save all the crashes in the current scheme and in a parallel "virtual file system structure" implement name based symbolic links:

{{prefix}}/{{crash_id}}.{{name_of_thing}}
{{prefix}}/{{date}}/{{crash_id}}.{{name_of_thing}}

the second structures is actually storing empty files. If we can iterate over {{prefix}}/{{date}} that gives us a list of {{crash_id}}.{{name_of_thing}} that we can look up from the other tree to get the actual data.

when we initially save crashes into this structure, we've got the date because we can introspect the raw_crash to get submitted_timestamp (when saving raw_crash & dumps) or date_processed (when saving the processed_crash).

I think this scheme can work, even if the date is not embedded in the crash_id.

rhelmer · 2014-11-17T17:27:05Z

@twobraids hm doesn't feel worth the added complexity to me - we can just use Postgres if we need a quick index. I am fine wontfixing this for now, we can always restructure later.

fix bug 1098954 - store crashes in S3 so they can be quickly listed b…

39a78d2

…y date and file type later

rhelmer closed this Nov 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug 1098954 - store crashes in S3 so they can be quickly listed by d... #2484

fix bug 1098954 - store crashes in S3 so they can be quickly listed by d... #2484

rhelmer commented Nov 15, 2014

rhelmer commented Nov 15, 2014

twobraids commented Nov 17, 2014

rhelmer commented Nov 17, 2014

twobraids commented Nov 17, 2014

rhelmer commented Nov 17, 2014

fix bug 1098954 - store crashes in S3 so they can be quickly listed by d... #2484

fix bug 1098954 - store crashes in S3 so they can be quickly listed by d... #2484

Conversation

rhelmer commented Nov 15, 2014

rhelmer commented Nov 15, 2014

twobraids commented Nov 17, 2014

rhelmer commented Nov 17, 2014

twobraids commented Nov 17, 2014

rhelmer commented Nov 17, 2014