Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1249359 store correlations in es #3251

Merged

Conversation

peterbe
Copy link
Contributor

@peterbe peterbe commented Mar 25, 2016

The way to run this is like this:

socorro correlations  \
--resource.postgresql.database_port=5433 \
--resource.postgresql.database_name=breakpad \
--secrets.postgresql.database_username=breakpad_rw \
--secrets.postgresql.database_password=*********** \
--resource.boto.access_key="**********************" \
--secrets.boto.secret_access_key="*********************" \
--resource.boto.resource_class=socorro.external.boto.connection_context.RegionalS3ConnectionContext \
--resource.boto.bucket_name="org.allizom.crash-stats.rhelmer-test.crashes" \
--resource.postgresql.transaction_executor_class=socorro.database.transaction_executor.TransactionExecutor \
--global.correlations.core.output_class=socorro.external.es.correlations.CoreCounts \
--global.correlations.interesting.output_class=socorro.external.es.correlations.InterestingModules \
--new_crash_source.elasticsearch.elasticsearch_urls=localhost:9222 \
--source.crashstorage_class=socorro.analysis.correlations.correlations_app.LocallyCachedBotoS3CrashStorage \
--producer_consumer.number_of_threads=5 \
--new_crash_source.cap=2000 \
$@

That'll start 5 threads that jointly download 2000 processed crashes and then generates the summaries which it inserts into a local ES.

Note: I have my ssh tunnel set to so that localhost:9222 is the stage ES. That's where I get a days worth of UUIDs from.

@peterbe
Copy link
Contributor Author

peterbe commented Mar 25, 2016

@adngdb I know the test is failing but I can't figure out what the error means. It works locally. Does it make any sense to you?

@peterbe peterbe closed this Mar 29, 2016
@peterbe peterbe reopened this Mar 29, 2016
--source.crashstorage_class=socorro.analysis.correlations.correlations_app.LocallyCachedBotoS3CrashStorage

in your call to `socorro correlations ...`
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that is awesome!

@peterbe
Copy link
Contributor Author

peterbe commented Mar 29, 2016

@adngdb See the extra new commit.
It solves that problem I emailed you about. It used to work like this:

1) start scan()
2) yield crash_id1
3) download crash_id1 from S3
4) yield crash_id1
5) download crash_id2 from S3
6) yield crash_id1
7) download crash_id3 from S3
...
999) close the new_crashes() iterator and calculate the summaries. 

That meant the scroll scan has to be open for a crazy long time. Instead now, it does this:

1) start scan()
2) list.append(crash_id1)
3) list.append(crash_id2)
4) list.append(crash_id3)
5) close the scroll scan()
6) yield crash_id1
4) yield crash_id2
6) yield crash_id3
7) download crash_id1 from S3
8) download crash_id2 from S3
9) download crash_id3 from S3
...
999) close the new_crashes() iterator and calculate the summaries. 

Now the scroll scan can close as soon as all crash IDs are extracted out and we can return to the FTS machinery.

@peterbe
Copy link
Contributor Author

peterbe commented Mar 29, 2016

@adngdb I just pushed another little change.
You know that class that you can use to avoid having to ever re-download the same thing from S3 twice. Well, I used to use local disk and use gzip to compress and decompress the files. Turns out, the CPU time this triggers is enormous. I did some measurements how long it takes to read and write these .json.gz files.

** READS **
TIMES 1789
AVERAGE 0.106031637786
MEDIAN 0.0528
** WRITES **
TIMES 59
AVERAGE 3.71772711864
MEDIAN 1.9134

I.e. it takes, on average, 3.7 SECONDS! to write one of these crashes into a .json.gz file.
It uses MUCH less space in my /tmp directory.

2017 files, 65Mb ==> 32Kb per .json.gz file

Instead, I just removed this fancy gzip stuff. Now it takes AMAZINGLY less time to write and read:

** READS **
TIMES 1000
AVERAGE 0.0031484
MEDIAN 0.0002
** WRITES **
TIMES 800
AVERAGE 0.012997875
MEDIAN 0.0033

Also, since we always have ujson installed, I used that.

Sure, it uses up more space.

2447 files, 1.8Gb ==> 771Kb per .json file

'date',
'key',
'signature',
'count',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think count should not be in there. It could change when you reload the job, if new crashes came in in the mid-time for example. I can happen if someone triggers the processing of a crash that has been throttled. Or say we had a problem with processing, and we want to re-run correlations a bit later when all crashes made it into the database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I've changed it.

@adngdb
Copy link
Contributor

adngdb commented Mar 30, 2016

It seems to have worked great locally!

r+ when the make_id function is fixed.

@peterbe peterbe force-pushed the bug-1249359-store-correlations-in-es branch from edfdf11 to e9fe8e6 Compare March 30, 2016 18:58
@peterbe peterbe merged commit fba686b into mozilla-services:master Mar 30, 2016
@peterbe peterbe deleted the bug-1249359-store-correlations-in-es branch March 31, 2016 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants