Bug 1249359 store correlations in es #3251

peterbe · 2016-03-25T17:15:57Z

The way to run this is like this:

socorro correlations  \
--resource.postgresql.database_port=5433 \
--resource.postgresql.database_name=breakpad \
--secrets.postgresql.database_username=breakpad_rw \
--secrets.postgresql.database_password=*********** \
--resource.boto.access_key="**********************" \
--secrets.boto.secret_access_key="*********************" \
--resource.boto.resource_class=socorro.external.boto.connection_context.RegionalS3ConnectionContext \
--resource.boto.bucket_name="org.allizom.crash-stats.rhelmer-test.crashes" \
--resource.postgresql.transaction_executor_class=socorro.database.transaction_executor.TransactionExecutor \
--global.correlations.core.output_class=socorro.external.es.correlations.CoreCounts \
--global.correlations.interesting.output_class=socorro.external.es.correlations.InterestingModules \
--new_crash_source.elasticsearch.elasticsearch_urls=localhost:9222 \
--source.crashstorage_class=socorro.analysis.correlations.correlations_app.LocallyCachedBotoS3CrashStorage \
--producer_consumer.number_of_threads=5 \
--new_crash_source.cap=2000 \
$@

That'll start 5 threads that jointly download 2000 processed crashes and then generates the summaries which it inserts into a local ES.

Note: I have my ssh tunnel set to so that localhost:9222 is the stage ES. That's where I get a days worth of UUIDs from.

peterbe · 2016-03-25T17:44:24Z

@adngdb I know the test is failing but I can't figure out what the error means. It works locally. Does it make any sense to you?

adngdb · 2016-03-29T14:54:55Z

socorro/analysis/correlations/correlations_app.py

+    --source.crashstorage_class=socorro.analysis.correlations.correlations_app.LocallyCachedBotoS3CrashStorage
+
+    in your call to `socorro correlations ...`
+    """


Wow, that is awesome!

peterbe · 2016-03-29T19:42:19Z

@adngdb See the extra new commit.
It solves that problem I emailed you about. It used to work like this:

1) start scan()
2) yield crash_id1
3) download crash_id1 from S3
4) yield crash_id1
5) download crash_id2 from S3
6) yield crash_id1
7) download crash_id3 from S3
...
999) close the new_crashes() iterator and calculate the summaries.

That meant the scroll scan has to be open for a crazy long time. Instead now, it does this:

1) start scan()
2) list.append(crash_id1)
3) list.append(crash_id2)
4) list.append(crash_id3)
5) close the scroll scan()
6) yield crash_id1
4) yield crash_id2
6) yield crash_id3
7) download crash_id1 from S3
8) download crash_id2 from S3
9) download crash_id3 from S3
...
999) close the new_crashes() iterator and calculate the summaries.

Now the scroll scan can close as soon as all crash IDs are extracted out and we can return to the FTS machinery.

peterbe · 2016-03-29T20:04:49Z

@adngdb I just pushed another little change.
You know that class that you can use to avoid having to ever re-download the same thing from S3 twice. Well, I used to use local disk and use gzip to compress and decompress the files. Turns out, the CPU time this triggers is enormous. I did some measurements how long it takes to read and write these .json.gz files.

** READS **
TIMES 1789
AVERAGE 0.106031637786
MEDIAN 0.0528
** WRITES **
TIMES 59
AVERAGE 3.71772711864
MEDIAN 1.9134

I.e. it takes, on average, 3.7 SECONDS! to write one of these crashes into a .json.gz file.
It uses MUCH less space in my /tmp directory.

2017 files, 65Mb ==> 32Kb per .json.gz file

Instead, I just removed this fancy gzip stuff. Now it takes AMAZINGLY less time to write and read:

** READS **
TIMES 1000
AVERAGE 0.0031484
MEDIAN 0.0002
** WRITES **
TIMES 800
AVERAGE 0.012997875
MEDIAN 0.0033

Also, since we always have ujson installed, I used that.

Sure, it uses up more space.

2447 files, 1.8Gb ==> 771Kb per .json file

adngdb · 2016-03-30T14:17:29Z

socorro/external/es/correlations.py

+            'date',
+            'key',
+            'signature',
+            'count',


I think count should not be in there. It could change when you reload the job, if new crashes came in in the mid-time for example. I can happen if someone triggers the processing of a crash that has been throttled. Or say we had a problem with processing, and we want to re-run correlations a bit later when all crashes made it into the database.

You're right. I've changed it.

adngdb · 2016-03-30T14:51:22Z

It seems to have worked great locally!

r+ when the make_id function is fixed.

peterbe assigned adngdb Mar 25, 2016

peterbe closed this Mar 29, 2016

peterbe reopened this Mar 29, 2016

adngdb reviewed Mar 29, 2016
View reviewed changes

adngdb reviewed Mar 30, 2016
View reviewed changes

bug 1249359 - Store correlations in ES

e9fe8e6

peterbe force-pushed the bug-1249359-store-correlations-in-es branch from edfdf11 to e9fe8e6 Compare March 30, 2016 18:58

peterbe merged commit fba686b into mozilla-services:master Mar 30, 2016

peterbe deleted the bug-1249359-store-correlations-in-es branch March 31, 2016 13:03

milescrabill unassigned adngdb May 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1249359 store correlations in es #3251

Bug 1249359 store correlations in es #3251

peterbe commented Mar 25, 2016

peterbe commented Mar 25, 2016

adngdb Mar 29, 2016

peterbe commented Mar 29, 2016

peterbe commented Mar 29, 2016

adngdb Mar 30, 2016

peterbe Mar 30, 2016

adngdb commented Mar 30, 2016

Bug 1249359 store correlations in es #3251

Bug 1249359 store correlations in es #3251

Conversation

peterbe commented Mar 25, 2016

peterbe commented Mar 25, 2016

adngdb Mar 29, 2016

Choose a reason for hiding this comment

peterbe commented Mar 29, 2016

peterbe commented Mar 29, 2016

adngdb Mar 30, 2016

Choose a reason for hiding this comment

peterbe Mar 30, 2016

Choose a reason for hiding this comment

adngdb commented Mar 30, 2016