Conversation
a19d264
to
afa2f00
Compare
I've run into the dreaded issues of memory corruption errors: server_b_1 | + mc cp minio/server-b/intermediate/external/aggregate/data.ndjson intermediate/external/aggregate/
server_b_1 | `minio/server-b/intermediate/external/aggregate/data.ndjson` -> `intermediate/external/aggregate/data.ndjson`
server_b_1 | Total: 54 B, Transferred: 54 B, Speed: 7.58 KiB/s
server_b_1 | + prio publish --n-data 3 --batch-id test --server-id B --private-key-hex E3AA3CC952C8553E46E699646A9DC3CBA7E3D4C7F0779D58574ABF945E259202 --shared-secret m/AqDal/ZSA9597GwMM+VA== --public-key-hex-internal 01D5D4F179ED233140CF97F79594F0190528268A99A6CDF57EF0E1569E673642 --public-key-hex-external 445C126981113E5684D517826E508F5731A1B35485BACCD63DAA8120DD11DA78 --input-internal intermediate/internal/aggregate/data.ndjson --input-external intermediate/external/aggregate/data.ndjson --output processed/
server_b_1 | Running publish
server_b_1 | corrupted size vs. prev_size
server_b_1 | scripts/server.sh: line 150: 173 Aborted prio publish --n-data $N_DATA --batch-id $BATCH_ID --server-id $SERVER_ID --private-key-hex $PRIVATE_KEY --shared-secret $SHARED_SECRET --public-key-hex-internal $PUBLIC_KEY_INTERNAL --public-key-hex-external $PUBLIC_KEY_EXTERNAL --input-internal intermediate/internal/aggregate/$filename --input-external intermediate/external/aggregate/$filename --output processed/
batched-processing_server_b_1 exited with code 134
server_a_1 | + : 0
server_a_1 | + 0
server_a_1 | scripts/server.sh: line 46: 0: command not found
server_a_1 | + mc stat minio/server-a/intermediate/external/aggregate/data.ndjson
server_a_1 | + mc cp minio/server-a/intermediate/external/aggregate/data.ndjson intermediate/external/aggregate/
server_a_1 | `minio/server-a/intermediate/external/aggregate/data.ndjson` -> `intermediate/external/aggregate/data.ndjson`
server_a_1 | Total: 54 B, Transferred: 54 B, Speed: 8.13 KiB/s
server_a_1 | + prio publish --n-data 3 --batch-id test --server-id A --private-key-hex 19DDC146FB8EE4A0B762A7DAE7E96033F87C9528DBBF8CA899CCD1DB8CD74984 --shared-secret m/AqDal/ZSA9597GwMM+VA== --public-key-hex-internal 445C126981113E5684D517826E508F5731A1B35485BACCD63DAA8120DD11DA78 --public-key-hex-external 01D5D4F179ED233140CF97F79594F0190528268A99A6CDF57EF0E1569E673642 --input-internal intermediate/internal/aggregate/data.ndjson --input-external intermediate/external/aggregate/data.ndjson --output processed/
server_a_1 | Running publish
server_a_1 | corrupted size vs. prev_size
server_a_1 | scripts/server.sh: line 150: 182 Aborted prio publish --n-data $N_DATA --batch-id $BATCH_ID --server-id $SERVER_ID --private-key-hex $PRIVATE_KEY --shared-secret $SHARED_SECRET --public-key-hex-internal $PUBLIC_KEY_INTERNAL --public-key-hex-external $PUBLIC_KEY_EXTERNAL --input-internal intermediate/internal/aggregate/$filename --input-external intermediate/external/aggregate/$filename --output processed/
batched-processing_server_a_1 exited with code 134 I can reliably reproduce this issue with the current setup in 76ca805. However, I can copy the files to my host and then run the commands by hand to get the correct results. https://gist.github.com/acmiyaguchi/2dea2a2fb95e3e967efddd6cd0dc4b39#file-prio-logs |
I've cleaned up this code significantly to use https://gist.github.com/acmiyaguchi/167dfabb888d0e0ea3eb89b61fea9f1a Here are the files that have been created:
|
76ca805
to
ac3ea4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran it locally and it seems to work! There are some minor things I would consider changing before landing.
I also think we should get this hooked up to CI sooner than later.
ac3ea4b
to
516834b
Compare
1b8dd3a
to
2302cd0
Compare
I've refactored the main server script so its easier to reason about. I think the only thing that this script is missing is the ability to process multiple partitions at the same time (i.e. using xargs or gnu parallel), but this can be added in as necessary. Ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second think, I feel like we could get this hooked up to ci without too much trouble. See comments.
Ready for yet another look. Using a custom script instead of relying on docker-compose means the logs from each server are indistinguishable (does it come from server A or server B?). On the other hand, it was easy to alter the workflow after figuring out how to watch for exit codes on docker process ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few very small comments, but lgtm. Feel free to merge as soon as you've either fixed or responded to the two comments.
Thanks for the review, having this example in CI is actually pretty nice. |
No description provided.