Skip to content
This repository has been archived by the owner on Feb 13, 2024. It is now read-only.

Minio file watcher #46

Merged
merged 22 commits into from Jun 4, 2019
Merged

Conversation

acmiyaguchi
Copy link
Contributor

No description provided.

@acmiyaguchi
Copy link
Contributor Author

I've run into the dreaded issues of memory corruption errors:

server_b_1   | + mc cp minio/server-b/intermediate/external/aggregate/data.ndjson intermediate/external/aggregate/
server_b_1   | `minio/server-b/intermediate/external/aggregate/data.ndjson` -> `intermediate/external/aggregate/data.ndjson`
server_b_1   | Total: 54 B, Transferred: 54 B, Speed: 7.58 KiB/s
server_b_1   | + prio publish --n-data 3 --batch-id test --server-id B --private-key-hex E3AA3CC952C8553E46E699646A9DC3CBA7E3D4C7F0779D58574ABF945E259202 --shared-secret m/AqDal/ZSA9597GwMM+VA== --public-key-hex-internal 01D5D4F179ED233140CF97F79594F0190528268A99A6CDF57EF0E1569E673642 --public-key-hex-external 445C126981113E5684D517826E508F5731A1B35485BACCD63DAA8120DD11DA78 --input-internal intermediate/internal/aggregate/data.ndjson --input-external intermediate/external/aggregate/data.ndjson --output processed/
server_b_1   | Running publish
server_b_1   | corrupted size vs. prev_size
server_b_1   | scripts/server.sh: line 150:   173 Aborted                 prio publish --n-data $N_DATA --batch-id $BATCH_ID --server-id $SERVER_ID --private-key-hex $PRIVATE_KEY --shared-secret $SHARED_SECRET --public-key-hex-internal $PUBLIC_KEY_INTERNAL --public-key-hex-external $PUBLIC_KEY_EXTERNAL --input-internal intermediate/internal/aggregate/$filename --input-external intermediate/external/aggregate/$filename --output processed/
batched-processing_server_b_1 exited with code 134
server_a_1   | + : 0
server_a_1   | + 0
server_a_1   | scripts/server.sh: line 46: 0: command not found
server_a_1   | + mc stat minio/server-a/intermediate/external/aggregate/data.ndjson
server_a_1   | + mc cp minio/server-a/intermediate/external/aggregate/data.ndjson intermediate/external/aggregate/
server_a_1   | `minio/server-a/intermediate/external/aggregate/data.ndjson` -> `intermediate/external/aggregate/data.ndjson`
server_a_1   | Total: 54 B, Transferred: 54 B, Speed: 8.13 KiB/s
server_a_1   | + prio publish --n-data 3 --batch-id test --server-id A --private-key-hex 19DDC146FB8EE4A0B762A7DAE7E96033F87C9528DBBF8CA899CCD1DB8CD74984 --shared-secret m/AqDal/ZSA9597GwMM+VA== --public-key-hex-internal 445C126981113E5684D517826E508F5731A1B35485BACCD63DAA8120DD11DA78 --public-key-hex-external 01D5D4F179ED233140CF97F79594F0190528268A99A6CDF57EF0E1569E673642 --input-internal intermediate/internal/aggregate/data.ndjson --input-external intermediate/external/aggregate/data.ndjson --output processed/
server_a_1   | Running publish
server_a_1   | corrupted size vs. prev_size
server_a_1   | scripts/server.sh: line 150:   182 Aborted                 prio publish --n-data $N_DATA --batch-id $BATCH_ID --server-id $SERVER_ID --private-key-hex $PRIVATE_KEY --shared-secret $SHARED_SECRET --public-key-hex-internal $PUBLIC_KEY_INTERNAL --public-key-hex-external $PUBLIC_KEY_EXTERNAL --input-internal intermediate/internal/aggregate/$filename --input-external intermediate/external/aggregate/$filename --output processed/
batched-processing_server_a_1 exited with code 134

I can reliably reproduce this issue with the current setup in 76ca805. However, I can copy the files to my host and then run the commands by hand to get the correct results.

https://gist.github.com/acmiyaguchi/2dea2a2fb95e3e967efddd6cd0dc4b39#file-prio-logs

@acmiyaguchi
Copy link
Contributor Author

I've cleaned up this code significantly to use _SUCCESS files with an exponential backoff for processing steps. The container can now process multiple partitions of data. I've provided the logs for the two servers below.

https://gist.github.com/acmiyaguchi/167dfabb888d0e0ea3eb89b61fea9f1a

Here are the files that have been created:

$ mc ls minio --recursive
[2019-05-22 16:55:17 PDT]      0B server-a/intermediate/external/aggregate/_SUCCESS
[2019-05-22 16:55:17 PDT]     54B server-a/intermediate/external/aggregate/part-0.ndjson
[2019-05-22 16:55:17 PDT]     54B server-a/intermediate/external/aggregate/part-1.ndjson
[2019-05-22 16:55:17 PDT]     54B server-a/intermediate/external/aggregate/part-2.ndjson
[2019-05-22 16:55:13 PDT]      0B server-a/intermediate/external/verify1/_SUCCESS
[2019-05-22 16:55:13 PDT]    282B server-a/intermediate/external/verify1/part-0.ndjson
[2019-05-22 16:55:13 PDT]    376B server-a/intermediate/external/verify1/part-1.ndjson
[2019-05-22 16:55:13 PDT]    658B server-a/intermediate/external/verify1/part-2.ndjson
[2019-05-22 16:55:16 PDT]      0B server-a/intermediate/external/verify2/_SUCCESS
[2019-05-22 16:55:16 PDT]    234B server-a/intermediate/external/verify2/part-0.ndjson
[2019-05-22 16:55:16 PDT]    312B server-a/intermediate/external/verify2/part-1.ndjson
[2019-05-22 16:55:16 PDT]    546B server-a/intermediate/external/verify2/part-2.ndjson
[2019-05-22 16:55:17 PDT]      0B server-a/processed/_SUCCESS
[2019-05-22 16:55:17 PDT]      9B server-a/processed/part-0.ndjson
[2019-05-22 16:55:17 PDT]      9B server-a/processed/part-1.ndjson
[2019-05-22 16:55:17 PDT]      9B server-a/processed/part-2.ndjson
[2019-05-22 16:55:09 PDT]      0B server-a/raw/_SUCCESS
[2019-05-22 16:55:09 PDT]  1.1KiB server-a/raw/part-0.ndjson
[2019-05-22 16:55:09 PDT]  1.4KiB server-a/raw/part-1.ndjson
[2019-05-22 16:55:09 PDT]  2.5KiB server-a/raw/part-2.ndjson
[2019-05-22 16:55:17 PDT]      0B server-b/intermediate/external/aggregate/_SUCCESS
[2019-05-22 16:55:17 PDT]     54B server-b/intermediate/external/aggregate/part-0.ndjson
[2019-05-22 16:55:17 PDT]     54B server-b/intermediate/external/aggregate/part-1.ndjson
[2019-05-22 16:55:17 PDT]     54B server-b/intermediate/external/aggregate/part-2.ndjson
[2019-05-22 16:55:13 PDT]      0B server-b/intermediate/external/verify1/_SUCCESS
[2019-05-22 16:55:13 PDT]    282B server-b/intermediate/external/verify1/part-0.ndjson
[2019-05-22 16:55:13 PDT]    376B server-b/intermediate/external/verify1/part-1.ndjson
[2019-05-22 16:55:13 PDT]    658B server-b/intermediate/external/verify1/part-2.ndjson
[2019-05-22 16:55:14 PDT]      0B server-b/intermediate/external/verify2/_SUCCESS
[2019-05-22 16:55:14 PDT]    234B server-b/intermediate/external/verify2/part-0.ndjson
[2019-05-22 16:55:14 PDT]    312B server-b/intermediate/external/verify2/part-1.ndjson
[2019-05-22 16:55:14 PDT]    546B server-b/intermediate/external/verify2/part-2.ndjson
[2019-05-22 16:55:19 PDT]      0B server-b/processed/_SUCCESS
[2019-05-22 16:55:19 PDT]      9B server-b/processed/part-0.ndjson
[2019-05-22 16:55:19 PDT]      9B server-b/processed/part-1.ndjson
[2019-05-22 16:55:19 PDT]      9B server-b/processed/part-2.ndjson
[2019-05-22 16:55:09 PDT]      0B server-b/raw/_SUCCESS
[2019-05-22 16:55:09 PDT]    810B server-b/raw/part-0.ndjson
[2019-05-22 16:55:09 PDT]  1.1KiB server-b/raw/part-1.ndjson
[2019-05-22 16:55:09 PDT]  1.8KiB server-b/raw/part-2.ndjson

@acmiyaguchi acmiyaguchi marked this pull request as ready for review May 23, 2019 00:01
@acmiyaguchi acmiyaguchi requested a review from wlach May 23, 2019 00:02
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it locally and it seems to work! There are some minor things I would consider changing before landing.

I also think we should get this hooked up to CI sooner than later.

examples/batched-processing/Dockerfile Outdated Show resolved Hide resolved
examples/batched-processing/README.md Outdated Show resolved Hide resolved
examples/batched-processing/scripts/server.sh Outdated Show resolved Hide resolved
examples/batched-processing/README.md Outdated Show resolved Hide resolved
examples/batched-processing/README.md Outdated Show resolved Hide resolved
examples/batched-processing/Makefile Outdated Show resolved Hide resolved
@acmiyaguchi
Copy link
Contributor Author

I've refactored the main server script so its easier to reason about. I think the only thing that this script is missing is the ability to process multiple partitions at the same time (i.e. using xargs or gnu parallel), but this can be added in as necessary.

Ready for another look.

@acmiyaguchi acmiyaguchi requested a review from wlach May 29, 2019 22:28
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second think, I feel like we could get this hooked up to ci without too much trouble. See comments.

@acmiyaguchi acmiyaguchi requested a review from wlach May 30, 2019 21:45
@acmiyaguchi
Copy link
Contributor Author

Ready for yet another look. Using a custom script instead of relying on docker-compose means the logs from each server are indistinguishable (does it come from server A or server B?). On the other hand, it was easy to alter the workflow after figuring out how to watch for exit codes on docker process ids.

.circleci/config.yml Outdated Show resolved Hide resolved
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few very small comments, but lgtm. Feel free to merge as soon as you've either fixed or responded to the two comments.

@acmiyaguchi
Copy link
Contributor Author

Thanks for the review, having this example in CI is actually pretty nice.

@acmiyaguchi acmiyaguchi merged commit e8d4fc0 into mozilla:master Jun 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants