Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xl: Avoid multi-disks node to exit when one disk fails #12423

Merged
merged 2 commits into from
Jun 5, 2021

Conversation

vadmeste
Copy link
Member

@vadmeste vadmeste commented Jun 2, 2021

Description

It makes sense that a node which has multiple disks to start when one
disk fails, returning i/o error for example. This commit will make this
faulty tolerence available in this specific use case.

Motivation and Context

The cluster should be started when a disk is broken.

How to test this PR?

Contact me.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Optimization (provides speedup with no functional changes)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • Fixes a regression (If yes, please add commit-id or PR # here)
  • Documentation updated
  • Unit tests added/updated

Copy link
Member

@harshavardhana harshavardhana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? @vadmeste

@vadmeste
Copy link
Member Author

vadmeste commented Jun 2, 2021

why do we need this? @vadmeste

There is one report about a node which is not starting because if input/output error, which is not good if a node has multiple disks. So this PR will ignore disks that are returning errors when try to upgrade format.json and clean temporary disks.

@harshavardhana
Copy link
Member

why do we need this? @vadmeste

There is one report about a node which is not starting because if input/output error, which is not good if a node has multiple disks. So this PR will ignore disks that are returning errors when try to upgrade format.json and clean temporary disks.

We already do that @vadmeste in the registerStorage()

@vadmeste
Copy link
Member Author

vadmeste commented Jun 2, 2021

We already do that @vadmeste in the registerStorage()

Well yes, but that was not enough, there are some golang os calls to try to upgrade format.json and clean tmp directories before calling newXLStorage(). Those functions are called formatErasureMigrateLocalEndpoints(endpoints) and formatErasureCleanupTmpLocalEndpoints(endpoints)

By the way, maybe it makes sense to move formatErasureMigrateLocalEndpoints and formatErasureCleanupTmpLocalEndpoints inside newXLStorage(), it looks like this is an easier fix.

@harshavardhana
Copy link
Member

Well yes, but that was not enough, there are some golang os calls to try to upgrade format.json and clean tmp directories before calling newXLStorage(). Those functions are called formatErasureMigrateLocalEndpoints(endpoints) and formatErasureCleanupTmpLocalEndpoints(endpoints)

By the way, maybe it makes sense to move formatErasureMigrateLocalEndpoints and formatErasureCleanupTmpLocalEndpoints inside newXLStorage(), it looks like this is an easier fix.

Yes that can be done instead @vadmeste adding another variable into Endpoints looks unclean.

@vadmeste vadmeste changed the title xl: Ignore initializing disks on problematic endpoints xl: Avoid multi-disks node to exit when one disk fails Jun 2, 2021
@harshavardhana
Copy link
Member

PTAL at the conflicts @vadmeste

@vadmeste
Copy link
Member Author

vadmeste commented Jun 4, 2021

PTAL at the conflicts @vadmeste

Fixed

cmd/prepare-storage.go Outdated Show resolved Hide resolved
for _, diskPath := range diskPaths {
j := &tierJournal{
diskPath: diskPath,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tier deletion journal is intentionally being done centrally as it an overkill to have it run on all disk paths. @krisis PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadmeste to select the first available local disk on a server works to an extent. It's definitely an improvement. Additionally, we need to act on journals that may be present on other local disks, due to a different set of disks available locally during previous server restarts.

E.g

  1. On the first startup, when all disks are online, this approach would select the first local disk on each server.
  2. Imagine the first disk on the first server goes temporarily offline. On restart, we would select the second disk on this server.
  3. Now the first disk is back online. On a subsequent restart, we would pick this disk.

At this point we would have a journal in the second disk which will remain unattended. We need to address this too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadmeste we could address the issue of possibly unattended tier deletion journals on other local disks in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I just thought this cannot be more bad than what we have right now. Yes, we can revert this and address it more properly in another PR.

It makes sense that a node which has multiple disks to start when one
disk fails, returning i/o error for example. This commit will make this
faulty tolerence available in this specific use case.
Copy link
Contributor

@poornas poornas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vadmeste
Copy link
Member Author

vadmeste commented Jun 4, 2021

LGTM

@poornas I reverted back the change of the journal, let me know if you see any issue with it.

@minio-trusted
Copy link
Contributor

Mint Automation

Test Result
mint-large-bucket.sh ✔️
mint-fs.sh ✔️
mint-gateway-s3.sh ✔️
mint-erasure.sh ✔️
mint-dist-erasure.sh ✔️
mint-zoned.sh ✔️
mint-gateway-nas.sh ✔️
mint-compress-encrypt-dist-erasure.sh more...

12423-734a45b/mint-compress-encrypt-dist-erasure.sh.log:

Running with
SERVER_ENDPOINT:      minio-dev7.minio.io:31704
ACCESS_KEY:           minio
SECRET_KEY:           ***REDACTED***
ENABLE_HTTPS:         0
SERVER_REGION:        us-east-1
MINT_DATA_DIR:        /mint/data
MINT_MODE:            full
ENABLE_VIRTUAL_STYLE: 0

To get logs, run 'docker cp 1e96884854e8:/mint/log /tmp/mint-logs'

(1/15) Running aws-sdk-go tests ... done in 1 seconds
(2/15) Running aws-sdk-java tests ... done in 1 seconds
(3/15) Running aws-sdk-php tests ... done in 43 seconds
(4/15) Running aws-sdk-ruby tests ... done in 4 seconds
(5/15) Running awscli tests ... done in 2 minutes and 14 seconds
(6/15) Running healthcheck tests ... done in 0 seconds
(7/15) Running mc tests ... done in 1 minutes and 13 seconds
(8/15) Running minio-dotnet tests ... done in 45 seconds
(9/15) Running minio-go tests ... done in 1 minutes and 53 seconds
(10/15) Running minio-java tests ... FAILED in 1 minutes and 21 seconds
{
  "name": "minio-java",
  "function": "composeObject()",
  "args": "[single source with offset]",
  "duration": 34,
  "status": "FAIL",
  "error": "error occurred\nErrorResponse(code = InvalidArgument, message = Range specified is not valid for source object, bucketName = minio-java-test-qvduse, objectName = minio-java-test-36en505, resource = /minio-java-test-qvduse/minio-java-test-36en505, requestId = 16857D1C44CA16E4, hostId = f609277c-16b3-439d-8929-4b42f6aee275)\nrequest={method=PUT, url=http://minio-dev7.minio.io:31704/minio-java-test-qvduse/minio-java-test-36en505?uploadId=7e0a443c-4b87-44ee-a64e-eeb478321e4f&partNumber=1, headers=x-amz-copy-source: /minio-java-test-qvduse/minio-java-test-sjd6e6\nx-amz-copy-source-range: bytes=2048-1048575\nx-amz-copy-source-if-match: cb92d17a904ccec2e6e23b8bb66245fb\nHost: minio-dev7.minio.io:31704\nAccept-Encoding: identity\nUser-Agent: MinIO (Linux; amd64) minio-java/8.0.3\nContent-MD5: 1B2M2Y8AsgTpgAmY7PhCfg==\nx-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\nx-amz-date: 20210604T210641Z\nAuthorization: AWS4-HMAC-SHA256 Credential=*REDACTED*/20210604/us-east-1/s3/aws4_request, SignedHeaders=content-md5;host;x-amz-content-sha256;x-amz-copy-source;x-amz-copy-source-if-match;x-amz-copy-source-range;x-amz-date, Signature=*REDACTED*\n}\nresponse={code=400, headers=Accept-Ranges: bytes\nContent-Length: 388\nContent-Security-Policy: block-all-mixed-content\nContent-Type: application/xml\nServer: MinIO\nVary: Origin\nX-Amz-Request-Id: 16857D1C44CA16E4\nX-Xss-Protection: 1; mode=block\nDate: Fri, 04 Jun 2021 21:06:41 GMT\n}\n >>> [io.minio.MinioClient.execute(MinioClient.java:775), io.minio.MinioClient.uploadPartCopy(MinioClient.java:4804), io.minio.MinioClient.composeObject(MinioClient.java:1431), FunctionalTest.testComposeObject(FunctionalTest.java:2120), FunctionalTest.composeObjectTests(FunctionalTest.java:2145), FunctionalTest.composeObject(FunctionalTest.java:2300), FunctionalTest.runObjectTests(FunctionalTest.java:3758), FunctionalTest.runTests(FunctionalTest.java:3783), FunctionalTest.main(FunctionalTest.java:3927)]"
}
(10/15) Running minio-js tests ... done in 47 seconds
(11/15) Running minio-py tests ... done in 2 minutes and 45 seconds
(12/15) Running s3cmd tests ... done in 17 seconds
(13/15) Running s3select tests ... done in 7 seconds
(14/15) Running security tests ... done in 0 seconds

Executed 14 out of 15 tests successfully.

Deleting image on docker hub
Deleting image locally

@harshavardhana harshavardhana merged commit 810af07 into minio:master Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants