Add normal/deep type of heal scanning #7251

vadmeste · 2019-02-18T10:50:37Z

Description

Healing scan used to read all objects parts to check for bitrot
checksum. This commit will add a quicker way of healing scan
by only checking if parts are actually present in disks or not.

Motivation and Context

Fixes #7510

Regression

No, this is a new feature

How Has This Been Tested?

minio server /tmp/xl/{1..4}
mc mb myminio/testbucket/
mc cp file myminio/testbucket/

Test normal heal scan:
4. rm /tmp/xl/1/testbucket/file/
5. mc admin heal -r myminio/testbucket/
6. ls -l /tmp/xl/1/testbucket/file/part.1

Test deep heal scan
7. echo "foo" >> /tmp/xl/1/testbucket/file/part.1
8. mc admin heal -r --scan=deep myminio/testbucket/

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added unit tests to cover my changes.
I have added/updated functional tests in mint. (If yes, add mint PR # here: )
All new and existing tests passed.

cmd/xl-v1-healing.go

cmd/xl-v1-healing-common.go

harshavardhana · 2019-02-20T19:44:03Z

We should also fix #7199 (comment)

what are the advantages of lot of small part files? One advantage I can see is if there is a bitrot error and we have to heal a part, we will have to heal a smaller part. But ofcourse our current code heals all the parts even if one part has bitrot error (to have the healing code simple) So this advantage also not there.

We should heal only the part that needs to be healed.

vadmeste · 2019-02-20T21:36:04Z

We should also fix #7199 (comment)

what are the advantages of lot of small part files? One advantage I can see is if there is a bitrot error and we have to heal a part, we will have to heal a smaller part. But ofcourse our current code heals all the parts even if one part has bitrot error (to have the healing code simple) So this advantage also not there.

We should heal only the part that needs to be healed.

@harshavardhana can we do this in another PR ? we used to prepare everything under tmp/ directory then a rename atomic, but this looks like it needs to be done differently

harshavardhana · 2019-02-20T22:22:09Z

@harshavardhana can we do this in another PR ? we used to prepare everything under tmp/ directory then a rename atomic, but this looks like it needs to be done differently

Why do we need to do that in other PR @vadmeste - is it harder to do in this? - Since we are bringing in new feature such as deep and normal healing. I thought we would want to address the above situation as well.

krishnasrinivas · 2019-02-26T19:38:35Z

Why do we need to do that in other PR @vadmeste - is it harder to do in this? - Since we are bringing in new feature such as deep and normal healing. I thought we would want to address the above situation as well.

@harshavardhana since we decided not to heal individual parts alone, and heal the entire object, we can take this PR

harshavardhana · 2019-02-28T02:48:38Z

@harshavardhana since we decided not to heal individual parts alone, and heal the entire object, we can take this PR

yeah done.. @vadmeste is this PR ready for review?

vadmeste · 2019-02-28T10:00:30Z

This is ready for review.

@krishnasrinivas I only rebased this commit against master, can you approve again ?

krishnasrinivas · 2019-02-28T20:11:25Z

@vadmeste travis failing

codecov · 2019-02-28T21:16:27Z

Codecov Report

Merging #7251 into master will increase coverage by <.01%.
The diff coverage is 63.63%.

@@            Coverage Diff             @@
##           master    #7251      +/-   ##
==========================================
+ Coverage   48.42%   48.43%   +<.01%     
==========================================
  Files         297      297              
  Lines       46451    46460       +9     
==========================================
+ Hits        22496    22503       +7     
- Misses      21884    21885       +1     
- Partials     2071     2072       +1

Impacted Files	Coverage Δ
cmd/fs-v1.go	`64.05% <ø> (ø)`	⬆️
pkg/madmin/heal-commands.go	`36.03% <ø> (ø)`	⬆️
cmd/admin-heal-ops.go	`4.09% <0%> (ø)`	⬆️
cmd/xl-sets.go	`39.93% <0%> (ø)`	⬆️
cmd/gateway-unsupported.go	`0% <0%> (ø)`	⬆️
cmd/xl-v1-healing.go	`60.48% <100%> (+0.07%)`	⬆️
cmd/xl-v1-healing-common.go	`88.78% <69.56%> (-2.13%)`	⬇️
cmd/bitrot-streaming.go	`78.64% <0%> (-2.92%)`	⬇️
cmd/posix.go	`64.08% <0%> (+0.21%)`	⬆️
cmd/fs-v1-helpers.go	`68.8% <0%> (+0.61%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0ee7be...5d6ffb9. Read the comment docs.

harshavardhana

Can you state the usefulness of normal/deep scan from the user point of view - let's say some files are corrupted with bit-rot how does a user know that they have to run mc admin heal --scan deep instead of normal.

What additional information we are providing to user in normal scan which makes this decision clear, I see that normal scan is simply stating the xl.json

Currently we short circuit from on the first error from the disk, shouldn't we be continuing? to the next disk?

                                dataErrs[i] = err
                                break

This is the question for both Deep and Normal scan, it also means we won't heal other disks which might have bitrot or xl.json missing.

krishnasrinivas · 2019-02-28T22:05:11Z

Can you state the usefulness of normal/deep scan from the user point of view - let's say some files are corrupted with bit-rot how does a user know that they have to run mc admin heal --scan deep instead of normal.

current implementation's spec was given by AB. On the mc side, do you think we should display anything extra (informing that bitrot check is not being done) when it is run in normal mode?

What additional information we are providing to user in normal scan which makes this decision clear, I see that normal scan is simply stating the `xl.json

we stat xl.json and all parts too. (but no bitrot)

Currently we short circuit from on the first error from the disk, shouldn't we be continuing? to the next disk?

it will continue to the next disk, note that it only breaks the loop for _, part := range partsMetadata[i].Parts {

harshavardhana · 2019-02-28T22:23:05Z

current implementation's spec was given by AB. On the mc side, do you think we should display anything extra (informing that bitrot check is not being done) when it is run in normal mode?

The spec is incomplete without user level indication on what they should be doing. For me personally explaining this to a customer will be a headache here are the reasons why.

From the PR it is harder to tell when should the deep scan be run v/s normal scan
Can normal scan indicate that deep scan might need to be run as we found issues
Is deep scan always better? if not how should we schedule deep scan v/s normal scan? how frequently?

From the PR its not evident that we can tell with clarity that deep scan should be run v/s normal scan. If our most users are just using mc admin heal and expecting it to work today to even heal parts it won't work. There is no easy way to address if we are planning to break compatibility.

I would suggest normal scan should also do bit-rot but not for the entire file instead it should jump to random block in the file and verify its bit-rot. The block is erasure block - this way we are not taking away the essential functionality but we are also not trying to read the entire file. Normal scan just stating xl.json is may not be enough, we should at least try to decode the JSON.

krishnasrinivas · 2019-03-01T01:22:32Z

If our most users are just using mc admin heal and expecting it to work today to even heal parts it won't work.
we should at least try to decode the JSON.

the above cases are handled in normal mode. the difference between normal/deep is just bitrot checking.

We can probably discuss in person.

harshavardhana · 2019-03-01T02:34:32Z

The spec is incomplete without user level indication on what they should be doing. For me personally explaining this to a customer will be a headache here are the reasons why.

From the PR it is harder to tell when should the deep scan be run v/s normal scan

Can normal scan indicate that deep scan might need to be run as we found issues

Is deep scan always better? if not how should we schedule deep scan v/s normal scan? how frequently?

The answers to these are needed before on how we communicate to our users and document them.

donatello · 2019-03-01T22:12:38Z

If we think about it, minio's metadata in xl.json is kind of not relevant to the user. When they manually run heal they are primarily concerned with their objects.

Maybe there is some benefit in only scanning the metadata during healing when healing is automatic and run by the server, but its not clear why it should be exposed to the user.

harshavardhana · 2019-03-01T22:17:56Z

If we think about it, minio's metadata in xl.json is kind of not relevant to the user. When they manually run heal they are primarily concerned with their objects.

Maybe there is some benefit in only scanning the metadata during healing when healing is automatic and run by the server, but its not clear why it should be exposed to the user.

Correct @donatello - I guess this is the gist which should be clarified.

harshavardhana · 2019-03-06T08:32:36Z

We discussed this with @abperiasamy - the normal heal should do the following

healing missing xl.json or corrupted
healing missing part files but not bit-rot errors, i.e new part files are created but if old part files have bit-rot will not be healed

Now bit-rot errors are healed only there is a deep scan, this would read and write all content if needed. the plan is to write a document providing recommendation to our users on when Deep scan should be used and what does normal scan can actually do.

vadmeste · 2019-03-07T19:51:33Z

@harshavardhana so basically nothing to change in this PR.

@donatello @krishnasrinivas can you review/approve again ?

@donatello

PTAL @donatello

harshavardhana · 2019-03-14T04:42:39Z

Can you fix the conflicts @vadmeste ?

Healing scan used to read all objects parts to check for bitrot checksum. This commit will add a quicker way of healing scan by only checking if parts are actually present in disks or not.

minio-ops · 2019-03-14T14:01:13Z

Mint Automation

Test	Result
mint-compression-xl.sh	✔️
mint-xl.sh	✔️
mint-compression-dist-xl.sh	✔️
mint-compression-fs.sh	✔️
mint-worm.sh	✔️
mint-fs.sh	✔️
mint-gateway-nas.sh	✔️
mint-large-bucket.sh	❌ more...
mint-dist-xl.sh	❌ more...

7251-5d6ffb9/mint-large-bucket.sh.log:

Running with
SERVER_ENDPOINT:      72.28.97.52:32632
ACCESS_KEY:           minio
SECRET_KEY:           ***REDACTED***
ENABLE_HTTPS:         0
SERVER_REGION:        us-east-1
MINT_DATA_DIR:        /mint/data
MINT_MODE:            full
ENABLE_VIRTUAL_STYLE: 0

To get logs, run 'docker cp 3dc829c90ebf:/mint/log /tmp/mint-logs'

(1/14) Running aws-sdk-go tests ... FAILED in 0 seconds
{
  "alert": "",
  "args": {
    "bucketName": "aws-sdk-go-test-4a6mmk0wawq0ws",
    "expiry": 60000000000,
    "objectName": "presignedTest"
  },
  "duration": 376,
  "error": "RequestError: send request failed\ncaused by: Put http://72.28.97.52:32632/aws-sdk-go-test-4a6mmk0wawq0ws: dial tcp 72.28.97.52:32632: connect: connection refused",
  "function": "PresignedPut",
  "message": "AWS SDK Go CreateBucket Failed",
  "name": "aws-sdk-go",
  "status": "FAIL"
}

Executed 0 out of 14 tests successfully.

7251-5d6ffb9/mint-dist-xl.sh.log:

Running with
SERVER_ENDPOINT:      72.28.97.57:30902
ACCESS_KEY:           minio
SECRET_KEY:           ***REDACTED***
ENABLE_HTTPS:         0
SERVER_REGION:        us-east-1
MINT_DATA_DIR:        /mint/data
MINT_MODE:            full
ENABLE_VIRTUAL_STYLE: 0

To get logs, run 'docker cp 478fa9c86a75:/mint/log /tmp/mint-logs'

(1/14) Running aws-sdk-go tests ... FAILED in 0 seconds
{
  "alert": "",
  "args": {
    "bucketName": "aws-sdk-go-test-hofoqo5n99fwyf",
    "expiry": 60000000000,
    "objectName": "presignedTest"
  },
  "duration": 332,
  "error": "RequestError: send request failed\ncaused by: Put http://72.28.97.57:30902/aws-sdk-go-test-hofoqo5n99fwyf: dial tcp 72.28.97.57:30902: connect: connection refused",
  "function": "PresignedPut",
  "message": "AWS SDK Go CreateBucket Failed",
  "name": "aws-sdk-go",
  "status": "FAIL"
}

Executed 0 out of 14 tests successfully.

vadmeste · 2019-03-14T19:25:41Z

@krishnasrinivas I only rebased this PR, can you review/approve again ?

vadmeste force-pushed the heal-fast-slow branch from 0741d6d to f423fa1 Compare February 19, 2019 16:42

krishnasrinivas reviewed Feb 20, 2019

View reviewed changes

cmd/xl-v1-healing.go Outdated Show resolved Hide resolved

krishnasrinivas reviewed Feb 20, 2019

View reviewed changes

cmd/xl-v1-healing-common.go Show resolved Hide resolved

vadmeste force-pushed the heal-fast-slow branch 2 times, most recently from 519fe65 to 8207add Compare February 20, 2019 21:27

vadmeste changed the title ~~Add normal/deep type of healing~~ Add normal/deep type of heal scanning Feb 20, 2019

krishnasrinivas previously approved these changes Feb 26, 2019

View reviewed changes

vadmeste marked this pull request as ready for review February 28, 2019 09:54

vadmeste dismissed krishnasrinivas’s stale review via d954781 February 28, 2019 09:56

vadmeste force-pushed the heal-fast-slow branch from 8207add to d954781 Compare February 28, 2019 09:56

nitisht requested review from donatello and krishnasrinivas February 28, 2019 10:00

krishnasrinivas previously approved these changes Feb 28, 2019

View reviewed changes

vadmeste dismissed krishnasrinivas’s stale review via 246d6a4 February 28, 2019 21:16

vadmeste force-pushed the heal-fast-slow branch from d954781 to 246d6a4 Compare February 28, 2019 21:16

harshavardhana previously requested changes Feb 28, 2019

View reviewed changes

vadmeste force-pushed the heal-fast-slow branch 2 times, most recently from 5f12d57 to 0d9c43c Compare March 7, 2019 19:10

krishnasrinivas previously approved these changes Mar 8, 2019

View reviewed changes

nitisht requested review from harshavardhana and removed request for donatello March 9, 2019 16:38

vadmeste dismissed krishnasrinivas’s stale review via 90a0df6 March 12, 2019 14:52

vadmeste force-pushed the heal-fast-slow branch from 3c65851 to 90a0df6 Compare March 12, 2019 14:52

harshavardhana previously approved these changes Mar 12, 2019

View reviewed changes

Add normal/deep type of heal scanning

5d6ffb9

Healing scan used to read all objects parts to check for bitrot checksum. This commit will add a quicker way of healing scan by only checking if parts are actually present in disks or not.

vadmeste dismissed harshavardhana’s stale review via 5d6ffb9 March 14, 2019 13:44

vadmeste force-pushed the heal-fast-slow branch from 90a0df6 to 5d6ffb9 Compare March 14, 2019 13:44

harshavardhana approved these changes Mar 14, 2019

View reviewed changes

krishnasrinivas approved these changes Mar 14, 2019

View reviewed changes

kannappanr merged commit facbd65 into minio:master Mar 14, 2019

vadmeste deleted the heal-fast-slow branch March 14, 2019 20:44

vadmeste mentioned this pull request Apr 4, 2019

Heal operation is too slow #7150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add normal/deep type of heal scanning #7251

Add normal/deep type of heal scanning #7251

vadmeste commented Feb 18, 2019 •

edited

harshavardhana commented Feb 20, 2019

vadmeste commented Feb 20, 2019 •

edited

harshavardhana commented Feb 20, 2019

krishnasrinivas commented Feb 26, 2019

harshavardhana commented Feb 28, 2019

vadmeste commented Feb 28, 2019

krishnasrinivas commented Feb 28, 2019

codecov bot commented Feb 28, 2019 •

edited

harshavardhana left a comment

krishnasrinivas commented Feb 28, 2019

harshavardhana commented Feb 28, 2019 •

edited

krishnasrinivas commented Mar 1, 2019

harshavardhana commented Mar 1, 2019

donatello commented Mar 1, 2019

harshavardhana commented Mar 1, 2019

harshavardhana commented Mar 6, 2019

vadmeste commented Mar 7, 2019

harshavardhana commented Mar 14, 2019

minio-ops commented Mar 14, 2019

vadmeste commented Mar 14, 2019

Add normal/deep type of heal scanning #7251

Add normal/deep type of heal scanning #7251

Conversation

vadmeste commented Feb 18, 2019 • edited

Description

Motivation and Context

Regression

How Has This Been Tested?

Types of changes

Checklist:

harshavardhana commented Feb 20, 2019

vadmeste commented Feb 20, 2019 • edited

harshavardhana commented Feb 20, 2019

krishnasrinivas commented Feb 26, 2019

harshavardhana commented Feb 28, 2019

vadmeste commented Feb 28, 2019

krishnasrinivas commented Feb 28, 2019

codecov bot commented Feb 28, 2019 • edited

Codecov Report

harshavardhana left a comment

Choose a reason for hiding this comment

krishnasrinivas commented Feb 28, 2019

harshavardhana commented Feb 28, 2019 • edited

krishnasrinivas commented Mar 1, 2019

harshavardhana commented Mar 1, 2019

donatello commented Mar 1, 2019

harshavardhana commented Mar 1, 2019

harshavardhana commented Mar 6, 2019

vadmeste commented Mar 7, 2019

harshavardhana commented Mar 14, 2019

minio-ops commented Mar 14, 2019

Mint Automation

7251-5d6ffb9/mint-large-bucket.sh.log:

7251-5d6ffb9/mint-dist-xl.sh.log:

vadmeste commented Mar 14, 2019

vadmeste commented Feb 18, 2019 •

edited

vadmeste commented Feb 20, 2019 •

edited

codecov bot commented Feb 28, 2019 •

edited

harshavardhana commented Feb 28, 2019 •

edited