Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

observed in noobaa log as "cancelled due to ctime change" actual upload didn't fail #7833

Closed
rkomandu opened this issue Feb 19, 2024 · 14 comments · Fixed by #8017 or #8158
Closed

observed in noobaa log as "cancelled due to ctime change" actual upload didn't fail #7833

rkomandu opened this issue Feb 19, 2024 · 14 comments · Fixed by #8017 or #8158
Assignees
Milestone

Comments

@rkomandu
Copy link
Collaborator

rkomandu commented Feb 19, 2024

Environment info

  • NooBaa Version: VERSION
  • Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify
    d/s version 0129 (noobaa-core-5.15.0-20240129.el9.x86_64)

rpm -qi noobaa-core-5.15.0-20240129.el9.x86_64
Name : noobaa-core
Version : 5.15.0
Release : 20240129.el9
Architecture: x86_64
Install Date: Wed 07 Feb 2024 01:19:10 AM CST
Group : Unspecified
Size : 409313992
License : Apache-2.0
Signature : RSA/SHA256, Tue 30 Jan 2024 10:48:47 AM CST, Key ID 1904eb93bd37e2d9
Source RPM : noobaa-core-5.15.0-20240129.el9.src.rpm
Build Date : Mon 29 Jan 2024 08:29:07 AM CST
Build Host : x86-64-07.build.eng.rdu2.redhat.com
Packager : Red Hat, Inc. http://bugzilla.redhat.com/bugzilla
Vendor : Red Hat, Inc.
...

Ran the upload of the 10G/5G/8G in a loop, 99 times the results are expected as for the first time the files doesn't exist to upload for both 10G and 8G.

grep "10737418240 file_10G" upload.log | wc -l
99
grep "8589934592 file_8G" upload.log |wc -l
99
grep "5368709120 file_5G" upload.log |wc -l
100

noobaa.log

Feb 14 05:30:14 c83f1-app3 node[3420425]: [nsfs/3420425] #33[36m [L0]#33[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART newbucket-14feb
file_10G #33[33m1178#033[39m
Feb 14 05:30:14 c83f1-app3 node[3420425]: #33[32mFeb-14 5:30:14.007#033[35m [nsfs/3420425] #33[36m [L0]#33[39m core.endpoint.s3.ops.s3_put_object_uploadI
d:: PUT OBJECT PART newbucket-14feb file_10G #33[33m1178#033[39m
Feb 14 05:30:14 c83f1-app3 node[3420425]: [nsfs/3420425] #33[31m[ERROR]#33[39m core.endpoint.s3.s3_rest:: S3 ERROR InternalErrorWe encountered an internal error. Please try again./newbucket-14feb/file_10G?uploadId=1e01a581-723e-4c90-805b-b6d4aa5aef06&partNumber=1178lslpn793-1gh7mr-b72 PUT /newbucket-14feb/file_10G?uploadId=1e01a581-723e-4c90-805b-b6d4aa5aef06&partNumber=1178 {"host":"172.20.100.33:6443","accept-encoding":"identity","user-agent":"aws-cli/2.15.19 Python/3.11.6 Linux/4.18.0-240.el8.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.cp","content-md5":"lplbWNTL9qqpBBtPAMf2rg==","expect":"100-continue","x-amz-date":"20240214T113014Z","x-amz-content-sha256":"UNSIGNED-PAYLOAD","authorization":"AWS4-HMAC-SHA256 Credential=qVuzjvWlHiVhcrrXf4uS/20240214/us-east-1/s3/aws4_request, SignedHeaders=content-md5;host;x-amz-content-sha256;x-amz-date, Signature=6ce9c23abc18309d2de824f4e7d15bcb5456dbe05fc5780915a21c67a123be15","content-length":"8388608"} Error: FileStat: _path=/gpfs/remote-fvt_fs/s3user6002-dir/newbucket-14feb/.noobaa-nsfs_65cc98c7237eb991d649b04c/multipart-uploads/1e01a581-723e-4c90-805b-b6d4aa5aef06 cancelled due to ctime change
Feb 14 05:30:14 c83f1-app3 node[3420425]: #33[32mFeb-14 5:30:14.009#033[35m [nsfs/3420425] #33[31m[ERROR]#33[39m core.endpoint.s3.s3_rest:: S3 ERROR InternalErrorWe encountered an internal error. Please try again./newbucket-14feb/file_10G?uploadId=1e01a581-723e-4c90-805b-b6d4aa5aef06&partNumber=1178lslpn793-1gh7mr-b72 PUT /newbucket-14feb/file_10G?uploadId=1e01a581-723e-4c90-805b-b6d4aa5aef06&partNumber=1178 {"host":"172.20.100.33:6443","accept-encoding":"identity","user-agent":"aws-cli/2.15.19 Python/3.11.6 Linux/4.18.0-240.el8.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.cp","content-md5":"lplbWNTL9qqpBBtPAMf2rg==","expect":"100-continue","x-amz-date":"20240214T113014Z","x-amz-content-sha256":"UNSIGNED-PAYLOAD","authorization":"AWS4-HMAC-SHA256 Credential=qVuzjvWlHiVhcrrXf4uS/20240214/us-east-1/s3/aws4_request, SignedHeaders=content-md5;host;x-amz-content-sha256;x-amz-date, Signature=6ce9c23abc18309d2de824f4e7d15bcb5456dbe05fc5780915a21c67a123be15","content-length":"8388608"} Error: FileStat: _path=/gpfs/remote-fvt_fs/s3user6002-dir/newbucket-14feb/.noobaa-nsfs_65cc98c7237eb991d649b04c/multipart-uploads/1e01a581-723e-4c90-805b-b6d4aa5aef06 cancelled due to ctime change
Feb 14 05:30:14 c83f1-app3 node[3420426]: [nsfs/3420426] #33[36m [L0]#33[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART newbucket-14feb file_10G #33[33m1179#033[39m

RH 9.3, x86_64 arch

Actual behavior

Didn't see any failures though, but what is this message , can this be investigated. As anyone highlighting as an Error will catch an eye

Expected behavior

Can this be investigated to identify it as "WARN" if really this is not an "ERROR"

Steps to reproduce

Script tried is as follows
i=1
while [ $i -le 100 ]
do
AWS_ACCESS_KEY_ID=x AWS_SECRET_ACCESS_KEY=x aws --endpoint https://172.20.100.33:6443 --no-verify-ssl s3 cp /root/file_10G s3://newbucket-14feb
AWS_ACCESS_KEY_ID=x AWS_SECRET_ACCESS_KEY=x aws --endpoint https://172.20.100.34:6443 --no-verify-ssl s3 cp /root/file_8G s3://newbucket-14feb
AWS_ACCESS_KEY_ID=x AWS_SECRET_ACCESS_KEY=x aws --endpoint https://172.20.100.31:6443 --no-verify-ssl s3 cp /root/file_5G s3://newbucket-14feb
echo "list the file"
AWS_ACCESS_KEY_ID=x AWS_SECRET_ACCESS_KEY=x aws --endpoint https://172.20.100.33:6443 --no-verify-ssl s3 ls s3://newbucket-14feb
echo "delete the file"
AWS_ACCESS_KEY_ID=x AWS_SECRET_ACCESS_KEY=x aws --endpoint https://172.20.100.36:6443 --no-verify-ssl s3 rm s3://newbucket-13feb/file_5G
AWS_ACCESS_KEY_ID=x AWS_SECRET_ACCESS_KEY=x aws --endpoint https://172.20.100.35:6443 --no-verify-ssl s3 rm s3://newbucket-14feb/file_10G
AWS_ACCESS_KEY_ID=x AWS_SECRET_ACCESS_KEY=x aws --endpoint https://172.20.100.32:6443 --no-verify-ssl s3 rm s3://newbucket-14feb/file_8G
i=expr $i + 1
echo $i is value =====
sleep 60
done

grep "cancelled due to ctime change" messages-20240218 | grep "file_10G" | wc -l --> it happened for only this 10G object upload
20

grep "cancelled due to ctime change" messages-20240218 | grep "file_8G" | wc -l
0
grep "cancelled due to ctime change" messages-20240218 | grep "file_5G" | wc -l
0

More information - Screenshots / Logs / Other output

Will upload the noobaa.log (it is currently logged under /var/log/messages

@rkomandu rkomandu added the NS-FS label Feb 19, 2024
@rkomandu
Copy link
Collaborator Author

uploading the noobaa.log file (gzip file)

messages-20240218-issue7883.zip

@rkomandu
Copy link
Collaborator Author

rkomandu commented Mar 7, 2024

@romayalon or @naveenpaul1 , any update here on this

@naveenpaul1
Copy link
Contributor

@rkomandu I tried to reproduce the issue with the same script in my local system but couldn't reproduce it. I will try with different set up and let you know

@guymguym
Copy link
Member

@romayalon and I looked at the code and think this can happen in race case if upload part is sent to different endpoints, and each one uses stat() to check the write-file attributes. However it seems that this stat is not really needed to use that ctime check because it doesn't really use those xattrs other than looking at the encryption xattr. We can address this by either removing that stat call in _finish_upload in the case of upload-part, or just disable the ctime check for that case. Will defer to next version.

@romayalon
Copy link
Contributor

@rkomandu this issue was fixed, can you please verify?

@guymguym
Copy link
Member

guymguym commented Jun 17, 2024

@romayalon I reproduced this issue with 5.15.4 using s3cat:

> bin/node src/tools/s3cat \
  --endpoint http://localhost:6001 --access_key XXX --secret_key YYY \
  --bucket guy-warp --upload --size $((100*1024)) --part_size 32 --concur 16

Uploading upload-lxjdee2k from generated data of size 100 GB
Upload Speed: 149.7 MB/sec (average 149.7)
Upload Speed: 277.2 MB/sec (average 182.6)
Upload Speed: 289.3 MB/sec (average 205.6)
Upload Speed: 280.5 MB/sec (average 220.3)
Upload Speed: 319.0 MB/sec (average 233.7)
Upload Speed: 234.5 MB/sec (average 233.8)
Upload Speed: 287.9 MB/sec (average 241.0)
Upload Speed: 298.5 MB/sec (average 246.6)
Upload Speed: 323.2 MB/sec (average 254.6)
Upload Speed: 379.1 MB/sec (average 264.1)
Upload Speed: 213.8 MB/sec (average 259.5)
Upload Speed: 292.0 MB/sec (average 261.7)
Upload Speed: 458.5 MB/sec (average 274.0)
Upload Speed: 235.1 MB/sec (average 271.1)
Upload Speed: 229.6 MB/sec (average 268.1)
Upload Speed: 284.2 MB/sec (average 269.2)
Upload Speed: 275.9 MB/sec (average 269.6)
Upload Speed: 289.5 MB/sec (average 270.6)
Upload Speed: 281.7 MB/sec (average 271.1)
Upload Speed: 272.7 MB/sec (average 271.2)
Upload Speed: 283.4 MB/sec (average 271.8)
Upload Speed: 283.2 MB/sec (average 272.2)
Upload Speed: 430.9 MB/sec (average 278.2)
Upload Speed: 193.1 MB/sec (average 274.5)
Upload Speed: 317.1 MB/sec (average 276.1)
Upload Speed: 287.5 MB/sec (average 276.5)
Upload Speed: 315.8 MB/sec (average 277.9)
Upload Speed: 256.5 MB/sec (average 277.2)
Upload Speed: 403.1 MB/sec (average 280.8)
Upload Speed: 203.2 MB/sec (average 277.9)
Upload Speed: 191.2 MB/sec (average 275.7)
Upload Speed: 344.4 MB/sec (average 277.7)
Upload Speed: 292.6 MB/sec (average 278.2)
Upload Speed: 258.4 MB/sec (average 277.5)
Upload Speed: 286.2 MB/sec (average 277.7)
Upload Speed: 311.9 MB/sec (average 278.5)
Upload Speed: 299.8 MB/sec (average 279.0)
UPLOAD ERROR: Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16) {
  errno: -32,
  code: 'TimeoutError',
  syscall: 'write',
  time: 2024-06-17T19:31:13.747Z,
  region: 'us-east-1',
  hostname: 'localhost',
  retryable: true,
  statusCode: 500
}
Upload Speed: 281.0 MB/sec (average 279.1)
Upload Speed: 0.0 MB/sec (average 270.3)
^C

the client upload is stuck from that point, and in the endpoint logs I see:

Jun 17 15:31:11 host [1991532]: [nsfs/1991532] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource>/guy-warp/upload-lxjdee2k?partNumber=401&amp;uploadId=3971594e-4f22-47f3-bcd8-23686dfe3d16</Resource><RequestId>lxjdfckd-bdahea-10bv</RequestId></Error> PUT /guy-warp/upload-lxjdee2k?partNumber=401&uploadId=3971594e-4f22-47f3-bcd8-23686dfe3d16 { "user-agent": "aws-sdk-nodejs/2.1550.0 linux/v20.11.0 callback", "content-length": "33554432", "content-type": "application/octet-stream", "x-amz-content-sha256": "99261b161fdf5a25feb9bdb93fc55b2fe462b0a46c4e1bba5d32f134ad7d651f", "host": "localhost:6001", "expect": "100-continue", "x-amz-date": "20240617T193111Z", "authorization": "REDACTED", "connection":"close" } Error: FileStat: _path=/gpfs/fs1/guy-warp/.noobaa-nsfs_660c8859e9ce17bc055da80c/multipart-uploads/3971594e-4f22-47f3-bcd8-23686dfe3d16 cancelled due to ctime change

@romayalon
Copy link
Contributor

@guymguym I tried to reproduce it 3 times on my mac without success, Are you running it on linux?

@guymguym
Copy link
Member

guymguym commented Jun 18, 2024

yes, on RHEL 8.6 with GPFS 5.1.8.
Perhaps @rkomandu can help with reproduction?

@rkomandu
Copy link
Collaborator Author

@romayalon

on my machine

Jun 18 10:17:21 c83f1-app2 [3151039]: [nsfs/3151039] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2925ESC[39m
Jun 18 10:17:21 c83f1-app2 [3151038]: [nsfs/3151038] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2920ESC[39m
Jun 18 10:17:21 c83f1-app2 [3151038]: [nsfs/3151038] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2922ESC[39m
Jun 18 10:17:21 c83f1-app2 [3151038]: [nsfs/3151038] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2924ESC[39m
Jun 18 10:17:21 c83f1-app2 [3151038]: [nsfs/3151038] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2926ESC[39m
Jun 18 10:17:21 c83f1-app2 [3151038]: [nsfs/3151038] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2928ESC[39m
Jun 18 10:17:21 c83f1-app2 [3151039]: [nsfs/3151039] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2927ESC[39m
Jun 18 10:17:22 c83f1-app2 [3151039]: [nsfs/3151039] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2929ESC[39m
Jun 18 10:17:22 c83f1-app2 [3151038]: [nsfs/3151038] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2930ESC[39m
Jun 18 10:17:22 c83f1-app2 [3151038]: [nsfs/3151038] ESC[31m[ERROR]ESC[39m core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><C
**ode>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource>/bucket-7833/upload-lxkhksb0?partNumber=2930&amp;uploadId=a1ca6ca3-b1ca-423c-9b72-f41c8cb319f3</Resource><RequestId>lxkhnm8o-90g11v-1czy</RequestId></Error> PUT /bucket-7833/upload-lxkhksb0?partNumber=2930&uploadId=a1ca6ca3-b1ca-423c-9b72-f41c8cb319f3 {"user-agent":"aws-sdk-nodejs/2.1550.0 linux/v20.11.0 callback","content-length":"33554432","content-type":"application/octet-stream","x-amz-content-sha256":"d43d7a2d6be779831eece352a3311393e6b0235ad3087136c62c108fb31a74ef","host":"172.20.100.31:6001","expect":"100-continue","x-amz-date":"20240618T141721Z","authorization":"AWS4-HMAC-SHA256 Credential=gZw9dlE9P9wgzk90KRDE/20240618/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=ee2ac9f3f0830438ed18fa196dc26838437c58b848d064c1b38d8eaa12d739bc","connection":"close"} Error: FileStat: _path=/gpfs/remote_fvt_fs/s3user-8402-dir/bucket-7833/.noobaa-nsfs_66719660c2325092bce2e1b5/multipart-uploads/a1ca6ca3-b1ca-423c-9b72-f41c8cb319f3  cancelled due to ctime change**
Jun 18 10:17:22 c83f1-app2 [3151039]: [nsfs/3151039] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2931ESC[39m
Jun 18 10:17:22 c83f1-app2 [3151038]: [nsfs/3151038] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2932ESC[39m
Jun 18 10:17:22 c83f1-app2 [3151039]: [nsfs/3151039] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2933ESC[39m
Jun 18 10:17:22 c83f1-app2 [3151038]: [nsfs/3151038] ESC[36m   [L0]ESC[39m core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART bucket-7833 upload-lxkhksb0 ESC[33m2930ESC[39m

upload done though
[root@c83f1-app2 noobaa-core]# bin/node src/tools/s3cat --endpoint http://172.20.100.31:6001 --access_key gZw9dlE9P9wgzk90KRDE --secret_key 4KVHKR1SMvE3l0rUbffABdNdKGuRFBYx8YVaSLZB --bucket bucket-7833 --upload --size $((100*1024)) --part_size 32 --concur 16
Uploading upload-lxkhksb0 from generated data of size 100 GB
Upload Speed: 348.2 MB/sec (average 348.2)
Upload Speed: 0.0 MB/sec (average 247.0)
Upload Speed: 448.0 MB/sec (average 290.5)
Upload Speed: 95.9 MB/sec (average 255.9)
Upload Speed: 608.0 MB/sec (average 309.0)
Upload Speed: 767.6 MB/sec (average 385.8)
Upload Speed: 573.7 MB/sec (average 406.9)
Upload Speed: 780.5 MB/sec (average 445.2)
Upload Speed: 791.8 MB/sec (average 481.6)
Upload Speed: 5.7 MB/sec (average 242.2)
Upload Speed: 829.6 MB/sec (average 269.1)
Upload Speed: 748.2 MB/sec (average 290.7)
Upload Speed: 763.4 MB/sec (average 309.3)
Upload Speed: 804.9 MB/sec (average 330.6)
...
Upload Speed: 758.9 MB/sec (average 709.6)
Upload Speed: 832.0 MB/sec (average 710.5)
Upload Speed: 741.6 MB/sec (average 710.8)
Upload Speed: 742.7 MB/sec (average 711.0)
Upload Speed: 795.6 MB/sec (average 711.6)
Upload Speed: 800.0 MB/sec (average 712.3)
upload done. 710 MB/sec

ls -lhrt s3user-8402-dir/bucket-7833/
total 100G
-rw-rw----. 1 8402 8400 100G Jun 18 10:17 upload-lxkhksb0

rpm -qa |grep noobaa
noobaa-core-5.15.4-20240611.el9.x86_64

kernel: 5.14.0-362.8.1.el9_3.x86_64

@rkomandu
Copy link
Collaborator Author

After our discussion on the recreate, uploaded the noobaa.log file --> https://ibm.box.com/s/i9qnck3az7nvh08f5poijx52nux9238h

@romayalon , you should have an access to this.

@romayalon
Copy link
Contributor

Closing the issue per merging a fix - #8158
I couldn't see the error anymore after applying the changes.

@rkomandu
Copy link
Collaborator Author

@romayalon , are we fixing this in 5.15.5 ?

@romayalon
Copy link
Contributor

romayalon commented Jun 20, 2024

@rkomandu @madhuthorat The request for backport to 5.15.5 should come from Madhu, Madhu please check this issue and if needed add it to your 5.15.5 candidates and let us know you added it.
CC: @nimrod-becker

@madhuthorat
Copy link
Collaborator

@rkomandu @madhuthorat The request for backport to 5.15.5 should come from Madhu, Madhu please check this issue and if needed add it to your 5.15.5 candidates and let us know you added it. CC: @nimrod-becker

This is not a priority as of now for 5.15.5, but good to have if you can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment