Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload fails after server to server redirect with cache #2104

Closed
2 tasks done
JonathanMELIUS opened this issue Apr 20, 2023 · 10 comments
Closed
2 tasks done

Upload fails after server to server redirect with cache #2104

JonathanMELIUS opened this issue Apr 20, 2023 · 10 comments

Comments

@JonathanMELIUS
Copy link

JonathanMELIUS commented Apr 20, 2023

  • main
  • 4-2-stable

Bug Report

iRODS Version, OS and Version

OS: Ubuntu 18.04
iRODS: 4.2.11

irods@icat:/tmp/test_iput$ apt list --installed | grep irods
irods-database-plugin-postgres/bionic,now 4.2.11-1~bionic amd64 [installed,upgradable to: 4.3.0-1~bionic]
irods-dev/bionic,now 4.2.11-1~bionic amd64 [installed,upgradable to: 4.3.0-1~bionic]
irods-externals-avro1.9.0-0/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-boost1.67.0-0/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-catch22.3.0-0/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-clang-runtime6.0-0/bionic,now 1.0~bionic amd64 [installed]
irods-externals-clang6.0-0/bionic,now 1.0~bionic amd64 [installed]
irods-externals-cppzmq4.2.3-0/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-fmt6.1.2-1/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-jansson2.7-0/bionic,now 1.0~bionic amd64 [installed]
irods-externals-json3.7.3-0/bionic,now 1.0~bionic amd64 [installed]
irods-externals-libarchive3.3.2-1/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-libs3e4197a5e-0/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-nanodbc2.13.0-1/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-spdlog1.5.0-1/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-externals-zeromq4-14.1.6-0/bionic,now 1.0~bionic amd64 [installed,automatic]
irods-icommands/bionic,now 4.2.11-1~bionic amd64 [installed,upgradable to: 4.3.0-1~bionic]
irods-resource-plugin-s3/bionic,now 4.2.11.2-1~bionic amd64 [installed,upgradable to: 4.3.0.0-1~bionic]
irods-rule-engine-plugin-python/bionic,now 4.2.11.1-1~bionic amd64 [installed,upgradable to: 4.3.0.1-1~bionic]
irods-runtime/bionic,now 4.2.11-1~bionic amd64 [installed,upgradable to: 4.3.0-1~bionic]
irods-server/bionic,now 4.2.11-1~bionic amd64 [installed,upgradable to: 4.3.0-1~bionic]
irods@icat:/tmp/test_iput$ ienv
irods_version - 4.2.11
irods_client_server_negotiation - request_server_negotiation
irods_server_control_plane_key - test_ABCDErulswop1q2b3v4m5z6as98
irods_server_control_plane_port - 1248
irods_client_server_policy - CS_NEG_REQUIRE
irods_host - icat.dh.local
irods_transfer_buffer_size_for_parallel_transfer_in_megabytes - 4
irods_user_name - rods
irods_zone_name - nlmumc
irods_cwd - /nlmumc/home/rods
irods_ssl_verify_server - cert
irods_connection_pool_refresh_time_in_seconds - 300
irods_encryption_key_size - 32
irods_default_hash_scheme - SHA256
irods_environment_file - /var/lib/irods/.irods/irods_environment.json
irods_default_number_of_transfer_threads - 4
irods_encryption_algorithm - AES-256-CBC
irods_ssl_ca_certificate_file - /etc/irods/SSL/test_only_dev_irods_dh_ca_cert.pem
irods_encryption_salt_size - 8
schema_version - v3
irods_home - /nlmumc/home/rods
irods_encryption_num_hash_rounds - 16
irods_default_resource - rootResc
irods_match_hash_policy - compatible
irods_maximum_size_for_single_buffer_in_megabytes - 32
irods_session_environment_file - /var/lib/irods/.irods/irods_environment.json.3436
irods_port - 1247
irods_server_control_plane_encryption_algorithm - AES-256-CBC
schema_name - service_account_environment
irods_server_control_plane_encryption_num_hash_rounds - 16
irods_ssl_certificate_chain_file - /etc/irods/SSL/icat.dh.local.crt
irods_ssl_certificate_key_file - /etc/irods/SSL/icat.dh.local.key
irods_ssl_dh_params_file - /etc/irods/SSL/dhparams.pem

irods@icat:/tmp/test_iput$ ilsresc 
[..]
replRescUMCeph01:replication
├── UM-Ceph-S3-AC:s3
└── UM-Ceph-S3-GL:s3
rootResc:passthru
└── demoResc:unixfilesystem
[..]
irods@icat:/tmp/test_iput$ ilsresc -l UM-Ceph-S3-GL
resource name: UM-Ceph-S3-GL
id: 10171
zone: nlmumc
type: s3
location: ires-ceph-gl.dh.local
vault: /dh-irods-bucket-dev
free space: 
free space time: : Never
status: 
info: 
comment: 
create time: 01681981574: 2023-04-20.11:06:14
modify time: 01681981584: 2023-04-20.11:06:24
context: S3_DEFAULT_HOSTNAME=minio2.dh.local:9000;S3_AUTH_FILE=/var/lib/irods/minio.keypair;S3_REGIONNAME=irods-dev;S3_RETRY_COUNT=1;S3_WAIT_TIME_SEC=3;S3_PROTO=HTTP;ARCHIVE_NAMING_POLICY=consistent;HOST_MODE=cacheless_attached;S3_CACHE_DIR=/cache
parent: 10173
parent context: 

What did you try to do?

Upload (iput) 20+ GB files to a coordinating replication parent resource with S3 resource children

Expected behavior

A successful operation or exit status that indicate failure in case of error

Observed behavior (including steps to reproduce, if applicable)

Step to reproduce:

irods@icat:/tmp/test_iput$ yes 123456789 | head -858993459 > 8GB.bin
irods@icat:/tmp/test_iput$ yes 123456789 | head -2147483648 > 20GB.bin
irods@icat:/tmp/test_iput$ yes 123456789 | head -4784801280 > 45GB.bin
irods@icat:/tmp/test_iput$ ls -lh
total 73G
-rw-rw-r-- 1 irods irods  20G Apr 20 11:08 20GB.bin
-rw-rw-r-- 1 irods irods  45G Apr 20 11:16 45GB.bin
-rw-rw-r-- 1 irods irods 8.0G Apr 20 11:07 8GB.bin
irods@icat:/tmp/test_iput$ iput -R replRescUMCeph01 8GB.bin 
irods@icat:/tmp/test_iput$ echo $?
0
irods@icat:/tmp/test_iput$ iput -R replRescUMCeph01 20GB.bin 
remote addresses: 172.19.0.13 ERROR: putUtil: put error for /nlmumc/home/rods/20GB.bin, status = -702000 status = -702000 S3_PUT_ERROR
irods@icat:/tmp/test_iput$ echo $?
3
irods@icat:/tmp/test_iput$ iput -R replRescUMCeph01 45GB.bin 
irods@icat:/tmp/test_iput$ echo $?
0
irods@icat:/tmp/test_iput$ ils -l
/nlmumc/home/rods:
  rods              0 replRescUMCeph01;UM-Ceph-S3-GL            0 2023-04-20.11:10 X 20GB.bin
  rods              0 replRescUMCeph01;UM-Ceph-S3-GL   4898339840 2023-04-20.11:17 X 45GB.bin
  rods              0 replRescUMCeph01;UM-Ceph-S3-GL   8589934590 2023-04-20.11:09 & 8GB.bin
  rods              1 replRescUMCeph01;UM-Ceph-S3-AC   8589934590 2023-04-20.11:10 & 8GB.bin

The situation is most likely highly related to "Uploads fail after server to server redirect" #1980
In the issue, it is mentioned that:

This doesn't appear to be an issue when cache is used. (This is either because of timing differences or some recent server changes fixed this.)

But, the issue seems to be triggered in our environment for file size above 20 GB, even if a cache file is created during the upload

dev-icat-1  | Apr 20 11:12:02 pid:3461 NOTICE: remoteFileClose: rcFileClose failed for 3, status = -702000
dev-icat-1  | Apr 20 11:12:02 pid:3461 remote addresses: 172.19.0.13, 192.168.64.17, 192.168.64.7 ERROR: [rsDataObjClose:794] - [S3_PUT_ERROR: [close_physical_file_and_throw_on_failure:531] - failed to close physical file [error_code=[-702000], path=[/nlmumc/home/rods/20GB.bin], hierarchy=[replRescUMCeph01;UM-Ceph-S3-GL], physical_path=[/dh-irods-bucket-dev/home/rods/20GB.bin]]

dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675869951744]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675869951744]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675869951744]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675869951744]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13, 192.168.64.8 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675714627328]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13, 192.168.64.8 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675848869632]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13, 192.168.64.8 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675714627328]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13, 192.168.64.8 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675848869632]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675840476928]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675848869632]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581  ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675748198144]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675832084224]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581  ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675723020032]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675857262336]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675714627328]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675840476928]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675731412736]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675848869632]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675832084224]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581  ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:224 [store_and_log_status] [[140675739805440]]  libs3_types::status: [XmlParseFailure] - 31
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675723020032]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675748198144]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675840476928]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675857262336]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675714627328]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675832084224]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675731412736]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675723020032]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581  ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675748198144]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:227 [store_and_log_status] [[140675739805440]]  S3Host: minio2.dh.local:9000
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675840476928]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675857262336]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675832084224]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675723020032]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675748198144]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675731412736]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/src/s3_transport.cpp:231 [store_and_log_status] [[140675739805440]]  Function: s3_multipart_upload::callback_for_write_to_s3_base::on_response_completion
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675857262336]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675731412736]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | Apr 20 11:12:00 pid:581 remote addresses: 192.168.64.13 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:2055 (s3_upload_part_worker_routine) [[140675739805440]] S3_upload_part returned error [status=XmlParseFailure][attempt=1][retry_count_limit=1].  Sleeping between 1 and 3 seconds
dev-ires-ceph-gl-1  | 
dev-ires-ceph-gl-1  | ERROR: OK
dev-ires-ceph-gl-1  | Apr 20 11:12:02 pid:581 remote addresses: 192.168.64.13, 192.168.64.8 ERROR: /irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:482 (close) [[140676242560960]] flush_cache_file returned error
dev-ires-ceph-gl-1  | Apr 20 11:12:02 pid:581 remote addresses: 192.168.64.13 ERROR: [-]	/repos/irods/server/api/src/rsFileClose.cpp:120:int _rsFileClose(rsComm_t *, fileCloseInp_t *) :  status [S3_PUT_ERROR]  errno [] -- message [fileClose failed for [3]]
dev-ires-ceph-gl-1  | 	[-]	/repos/irods/server/drivers/src/fileDriver.cpp:167:irods::error fileClose(rsComm_t *, irods::first_class_object_ptr) :  status [S3_PUT_ERROR]  errno [] -- message [failed to call 'close']
dev-ires-ceph-gl-1  | 		[-]	/repos/irods/plugins/resources/replication/librepl.cpp:837:irods::error repl_file_close(irods::plugin_context &) :  status [S3_PUT_ERROR]  errno [] -- message [Failed while calling child operation.]
dev-ires-ceph-gl-1  | 			[-]	/irods_resource_plugin_s3/s3/s3_transport/include/s3_transport.hpp:483:virtual bool irods::experimental::io::s3_transport::s3_transport<char>::close(const irods::experimental::io::on_close_success *) [CharT = char] :  status [S3_PUT_ERROR]  errno [] -- message [flush_cache_file returned error]

Most importantly, for file size above 45 GB, the error is silent, meaning the return status code is 0 - success.

We get this following line in iCAT logs, but nothing in the S3 servers logs:

dev-icat-1  | Apr 20 11:19:53 pid:3480 remote addresses: 172.19.0.13, 192.168.64.17, 192.168.64.7 ERROR: [rsDataObjClose:794] - [Unknown iRODS error: [update_replica_size_and_throw_on_failure:454] - failed to get size in vault [error_code=[4898339840], path=[/nlmumc/home/rods/45GB.bin], hierarchy=[replRescUMCeph01;UM-Ceph-S3-GL]]

If the iput is performed directly on the S3 server toward a S3 resource (without a coordinating replication parent), it works fine.
But with a coordinating replication parent, it fails:

irods@fires-ceph-gl:/tmp$ iput -R replRescUMCeph01 20GB.bin
Level 0: selected source hierarchy [replRescUMCeph01;UM-Ceph-S3-AC] is not good and will overwrite an existing replica; replication is not allowed.
irods@ires-ceph-gl:/tmp$ ils -l 20GB.bin
  rods              0 replRescUMCeph01;UM-Ceph-S3-AC  21474836480 2023-04-17.12:28 X 20GB.bin
  rods              1 replRescUMCeph01;UM-Ceph-S3-GL  21474836480 2023-04-17.12:28 X 20GB.bin

Note: We are already planning to move in production to the cacheless_detached mode which seems to solve the issue in our environment.

@luijs
Copy link

luijs commented Apr 21, 2023

I have the same issue here with 4.3.0, ubuntu 20.04, see https://groups.google.com/g/iROD-Chat/c/LhSPiQ4t0fs

@JustinKyleJames
Copy link
Contributor

I have the same issue here with 4.3.0, ubuntu 20.04, see https://groups.google.com/g/iROD-Chat/c/LhSPiQ4t0fs

When I was testing your issue, I was not using two servers. That is likely why I could not reproduce it. I will try this with multiple servers.

@JustinKyleJames JustinKyleJames self-assigned this Apr 29, 2023
@JustinKyleJames
Copy link
Contributor

JustinKyleJames commented Apr 29, 2023

I have reproduced this issue. With wire logging enabled I noticed we were getting a huge Content-Length being sent to UploadPart. I believe this happens when the part size > 2^31 - 1 and it is due to an int64_t being converted to an int and then printed as an int64_t.

This will require a change to libs3.

Please try the following workarounds and tell me if these work for you (both set in the resource context string):

  1. If possible, set HOST_MODE=cacheless_detached.
  2. Increase S3_MPU_THREADS so that the (maximum file size) / S3_MPU_THREADS < 2^31-1.

Note that I am getting a timeout error after 2 minutes when I try the second option. I am not sure why that is the case as each thread is still actively sending data. This might be a second issue.

@trel
Copy link
Member

trel commented Apr 29, 2023

excellent. what is the default S3_MPU_THREADS if nothing is set in the context string?

@JonathanMELIUS
Copy link
Author

Thank you for the explanation and the formula. It will make it easier to know what to test for, when needed.

Yes, setting the HOST_MODE to cacheless_detached solve this issue for us.
We are actively planning to roll out this configuration change in our acc/prod environment very soon.
It seems to be the better choice for our iRODS installation, as it also introduces true cache-less/streaming upload for large files.

Was it also possible to reproduce the successful exit code for the failed 45GB file upload?

@JustinKyleJames
Copy link
Contributor

Thank you for the explanation and the formula. It will make it easier to know what to test for, when needed.

Yes, setting the HOST_MODE to cacheless_detached solve this issue for us. We are actively planning to roll out this configuration change in our acc/prod environment very soon. It seems to be the better choice for our iRODS installation, as it also introduces true cache-less/streaming upload for large files.

Was it also possible to reproduce the successful exit code for the failed 45GB file upload?

I have not reproduced that. I will keep an eye open for it.

@JustinKyleJames
Copy link
Contributor

excellent. what is the default S3_MPU_THREADS if nothing is set in the context string?

The default is 10.

@trel
Copy link
Member

trel commented May 1, 2023

The default is 10.

Ah, very good - please get that into the README somewhere as part of one of these tweaks/commits. Thanks.

JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue May 12, 2023
alanking pushed a commit that referenced this issue May 12, 2023
JustinKyleJames added a commit to JustinKyleJames/irods_resource_plugin_s3 that referenced this issue Jul 5, 2023
alanking pushed a commit that referenced this issue Jul 5, 2023
@alanking
Copy link
Contributor

alanking commented Jul 5, 2023

@JustinKyleJames - Please close if complete. Thanks!

@JustinKyleJames
Copy link
Contributor

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants