Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replication from compound using python #103

Open
ciklysta opened this issue Jan 17, 2022 · 8 comments
Open

replication from compound using python #103

ciklysta opened this issue Jan 17, 2022 · 8 comments

Comments

@ciklysta
Copy link

Bug Report

Irods version 4.2.11, centos7

I have the following resource hierarchy (OldResource being an old resource that I want to migrate data from)

DiskResource:unixfilesystem
OldResource:compound
├── OldArchiveResource:univmss
└── OldCacheResource:unixfilesystem

Data is only in the archive, the cache is empty.

I've written a custom rule that manages replication since it is a lengthy operation that I need to manage myself

migrateOneObj.r content:

main {
    testReplicationFromCompound();
}
INPUT null
OUTPUT ruleExecOut

core.re content:

testReplicationFromCompound {
    msiDataObjRepl("/zone/file.zip",
        "rescName=OldResource++++backupRescName=DiskResource", *Status);
}

When I run irule -F migrateOneObj.r under the user that owns file.zip, it works correctly.

However, if I move the function testReplicationFromCompound to python (core.py):

def testReplicationFromCompound(rule_args, callback, rei):
    callback.msiDataObjRepl("/zone/file.zip",
        "rescName=OldResource++++backupRescName=DiskResource", 0)

and run irule -F migrateOneObj.r (under the user that owns file.zip) the following happens:

  • the operation blocks on retrieving data from archive and never finishes
  • there are 2 irodsServer processes, that started when irule was started. strace says that
    • first is blocked on read from a pipe
    • second (child of the previous one) is blocked on futex(0x7f85e89d8ec8, FUTEX_WAIT_PRIVATE, 2, NULL
  • rodsLog's last line is
    Jan 17 12:52:03 pid:88 NOTICE: execCmd:../../var/lib/irods/msiExecCmd_bin/migration- 
    interface.sh argv:stageToCache '/data/archive/dev/file.zip' '/data/cache/dev/file.zip
    
  • univmss driver (shell script migration-interface.sh) is never run (otherwise it would create an entry in a custom log file)
  • there is an empty file created in the cache resource vault
  • in the sql db there are 2 rows:
    • first one corresponds to the archive replica (data_is_dirty = 4)
    • second one corresponds to the cache replica (data_is_dirty = 2)
@trel
Copy link
Member

trel commented Jan 17, 2022

I am not sure the functionality/quality of the backupRescName keyWd (it should probably be deprecated)...
try using destRescName instead to define the target of the replication operation.

https://docs.irods.org/4.2.11/doxygen/reDataObjOpr_8cpp.html#a957a06d93d1100dceb5a497bb9d1253f

It's possible you've found a similar issue to #54 - but this sounds a bit different.

@ciklysta
Copy link
Author

I"ve just tried destRescName. There is no difference.

The linked bug arises only in case more threads are used. However this one indeed has different cause as it occurs even with numThreds=1.

@trel
Copy link
Member

trel commented Jan 20, 2022

Upon further reading/consultation.... we think this is definitely the same as #54.

#54 was reported against 4.2.6 and 4.2.7, before we introduced logical locking in 4.2.9, which makes all data movement create a placeholder in the catalog first... like only parallel transfer did in 4.2.8 and before. This matches the scenario you're seeing above.

Pretty sure this is the reason... #1

  • a PREP-wide mutex getting wedged waiting on a child waiting on its parent.

Also explains why it works fine without coming through the Python rule engine plugin.

We'll test if whether that lock is still required/essential.

@ciklysta
Copy link
Author

Thank you for investigation. In that case this ticket is a duplicate.

@trel trel transferred this issue from irods/irods Jan 24, 2022
@dworkin
Copy link

dworkin commented Nov 9, 2022

@trel @ciklysta This may be a duplicate of irods/irods#6622 instead of #54: the problem occurs when numThreads=1 and msiExecCmd is used with a python rule on the call stack.

@trel
Copy link
Member

trel commented Nov 12, 2022

Oh, interesting...

Any chance you think that irods/irods#6622 is actually, itself, the same as #54?

In other words, should we now re-test #54 to see if it is still a deadlock with the new irods/irods#6622 codefix in place?

@dworkin
Copy link

dworkin commented Nov 12, 2022

Both issues deadlock on the same lock, both require a python rule on the call stack, but they are distinct. #54 deadlocks without using msiExecCmd.

@trel
Copy link
Member

trel commented Nov 13, 2022

Got it - right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants