Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The pep_database_mod_data_obj_meta_* PEPs not called when new file uploaded #6385

Closed
3 tasks done
tedgin opened this issue May 11, 2022 · 20 comments
Closed
3 tasks done
Assignees
Labels
Milestone

Comments

@tedgin
Copy link

tedgin commented May 11, 2022

  • main
  • 4-3-stable
  • 4-2-stable

Bug Report

iRODS Version, OS and Version

iRODS 4.2.11, CentOS 7

What did you try to do?

I use the PEP pep_database_mod_data_obj_meta_post enforce various policies. One of those policies is ensuring that every replica has a checksum, In iRODS 4.2.8, when data object was created for an uploaded file, this PEP was called after the replica was created on a storage resource. The database_mod_data_obj_meta operation was called to set the size of the new replica in the catalog.

I tried to use iput to create a new data object having a checksum using this same rule logic in iRODS 4.2.11, but with a writeLine('serverLog', 'database_mod_data_obj_meta'); added to the top of the rule attached to pep_database_mod_data_obj_meta_post.

Expected behavior

I expect a newly uploaded data object having a checksum. I also expect to see a message in the server log mentioning database_mod_data_obj_meta.

Observed behavior (including steps to reproduce, if applicable)

I use iput to upload a file, but the new data object doesn't have a checksum and there is no mention of database_mod_data_obj_meta in the server log.

[centos@bms-irods ~]$ iput motd
[centos@bms-irods ~]$ ils -L motd
  centos            0 demoResc            5 2022-05-11.17:40 & motd
        generic    /var/lib/irods/Vault/home/centos/motd

In 4.2.11, the database_mod_data_obj_meta operation isn't called when a file is uploaded, and a newly uploaded data object doesn't receive a checksum.

It looks like the database plugin is being bypassed. With SQL logging enabled, I can see an INSERT statement that creates an entry for the data object and it's zeroth replica with size set to zero, but I don't see an UPDATE statement that set's the replica's correct size after it has been written to the storage resource. However, the catalog has the data object's correct size.

@korydraughn korydraughn added this to the 4.2.12 milestone May 12, 2022
@tedgin
Copy link
Author

tedgin commented May 16, 2022

I verified that this bug exists in 4.2.10 as well.

@korydraughn
Copy link
Collaborator

That is correct.

With the introduction of logical locking, we failed to realize that database PEPs weren't being triggered due to direct use of nanodbc.

We will restore this ability in 4.2.12, but not through that PEP. We will provide more details once work on 4.2.12 picks up.

Below are a couple ways to get around this limitation:

  • Use the pep_api_* PEPs
  • Implement a sweeper that finds all replicas without a checksum and process them

@tedgin
Copy link
Author

tedgin commented May 16, 2022

Which pep_api_* PEPs are triggered when a replica is created or updated? From what I can tell the following operations will likely create or update a replica. The ones preceded by an ! will likely create or update a replica, while the ones preceded by a ? will possibly create or update replica. I'm going solely off of the names, so I may be wildly incorrect.

!  bulk_data_obj_put(*INSTANCE, *COMM, *BULKOPRINP, *BULKOPRINPBBUF)
!  bulk_data_obj_reg(*INSTANCE, *COMM, *BULKDATAOBJREGINP, *BULKDATAOBJREGOUT)
?  data_copy(*INSTANCE, *COMM, *DATACOPYINP)
!  data_obj_copy(*INSTANCE, *COMM, *DATAOBJCOPYINP, *TRANSSTAT)
!  data_obj_create_and_stat(*INSTANCE, *COMM, *DATAOBJINP, *OPENSTAT)
!  data_obj_create(*INSTANCE, *COMM, *DATAOBJINP)
?  data_object_finalize(*INSTANCE, *COMM, *JSON_INPUT, *JSON_OUTPUT)
!  data_obj_phymv(*INSTANCE, *COMM, *DATAOBJINP, *TRANSSTAT)
!  data_obj_put(*INSTANCE, *COMM, *DATAOBJINP, *DATAOBJINPBBUF, *PORTALOPROUT)
!  data_obj_repl(*INSTANCE, *COMM, *DATAOBJINP, *TRANSSTAT)
!  data_obj_rsync(*INSTANCE, *COMM, *DATAOBJINP, *OUTPARAMARRAY)
!  data_obj_truncate(*INSTANCE, *COMM, *DATAOBJTRUNCATEINP)
!  data_obj_write(*INSTANCE, *COMM, *DATAOBJWRITEINP, *DATAOBJWRITEINPBBUF)
?  data_put(*INSTANCE, *COMM, *DATAOPRINP, *PORTALOPROUT)
?  file_create(*INSTANCE, *COMM, *FILECREATEINP, *OUT)
?  file_put(*INSTANCE, *COMM, *FILEPUTINP, *FILEPUTINPBBUF, *PUT_OUT)
?  file_stage_to_cache(*INSTANCE, *COMM, *FILESTAGETOCACHEINP)
?  file_sync_to_arch(*INSTANCE, *COMM, *FILESYNCTOARCHINP, *SYNC_OUT)
?  file_truncate(*INSTANCE, *COMM, *FILETRUNCATEINP)
?  general_update(*INSTANCE, *COMM, *GENERALUPDATEINP)
?  l3_file_put_single_buf(*INSTANCE, *COMM, *L1DESCINX, *DATAOBJINBBUF)
!  phy_path_reg(*INSTANCE, *COMM, *PHYPATHREGINP)
!  reg_data_obj(*INSTANCE, *COMM, *DATAOBJINFO, *OUTDATAOBJINFO)
!  reg_replica(*INSTANCE, *COMM, *REGREPLICAINP)
?  struct_file_ext_and_reg(*INSTANCE, *COMM, *STRUCTFILEEXTANDREGINP)
?  struct_file_extract(*INSTANCE, *COMM, *STRUCTFILEOPRINP)
?  struct_file_sync(*INSTANCE, *COMM, *STRUCTFILEOPRINP)
?  sub_struct_file_create(*INSTANCE, *COMM, *SUBFILE)
?  sub_struct_file_put(*INSTANCE, *COMM, *SUBFILE, *SUBFILEPUTOUTBBUF)
?  sub_struct_file_truncate(*INSTANCE, *COMM, *SUBFILE)
?  sub_struct_file_write(*INSTANCE, *COMM, *SUBSTRUCTFILEWRITEINP, *SUBSTRUCTFILEWRITEOUTBBUF)
?  touch(*INSTANCE, *COMM, *JSON_INPUT)
!  unbun_and_reg_phy_bunfile(*INSTANCE, *COMM, *DATAOBJINP)

@trel
Copy link
Member

trel commented May 17, 2022

this is in the service of having a checksum on every object, so, a two-prong approach is probably recommended anyway:

  1. A sweeper (recurring code) that enqueues everything without a checksum to have a checksum calculated. This works for the cold start situation, as well as can help find any holes in step 2.
  2. Synchronous PEPs that fire and enqueue/calculate checksums on just-uploaded/created data objects.

So, until we know exactly how many PEPs to instrument/define/use - the following lists can help us get there...

yes

bulk_data_obj_put(*INSTANCE, *COMM, *BULKOPRINP, *BULKOPRINPBBUF)
bulk_data_obj_reg(*INSTANCE, *COMM, *BULKDATAOBJREGINP, *BULKDATAOBJREGOUT)
data_obj_close(*INSTANCE, *COMM, *DATAOBJCLOSEINP)
data_obj_copy(*INSTANCE, *COMM, *DATAOBJCOPYINP, *TRANSSTAT)
data_obj_create_and_stat(*INSTANCE, *COMM, *DATAOBJINP, *OPENSTAT)
data_obj_create(*INSTANCE, *COMM, *DATAOBJINP)
data_obj_open(*INSTANCE, *COMM, *DATAOBJINP)
replica_close(*INSTANCE, *COMM, *JSON_INPUT)
replica_open(*INSTANCE, *COMM, *JSON_INPUT)
data_obj_rsync(*INSTANCE, *COMM, *DATAOBJINP, *OUTPARAMARRAY)
data_obj_put(*INSTANCE, *COMM, *DATAOBJINP, *DATAOBJINPBBUF, *PORTALOPROUT)
phy_path_reg(*INSTANCE, *COMM, *PHYPATHREGINP)
touch(*INSTANCE, *COMM, *JSON_INPUT)

pretty sure no

  • for the sweeper to help determine over time
data_copy(*INSTANCE, *COMM, *DATACOPYINP)
data_obj_phymv(*INSTANCE, *COMM, *DATAOBJINP, *TRANSSTAT)
data_obj_repl(*INSTANCE, *COMM, *DATAOBJINP, *TRANSSTAT)
data_obj_truncate(*INSTANCE, *COMM, *DATAOBJTRUNCATEINP)
data_obj_write(*INSTANCE, *COMM, *DATAOBJWRITEINP, *DATAOBJWRITEINPBBUF)
data_object_finalize(*INSTANCE, *COMM, *JSON_INPUT, *JSON_OUTPUT)
data_put(*INSTANCE, *COMM, *DATAOPRINP, *PORTALOPROUT)
file_create(*INSTANCE, *COMM, *FILECREATEINP, *OUT)
file_put(*INSTANCE, *COMM, *FILEPUTINP, *FILEPUTINPBBUF, *PUT_OUT)
file_stage_to_cache(*INSTANCE, *COMM, *FILESTAGETOCACHEINP)
file_sync_to_arch(*INSTANCE, *COMM, *FILESYNCTOARCHINP, *SYNC_OUT)
file_truncate(*INSTANCE, *COMM, *FILETRUNCATEINP)
general_update(*INSTANCE, *COMM, *GENERALUPDATEINP)
l3_file_put_single_buf(*INSTANCE, *COMM, *L1DESCINX, *DATAOBJINBBUF)
reg_data_obj(*INSTANCE, *COMM, *DATAOBJINFO, *OUTDATAOBJINFO)
reg_replica(*INSTANCE, *COMM, *REGREPLICAINP)
struct_file_ext_and_reg(*INSTANCE, *COMM, *STRUCTFILEEXTANDREGINP)
struct_file_extract(*INSTANCE, *COMM, *STRUCTFILEOPRINP)
struct_file_sync(*INSTANCE, *COMM, *STRUCTFILEOPRINP)
sub_struct_file_create(*INSTANCE, *COMM, *SUBFILE)
sub_struct_file_put(*INSTANCE, *COMM, *SUBFILE, *SUBFILEPUTOUTBBUF)
sub_struct_file_truncate(*INSTANCE, *COMM, *SUBFILE)
sub_struct_file_write(*INSTANCE, *COMM, *SUBSTRUCTFILEWRITEINP, *SUBSTRUCTFILEWRITEOUTBBUF)
unbun_and_reg_phy_bunfile(*INSTANCE, *COMM, *DATAOBJINP)

@tedgin
Copy link
Author

tedgin commented May 23, 2022

I figured out how to trigger the following "yes" PEPs.

data_obj_copy   # icp
replica_close   # istream
replica_open    # istream
data_obj_put    # iput
data_obj_rsync  # irsync
phy_path_reg    # ireg
touch           # itouch

But I couldn't figure out how to trigger the following ones.

bulk_data_obj_put  # iput -b doesn't trigger it
bulk_data_obj_reg
data_obj_close
data_obj_create_and_stat
data_obj_create
data_obj_open

How does one trigger these PEPs?

@korydraughn
Copy link
Collaborator

It depends on the API being invoked by the client. Each of these PEPs have client-side API endpoints. The only way to trigger them is if the client directly invokes one of those calls.

A quick search in the irods repo and icommands repo indicate that the following PEPs are never invoked by the icommands:

pep_api_data_obj_open
pep_api_data_obj_create
pep_api_data_obj_create_and_stat
pep_api_data_obj_close

iput should be able to trigger the bulk_data_obj_put PEP given that putUtil.cpp contains calls to the bulk API.

@tedgin
Copy link
Author

tedgin commented May 25, 2022

So I need to use python-irodsclient or another client library to trigger the following.

pep_api_data_obj_open
pep_api_data_obj_create
pep_api_data_obj_create_and_stat
pep_api_data_obj_close

If iput -b doesn't trigger buik_data_obj_put, how do I use iput to trigger this PEP?

Also, how do I trigger bulk_data_obj_reg?

@tedgin
Copy link
Author

tedgin commented Jun 2, 2022

I figured out how to trigger bulk_data_obj_put. iput needs to be called with both the -b and -r options. I'm guessing that bulk upload isn't used when -b is provided without -r. If this is true, iput help should state this requirement.

@korydraughn
Copy link
Collaborator

Agreed. We will capture that in the docs.

@tedgin
Copy link
Author

tedgin commented Jun 2, 2022

It looks like bulk_data_obj_reg can't be triggered by iCommands. From my understanding the code, This family of API PEPs are called by from the server endpoint rsBulkDataObjReg, which are only accessed from the corresponding client endpoint rcBulkDataObjReg. The iCommands don't appear to use this function.

@tedgin
Copy link
Author

tedgin commented Jun 3, 2022

data_obj_rsync isn't needed for a checksum calculation rule. These PEPs are triggered by client calls to rcDataObjRsync. For syncing from client to server, this function always calls rcDataObjPut, which triggers the PEPs data_obj_put. These PEPs are already being implemented.

@tedgin
Copy link
Author

tedgin commented Sep 22, 2022

For completion's sake, here is the rule logic for computing checksums using the API PEPs. The PEPs are in
chksum.re, and json.re provides supporting logic for parsing JSON.

@alanking
Copy link
Contributor

alanking commented Mar 7, 2023

I think we can restore a hook similar to pep_database_mod_data_obj_meta_* by lifting the database code from the data object finalize API into a new database plugin operation. This logic is used by all the APIs which use data object close and replica close APIs (which is... most of them) and would have to come through that database plugin operation.

I already had something similar to this (albeit, for different reasons) on standby in a draft PR from way back: #5664

@korydraughn
Copy link
Collaborator

korydraughn commented Mar 11, 2023

That makes sense to me.

The atomic API plugins share the same issue (that is, they don't trigger any database PEPs because they reach out to the database directly).

Can discuss more next week.

@alanking
Copy link
Contributor

It occurred to me this morning that pep_resource_modified_post could be an option for detecting when a new file is uploaded. Haven't thought too deeply about it, but, maybe worth looking into as an alternative or an additional option.

@korydraughn
Copy link
Collaborator

Looking at the docs, that PEP doesn't appear to expose any info about what was modified.

Am I missing something?

@alanking
Copy link
Contributor

alanking commented Mar 30, 2023

The plugin context should have all of the information necessary to determine what was modified. Otherwise, the coordinating resources themselves wouldn't be able to perform their "modified" functions (e.g. sync to archive, replicate, etc.). However, if the rule does not have access to the plugin context, then it wouldn't be able to access the info itself.

I could be wrong.

@korydraughn
Copy link
Collaborator

This just means we need to add examples to the developer docs so no one has to figure it out anymore.

That will be covered by the starter project template work I'm still working on.

Carry on.

@alanking
Copy link
Contributor

alanking commented Apr 4, 2023

I tried implementing pep_resource_modified_post and it may be a viable replacement. Have you tried using this PEP instead, @tedgin?

For reference, here's the information available in the context (I removed the "++++" delimiters and put each value on its own line):

api_index=606
auth_scheme=native
client_addr=192.168.208.3
connect_count=1
dataId=0
dataType=generic
file_descriptor=-1
file_size=283
flags_kw=0
in_pdmo=
l1_desc_idx=-1
logical_path=/tempZone/home/rods/foo
mode_kw=0
openType=3
option=iput
physical_path=/var/lib/irods/Vault/home/rods/foo
proxy_auth_info_auth_flag=5
proxy_auth_info_auth_scheme=
proxy_auth_info_auth_str=
proxy_auth_info_flag=0
proxy_auth_info_host=
proxy_auth_info_ppid=0
proxy_rods_zone=tempZone
proxy_sys_uid=0
proxy_user_name=rods
proxy_user_other_info_user_comments=
proxy_user_other_info_user_create=
proxy_user_other_info_user_info=
proxy_user_other_info_user_modify=
proxy_user_type=
repl_requested=0
resc_hier=demoResc
socket=11
status=0
user_auth_info_auth_flag=5
user_auth_info_auth_scheme=
user_auth_info_auth_str=
user_auth_info_flag=0
user_auth_info_host=
user_auth_info_ppid=0
user_rods_zone=tempZone
user_sys_uid=0
user_user_name=rods
user_user_other_info_user_comments=
user_user_other_info_user_create=
user_user_other_info_user_info=
user_user_other_info_user_modify=
user_user_type=

It may not be everything one might want, but you get the logical path, resource hierarchy, and some other goodies so, it should serve a good many use cases, I'd imagine.

@tedgin
Copy link
Author

tedgin commented Apr 5, 2023

I haven't tried that PEP. It looks useful. Thanks for pointing it out.

alanking added a commit to alanking/irods that referenced this issue Apr 10, 2023
This adds a test which ensures that the dynamic post-PEP for the data
object finalize database operation runs after a variety of different
operations. Also gives a first application to the JSON microservice
family.
alanking added a commit to alanking/irods that referenced this issue Apr 10, 2023
This adds a database plugin operation for use in the
data_object_finalize API plugin. This is meant to act as interchangeable
logic with the nanodbc-based database interactions which exist in the
API plugin today.

This change is needed in order for zone administrators to implement
policy around data objects being created or modified in the system. In
the past, this was possible because of the "mod data obj meta" database
operation. This was discarded in favor of the nanodbc-based operations
built directly into the data_object_finalize API plugin. This new
database operation should act as a replacement.

Due to logical locking, this database operation will be enacted once
when the data object is opened and once when it is closed. The rule
implementer is responsible for taking the appropriate action.
alanking added a commit that referenced this issue Apr 10, 2023
This adds a test which ensures that the dynamic post-PEP for the data
object finalize database operation runs after a variety of different
operations. Also gives a first application to the JSON microservice
family.
alanking added a commit that referenced this issue Apr 10, 2023
This adds a database plugin operation for use in the
data_object_finalize API plugin. This is meant to act as interchangeable
logic with the nanodbc-based database interactions which exist in the
API plugin today.

This change is needed in order for zone administrators to implement
policy around data objects being created or modified in the system. In
the past, this was possible because of the "mod data obj meta" database
operation. This was discarded in favor of the nanodbc-based operations
built directly into the data_object_finalize API plugin. This new
database operation should act as a replacement.

Due to logical locking, this database operation will be enacted once
when the data object is opened and once when it is closed. The rule
implementer is responsible for taking the appropriate action.
alanking added a commit to alanking/irods that referenced this issue Apr 10, 2023
This adds a test which ensures that the dynamic post-PEP for the data
object finalize database operation runs after a variety of different
operations. Also gives a first application to the JSON microservice
family.
alanking added a commit to alanking/irods that referenced this issue Apr 10, 2023
This adds a database plugin operation for use in the
data_object_finalize API plugin. This is meant to act as interchangeable
logic with the nanodbc-based database interactions which exist in the
API plugin today.

This change is needed in order for zone administrators to implement
policy around data objects being created or modified in the system. In
the past, this was possible because of the "mod data obj meta" database
operation. This was discarded in favor of the nanodbc-based operations
built directly into the data_object_finalize API plugin. This new
database operation should act as a replacement.

Due to logical locking, this database operation will be enacted once
when the data object is opened and once when it is closed. The rule
implementer is responsible for taking the appropriate action.
alanking added a commit that referenced this issue Apr 10, 2023
This adds a test which ensures that the dynamic post-PEP for the data
object finalize database operation runs after a variety of different
operations. Also gives a first application to the JSON microservice
family.
alanking added a commit that referenced this issue Apr 10, 2023
This adds a database plugin operation for use in the
data_object_finalize API plugin. This is meant to act as interchangeable
logic with the nanodbc-based database interactions which exist in the
API plugin today.

This change is needed in order for zone administrators to implement
policy around data objects being created or modified in the system. In
the past, this was possible because of the "mod data obj meta" database
operation. This was discarded in favor of the nanodbc-based operations
built directly into the data_object_finalize API plugin. This new
database operation should act as a replacement.

Due to logical locking, this database operation will be enacted once
when the data object is opened and once when it is closed. The rule
implementer is responsible for taking the appropriate action.
alanking added a commit to alanking/irods that referenced this issue Apr 10, 2023
This adds a test which ensures that the dynamic post-PEP for the data
object finalize database operation runs after a variety of different
operations. Also gives a first application to the JSON microservice
family.
alanking added a commit to alanking/irods that referenced this issue Apr 10, 2023
This adds a database plugin operation for use in the
data_object_finalize API plugin. This is meant to act as interchangeable
logic with the nanodbc-based database interactions which exist in the
API plugin today.

This change is needed in order for zone administrators to implement
policy around data objects being created or modified in the system. In
the past, this was possible because of the "mod data obj meta" database
operation. This was discarded in favor of the nanodbc-based operations
built directly into the data_object_finalize API plugin. This new
database operation should act as a replacement.

Due to logical locking, this database operation will be enacted once
when the data object is opened and once when it is closed. The rule
implementer is responsible for taking the appropriate action.
alanking added a commit that referenced this issue Apr 10, 2023
This adds a test which ensures that the dynamic post-PEP for the data
object finalize database operation runs after a variety of different
operations. Also gives a first application to the JSON microservice
family.
alanking added a commit that referenced this issue Apr 10, 2023
This adds a database plugin operation for use in the
data_object_finalize API plugin. This is meant to act as interchangeable
logic with the nanodbc-based database interactions which exist in the
API plugin today.

This change is needed in order for zone administrators to implement
policy around data objects being created or modified in the system. In
the past, this was possible because of the "mod data obj meta" database
operation. This was discarded in favor of the nanodbc-based operations
built directly into the data_object_finalize API plugin. This new
database operation should act as a replacement.

Due to logical locking, this database operation will be enacted once
when the data object is opened and once when it is closed. The rule
implementer is responsible for taking the appropriate action.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants