ETT-1346: testing for populate_rights_data.pl by aelkiss · Pull Request #79 · hathitrust/post_zephir_processing

aelkiss · 2026-05-12T17:23:28Z

wrap populate_rights_data.pl in "main" function so we can unit test
use Test2::Tools::Spec (similar to Test::Spec but maintained)
remove unused --source option from populate_rights
don't set a default source (from old HT-1733)

aelkiss · 2026-05-12T17:24:06Z

    'archive=s'       => \$archive,
    'rights_dir=s'    => \$rights_dir,
    'note=s'          => \$note,
-    'source=s'        => \$new_source_cmdline,


As far as I know this is not an option we would ever use.

aelkiss · 2026-05-12T17:26:17Z

+  }

-# Structure to keep track of results - which records were created.
-my %results;


As written now, these need to be global, which makes this a bit more difficult to test. We could consider turning this whole thing into an object to have some place to store those results.

aelkiss · 2026-05-12T17:28:14Z

+      print $rights "prtest.goodline2\tpd\tbib\ttestuser\tgoogle\n";
+      close($rights);
+
+      my $res = qx(perl -w bin/populate_rights_data.pl --data=$tempdir/testfile3.rights --archive=$tempdir/archive 2>&1);


I wanted to avoid too much refactoring without tests. Adding main got it to the point where we could unit test anything, but running main itself leaves things in a weird state and potentially calls exit and ends the tests early. I guess the question is if it's worth it right now to do more refactoring on populate_rights_data.pl to make this more amenable to integration tests as well without needing to shell out.

I would put a ticket in to refactor the logic in a more OO way, but would be perfectly satisfied with the new level of testing represented here.

aelkiss · 2026-05-12T17:30:38Z

@moseshll questions to consider for review:

Do the tests make sense & do they test what they claim to test?
Is there anything else we should test?
Do the changes in populate_rights_data.pl itself make sense?
Should we consider further refactoring now to avoid the need to shell out for integration tests & to make it possible to assert things about what's currently %results?

aelkiss · 2026-05-12T17:41:36Z

+use File::Temp qw(tempdir);
+use POSIX qw(strftime);
+use Test2::Bundle::Extended;
+use Test2::Tools::Spec;


This is similar to Test::Spec in syntax, but appears to be maintained.

aelkiss · 2026-05-12T17:44:10Z

+      like $@, qr(Invalid source);
+    };
+
+    it "requires source if not previously loaded" => sub {


This is new behavior. Bib rights is generally responsible for getting the "source" (digitizer) from the Zephir metadata and setting it. In practice, this prevents us from accidentally loading rights for things that aren't in the repository.

aelkiss · 2026-05-12T17:46:06Z

+      is([2,1,4],$rights);
+    };
+
+    it "new source updates access profile (as specified in sources)" => sub {


The access profile stuff is kind of a mess; it's specified in multiple tables -- ht_collection_digitizers as well as sources. Ideally we'd probably be using ht_collection_digitizers in preference to sources. (There are also multiple keys for the digitization agent, and part of this stems from work around sources and access profiles that was started in 2014 and never fully completed after some staff departures.)

aelkiss · 2026-05-12T17:47:24Z

+      ok($?);
+      ok($res =~ /Invalid namespace\/barcode/);
+
+      # Should have loaded goodline1, but not goodline2 (since it bailed out after badline)


I'm not sure this is really the behavior that we want, but it is the current behavior.

We should probably consider what we want to happen in this case -- I suspect ideally it would load the rest of the file and alert us in some way to the problem line, or maybe load nothing (but still alert.) Loading part of the file seems like a recipe for problems.

Personally I do not like the idea of loading part of a file. Bad lines should be skipped and reported, and since they're independent of each other a bad one need not bring everything to a halt. (I suspect the most likely issue we will see is Unicode or other text weirdness in the note field.)

moseshll · 2026-05-13T17:01:13Z

I have a couple of suggestions.

 || die("$thisprog -ERR- Database error: " . $dbh->errstr());

is a repeated trope that can be stuck in something like db_prepare and db_execute utilities. Cutting down on long line noise and edge case handling.

Another thing I would add to increase coverage (and Devel::Cover is definitely being stretched to the breaking point on this one so I don't want to lean on it too heavily), is

  it "ic/bib overrides pdus/gfv" => sub { expect_no_update('pdus','gfv','ic','bib') };
  it "und/bib overrides pdus/gfv" => sub { expect_no_update('pdus','gfv','und','bib') };
  it "pd/bib overrides pdus/gfv" => sub { expect_update('pdus','gfv','pd','bib') };
  it "pdus/bib overrides pdus/gfv" => sub { expect_update('pdus','gfv','pdus','bib') };

in the gfv overrides tests. More test coverage we want in place for OO refactor.

In the main the changes make sense and seem well exercised. The coverage achieved here is respectable for a monolith with so much edge case resilience baked in.

moseshll

I would definitely add the suggested additional gfv tests, provided they are actually correct (I think they are). All else passes muster.

aelkiss · 2026-05-13T18:44:07Z

I don't think || die("$thisprog -ERR- Database error: " . $dbh->errstr()); should be necessary at all -- the connection can automatically raise an exception on error, and probably does by default. I'll check that behavior.

I went ahead and added the gfv tests. I also did some other refactoring for readability and work on the way towards the desired behavior w/ access profile for harvard material.

aelkiss · 2026-05-13T19:02:51Z

Verified the constant checks for database error don't do anything -- if there is a database error, DBI raises an exception itself rather than returning, at least in the way we're using it.

aelkiss · 2026-05-13T21:58:05Z

What's here now passes the tests and I believe implements the requirement around using the 'open' access profile for Harvard, but there are several things we should fix while we have the opportunity; having the tests enables refactoring.

I went ahead and removed the explicit error checking for database operations, because I don't think it was doing anything.

I set this to draft for now; there are a few things I'd like to do:

Fix the issue with invalid lines causing partially-loaded files
Use date comparison rather than string comparison for the cutoff date (maybe consider doing this in a sql query rather than perl?)
Pre-fetch values from attributes, rights, sources, access profiles to a hash rather than constantly querying the db. They aren't going to change over the lifetime of populate_rights, and the constant querying just makes the code harder to follow.
Extract a function for fetching a single value/row from a database query using a prepared statement (maybe obviated by above)
Consider untangling some of the nested if statements

If it makes sense to extract some objects in the course of doing that I think there's no reason to hold off.
@moseshll If you want to take a crack at any of that while I'm out, feel free, but I think it's fine to wait until I'm back.

aelkiss · 2026-05-13T22:00:05Z

This also has the risk of becoming a bigger/long running PR, and it might make sense to draw a line at adding the tests and doing a little bit of the cleanup/refactoring before moving on to the support for the Harvard access profiles. @moseshll let me know what you think; again, I can take care of that when I'm back.

aelkiss · 2026-05-22T16:17:25Z

@moseshll Let me know what you think about the new commits since you last reviewed and where to draw the line here with what we merge:

adding tests for existing stuff
adding the Harvard access profile
doing some of the additional refactoring listed above.

moseshll · 2026-05-22T18:51:45Z

@aelkiss I would suggest addressing items #1 (partial file load "bug") and #3 (pre-fetch) from the checklist you have above. The former because it's kinda yucky. The latter is an opportunity to get some more into lib and tested without diving into refactoring logic. And if that's too much for now stick to #1 and wait on everything else.

I optimistically put some of this ticket on my TODO notes but alas it stayed buried near the back of the LIFO.

moseshll · 2026-05-22T20:14:04Z

-        }
+  while (my ($priority, $codes) = each %$priorities) {
+    if (grep {$_ eq $code} @$codes) {
+      warn("$attr/$reason has two priorities ($priority and $toreturn)??") if defined $toreturn;


An argument could be made for not doing this check at all here, but instead checking the data for > 1 priority assigned to a rights combination as part of the tests. Since by the presence of this line it is worth checking for. Better to fail test than be warned at runtime.

I've pushed a fix that obviates the need for this by inverting the hash.

aelkiss · 2026-05-22T20:18:44Z

Sounds good. I'll aim to work on this next week.

aelkiss · 2026-05-27T17:01:28Z

@moseshll I've addressed everything I think I'm planning to at this time.

However, I'm rethinking the partial load thing. Right now, we have no way to get alerted when there are problems in a file. So far as I know, we haven't ever really run into the case of malformed rights files in production (this would be more likely to happen if we were manually generating rights files, which we don't any more.)

However, what we could run into is a database issue, which would succeed if retried.

In the current production behavior, a file that's only partially been processed will be left in the directory to be automatically retried later. If we report the failures but continue on, we will move the file to the processed area and be unaware of the failures.

If we instead keep the current retry process, we could still end up with a situation where there are flappings rights -- e.g. a file with a partial failure gets loaded over and over again, overriding later generated rights.

Ultimately I'm not sure what makes sense to do without significant changes to the whole process, so @moseshll I think we should discuss further.

aelkiss · 2026-05-27T19:03:53Z

That said, I think my leaning @moseshll is that the current behavior (bail out and leave the file in the 'to process' directory) is somewhat safer given that the more likely source of problems is database connection issues or other external, temporary issues rather than issues with the data.

If we were able to easily separate out data issues from database issues, maybe I'd lean the other way, but Perl's entirely text-based free-form exception system leaves something to be desired on that front. I suppose we could include some kind of tag if there was a data issue and skip that line if so, and re-raise if not, BUT given that (so far as I know) there's never been a data issue with automatically generated rights files, that level of complexity is probably not worth it right now.

In any case @moseshll let me know what you think but my leaning at this point is to:

revert f5b0152
make some follow-up issues -- further refactoring to make it more testable & avoid the module globals through OO; cleaning up more of the nested conditionals; figuring out a better way of error handling & reporting
merge and deploy

moseshll

I agree with the reversion to "bail out and leave for subsequent run" approach. Other than that, what we have here is plenty good. APPROVE

* wrap populate_rights_data.pl in "main" function so we can unit test * use Test2::Tools::Spec (similar to Test::Spec but maintained) * remove unused --source option from populate_rights * don't set a default source (from old HT-1733)

* add constants for numerical values * make process_rights_line tests more readable by using join("\t",...) * stub out tests for harvard access profile (skipped; currently failing) * use selectrow_array when only fetching a single value * use placeholders for values selectrow_arrayref (more to do) * add additional gfv tests

* Avoid explicit error checking -- if there is an error, the DBI functions themselves raise an exception. * Prepare statements for things previously using selectrow_arrayref / selectcol_arrayref in populate_rights_data.pl

Initial pass at updating access profile for Harvard material based on scan date.

https://metacpan.org/pod/DBI#finish says: "When all the data has been fetched from a SELECT statement, the driver will automatically call finish for you. So you should not call it explicitly except when you know that you've not fetched all the data from a statement handle and the handle won't be destroyed soon." Generally we're fetching one row from things that should return one or zero rows, so there isn't a need to use it.

* reduces number of database calls * simplifies nested conditionals

Avoids issue of needing to search arrays and the possibility of values being present for multiple keys. Removes unused 'orph/ddd' and 'orphcand/ddd' combinations.

aelkiss · 2026-05-28T19:41:53Z

I backed out that commit. I'm going to go ahead and merge this, but I'll wait until next week to deploy.

aelkiss requested a review from moseshll May 12, 2026 17:23

aelkiss commented May 12, 2026

View reviewed changes

Comment thread bin/populate_rights_data.pl Outdated

aelkiss commented May 12, 2026

View reviewed changes

Comment thread bin/populate_rights_data.pl Outdated

aelkiss commented May 12, 2026

View reviewed changes

aelkiss force-pushed the ETT-1346-hvd-access-profile branch from 7feb98f to cacc03f Compare May 12, 2026 17:29

aelkiss commented May 12, 2026

View reviewed changes

moseshll approved these changes May 13, 2026

View reviewed changes

aelkiss marked this pull request as draft May 13, 2026 21:49

aelkiss requested a review from moseshll May 22, 2026 16:17

moseshll reviewed May 22, 2026

View reviewed changes

Comment thread bin/populate_rights_data.pl Outdated

moseshll reviewed May 22, 2026

View reviewed changes

Comment thread bin/populate_rights_data.pl Outdated

moseshll reviewed May 22, 2026

View reviewed changes

aelkiss force-pushed the ETT-1346-hvd-access-profile branch from ad2c87a to a8bec4c Compare May 27, 2026 16:56

aelkiss requested a review from moseshll May 27, 2026 17:01

moseshll approved these changes May 28, 2026

View reviewed changes

aelkiss added 4 commits May 28, 2026 15:39

ETT-1346: testing for populate_rights_data.pl

8619de7

* wrap populate_rights_data.pl in "main" function so we can unit test * use Test2::Tools::Spec (similar to Test::Spec but maintained) * remove unused --source option from populate_rights * don't set a default source (from old HT-1733)

Back out $no_wait; remove 10 second delay

c22f2f2

Database refactoring

f27f44b

* Avoid explicit error checking -- if there is an error, the DBI functions themselves raise an exception. * Prepare statements for things previously using selectrow_arrayref / selectcol_arrayref in populate_rights_data.pl

aelkiss force-pushed the ETT-1346-hvd-access-profile branch from a8bec4c to c436422 Compare May 28, 2026 19:40

aelkiss added 6 commits May 28, 2026 15:40

ETT-1346 - access profile for Harvard material

ac12efc

Initial pass at updating access profile for Harvard material based on scan date.

Preload data for static tables

c72ec19

* reduces number of database calls * simplifies nested conditionals

Use Date_Cmp for comparing scan date & harvard date

15aa665

Remove crmsworld

cbbd8fe

Swap order of priorities hash

bb2e086

Avoids issue of needing to search arrays and the possibility of values being present for multiple keys. Removes unused 'orph/ddd' and 'orphcand/ddd' combinations.

aelkiss force-pushed the ETT-1346-hvd-access-profile branch from c436422 to bb2e086 Compare May 28, 2026 19:41

aelkiss marked this pull request as ready for review May 28, 2026 19:41

aelkiss merged commit f06d50e into main May 28, 2026
1 check passed

Conversation

aelkiss commented May 12, 2026

Uh oh!

Uh oh!

aelkiss May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aelkiss commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aelkiss May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moseshll commented May 13, 2026

Uh oh!

moseshll left a comment

Choose a reason for hiding this comment

Uh oh!

aelkiss commented May 13, 2026

Uh oh!

aelkiss commented May 13, 2026

Uh oh!

aelkiss commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aelkiss commented May 13, 2026

Uh oh!

aelkiss commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moseshll commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aelkiss commented May 22, 2026

Uh oh!

aelkiss commented May 27, 2026

Uh oh!

aelkiss commented May 27, 2026

Uh oh!

moseshll left a comment

Choose a reason for hiding this comment

Uh oh!

aelkiss commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aelkiss May 12, 2026 •

edited

Loading

aelkiss May 12, 2026 •

edited

Loading

aelkiss commented May 13, 2026 •

edited

Loading

aelkiss commented May 22, 2026 •

edited

Loading