Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CsvBooks doesn't like periods in filenames #498

Open
bondjimbond opened this issue Feb 13, 2019 · 17 comments
Open

CsvBooks doesn't like periods in filenames #498

bondjimbond opened this issue Feb 13, 2019 · 17 comments

Comments

@bondjimbond
Copy link
Collaborator

bondjimbond commented Feb 13, 2019

I got a bunch of files with names like this: 1987.0019.0039.0001.TIF

I set page_sequence_separator = .

Result:

[2019-02-13 18:44:34] input validator.ERROR: Input validation failed {"record ID":"1","book object directory":"/Volumes/Barkerville/RG136/books/1","error":"Some files in the book object directory have invalid extensions"} []

If I change the filename: 1987.0019.0039_0001.tif and set page_sequence_separator = _, it works.

Hypothesis: using a period as a page sequence separator doesn't work because MIK reads ".tif" as the page rather than the extension.

Alternate hypothesis: MIK is reading .0019.0039.0001.tif as a file extension rather than as part of a filename followed by a page number followed by extension.

Is it possible to fix this?

@mjordan
Copy link
Collaborator

mjordan commented Feb 16, 2019

I think this is more to do with using TIF as an extension that using a period. The validation error is coming from https://github.com/MarcusBarnes/mik/blob/master/src/inputvalidators/CsvBooks.php#L140. Coincidentally I opened a PR (#496) two weeks ago that will let you indicate TIF as a valid file extension.

Just to be sure, can you change the filenames back to using periods as separators but leave the (currently invalid until we merge #496) TIF extension as is and see what happens?

@mjordan
Copy link
Collaborator

mjordan commented Feb 16, 2019

Sorry, my logic is wrong. Change the extension to tif and try using periods as separators.

@bondjimbond
Copy link
Collaborator Author

OK, a couple of tests:

Separator _, extension tif = success
Separator _, extension TIF = fail
Separator ., extension TIF = fail
Separator ., extension tif = success

So it looks like the uppercase/lowercase extension is the culprit.

But interestingly, it behaves very unpredictably in my set -- because the first record is not .0001.tif but instead .0000.tif.

The presence of this file results in weird behaviours. Typically directory 0 just isn't generated, but it can also cause MIK to skip, say, directory 2, or more directories.

@mjordan
Copy link
Collaborator

mjordan commented Feb 19, 2019

@bondjimbond when you say "can also", are you saying that it behaves differently across runs of MIK with the same input data?

@bondjimbond
Copy link
Collaborator Author

bondjimbond commented Feb 19, 2019

when you say "can also", are you saying that it behaves differently across runs of MIK with the same input data?

Yes... I was running with a test directory of just four images (0000.tif through 0003.tif).

The first few runs, it produced directories 1, 2, and 3.

Later runs, just 1 and 3.

Another later run, just 3.

After removing 0000.tif, it consistently produced 1, 2, and 3.

@mjordan
Copy link
Collaborator

mjordan commented Feb 19, 2019

Could you zip up all your data and config files and send them to me so I can try to replicate that?

@bondjimbond
Copy link
Collaborator Author

Sure, here's my test directory and ini file: https://vault.sfu.ca/index.php/s/ChGaez7NLOygY3w

@mjordan
Copy link
Collaborator

mjordan commented Feb 19, 2019

OK, got it, I'll give it a try this evening.

@mjordan
Copy link
Collaborator

mjordan commented Feb 19, 2019

Can you send me your mappings file?

@bondjimbond
Copy link
Collaborator Author

Ack, of course, sorry
barkerville_mapping.txt

@mjordan
Copy link
Collaborator

mjordan commented Feb 20, 2019

@bondjimbond Strangely, I can't replicate the behavior you are seeing. I ran MIK about 10 times and always got the same thing: page objects for pages 1-3 and an error indicating a problem with the 0000 file ([...] added by me):

"message":"mkdir(): File exists" [...] "filename_segments":["1987","0019","0039","0000"],"page_number":""

The problem is coming from https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvBooks.php#L132-L134: since we trim all left padding 0s, we need something other than a 0000 as the page number. I'm not sure a fix to allow 0000 as a page number would be trivial.

@mjordan
Copy link
Collaborator

mjordan commented Feb 20, 2019

Although a check to see if $page_number is an empty string, and if it is, assign it a value of 0 to create a 0 directory, might be a simple fix. But do you want 0 to be the first page number instead of 1?

Something like:

$page_number = ltrim(end($filename_segments), '0');
if (strlen($page_number) === 0) {
  $page_number = '0';
}
$page_level_output_dir = $book_level_output_dir . DIRECTORY_SEPARATOR . $page_number;
mkdir($page_level_output_dir);

@mjordan
Copy link
Collaborator

mjordan commented Feb 20, 2019

Just tried that, it worked:

/tmp/brandon_books/
└── 1987
    ├── 0
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 1
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 2
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 3
    │   ├── MODS.xml
    │   └── OBJ.tif
    └── MODS.xml

MODS.xml for page 0 is:

  <titleInfo>
    <title>This is a title, page 0</title>
  </titleInfo>
</mods>

MODS.xml for page 1 is:

  <titleInfo>
    <title>This is a title, page 1</title>
  </titleInfo>
</mods>

@bondjimbond
Copy link
Collaborator Author

That's exactly what I need! :)

@mjordan
Copy link
Collaborator

mjordan commented Feb 20, 2019

OK, I can open a PR for this if you want.

@bondjimbond
Copy link
Collaborator Author

Please do!

mjordan added a commit that referenced this issue Feb 20, 2019
@mjordan
Copy link
Collaborator

mjordan commented Feb 20, 2019

I've made the same change to the CsvNewspapers writer and pushed up the issue-498 branch. I'll need to assemble some test data later but once I do that I'll open a PR.

@mjordan mjordan mentioned this issue Feb 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants