Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with NullPointerException while validating an MRC file #356

Open
pabloab opened this issue Nov 13, 2023 · 6 comments
Open

Crash with NullPointerException while validating an MRC file #356

pabloab opened this issue Nov 13, 2023 · 6 comments
Assignees

Comments

@pabloab
Copy link

pabloab commented Nov 13, 2023

Recently found this project after searching for regex patterns for each MARC 21 subfield. A little overwhelmed about all its features. I start trying to get a report of a set of 102964 records of a MARC file from a Koha (v22.05).

It stays processing for a couple of seconds and then starts sending all the records contents to stdout. Then it crashes with a NullPointerException.

$ ./validate --summary --marcFormat ISO --schemaType MARC21 --defaultEncoding UTF-8 koha-2023-11-10.mrc

Nov 13, 2023 5:59:47 PM de.gwdg.metadataqa.marc.cli.ValidatorCli beforeIteration
INFO: schemaType: MARC21
marcVersion: MARC21, MARC21
marcFormat: ISO, Binary (ISO 2709)
dataSource: FILE, from file
limit: -1
offset: -1
MARC files: koha-2023-11-10.mrc
id: null
defaultRecordType: null
fixAlephseq: false
fixAlma: false
alephseq: false
marcxml: false
lineSeparated: false
outputDir: .
trimId: false
ignorableFields: 
allowableRecords: 
ignorableRecords: 
defaultEncoding: UTF-8
alephseqLineType: null
details: true
summary: true
detailsFileName: validation-report.txt
summaryFileName: null
format: simple text
emptyLargeCollectors: false

Nov 13, 2023 5:59:47 PM de.gwdg.metadataqa.marc.cli.ValidatorCli beforeIteration
INFO: details output: ./validation-report.txt
Nov 13, 2023 5:59:47 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator start
INFO: marcVersion: MARC21, MARC21
Nov 13, 2023 5:59:47 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: processing: koha-2023-11-10.mrc
[main] INFO org.reflections.Reflections - Reflections took 119 ms to scan 1 urls, producing 3 keys and 445 values
Nov 13, 2023 6:00:02 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processContent
SEVERE: No record number at 89353, last known ID: PAPER-28569

[....]

999   $c102477$d102477
952   $00$10$2udc$40$50$6504482_S6999$73$9307558$aBC$bBC$cDEP$d2023-11-10$eDiego Lisandro Sonzogni Mazzaro$i91451$l0$o504.4(82) S6999$p91451$r2023-11-10$w2023-11-10$yDEP$�$�$�$�flex$�$�DO$�$�

Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: Finished processing file. Processed 102,125 records.
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCounter
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printSummary
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCategoryCounts
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printTypeCounts
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printTotalCounts
Nov 13, 2023 5:40:28 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCollector
Exception in thread "main" java.lang.NullPointerException: file
        at java.base/java.util.Objects.requireNonNull(Objects.java:246)
        at org.apache.commons.io.FileUtils.openOutputStream(FileUtils.java:2444)
        at org.apache.commons.io.FileUtils.writeStringToFile(FileUtils.java:3540)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.printToFile(ValidatorCli.java:465)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.print(ValidatorCli.java:459)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.printCollectorEntry(ValidatorCli.java:445)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.printCollector(ValidatorCli.java:312)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.afterIteration(ValidatorCli.java:294)
        at de.gwdg.metadataqa.marc.cli.utils.RecordIterator.start(RecordIterator.java:91)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.main(ValidatorCli.java:107)

It seems it doesn't consider a subfield code could be some Unicode char like a Greek letter (alpha, beta, gamma...):

    <subfield code="o">504.4(82) S6999</subfield>
    <subfield code="p">91451</subfield>
    <subfield code="r">2023-11-10</subfield>
    <subfield code="w">2023-11-10</subfield>
    <subfield code="y">DEP</subfield>
    <subfield code="&#x3B4;">flex</subfield>
    <subfield code="&#x3C3;">DO</subfield>
  </datafield>
</record>
@pkiraly
Copy link
Owner

pkiraly commented Nov 13, 2023

Dear @pabloab,

thanks for give QA catalogue a try. Which version of the software do you use, is it a release or did you build it from the source code? (I guess it a released one). Is this file downloadable from somewhere, or could you upload some records? If you do not want to make it available in the issue, you can send me in email: kirunews x gmail. So far I did not worked with records having Greek characters as subfield code.

I guess the problem is cased by this line:

FileUtils.writeStringToFile(file, content, Charset.defaultCharset(), true)

Do you know what is the default character set on your machine? I think we should use UTF-8 instead.

And out of curiosity; does UBA stands for Universidad de Buenos Aires?

@pkiraly pkiraly self-assigned this Nov 13, 2023
pkiraly added a commit that referenced this issue Nov 13, 2023
…e explicit UTF-8 instead of the default charset when writing to file.
@pabloab
Copy link
Author

pabloab commented Nov 13, 2023

I'm using v0.6.0, using the wget/unzip installation. locale is en_US.UTF-8.

I exported a new mrc with just one record, and get the same error:

$ ./validate --summary --marcFormat ISO --schemaType MARC21 --defaultEncoding UTF-8 /tmp/koha3bis.mrc

Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli beforeIteration
INFO: schemaType: MARC21
marcVersion: MARC21, MARC21
marcFormat: ISO, Binary (ISO 2709)
dataSource: FILE, from file
limit: -1
offset: -1
MARC files: /tmp/koha3bis.mrc
id: null
defaultRecordType: null
fixAlephseq: false
fixAlma: false
alephseq: false
marcxml: false
lineSeparated: false
outputDir: .
trimId: false
ignorableFields: 
allowableRecords: 
ignorableRecords: 
defaultEncoding: UTF-8
alephseqLineType: null
details: true
summary: true
detailsFileName: validation-report.txt
summaryFileName: null
format: simple text
emptyLargeCollectors: false

Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli beforeIteration
INFO: details output: ./validation-report.txt
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator start
INFO: marcVersion: MARC21, MARC21
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: processing: koha3bis.mrc
[main] INFO org.reflections.Reflections - Reflections took 146 ms to scan 1 urls, producing 3 keys and 445 values
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: Finished processing file. Processed 1 records.
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCounter
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printSummary
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCategoryCounts
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printTypeCounts
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printTotalCounts
Nov 13, 2023 7:36:38 PM de.gwdg.metadataqa.marc.cli.ValidatorCli afterIteration
INFO: printCollector
Exception in thread "main" java.lang.NullPointerException: file
	at java.base/java.util.Objects.requireNonNull(Objects.java:246)
	at org.apache.commons.io.FileUtils.openOutputStream(FileUtils.java:2444)
	at org.apache.commons.io.FileUtils.writeStringToFile(FileUtils.java:3540)
	at de.gwdg.metadataqa.marc.cli.ValidatorCli.printToFile(ValidatorCli.java:465)
	at de.gwdg.metadataqa.marc.cli.ValidatorCli.print(ValidatorCli.java:459)
	at de.gwdg.metadataqa.marc.cli.ValidatorCli.printCollectorEntry(ValidatorCli.java:445)
	at de.gwdg.metadataqa.marc.cli.ValidatorCli.printCollector(ValidatorCli.java:312)
	at de.gwdg.metadataqa.marc.cli.ValidatorCli.afterIteration(ValidatorCli.java:294)
	at de.gwdg.metadataqa.marc.cli.utils.RecordIterator.start(RecordIterator.java:91)
	at de.gwdg.metadataqa.marc.cli.ValidatorCli.main(ValidatorCli.java:107)
$ cat validation-report.txt

"id","MarcPath","categoryId","typeId","type","message","url","instances","records"
2,931,3,9,undefined field,931,,1,1
7,999,3,9,undefined field,999,,1,1
1,691,3,9,undefined field,691,,1,1
5,976,3,9,undefined field,976,,1,1
6,997,3,9,undefined field,997,,1,1
4,962,3,9,undefined field,962,,1,1
3,942,3,9,undefined field,942,,1,1
$ yaz-marcdump  /tmp/koha3bis.mrc

01222cam a22004217a 4500
001 BIBLO-1
005 20230517170929.0
008 000201m19291951nyua|d|f |||| 00| 0|spa|d
044    $a xxu
080    $a 535 $b W759
100 1  $a Winchell, Alexander Newton $4 aut $e autor
245 10 $a Elements of optical mineralogy : $b an introduction to microscopic petrography
250    $a 4th. ed.
260    $a New York, NY : $b Wiley, $c 1929-1951
300    $a 3 v. : $b il., diagrs., tablas (algunas col.)
541    $c DO $a Dr. Ruben Cucchi $n V2E8
562    $e 3V1, 8V2, 3V3
653 10 $a MINERALOGIA
653 10 $a CRISTALOGRAFIA
653 10 $a MINERALES
653 10 $a MINERALES ISOTROPOS
653 10 $a MINERALOGIA OPTICA
653 10 $a OXIDOS
653 10 $a CARBONATOS
653 10 $a MINERALES OPACOS
653 10 $a MINERALES ANISOTROPOS
653 10 $a MINERALES BIRREFRIGENTES
653 10 $a NITRATOS
653 10 $a SORATOS
653 10 $a SULFATOS
653 10 $a FOSFATOS
691  7 $2 fcen-at $a geologia
931    $a PALEO $b PALEONTOLOGIA
942    $2 udc $n 0
962    $a info:eu-repo/semantics/book $a info:ar-repo/semantics/libro $b info:eu-repo/semantics/publishedVersion
976    $a AEX
997    $a MONOGRAF
999    $c 1 $d 1

Yes, stands for Universidad de Buenos Aires. Glad you know about us 😄

@pkiraly
Copy link
Owner

pkiraly commented Nov 14, 2023

@pabloab Thanks! I tested it. It really throws an exepction with 0.6.0 release, but it was fixed in 0.7.0, and also works well with the current developing version. So my suggestion is to use 0.7.0, or - if you would like to keep update with the latest features the current source code.

My knowledge about Universidad de Buenos Aires is quite limited, but I know that one of my favorite authors, Jorge Luis Borges was a professor of English at your university before he was appointed as a director of the national library. The teaching activities (such as a seminar about the Saxon language) and teaching subjects (the thoughts of his favorite English writers) appeared in his writings here and there. But it is a good time to learn more about the university itself!

@pabloab
Copy link
Author

pabloab commented Nov 14, 2023

I first tried to install v0.7.0, changing the wget line, the 6 for a 7. Now, after a closer look, I notice that URL point to other repo, metadata-qa-marc, which have v0.6.0 but not v0.7.0 (in turn the older version is not present on qa-catalogue).

I tried with v0.7.0 and indeed it doesn't crash. I had other issues that maybe I could file aside:

  • Everything is sent to stderr, both info and warnings/errors.

  • Even with --summary I get 3k+ dump of records, after the SEVERE: No record number at _, last known ID: CONTROLNUMBERPREFFIX-123.

    • A feature request would be to add a space around subfield codes, like the default line mode MARC output format of yaz-marcdump.
  • I get a bunch of:

    Nov 14, 2023 4:53:40 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator extracted
    SEVERE: Error (illegal argument) with record 'CONTROLNUMBERPREFFIX-456'. Error in '001 CONTROLNUMBERPREFFIX-456': no type has been detected. Leader: '01276ca  a22002773a 4500'.
    Nov 14, 2023 4:53:40 PM de.gwdg.metadataqa.marc.cli.utils.RecordIterator extracted
    SEVERE: start
    java.security.InvalidParameterException: Error in '001 CONTROLNUMBERPREFFIX-456': no type has been detected. Leader: '01276ca  a22002773a 4500'.
        at de.gwdg.metadataqa.marc.MarcFactory.createFromMarc4j(MarcFactory.java:156)
        at de.gwdg.metadataqa.marc.cli.utils.RecordIterator.transformMarcRecord(RecordIterator.java:191)
        at de.gwdg.metadataqa.marc.cli.utils.RecordIterator.processContent(RecordIterator.java:172)
        at de.gwdg.metadataqa.marc.cli.utils.RecordIterator.processFile(RecordIterator.java:113)
        at de.gwdg.metadataqa.marc.cli.utils.RecordIterator.start(RecordIterator.java:81)
        at de.gwdg.metadataqa.marc.cli.ValidatorCli.main(ValidatorCli.java:107)
    

    The problem wasn't on LDR/06 - Type of record but on 07 - Bibliographic level (should be [abcdims] and was \s).


I also really like Borges (I recently revisited an interview). I was lucky enough to be a professor for a some years at that same campus, Puán, which now has its own film (from what I see in the trailer it captures the academic interns quite well).

We also have a copy of H. P. Lovecraft's Necronomicon. Of course, I made sure it has its MARC record 😉

@pkiraly
Copy link
Owner

pkiraly commented Nov 14, 2023

These are a number of different things:

  1. wget: my mistake, I am fixing it.

  2. "type" errors

no type has been detected. Leader: '01276ca  a22002773a 4500'.

Here the problem is that in order to process the control fields (mainly 008) we should figure out the type of the record from Leader/06 (Type of record) and Leader/07 (Bibliographic level). There are some possible valid combinations of these two characters, "a " in this case is not among them. You can add an extra flag to all analyses: --defaultRecordType BOOKS which set the default record type IF the above error happens.

  1. logging: we use java.util.logging.Logger, it could be configured to separate different messages. I am thinking about that. In the common-script file which I mostly use the strerr and stdout is intentionally redirected to the same place - for me it is easier to check everything in one place, but you are right, there might be different expectations.

  2. "A feature request would be to add a space around subfield codes, like the default line mode MARC output format of yaz-marcdump."
    Could you put an example output? In which file it happens?

Borges: many thanks! I was not aware of that interview. I like a lot another one from the same time: https://www.youtube.com/watch?v=bNxzQSheCkc, this was done in Eglish for a US TV show. Borges said interesting things, like that Latin America did not produce literature which would be interesting for the rest of the world - it was some years before Marquez' Nobel prize, and the big success of other Latin American writers (Llosa, Cortasar etc.). Does Borges have a sculpture or some other memorial at Puán? The film seems to be interesting - the situation is quite typical in academic world.

@pabloab
Copy link
Author

pabloab commented Nov 15, 2023

  • Record type: Clarified. The no type has been detected led me to think it was referring only to LDR/06.

  • Could you put an example output? In which file it happens?

    I meant yaz-marcdump, from yaz toolkit (available as deb package on Debian and its derivatives). I post a sample on a previous comment.

    validate output:

    image

    yaz-marcdump output (with a syntax highlight I built over bat)

    image

  • On the previous screenshots is also shown the encoding issue. On other projects the problem was on Java using/forcing Latin1 instead of UTF-8.

  • Probably I shouldn't ask this here, but: Why validate dump record contents if I used --summary? I was expecting something like a table with the form count, field, subfield, type_of_error. This is the kind of table we usually get with marclint with time marclint file.mrc | sort | uniq -c | sort -rn | head -30.


AFAIK there is no Borges statue on Puan. No one doubts his talent as a writer, but his politics opinions (which he himself says shouldn't be taken into account) are at the opposite extreme from a vast majority, especially there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants