Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Johns Hopkins: MARC files contain non-printing control characters not allowed in XML #829

Open
corylown opened this issue May 20, 2022 · 2 comments
Labels

Comments

@corylown
Copy link
Member

corylown commented May 20, 2022

The normalized MARC XML files generated by POD from MARC21 files submitted by Johns Hopkins contain non-printing control characters that are not valid in XML. The visible symptom in the UI is that the records count is displayed as ??? for the XML version of the normalized files. Parsing these files with normal MARC tooling such as Marc Edit or ruby-marc raise errors because the files are not valid. I'm not sure what we can do about this in POD since the problem is with the files submitted to POD. While the records will be available for downstream consumers they are likely to run into problems making use of these files since they are not valid.

It's possible we could filter out non-printing characters during the normalized file writing process. Will need to investigate.

Example record:

JHU 001/bib number: 9638894 contains two instances of \x07 in the MARC 505$a

cc @bobpersing

@JohnMarkOckerbloom
Copy link

It'd be best if possible if Johns Hopkins fixes this on their end. (I'm not even sure if they know it's happening.) Has @bobpersing talked about this with them?

@bobpersing
Copy link

I've emailed Jing at Johns Hopkins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

3 participants