Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore invalid utf-8 sequences by default #475

Closed

Conversation

de-code
Copy link
Collaborator

@de-code de-code commented Aug 5, 2019

This is a workaround for #472

By default it will ignore invalid UTF-8 sequences as it did in GROBID 0.5.3 and before.
The validation can be turned on via the properties file (grobid.3rdparty.pdf2xml.validation.enabled).

/cc @lfoppiano

@@ -15,6 +15,7 @@
String PROP_3RD_PARTY_PDFTOXML = "grobid.3rdparty.pdf2xml.path";
String PROP_3RD_PARTY_PDFTOXML_MEMORY_LIMIT = "grobid.3rdparty.pdf2xml.memory.limit.mb";
String PROP_3RD_PARTY_PDFTOXML_TIMEOUT_SEC = "grobid.3rdparty.pdf2xml.memory.timeout.sec";
String PROP_3RD_PARTY_PDFTOXML_VALIDATION_ENABLED = "grobid.3rdparty.pdf2xml.validation.enabled";
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming this also belongs to "third party"? (just trying to be consistent)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's the XML parsing part inside grobid... not third party, but I would suggest to be robust wrt encoding all the time, so not introduce this additional property.

@@ -569,6 +569,12 @@ public static Integer getPdfToXMLTimeoutMs() {
return Integer.parseInt(getPropertyValue(GrobidPropertyKeys.PROP_3RD_PARTY_PDFTOXML_TIMEOUT_SEC, "60"), 10) * 1000;
}

public static boolean isPdfToXMLValidationEnabled() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure why we spell Pdf and XML. Java naming conventions is all caps but personally I prefer Pdf and Xml.

Copy link
Owner

@kermitt2 kermitt2 Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure neither :)
Maybe a compromise between the different conventions :D

@coveralls
Copy link

coveralls commented Aug 5, 2019

Coverage Status

Coverage increased (+0.01%) to 38.029% when pulling d41a05a on elifesciences:ignore-invalid-utf-8-sequences into 206583a on kermitt2:master.

Copy link
Collaborator

@lfoppiano lfoppiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I'm fine on how the change has been done.

I'm wondering whether we should replace these invalid characters with some placeholder or just remove them.
@kermitt2 what do you think?

@de-code
Copy link
Collaborator Author

de-code commented Aug 6, 2019

I'm wondering whether we should replace these invalid characters with some placeholder or just remove them.

Yes, I think that would be better. Actually GROBID 0.5.3 was also displaying some odd character.

@de-code
Copy link
Collaborator Author

de-code commented Aug 6, 2019

I am also thinking whether there should be an option to save the erroneous XML files? I had to hack the code to do that and inspect the XML. Or is there an option not to delete the temp files?

@kermitt2
Copy link
Owner

kermitt2 commented Aug 6, 2019

So as I mentioned in the issue, it's something wrong with pdfalto. The contract between pdfalto and grobid now is somehow that pdfalto must neutralize all these encoding problems before sending the ALTO file to grobid.

Of course having grobid robust to that is also very nice. I find the property name misleading because in the context of XML, validation has a different meaning which is XML validation (something we really don't want to do and to allow with the ALTO files), here it's just unicode correctness in the XML parsing (well-formedness).

What about ignoring the unicode problem all the time in the XML parsing? It's not clear what is the interest of keeping it and having an option to allow it or not.

About the temporary ALTO XML files (the .lxml extension), they are indeed always removed. For inspecting those files, I was simply re-generating them with the pdfalto command line, which is something very simple and avoid adding options?

@de-code
Copy link
Collaborator Author

de-code commented Aug 6, 2019

I have revised it to replace with question marks. It didn't actually completely ignore the bytes but was skipping bytes until the utf-8 sequence is valid. It is now displaying question marks for every invalid byte. In the example above it ends up with ??5 in the output.

@de-code
Copy link
Collaborator Author

de-code commented Aug 6, 2019

So as I mentioned in the issue, it's something wrong with pdfalto. The contract between pdfalto and grobid now is somehow that pdfalto must neutralize all these encoding problems before sending the ALTO file to grobid.

I would agree that the XML generated by pdfalto should ideally be valid.

Of course having grobid robust to that is also very nice. I find the property name misleading because in the context of XML, validation has a different meaning which is XML validation (something we really don't want to do and to allow with the ALTO files), here it's just unicode correctness in the XML parsing (well-formedness).

I agree, validate is an overloaded term and in the XML world usually associated with schema validation like you said. I couldn't think of a better name, I didn't try very hard to be honest. Any suggestion?

What about ignoring the unicode problem all the time in the XML parsing? It's not clear what is the interest of keeping it and having an option to allow it or not.

In principal, rejecting badly encoded files seems to be a fair choice. This is really a work-around as it is not clear what to do with the badly encoded bytes. Once pdfalto fulfils the contract it might, some people might prefer to turn it on. But I wouldn't entirely opposed to removing the configuration.

About the temporary ALTO XML files (the .lxml extension), they are indeed always removed. For inspecting those files, I was simply re-generating them with the pdfalto command line, which is something very simple and avoid adding options?

Okay, fair enough. It is just that it tells me there is an error with that particular XML file but by the time I could look at it, it's gone. Running it manually I would need to make sure that I call it with the same parameters. But it's not a big issue. Let's leave it out.

@kermitt2
Copy link
Owner

kermitt2 commented Aug 6, 2019

I have revised it to replace with question marks. It didn't actually completely ignore the bytes but was skipping bytes until the utf-8 sequence is valid. It is now displaying question marks for every invalid byte. In the example above it ends up with ??5 in the output.

ok it has to be consistent I think with the way place holders from pdfalto are managed in GROBID, otherwise it's going to be messy.

pdfalto generates place holders with certain unicode character range (in theory it should do the same in this case too, but for some reason here there's a problem). For me, the way those place holders are used will depend on the application. For instance the place holder could be considered in the further processing because they are structurally useful (for instance in dictionaries, we have special characters which are separators but failing encoding resolution), or we could replace them all with a special character, or we could ignore them all. In grobid-core, we could ignore them I think.

I don't like putting hard coded question marks. How can we differentiate a normal question mark from an encoding issue then?

I think the best would be to always ignore encoding issue when parsing XML as you introduced here (we don't want to reject badly encoded ALTO files while we can process them), and just log the encoding issue so that we keep track of the error and address the problem in pdfalto.

I would simply ignore those characters in grobid-core - for instance calling a unique method that indicates what to do for all place holder and "CodingErrorAction"?

@de-code
Copy link
Collaborator Author

de-code commented Aug 6, 2019

I have revised it to replace with question marks. It didn't actually completely ignore the bytes but was skipping bytes until the utf-8 sequence is valid. It is now displaying question marks for every invalid byte. In the example above it ends up with ??5 in the output.

ok it has to be consistent I think with the way place holders from pdfalto are managed in GROBID, otherwise it's going to be messy.

pdfalto generates place holders with certain unicode character range (in theory it should do the same in this case too, but for some reason here there's a problem). For me, the way those place holders are used will depend on the application. For instance the place holder could be considered in the further processing because they are structurally useful (for instance in dictionaries, we have special characters which are separators but failing encoding resolution), or we could replace them all with a special character, or we could ignore them all. In grobid-core, we could ignore them I think.

Using a special character that is unlikely to be used otherwise seems to make sense to me. In some cases the original character could have been important. In the example document it is μ (micro) as a unit. What the impact of the misinterpretation of the invalid byte / character is, may not even be clear to the application but only the user. Using that special character could allow it to be visible to the user unless the application really knows what to do with it (and it could replace it).

I don't like putting hard coded question marks. How can we differentiate a normal question mark from an encoding issue then?

I agree, a question mark is not the best choice. Maybe use a designated unicode character. I don't know what pdfalto is using?

I think the best would be to always ignore encoding issue when parsing XML as you introduced here (we don't want to reject badly encoded ALTO files while we can process them), and just log the encoding issue so that we keep track of the error and address the problem in pdfalto.

I would simply ignore those characters in grobid-core - for instance calling a unique method that indicates what to do for all place holder and "CodingErrorAction"?

CodingErrorAction also has REPORT option:

Action indicating that a coding error is to be reported, either by returning a CoderResult object or by throwing a CharacterCodingException, whichever is appropriate for the method implementing the coding process.

I am not quite sure how to control that it should log and ignore. Do you?

@de-code
Copy link
Collaborator Author

de-code commented Aug 8, 2019

I didn't easily find which placholder character used by pdfalto.

Could we maybe agree to for now change it to?:

  • use the placeholder character used by pdfalto (to be defined)
  • remove the configuration option an always replace invalid utf-8 byte sequences with it

Ideally this shouldn't happen anyway and I raised kermitt2/pdfalto#68

removed no longer used line

added line feed
@kermitt2
Copy link
Owner

I've merged manually this PR without the property (commit 2bd9f00) - the invalid URF-8 characters are always ignored to avoid failing on the whole document because of a couple crappy character codes :)
Thanks a lot @de-code !

@kermitt2 kermitt2 closed this Nov 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants