-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignore invalid utf-8 sequences by default #475
ignore invalid utf-8 sequences by default #475
Conversation
@@ -15,6 +15,7 @@ | |||
String PROP_3RD_PARTY_PDFTOXML = "grobid.3rdparty.pdf2xml.path"; | |||
String PROP_3RD_PARTY_PDFTOXML_MEMORY_LIMIT = "grobid.3rdparty.pdf2xml.memory.limit.mb"; | |||
String PROP_3RD_PARTY_PDFTOXML_TIMEOUT_SEC = "grobid.3rdparty.pdf2xml.memory.timeout.sec"; | |||
String PROP_3RD_PARTY_PDFTOXML_VALIDATION_ENABLED = "grobid.3rdparty.pdf2xml.validation.enabled"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming this also belongs to "third party"? (just trying to be consistent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it's the XML parsing part inside grobid... not third party, but I would suggest to be robust wrt encoding all the time, so not introduce this additional property.
@@ -569,6 +569,12 @@ public static Integer getPdfToXMLTimeoutMs() { | |||
return Integer.parseInt(getPropertyValue(GrobidPropertyKeys.PROP_3RD_PARTY_PDFTOXML_TIMEOUT_SEC, "60"), 10) * 1000; | |||
} | |||
|
|||
public static boolean isPdfToXMLValidationEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure why we spell Pdf
and XML
. Java naming conventions is all caps but personally I prefer Pdf
and Xml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure neither :)
Maybe a compromise between the different conventions :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I'm fine on how the change has been done.
I'm wondering whether we should replace these invalid characters with some placeholder or just remove them.
@kermitt2 what do you think?
Yes, I think that would be better. Actually GROBID 0.5.3 was also displaying some odd character. |
I am also thinking whether there should be an option to save the erroneous XML files? I had to hack the code to do that and inspect the XML. Or is there an option not to delete the temp files? |
So as I mentioned in the issue, it's something wrong with pdfalto. The contract between pdfalto and grobid now is somehow that pdfalto must neutralize all these encoding problems before sending the ALTO file to grobid. Of course having grobid robust to that is also very nice. I find the property name misleading because in the context of XML, validation has a different meaning which is XML validation (something we really don't want to do and to allow with the ALTO files), here it's just unicode correctness in the XML parsing (well-formedness). What about ignoring the unicode problem all the time in the XML parsing? It's not clear what is the interest of keeping it and having an option to allow it or not. About the temporary ALTO XML files (the |
I have revised it to replace with question marks. It didn't actually completely ignore the bytes but was skipping bytes until the utf-8 sequence is valid. It is now displaying question marks for every invalid byte. In the example above it ends up with |
I would agree that the XML generated by pdfalto should ideally be valid.
I agree,
In principal, rejecting badly encoded files seems to be a fair choice. This is really a work-around as it is not clear what to do with the badly encoded bytes. Once
Okay, fair enough. It is just that it tells me there is an error with that particular XML file but by the time I could look at it, it's gone. Running it manually I would need to make sure that I call it with the same parameters. But it's not a big issue. Let's leave it out. |
ok it has to be consistent I think with the way place holders from pdfalto are managed in GROBID, otherwise it's going to be messy. pdfalto generates place holders with certain unicode character range (in theory it should do the same in this case too, but for some reason here there's a problem). For me, the way those place holders are used will depend on the application. For instance the place holder could be considered in the further processing because they are structurally useful (for instance in dictionaries, we have special characters which are separators but failing encoding resolution), or we could replace them all with a special character, or we could ignore them all. In grobid-core, we could ignore them I think. I don't like putting hard coded question marks. How can we differentiate a normal question mark from an encoding issue then? I think the best would be to always ignore encoding issue when parsing XML as you introduced here (we don't want to reject badly encoded ALTO files while we can process them), and just log the encoding issue so that we keep track of the error and address the problem in I would simply ignore those characters in grobid-core - for instance calling a unique method that indicates what to do for all place holder and "CodingErrorAction"? |
Using a special character that is unlikely to be used otherwise seems to make sense to me. In some cases the original character could have been important. In the example document it is
I agree, a question mark is not the best choice. Maybe use a designated unicode character. I don't know what pdfalto is using?
I am not quite sure how to control that it should log and ignore. Do you? |
I didn't easily find which placholder character used by pdfalto. Could we maybe agree to for now change it to?:
Ideally this shouldn't happen anyway and I raised kermitt2/pdfalto#68 |
35dc2c4
to
a270f58
Compare
413cd89
to
73f452c
Compare
removed no longer used line added line feed
73f452c
to
d41a05a
Compare
This is a workaround for #472
By default it will ignore invalid UTF-8 sequences as it did in GROBID 0.5.3 and before.
The validation can be turned on via the properties file (
grobid.3rdparty.pdf2xml.validation.enabled
)./cc @lfoppiano