Change wording for text extraction scope. #53

titusz · 2019-04-30T18:04:07Z

Currently:
"While text-extraction is out of scope for this specification ..."

Proposed Change:
"While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."

For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.

lrosenthol · 2020-05-13T15:51:19Z

the definition of the extraction tool/version is part of the normative specification

Mandating a specific tool only works if and only if you also tie it to a version of that tool (as you may be implying). But since software is known to have vulnerabilities that would require systems to update - it is unreasonable/unacceptable to take this approach.

Additionally, it would prevent innovation in this area especially in complex formats such as PDF.

titusz added the Affects: Spec label Apr 30, 2019

titusz added this to the Version 1.1 milestone Apr 30, 2019

titusz self-assigned this Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change wording for text extraction scope. #53

Change wording for text extraction scope. #53

titusz commented Apr 30, 2019 •

edited

Loading

lrosenthol commented May 13, 2020

Change wording for text extraction scope. #53

Change wording for text extraction scope. #53

Comments

titusz commented Apr 30, 2019 • edited Loading

lrosenthol commented May 13, 2020

titusz commented Apr 30, 2019 •

edited

Loading