You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently:
"While text-extraction is out of scope for this specification ..."
Proposed Change:
"While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."
For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.
The text was updated successfully, but these errors were encountered:
the definition of the extraction tool/version is part of the normative specification
Mandating a specific tool only works if and only if you also tie it to a version of that tool (as you may be implying). But since software is known to have vulnerabilities that would require systems to update - it is unreasonable/unacceptable to take this approach.
Additionally, it would prevent innovation in this area especially in complex formats such as PDF.
Currently:
"While text-extraction is out of scope for this specification ..."
Proposed Change:
"While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."
For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.
The text was updated successfully, but these errors were encountered: