Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change wording for text extraction scope. #53

Open
titusz opened this issue Apr 30, 2019 · 1 comment
Open

Change wording for text extraction scope. #53

titusz opened this issue Apr 30, 2019 · 1 comment
Assignees
Milestone

Comments

@titusz
Copy link
Member

titusz commented Apr 30, 2019

Currently:
"While text-extraction is out of scope for this specification ..."

Proposed Change:
"While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."

For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.

@titusz titusz added this to the Version 1.1 milestone Apr 30, 2019
@titusz titusz self-assigned this Apr 30, 2019
@lrosenthol
Copy link

the definition of the extraction tool/version is part of the normative specification

Mandating a specific tool only works if and only if you also tie it to a version of that tool (as you may be implying). But since software is known to have vulnerabilities that would require systems to update - it is unreasonable/unacceptable to take this approach.

Additionally, it would prevent innovation in this area especially in complex formats such as PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants