Skip to content

Conversation

@joergsch
Copy link
Contributor

@joergsch joergsch commented Sep 5, 2025

prevent indexing of drawio files as well as files where content is not pure utf8 compatible
potential fix for #15

prevent indexing of drawio files as well as files where content is not pure utf8 compatible
$content = $pdf->getText();
} elseif (
str_starts_with($content, "<mxfile ") // draw.io file
|| !mb_detect_encoding($content, "UTF-8", true) // binary file
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing non-UTF8 is a feature, not a bug. 😉 See the call to mb_convert_encoding() below. (Many of - at least my - PDFs have non-UTF8 text in it)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, that call could be modified to explicitly call mb_detect_encoding() with $strict = true first. If the system setting for what encodings to detect then only contains UTF-8, that would return false and you could abort at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was the following:

  1. identify pdf files
  2. identify drwa.io files ("cheap" test, only check the first few bytes)
  3. identify files which are no "text" files ("expensive", since it might be necessary to scan all data)

Just reject that merge request and implement it the way fitting your design.
I'm not good in php, neither do I know the complete return value signatures, hence using a shortcut and "wiping" the content to be stored.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've implemented at least the binary part in 0e794a1

@jplitza jplitza closed this Sep 10, 2025
@eyona eyona mentioned this pull request Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants