Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] util.getSafeFilename() maximum extension length #5067

Merged
merged 7 commits into from Jun 10, 2019

Conversation

Projects
None yet
2 participants
@Frenzie
Copy link
Member

commented Jun 9, 2019

Mainly in case of sentences. Fixes #5049.

[fix] util.getSafeFilename() maximum extension length
Mainly in case of sentences. Fixes #5049.

@Frenzie Frenzie added bug Plugin labels Jun 9, 2019

@Frenzie Frenzie added this to the 2019.06 milestone Jun 9, 2019

@poire-z

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

(It's super edge-casy, and there's no clean solution - but I guess that if you get a suffix longer than 10 chars, it's not a suffix, and you could just truncate the filename as a whole - instead of truncating the first part and again truncating the 2nd part that's not really a suffix, making the sentence twicely broken?)

@Frenzie

This comment has been minimized.

Copy link
Member Author

commented Jun 9, 2019

Another indicator would be spaces in the suffix. The conceptual problem with this edge case is that it's all perfectly permissible. The only limit is 255 characters, regardless if it's in the "filename" or the "extension." The fact that extensions are seldom longer than 5 characters and don't contain spaces is just convention.

But for the safe filename function your suggestion makes sense. If the extension comes out longer than 10 or 12 characters or contains spaces, just say filename = str.

@poire-z

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

Also, at least for the documents we save ourselves, which are aimed to be open with a html or epub renderer, it could make sense to force having one of these suffixes (even if wrong, crengine does not care about the suffix, it will guess the content type when opening the file) - so at least, it displays as a supported document with a supported suffix in the file browser.

@Frenzie

This comment has been minimized.

Copy link
Member Author

commented Jun 9, 2019

I think all the callers already do that? :-)

@poire-z

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

So, what happened in the issue this is supposed to fix? You shouldn't have had a long suffix then?

@Frenzie

This comment has been minimized.

Copy link
Member Author

commented Jun 9, 2019

The .epub is appended after getSafeFilename() processing.

@poire-z

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

OK, but what really happened?
A long suffix was detected, the name part truncated, the long suffix appended again, making it too long, and the appending of .epub kept making it too long, and it failed saving (or was discarded)?

@Frenzie

This comment has been minimized.

Copy link
Member Author

commented Jun 9, 2019

Yes, exactly. You can't write a filename that's too long, so it fails.

[w-id_1969015] Jak koduje się pilota (scyzoryk) do #<a href="#peugeot">peugeot<_a> 307 po lifcie? W necie są &quot;jakieś&quot; instrukcje, każda jest inna, a nic nie idzie. Ktoś zna sposób? #<a href="#motoryzacja">motoryzacja<_a> #<a href="#pytanie">pytanie<_a>.epub: File name too long

The part after idzie is the auto-detected suffix.

The idea in the function being you can give it a string or a filename, but that Polish site seems to be a champ at finding length-related edge cases that never come up in my personal use.

Strictly speaking I should go test-driven here. :-)

Alternatively the function could just be ignorant of extensions, although I'd like to be able to toss a Unix-safe filename at it to transform it into one suitable for a Windows filesystem.

@Frenzie

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

It only just occurred to me that we can get a much nicer filename by simply stripping the HTML:

06/10/2019-04:19:39 PM DEBUG Wallabag: DOWNLOAD: filename:  /home/frans/src/kobo/koreader/test/wallabag/[w-id_1969015] Jak koduje się pilota (scyzoryk) do #peugeot 307 po lifcie? W necie są "jakieś" instrukcje, każda jest inna, a nic nie idzie. Ktoś zna sposób? #motoryzacja #pytanie.epub

That actually takes care of the original issue by sufficiently reducing the length, but your remarks still made the code better. :-)

Frenzie added some commits Jun 10, 2019

@poire-z

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

We also have the following, which has the additional effect of removing twicely encoded stuff (if that can happen where we get the filenames from):

koreader/frontend/util.lua

Lines 688 to 716 in da12fc9

--- Convert HTML to plain text if text seems to be HTML
-- Detection of HTML is simple and may raise false positives
-- or negatives, but seems quite good at guessing content type
-- of text found in EPUB's <dc:description>.
--
--- @string text the string with possibly some HTML
--- @treturn string cleaned text
function util.htmlToPlainTextIfHtml(text)
local is_html = false
-- Quick way to check if text is some HTML:
-- look for html tags
local _, nb_tags
_, nb_tags = text:gsub("<%w+.->", "")
if nb_tags > 0 then
is_html = true
else
-- no <tag> found
-- but we may meet some text badly twicely encoded html containing "&lt;br&gt;"
local nb_encoded_tags
_, nb_encoded_tags = text:gsub("&lt;%a+&gt;", "")
if nb_encoded_tags > 0 then
is_html = true
-- decode one of the two encodes
text = util.htmlEntitiesToUtf8(text)
end
end
if is_html then
text = util.htmlToPlainText(text)

@Frenzie

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

@poire-z Thanks! I think that most sources (e.g., Wallabag, Wikipedia) should be fine, but you never know with random newsfeeds or some such.

Since we always get rid of bad characters it doesn't really matter for basic functionality, but it might occasionally have slightly more aesthetically pleasing results.

@codecov

This comment has been minimized.

Copy link

commented Jun 10, 2019

Codecov Report

Merging #5067 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5067   +/-   ##
=======================================
  Coverage   47.93%   47.93%           
=======================================
  Files         245      245           
  Lines       34888    34888           
=======================================
  Hits        16723    16723           
  Misses      18165    18165

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5df31f2...b4d7ac9. Read the comment docs.

1 similar comment
@codecov

This comment has been minimized.

Copy link

commented Jun 10, 2019

Codecov Report

Merging #5067 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5067   +/-   ##
=======================================
  Coverage   47.93%   47.93%           
=======================================
  Files         245      245           
  Lines       34888    34888           
=======================================
  Hits        16723    16723           
  Misses      18165    18165

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5df31f2...b4d7ac9. Read the comment docs.

@Frenzie Frenzie merged commit 9300a59 into koreader:master Jun 10, 2019

1 check passed

ci/circleci Your tests passed on CircleCI!
Details

@Frenzie Frenzie deleted the Frenzie:getsafefilename-max-suffix-length branch Jun 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.