Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if filename extensions are coherent with mime-type #187

Open
kelson42 opened this issue Nov 13, 2020 · 4 comments
Open

Check if filename extensions are coherent with mime-type #187

kelson42 opened this issue Nov 13, 2020 · 4 comments

Comments

@kelson42
Copy link
Contributor

  • If not extension at all, don't check.
  • If wrong generate a warning
@jmontleon
Copy link

jmontleon commented Jun 15, 2021

Is this issue based around this code, or something different?
https://github.com/openzim/zim-tools/blob/master/src/zimwriterfs/tools.cpp#L253-L259

  auto index_of_last_dot = filename.find_last_of(".");
  if (index_of_last_dot != std::string::npos) {
    mimeType = filename.substr(index_of_last_dot + 1);
    try {
      return extMimeTypes.at(mimeType);
    } catch (std::out_of_range&) {}
  }

While trying to build a new zim file of stackoverflow it looked like the mimeType list being generated was way too long and the ASSERT at the end just fails and kills the build..

There are lots of links to sites like yuml.me where people are using long URL's that include dots to produce UML diagrams. Most of the ridiculously long lines here are from yuml.me in this comment. These URL's end up serving an svg image. #126 (comment)

At least two of the numeric strings are lat/long from google maps.

Many of the others are obvious typos or copy/paste errors.

Looking at the list I didn't really see anything legitimate under 3 or more than 4 characters so I patched it like so to let it build. Without a doubt that still leaves a lot of garbage, but I can at least build stackoverflow. Maybe there is a better approach.

diff --git a/src/zimwriterfs/tools.cpp b/src/zimwriterfs/tools.cpp
index 06aa8ce..df6f9b2 100644
--- a/src/zimwriterfs/tools.cpp
+++ b/src/zimwriterfs/tools.cpp
@@ -276,9 +276,11 @@ std::string getMimeTypeForFile(const std::string &directoryPath, const std::stri
   auto index_of_last_dot = filename.find_last_of(".");
   if (index_of_last_dot != std::string::npos) {
     mimeType = filename.substr(index_of_last_dot + 1);
-    try {
-      return extMimeTypes.at(mimeType);
-    } catch (std::out_of_range&) {}
+    if (mimeType.length() >2 && mimeType.length() <= 4) {
+      try {
+        return extMimeTypes.at(mimeType);
+      } catch (std::out_of_range&) {}
+    }
   }

@kelson42
Copy link
Contributor Author

This ticket is not a bug, this is a feature request. I believe your comment to be unrealted to this.

@mgautierfr
Copy link
Collaborator

The bug is more that we should not use the extension as mimetype (if we cannot found the mimetype).
The code to change is more at the end of the function than checking the length of the extension.

@venkatd
Copy link

venkatd commented Aug 27, 2021

@jmontleon if you have a working copy of a recent stackoverflow dump, would love to get a hold of it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants