Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes #10

Merged
merged 23 commits into from Jan 7, 2017
Merged

Bugfixes #10

merged 23 commits into from Jan 7, 2017

Conversation

Shura1oplot
Copy link
Contributor

@Shura1oplot Shura1oplot commented Dec 29, 2016

Various bugfixes and improvements:

  1. transliterate filename instead of stripping non-ascii characters
  2. improve scanned pdf detection
  3. add test for tika extractor with scanned pdf files
  4. fix race condition bug caused by flask threaded=True option
  5. make es mapping settings configurable
  6. use unzip for ZIP archives as it can detect non-ascii filenames encoding
  7. add dependency to extract RAR archives

@coveralls
Copy link

coveralls commented Dec 29, 2016

Coverage Status

Coverage increased (+1.0%) to 83.481% when pulling 04382fc on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@pcbje
Copy link
Owner

pcbje commented Dec 29, 2016

secure_filename is used for a reason. I suggest adding an 'original_filename' field instead.

@coveralls
Copy link

coveralls commented Dec 29, 2016

Coverage Status

Coverage increased (+1.0%) to 83.481% when pulling 3084c35 on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@Shura1oplot
Copy link
Contributor Author

Shura1oplot commented Dec 29, 2016

secure_filename truncates filenames like привет.docx into docx thus breaks extension detection. Can you describe the reason? Possibly I'll find a compromise solution.

@pcbje
Copy link
Owner

pcbje commented Dec 29, 2016

The filename is provided by the client and is used when versions of the file are written to disk later. There is nothing stopping the client from sending a filename like "../hello.txt" and thus write the file to a different location. This can have some bad consequences. (See http://flask-russian-docs.readthedocs.io/ru/latest/patterns/fileuploads.html)

It does sound a bit weird if Flask doesn't support Cyrillic filenames. One option is to not use the filename as base for written files, but then they become harder to work with later. Another is to transliterate the filename before calling secure_filename:

from transliterate import detect_language, translit

language = detect_language(file.name) 
transliterated = translit(file.name, language, reversed=True)
filename = secure_filename(transliterated)

@Shura1oplot
Copy link
Contributor Author

Shura1oplot commented Dec 29, 2016

What do you think about replacing '\0' and '/' on *nix systems and '\' with other illegal characters on windows? And escaping special file names: '.'. '..', 'CON', ...

@coveralls
Copy link

coveralls commented Dec 29, 2016

Coverage Status

Coverage increased (+1.0%) to 83.494% when pulling e353117 on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@pcbje
Copy link
Owner

pcbje commented Dec 29, 2016

I'm not if I understand you correctly, but I don't think doing the sanitisation ourselves is a good idea.

@Shura1oplot
Copy link
Contributor Author

I have found a library that can sanitize filenames: https://github.com/thombashi/pathvalidate. It is more permissive than werkzeug.secure_filename. Still some additional check are required for special filenames (., .., CON, ...).

@pcbje
Copy link
Owner

pcbje commented Dec 29, 2016

Is there anything wrong with using both a transliterated + secured filename, as well as adding a new field containing the original filename? The UI may display the original name. To me this seems a bit cleaner.

Something like this in document.py?

def get_document(path, parent=None):
  ...
  language = detect_language(path) 
  transliterated = translit(path, language, reversed=True)
  doc.path = secure_filename(transliterated)
  doc.original_path = path
  ...

Or let 'path' be the original and have a 'secure_path' property we use when writing files to disk.

@Shura1oplot
Copy link
Contributor Author

Transliteration is not universal and depends on dictionaries for particular languages. Moreover, detecting the language using a file path could be not accurate in case of mixing latin chars with non-latin. I'll make secure_path attribute of Document as a combination of docid and secure_filename(path).

@Shura1oplot
Copy link
Contributor Author

After all, I have to admit that your variant securing filenames is better. Now I have an issue with travis as it uses old version of Ubuntu. I`ll add mock script for pdfimage executable, it should fix tests.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.7%) to 83.215% when pulling 763a043 on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@coveralls
Copy link

coveralls commented Jan 3, 2017

Coverage Status

Coverage increased (+0.8%) to 83.255% when pulling 88b06cd on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@coveralls
Copy link

coveralls commented Jan 3, 2017

Coverage Status

Coverage increased (+0.8%) to 83.248% when pulling 7d93047 on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@pcbje pcbje merged commit 63be498 into pcbje:master Jan 7, 2017
@pcbje
Copy link
Owner

pcbje commented Jan 7, 2017

Great stuff! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants