New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes #10

Merged
merged 23 commits into from Jan 7, 2017

Conversation

Projects
None yet
3 participants
@Shura1oplot
Contributor

Shura1oplot commented Dec 29, 2016

Various bugfixes and improvements:

  1. transliterate filename instead of stripping non-ascii characters
  2. improve scanned pdf detection
  3. add test for tika extractor with scanned pdf files
  4. fix race condition bug caused by flask threaded=True option
  5. make es mapping settings configurable
  6. use unzip for ZIP archives as it can detect non-ascii filenames encoding
  7. add dependency to extract RAR archives
@coveralls

This comment has been minimized.

coveralls commented Dec 29, 2016

Coverage Status

Coverage increased (+1.0%) to 83.481% when pulling 04382fc on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@pcbje

This comment has been minimized.

Owner

pcbje commented Dec 29, 2016

secure_filename is used for a reason. I suggest adding an 'original_filename' field instead.

@coveralls

This comment has been minimized.

coveralls commented Dec 29, 2016

Coverage Status

Coverage increased (+1.0%) to 83.481% when pulling 3084c35 on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@Shura1oplot

This comment has been minimized.

Contributor

Shura1oplot commented Dec 29, 2016

secure_filename truncates filenames like привет.docx into docx thus breaks extension detection. Can you describe the reason? Possibly I'll find a compromise solution.

@pcbje

This comment has been minimized.

Owner

pcbje commented Dec 29, 2016

The filename is provided by the client and is used when versions of the file are written to disk later. There is nothing stopping the client from sending a filename like "../hello.txt" and thus write the file to a different location. This can have some bad consequences. (See http://flask-russian-docs.readthedocs.io/ru/latest/patterns/fileuploads.html)

It does sound a bit weird if Flask doesn't support Cyrillic filenames. One option is to not use the filename as base for written files, but then they become harder to work with later. Another is to transliterate the filename before calling secure_filename:

from transliterate import detect_language, translit

language = detect_language(file.name) 
transliterated = translit(file.name, language, reversed=True)
filename = secure_filename(transliterated)
@Shura1oplot

This comment has been minimized.

Contributor

Shura1oplot commented Dec 29, 2016

What do you think about replacing '\0' and '/' on *nix systems and '\' with other illegal characters on windows? And escaping special file names: '.'. '..', 'CON', ...

@coveralls

This comment has been minimized.

coveralls commented Dec 29, 2016

Coverage Status

Coverage increased (+1.0%) to 83.494% when pulling e353117 on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@pcbje

This comment has been minimized.

Owner

pcbje commented Dec 29, 2016

I'm not if I understand you correctly, but I don't think doing the sanitisation ourselves is a good idea.

@Shura1oplot

This comment has been minimized.

Contributor

Shura1oplot commented Dec 29, 2016

I have found a library that can sanitize filenames: https://github.com/thombashi/pathvalidate. It is more permissive than werkzeug.secure_filename. Still some additional check are required for special filenames (., .., CON, ...).

@pcbje

This comment has been minimized.

Owner

pcbje commented Dec 29, 2016

Is there anything wrong with using both a transliterated + secured filename, as well as adding a new field containing the original filename? The UI may display the original name. To me this seems a bit cleaner.

Something like this in document.py?

def get_document(path, parent=None):
  ...
  language = detect_language(path) 
  transliterated = translit(path, language, reversed=True)
  doc.path = secure_filename(transliterated)
  doc.original_path = path
  ...

Or let 'path' be the original and have a 'secure_path' property we use when writing files to disk.

@Shura1oplot

This comment has been minimized.

Contributor

Shura1oplot commented Dec 30, 2016

Transliteration is not universal and depends on dictionaries for particular languages. Moreover, detecting the language using a file path could be not accurate in case of mixing latin chars with non-latin. I'll make secure_path attribute of Document as a combination of docid and secure_filename(path).

@Shura1oplot

This comment has been minimized.

Contributor

Shura1oplot commented Jan 1, 2017

After all, I have to admit that your variant securing filenames is better. Now I have an issue with travis as it uses old version of Ubuntu. I`ll add mock script for pdfimage executable, it should fix tests.

@coveralls

This comment has been minimized.

coveralls commented Jan 3, 2017

Coverage Status

Coverage increased (+0.7%) to 83.215% when pulling 763a043 on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@coveralls

This comment has been minimized.

coveralls commented Jan 3, 2017

Coverage Status

Coverage increased (+0.8%) to 83.255% when pulling 88b06cd on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@coveralls

This comment has been minimized.

coveralls commented Jan 3, 2017

Coverage Status

Coverage increased (+0.8%) to 83.248% when pulling 7d93047 on Shura1oplot:improvements2 into d2d9aaf on pcbje:master.

@pcbje pcbje merged commit 63be498 into pcbje:master Jan 7, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.8%) to 83.248%
Details
@pcbje

This comment has been minimized.

Owner

pcbje commented Jan 7, 2017

Great stuff! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment