Plainy is a small tool to get plain text out of various file formats, e.g. for indexing purposes.
License
hjjg/plainy
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# Needed debian packages: # python-magic (guessing mimetype) # python-lxml (docx) # python-imaging (docx) # xpdf-utils (pdftotext) # xmlstarlet (odt) # unzip (odt) # w3m (html) # TODO: # - Implement something for archives (extract, recursivly scan) # - More functions - extend __name__ = main part. plainy should # be usable as lib # - We need to add OCR. Free variants are tesseract-ocr / ocropus; # commercial ABBYY Finereader # - What about images in documents? Should the be processed or # should we provide them to external tools? # - Should we implement a feature to extract MIME from mails? # See UUDeview, mpack/munpack # - decrease the dependencies to external tools # - implement a function to strip things from text, i.e. optimize # for fulltext search (stopwords? characters? getopt!) # - to be continued # TESSERACT: # There are debian packages, but they are outdated (2.x). # Version 3.x supports layouts (googles pdf text overlay! yay!) # # - http://code.google.com/p/tesseract-ocr/ # - http://de.wikipedia.org/wiki/Tesseract_%28Software%29 # - http://adnanvatandas.wordpress.com/2010/10/28/update-tesseract-3/ # - https://help.ubuntu.com/community/OCR # # gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=out.tif input.pdf # mogrify -brightness-contrast 10,80 -colorspace Gray -depth 8 +compress -format tif *.ppm # tesseract Output_File_Name.tif Name_of_TXT -l eng # Maybe Useful: # HTML # http://www.aaronsw.com/2002/html2text/ # http://www.unixuser.org/~euske/python/webstemmer/#extract # DOCX # https://github.com/mikemaccana/python-docx
About
Plainy is a small tool to get plain text out of various file formats, e.g. for indexing purposes.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published