Skip to content

hjjg/plainy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

# Needed debian packages:
# python-magic (guessing mimetype)
# python-lxml (docx)
# python-imaging (docx)
# xpdf-utils (pdftotext)
# xmlstarlet (odt)
# unzip (odt)
# w3m (html)

# TODO:
# - Implement something for archives (extract, recursivly scan)
# - More functions - extend __name__ = main part. plainy should 
#   be usable as lib
# - We need to add OCR. Free variants are tesseract-ocr / ocropus; 
#   commercial ABBYY Finereader
# - What about images in documents? Should the be processed or 
#   should we provide them to external tools?
# - Should we implement a feature to extract MIME from mails?
#   See UUDeview, mpack/munpack
# - decrease the dependencies to external tools
# - implement a function to strip things from text, i.e. optimize
#   for fulltext search (stopwords? characters? getopt!)
# - to be continued

# TESSERACT:
# There are debian packages, but they are outdated (2.x). 
# Version 3.x supports layouts (googles pdf text overlay! yay!)
#
# - http://code.google.com/p/tesseract-ocr/
# - http://de.wikipedia.org/wiki/Tesseract_%28Software%29
# - http://adnanvatandas.wordpress.com/2010/10/28/update-tesseract-3/
# - https://help.ubuntu.com/community/OCR
# 
# gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=out.tif input.pdf
# mogrify -brightness-contrast 10,80 -colorspace Gray -depth 8 +compress -format tif *.ppm
# tesseract Output_File_Name.tif Name_of_TXT -l eng

# Maybe Useful:
# HTML
# http://www.aaronsw.com/2002/html2text/
# http://www.unixuser.org/~euske/python/webstemmer/#extract

# DOCX
# https://github.com/mikemaccana/python-docx

About

Plainy is a small tool to get plain text out of various file formats, e.g. for indexing purposes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages