GitHub - hjjg/plainy: Plainy is a small tool to get plain text out of various file formats, e.g. for indexing purposes.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README		README
plainy.py		plainy.py

Repository files navigation

# Needed debian packages:
# python-magic (guessing mimetype)
# python-lxml (docx)
# python-imaging (docx)
# xpdf-utils (pdftotext)
# xmlstarlet (odt)
# unzip (odt)
# w3m (html)

# TODO:
# - Implement something for archives (extract, recursivly scan)
# - More functions - extend __name__ = main part. plainy should 
#   be usable as lib
# - We need to add OCR. Free variants are tesseract-ocr / ocropus; 
#   commercial ABBYY Finereader
# - What about images in documents? Should the be processed or 
#   should we provide them to external tools?
# - Should we implement a feature to extract MIME from mails?
#   See UUDeview, mpack/munpack
# - decrease the dependencies to external tools
# - implement a function to strip things from text, i.e. optimize
#   for fulltext search (stopwords? characters? getopt!)
# - to be continued

# TESSERACT:
# There are debian packages, but they are outdated (2.x). 
# Version 3.x supports layouts (googles pdf text overlay! yay!)
#
# - http://code.google.com/p/tesseract-ocr/
# - http://de.wikipedia.org/wiki/Tesseract_%28Software%29
# - http://adnanvatandas.wordpress.com/2010/10/28/update-tesseract-3/
# - https://help.ubuntu.com/community/OCR
# 
# gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=out.tif input.pdf
# mogrify -brightness-contrast 10,80 -colorspace Gray -depth 8 +compress -format tif *.ppm
# tesseract Output_File_Name.tif Name_of_TXT -l eng

# Maybe Useful:
# HTML
# http://www.aaronsw.com/2002/html2text/
# http://www.unixuser.org/~euske/python/webstemmer/#extract

# DOCX
# https://github.com/mikemaccana/python-docx