-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explain how to read text data from PDF and PowerPoint and use it with Texthero #24
Comments
Very interesting comment. Completely agree that we should do something related. textract There are also other python tools for PDF extraction such as PyPDF2, PDFminer, etc. dataLoader as the use cases are quite different from task to task and also as this feature is a bit too far from the core idea of texthero, an alternative would be to add a detailed tutorial on the blog with also snippet of code (that can also be added somewhere in the github repo) that explain how to extract text data from different sources such as PDF and PowerPoint. What do you think about this? Also, having a universal dataLoader might be quite hard and that's why there is in general a custom python package that does only that. As a final comment, it's important to define precisely what are the goals and objective of texthero, better doing one thing great than 5 average. We can discuss also that eventually. |
Completely agree with your final comment ! Even though this is not one of the core goals of texthero, but I think it can be a cool feature to have. Just wanted to write it down so it can be made later on after core is built and running. I think having ideas written down / shared is good for the project. My idea for a universal data loader is that it appears as "universal" to the user, however it will have multiple implementations and can use different packages under-hood depending on file type / data source. For now yeah we can just have a tutorial on the blog! |
There's a good library TIKA-Python (https://github.com/chrismattmann/tika-python) that handles PDFs, emails, and other formats as well. The only con I find is that it needs a JVM to run TIKA behind the scenes; but it's very easy to start using it: import tika
from tika import parse,
tika.initVM() # Gets apache-tika jar file (if not present) and lauch tika from the JVM
filename = 'path/to/your/file(ppt|doc|docx|pdf)'
thedoc= parse(filename)
print( thedoc['metadata'] ) # dict with information about the file itself
print(thedoc['content']) # Output utf8 text from the file
# Dump attachments if the file has any (like .msg, .eml, etc).
if thedoc.get('attachments',False):
print(thedoc['attachments']) |
Hi @igponce! Thank you for your comment! Adding native PDF support might be a bit out of Texthero's purposes. What it's definitely useful is to have a tutorial on the Texther's blog page that explains how to start hero-analyzing a collection of documents, starting from raw and other formats. There are different solutions for doing that, another valid alternative is for instance to use pdfminer.six as it's very simple to use and it's based only on python (no need for the JVM). For example, to go from raw pdf data to a Pandas Dataframe this line of code does the job:
Would you be interested in writing such a blog post? It would be great to show how to go from raw data to Pandas/Hero using different tools, including Apache Tika and Pdfminer, Textract, ... regards, |
Good point on getting PDF etc. out of scope: it's vert tempting to add stuff; but hard to leave it aout. |
Sounds amazing! Looking forward to that! |
PDF, PowerPoint presentations and other unstructured text, contain very valuable data that can be used for analysis.
There are many tools providing this features. It would be nice if we can provide a single method to read such files and don't bother user with this.
There is a python library textract provide this functionality unfortunately it is not maintained.
We can provide a method loadData or so that has different implementation depending on file type
The text was updated successfully, but these errors were encountered: