You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import pdftotext
def get_text(filepath, page=None):
"""
Extract text from a PDF
Parameters
----------
filepath : str
Path to a PDF file
page : int or None
Returns
-------
text : str
"""
with open(filepath) as f:
pdf = pdftotext.PDF(f)
if page is not None:
text = pdf[page]
else:
text = pdf.read_all()
return text
It returns:
TypeError: 'pdftotext.PDF' object has no attribute '__getitem__'
The text was updated successfully, but these errors were encountered:
For compatibility with more platforms and/or python versions, consider open(filepath, "rb") to explicitly open the PDF file in binary mode.
I added __getitem__ later, so you would just need a newer version for that.
Later I also dropped read_all, since it didn't really provide much value. You can get the same result with "\n\n".join(pdf). (I initially thought read_all would be faster, but it turned out to make almost no difference.)
Sorry for all the changes, but I made sure to document them and to update the version number accordingly. This should do what you want:
import pdftotext
def get_text(filepath, page=None):
with open(filepath, "rb") as f:
pdf = pdftotext.PDF(f)
if page is not None:
text = pdf[page]
else:
text = "\n\n".join(pdf)
return text
This does not work:
It returns:
The text was updated successfully, but these errors were encountered: