Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: 'pdftotext.PDF' object has no attribute '__getitem__' #6

Closed
MartinThoma opened this issue Aug 2, 2017 · 1 comment
Closed

Comments

@MartinThoma
Copy link

This does not work:

import pdftotext

def get_text(filepath, page=None):
    """
    Extract text from a PDF

    Parameters
    ----------
    filepath : str
        Path to a PDF file
    page : int or None

    Returns
    -------
    text : str
    """
    with open(filepath) as f:
        pdf = pdftotext.PDF(f)
    if page is not None:
        text = pdf[page]
    else:
        text = pdf.read_all()
    return text

It returns:

TypeError: 'pdftotext.PDF' object has no attribute '__getitem__'
@jalan
Copy link
Owner

jalan commented Aug 2, 2017

Few things here:

  • For compatibility with more platforms and/or python versions, consider open(filepath, "rb") to explicitly open the PDF file in binary mode.
  • I added __getitem__ later, so you would just need a newer version for that.
  • Later I also dropped read_all, since it didn't really provide much value. You can get the same result with "\n\n".join(pdf). (I initially thought read_all would be faster, but it turned out to make almost no difference.)

Sorry for all the changes, but I made sure to document them and to update the version number accordingly. This should do what you want:

import pdftotext

def get_text(filepath, page=None):
    with open(filepath, "rb") as f:
        pdf = pdftotext.PDF(f)
    if page is not None:
        text = pdf[page]
    else:
        text = "\n\n".join(pdf)
    return text

@jalan jalan closed this as completed Aug 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants