TypeError: 'pdftotext.PDF' object has no attribute 'getitem' #6

MartinThoma · 2017-08-02T14:05:26Z

This does not work:

import pdftotext

def get_text(filepath, page=None):
    """
    Extract text from a PDF

    Parameters
    ----------
    filepath : str
        Path to a PDF file
    page : int or None

    Returns
    -------
    text : str
    """
    with open(filepath) as f:
        pdf = pdftotext.PDF(f)
    if page is not None:
        text = pdf[page]
    else:
        text = pdf.read_all()
    return text

It returns:

TypeError: 'pdftotext.PDF' object has no attribute '__getitem__'

The text was updated successfully, but these errors were encountered:

jalan · 2017-08-02T16:25:57Z

Few things here:

For compatibility with more platforms and/or python versions, consider open(filepath, "rb") to explicitly open the PDF file in binary mode.
I added __getitem__ later, so you would just need a newer version for that.
Later I also dropped read_all, since it didn't really provide much value. You can get the same result with "\n\n".join(pdf). (I initially thought read_all would be faster, but it turned out to make almost no difference.)

Sorry for all the changes, but I made sure to document them and to update the version number accordingly. This should do what you want:

import pdftotext

def get_text(filepath, page=None):
    with open(filepath, "rb") as f:
        pdf = pdftotext.PDF(f)
    if page is not None:
        text = pdf[page]
    else:
        text = "\n\n".join(pdf)
    return text

jalan closed this as completed Aug 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: 'pdftotext.PDF' object has no attribute 'getitem' #6

TypeError: 'pdftotext.PDF' object has no attribute 'getitem' #6

MartinThoma commented Aug 2, 2017

jalan commented Aug 2, 2017

TypeError: 'pdftotext.PDF' object has no attribute '__getitem__' #6

TypeError: 'pdftotext.PDF' object has no attribute '__getitem__' #6

Comments

MartinThoma commented Aug 2, 2017

jalan commented Aug 2, 2017

TypeError: 'pdftotext.PDF' object has no attribute 'getitem' #6

TypeError: 'pdftotext.PDF' object has no attribute 'getitem' #6