Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when trying to extract table it can't quite find #216

Closed
stucka opened this issue May 27, 2020 · 4 comments
Closed

Crash when trying to extract table it can't quite find #216

stucka opened this issue May 27, 2020 · 4 comments
Labels

Comments

@stucka
Copy link

stucka commented May 27, 2020

pdfplumber crashes when it's trying to extract_table with a table it can't quite ... find? but extract_tables (plural) returns an empty list. Should extract_table return None or an empty list?
May 18 file from here works fine. May 25 file crashes:
http://ldh.la.gov/index.cfm/page/3965

      1 for pagenumber, page in enumerate(pdf.pages):
----> 2     table = page.extract_table()

c:\python37\lib\site-packages\pdfplumber\page.py in extract_table(self, table_settings)
    177         # Return the largest table, as measured by number of cells.
    178         sorter = lambda x: (-len(x.cells), x.bbox[1], x.bbox[0])
--> 179         largest = list(sorted(tables, key=sorter))[0]
    180         return largest.extract()
    181 

IndexError: list index out of range
@jsvine
Copy link
Owner

jsvine commented May 28, 2020

Thanks for flagging, @stucka! That's an oversight on my part, and changing the behavior sounds like a good idea.

@jsvine jsvine added the bug label May 28, 2020
@jsvine jsvine closed this as completed in d64afa8 May 28, 2020
@jsvine
Copy link
Owner

jsvine commented May 28, 2020

Fixed and now available in v0.5.21. Thanks again!

@stucka
Copy link
Author

stucka commented Jun 2, 2020

Thank you! Weirdly, still getting that though, and I can't figure out why from your code. It's now doing that on two of three versions of the same report, but the first one worked. Latest:
http://ldh.la.gov/assets/oph/Coronavirus/NursingHomes/NHReport053120.pdf
Download page: http://ldh.la.gov/index.cfm/page/3965

5/25 crashed on the penultimate version of pdfplumber. Here's the 5/31 file:


IndexError Traceback (most recent call last)
in
4 masterlist = []
5 for page in pdf.pages:
----> 6 table = page.extract_table()
7 for row in table:
8 line = OrderedDict()

c:\python37\lib\site-packages\pdfplumber\page.py in extract_table(self, table_settings)
177
178 if len(tables) == 0:
--> 179 return None
180
181 # Return the largest table, as measured by number of cells.

IndexError: list index out of range

@jsvine
Copy link
Owner

jsvine commented Jul 18, 2020

Just a note to say that, on my tests, I'm not getting this error. Judging by the traceback in your most recent comment, I wonder whether it was a temporary environment issue, since the IndexError doesn't seem to match up with the code the traceback produced. (I.e., return None shouldn't ever produce an IndexError, but perhaps I'm misreading.) In any case, if this issue persists for you, feel free to reopen this thread or start a new one. Thanks again for the initial bug report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants