Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MAINT: Quadratic runtime while parsing reduced to linear (#808)
When the PdfFileReader tries to find the xref marker, the readNextEndLine methods builds a so called line by reading byte-for-byte. Every time a new byte is read, it is concatenated with the currently read line. This leads to quadratic runtime O(n²) behavior as Python strings (also byte-strings) are immutable and have to be copied where n is the size of the file. For files where the xref marker can not be found at the end this takes a enormous amount of time: * 1mb of zeros at the end: 45.54 seconds * 2mb of zeros at the end: 357.04 seconds (measured on a laptop made in 2015) This pull request changes the relevant section of the code to become linear runtime O(n), leading to a run time of less then a second for both cases mentioned above. Furthermore this PR adds a regression test.
- Loading branch information