Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3 - TypeError: ord() expected string of length 1, but int found #254

Closed
LeoFCardoso opened this issue Mar 27, 2016 · 11 comments
Closed

Comments

@LeoFCardoso
Copy link

I am getting this error when using python3 and this simple code:

imagepdf = PdfFileReader(open(sys.argv[1], 'rb'), strict=False)
textpdf = PdfFileReader(open(sys.argv[2], 'rb'), strict=False)
for i in range(imagepdf.getNumPages()):
imagepage = imagepdf.getPage(i)
textpage = textpdf.getPage(i)
factor_x = textpage.mediaBox.upperRight[0] / imagepage.mediaBox.upperRight[0]
factor_y = textpage.mediaBox.upperRight[1] / imagepage.mediaBox.upperRight[1]
imagepage.scale(float(factor_x), float(factor_y))
textpage.mergePage(imagepage) # imagepage stay on top
textpage.compressContentStreams()
output.addPage(textpage)

Trace:

Traceback (most recent call last):
File "...", line 34, in
imagepage.scale(float(factor_x), float(factor_y))
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2493, in scale
0, 0])
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2479, in addTransformation
originalContent, self.pdf, ctm)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2180, in _addTransformationMatrix
contents = ContentStream(contents, pdf)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/pdf.py", line 2641, in init
data += s.getObject().getData()
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/generic.py", line 837, in getData
decoded._data = filters.decodeStreamData(self)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/filters.py", line 350, in decodeStreamData
data = LZWDecode.decode(data, stream.get("/DecodeParms"))
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/filters.py", line 255, in decode
return LZWDecode.decoder(data).decode()
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/filters.py", line 228, in decode
cW = self.nextCode();
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/PyPDF2/filters.py", line 205, in nextCode
nextbits=ord(self.data[self.bytepos])
TypeError: ord() expected string of length 1, but int found

Am I doing something wrong?

@mstamy2
Copy link
Collaborator

mstamy2 commented May 19, 2016

Any chance you can post/send me the PDF(s) you're working with?

Most likely this is a Python 3 type handling issue found in the LZW decoding algorithm, in which case it is easily fixable

@LeoFCardoso
Copy link
Author

My script is
https://github.com/LeoFCardoso/pdf2pdfocr/blob/master/pdf2pdfocr_multibackground.py

Here is the call:

python3.4 pdf2pdfocr_multibackground.py first.pdf second.pdf result.pdf

second.pdf
first.pdf

Thanks!

@mstamy2
Copy link
Collaborator

mstamy2 commented May 23, 2016

5bbd5af should take care of these type issues

@mstamy2
Copy link
Collaborator

mstamy2 commented May 24, 2016

Let me know of any further issues!

@mstamy2 mstamy2 closed this as completed May 24, 2016
@LeoFCardoso
Copy link
Author

It works! Thanks!

@LeoFCardoso
Copy link
Author

when new version will be available at https://pypi.python.org/pypi/PyPDF2?
Thanks!

@jguram
Copy link

jguram commented Nov 15, 2017

199709222.pdf
I am getting the same issue.Can you please help me. @mstamy2

Here is my code

import PyPDF2
import os
import xlsxwriter

search_words = []
os.chdir(r'C:\Users\jui\Documents\Parabole')
with open('allwords.txt') as f:
for line in f:
search_words.append(line)
print(len(search_words))

p1 = 'C:\data\pdfContainer\test'
file_name = []
for filename in os.listdir(p1):
file_name.append(filename.lstrip())
print(file_name)

for i in file_name:
print(i)
os.chdir(r'C:\data\pdfContainer\test')
pdf_file = open(filename, 'rb')
print('opening pdf file')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
print(read_pdf.isEncrypted)

number_of_pages = read_pdf.getNumPages()
print(number_of_pages)
pdf_content = ' '
for j in range(number_of_pages):
    page = read_pdf.getPage(j)
    print(page)
    j = +1
    page_content = page.extractText()
    print(page_content)
    pdf_content = pdf_content + page_content
#print(pdf_content)
new_dict = {}
for word in search_words:
    cnt_of_words = pdf_content.count(word)
    new_dict.update({word: cnt_of_words})
#print(new_dict)    
#for i in file_name:
print(i)
p = i + '.xlsx'
print(p)
workbook = xlsxwriter.Workbook(p)
worksheet = workbook.add_worksheet()
row = 0
col = 0
for key in new_dict.keys():
    row += 1
    worksheet.write(row, col, key)
    worksheet.write(row, col + 1, new_dict[key])
    row += 1

workbook.close()

allwords.txt

@jguram
Copy link

jguram commented Nov 15, 2017

@mstamy2 I want to check the occurence of words present in allwords.txt in the PDF file mentioned and write it in excel

@rameessahlu
Copy link

5bbd5af should take care of these type issues

These fixes are still not merged.

@Jeff-Winchell
Copy link

I got multiple files that trigger this error.
I am currently running a loop looking at 177,000 pdf files. It blows up with that error on the extractText() function. Yes I am running python 3, which was shipped A DECADE AGO.

for index, row in Files.iterrows():
    try:
        filename=row.Filename
        pdffile=PdfFileReader(filename)
        for pagenum in range(pdffile.numPages):
            foo=[word.lower() for word in tokenizer.tokenize(pdffile.getPage(pagenum).extractText()) if word.lower() not in stopWords and not word.isdigit()]
    except:
        print(index,filename,pagenum)

The bug has been triggered 6 times in the first 1255 files, so I'm guessing the error rate is about 0.5%

@misilot
Copy link

misilot commented Sep 24, 2020

Can a release be made that includes this fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants