Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue: PyPDF2.utils.PdfReadError: Illegal character in Name Object #438

Closed
gitzjm opened this issue Jun 23, 2018 · 12 comments
Closed
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem

Comments

@gitzjm
Copy link
Contributor

gitzjm commented Jun 23, 2018

(已解决)我在给一个PDF添加水印的时候遇到了如下错误,提示我的Name Object中有非法字符:

Traceback (most recent call last):
  File "E:/test/水印/PDF水印.py", line 66, in <module>
    add_watermark("111111.pdf",r"F:\SVN代码\repository\back\ninstar_demo1\static\watermark\logo_watermark.pdf","output")
  File "E:/test/水印/PDF水印.py", line 61, in add_watermark
    pdf_output.write(output_stream)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 482, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 577, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\pdf.py", line 1611, in getObject
    retval = readObject(self.stream, self)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\generic.py", line 66, in readObject
    return DictionaryObject.readFromStream(stream, pdf)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\generic.py", line 579, in readFromStream
    value = readObject(stream, pdf)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\generic.py", line 60, in readObject
    return NameObject.readFromStream(stream, pdf)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\PyPDF2\generic.py", line 492, in readFromStream
    raise utils.PdfReadError("Illegal character in Name Object")
PyPDF2.utils.PdfReadError: Illegal character in Name Object

从代码中发现文件流已经合并完成,理论上我的水印是已经加上了的,但是往文件中写入的时候抛出了异常
我发现是generic.py的484行:
return NameObject(name.decode('utf-8'))
抛出的异常,因为我的PDF是中文所以我想到是因为编码问题,于是我把utf-8改成了GBK,
但是又出现了另外一个异常:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-9: ordinal not in range(256)
又找到这个异常是 utils的第238行导致的:
r = s.encode('latin-1')
又是一个编码问题,一开始我将latin-1换成了utf-8发现可以输出文件,但是文字排版错乱,而且少了许多文字,于是我想到可能是因为PDF中存在不同编码的文字导致的,所以我将此处代码改为了:

            try:
                r = s.encode('latin-1')
                if len(s) < 2:
                    bc[s] = r
                return r
            except Exception as e:
                print(s)
                r = s.encode('utf-8')
                if len(s) < 2:
                    bc[s] = r
                return r

问题成功解决,但是我感觉还会发生其他类似的异常,希望官方能关注一下PDF不同字符编码的兼容问题.

@zhangsanfu
Copy link

the same problem

@brchiu
Copy link

brchiu commented Sep 28, 2018

I have the same problem and your fix works for me.
Could you send a pull request to author ?

我遇到同一個問題,您的方法可以解決我的問題。
您可以發一個 pull request 給作者嗎?

@gitzjm
Copy link
Contributor Author

gitzjm commented Oct 5, 2018

我有同样的问题,你的修复对我有用。
你可以向作者发送拉动请求吗?

我遇到同一个问题,您的方法可以解决我的问题。
您可以发一个拉请求给作者吗?

台灣的朋友你好:
已发pull request 具體修正方法如下
generic.py 的第 486行的代碼:
return NameObject(name.decode('utf-8'))
替換為:

        try:
            ret=name.decode('utf-8')
        except (UnicodeEncodeError, UnicodeDecodeError) as e:
            ret=name.decode('gbk')
        return NameObject(ret)

以及utils.py 中的 238-241行

            r = s.encode('latin-1')
            if len(s) < 2:
                bc[s] = r
            return r

替換為:
```
try:
r = s.encode('latin-1')
if len(s) < 2:
bc[s] = r
return r
except Exception as e:
print(s)
r = s.encode('utf-8')
if len(s) < 2:
bc[s] = r
return r

即可

@eagleoflqj
Copy link

try:
    r = s.encode('latin-1')
except:
    r = s.encode('utf-8')
if len(s) < 2:
    bc[s] = r
return r

cbbing added a commit to cbbing/PyPDF2 that referenced this issue Oct 30, 2019
cbbing added a commit to cbbing/PyPDF2 that referenced this issue Oct 30, 2019
@Vimos Vimos mentioned this issue Nov 21, 2019
@zuiyuewentian
Copy link

遇到同样问题,重新打了个包,发在这里
https://github.com/zuiyuewentian/PyPDF2/releases/tag/v1.26.1

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 8, 2022
@MartinThoma MartinThoma changed the title 编码问题: PyPDF2.utils.PdfReadError: Illegal character in Name Object Encoding issue: PyPDF2.utils.PdfReadError: Illegal character in Name Object Jun 27, 2022
@MartinThoma
Copy link
Member

Do you still get the same issue with the latest PyPDF2?

Can somebody share a pdf that causes it?

@MartinThoma MartinThoma added the needs-pdf The issue needs a PDF file to show the problem label Jul 29, 2022
@MartinThoma
Copy link
Member

I'm closing this issue now as it might have been solved with the latest improvements. Please let me know if it wasn't solved by the latest PyPDF2 version.

Also, please share a PDF which causes issues!

@michelle-chou25
Copy link

I met the same problem again, same as the author

@zuiyuewentian
Copy link

zuiyuewentian commented Apr 24, 2023 via email

@pubpub-zz
Copy link
Collaborator

I met the same problem again, same as the author

@michelle-chou25
Without pdf and code we can not come complete analysis. Please open a new issue and provide the data

@lwdsw
Copy link

lwdsw commented Oct 30, 2023

使用最新的 PyPDF2 是否仍然遇到同样的问题?

有人可以分享导致它的pdf吗?

来 用我这个
工艺流程图.pdf

@stefan6419846
Copy link
Collaborator

@lwdsw Please open a new issue for it with your code and the PDF file as well as an English description. Note that PyPDF2 is deprecated and should be migrated to pypdf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF needs-pdf The issue needs a PDF file to show the problem
Projects
None yet
Development

No branches or pull requests

10 participants