RSCHTMLReader throws bytes/string error #8

chemlynx · 2016-10-13T06:59:09Z

I'm using python 3.5 and trying to process an RSC article (10.1039/C6OB02074G)

I see the error:

TypeError: %b requires bytes, or an object that implements __bytes__, not 'str'

The issue seems to be with the replace_rsc_img_chars function in rsc.py.

Looking at it the matches that are obtained from parsing the entity xpath (u1 and u2) are unicode strings (see lines 270, 272). u1 and u2 are then subsequently used to generate rep (line 276) here the code is trying to insert a unicode string into a byte string.

The text was updated successfully, but these errors were encountered:

mcs07 · 2016-10-13T12:43:29Z

Thanks. There have been a lot of these types of encoding bugs due to me not properly testing under python 3. In this case, it is because the lxml parser returns byte strings in python 2, but unicode strings in python 3. I've committed a fix, and will push a new version pending testing.

mcs07 added a commit that referenced this issue Oct 13, 2016

Fix encoding bug in RSC image character handling - fixes #8

020cc21

mcs07 mentioned this issue Oct 13, 2016

Fix encoding bug in RSC image character handling #9

Merged

mcs07 closed this as completed in #9 Oct 13, 2016

mcs07 added the bug label Feb 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSCHTMLReader throws bytes/string error #8

RSCHTMLReader throws bytes/string error #8

chemlynx commented Oct 13, 2016

mcs07 commented Oct 13, 2016

RSCHTMLReader throws bytes/string error #8

RSCHTMLReader throws bytes/string error #8

Comments

chemlynx commented Oct 13, 2016

mcs07 commented Oct 13, 2016