Chapter 3 use utf-8-sig for encoding to exclude BOM in book file #225

pjhinton · 2019-08-27T16:52:35Z

In this section:

The following code is used to decode the bytes from the Gutenberg Project webserver:

raw = response.read().decode('utf8')

With Python 3.7.4, the value of raw will contain a byte-order mark (BOM).

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

and the return value of len() will be 1176967 rather than 1176893.

The Python Unicode HOWTO recommends the use of utf-8-sig as an encoding value to exclude the BOM, which really isn't needed for UTF-8.

from urllib import request

url = 'http://www.gutenberg.org/files/2554/2554-0.txt'

response = request.urlopen(url)

raw = response.read().decode(encoding='utf-8-sig')

type(raw)

str

len(raw)

raw[:75]

'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

The text was updated successfully, but these errors were encountered:

stevenbird self-assigned this Sep 4, 2019

stevenbird added the Chapter 4 label Sep 4, 2019

Provide feedback