Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 3 use utf-8-sig for encoding to exclude BOM in book file #225

Open
pjhinton opened this issue Aug 27, 2019 · 0 comments
Open

Chapter 3 use utf-8-sig for encoding to exclude BOM in book file #225

pjhinton opened this issue Aug 27, 2019 · 0 comments
Assignees

Comments

@pjhinton
Copy link

In this section:

http://www.nltk.org/book/ch03.html#accessing-text-from-the-web-and-from-disk

The following code is used to decode the bytes from the Gutenberg Project webserver:

raw = response.read().decode('utf8')

With Python 3.7.4, the value of raw will contain a byte-order mark (BOM).

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

and the return value of len() will be 1176967 rather than 1176893.

The Python Unicode HOWTO recommends the use of utf-8-sig as an encoding value to exclude the BOM, which really isn't needed for UTF-8.

from urllib import request
url = 'http://www.gutenberg.org/files/2554/2554-0.txt'
response = request.urlopen(url)
raw = response.read().decode(encoding='utf-8-sig')
type(raw)
str
len(raw)
1176966
raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'
@stevenbird stevenbird self-assigned this Sep 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants