Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DictReader doesn't work with io.StringIO (Python 2.7) #42

Closed
arielpontes opened this issue Jan 26, 2015 · 1 comment
Closed

DictReader doesn't work with io.StringIO (Python 2.7) #42

arielpontes opened this issue Jan 26, 2015 · 1 comment

Comments

@arielpontes
Copy link

As described in this SO question, I am getting the following error with unicodecsv.DictReader:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)

Here's a simplified version of my code:

from io import StringIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL

data = (
    'first_name,last_name,email\r'
    'Elmer,Fudd,elmer@looneytunes.com\r'
    'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,joaoantonio@araujo.com\r'
)

unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)

class CustomDialect(Dialect):
    delimiter = ','
    doublequote = True
    escapechar = '\\'
    lineterminator = '\r\n'
    quotechar = '"'
    quoting = QUOTE_MINIMAL
    skipinitialspace = True

rows = DictReader(unicode_data, dialect=CustomDialect)

for row in rows:
    print row

If I replace StringIO with BytesIO, the encoding works but I can't send the newlines argument anymore and then I get:

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
@jdunck
Copy link
Owner

jdunck commented Jan 26, 2015

It may not be clear from the docs, but the input to a unicodecsv reader is expected to be bytes (str in python2), not unicode, so you should be using BytesIO rather than StringIO.

Testing with BytesIO rather than StringIO, I do see the "new-line character seen in unquoted field" error. I think this is a bug in the underlying csv module - https://docs.python.org/2/library/csv.html#csv.Dialect.lineterminator "The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future."

Using \r (as with your original data), I can reproduce this with a normal file, not io.BytesIO or io.StringIO -- and of course unicodecsv does generally work with files.

When I change your data to use \n rather than \r, then the code works:

from io import StringIO, BytesIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL

data = (
    'first_name,last_name,email\n'
    'Elmer,Fudd,elmer@looneytunes.com\n'
    'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,joaoantonio@araujo.com\n'
)

unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)
str_data = BytesIO(data)

class CustomDialect(Dialect):
    delimiter = ','
    doublequote = True
    escapechar = '\\'
    lineterminator = '\r'
    quotechar = '"'
    quoting = QUOTE_MINIMAL
    skipinitialspace = True

rows = DictReader(str_data, dialect=CustomDialect)

Unfortunately I don't see a way to fix this from within unicodecsv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants