Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv fails for UTF-16 with BOM (maybe also other encodings with BOM) and skiprows #2298

Closed
gerigk opened this issue Nov 20, 2012 · 10 comments

Comments

@gerigk
Copy link

commented Nov 20, 2012

Unfortunately I can't send the file but from the output of head filename -n 7

��Name  Ad performance report                           
Type    Ad                                  
Frequency   One time                            
Date range  Custom date range                       
Dates   Sep 19, 2012-Nov 19, 2012                       
Account Day Campaign    Ad group    Ad ID   Client name Destination URL Impressions Clicks  Cost    Avg. position   Status  Conv. (1-per-click)
Categories 2    15.11.2012  something: ��;�C�7�:�8� [somethinglse]{test}: ��;�C�7�:�8�  16902484818 Categories 2    http://www.someurl?ad=291012    333 2   4.7 5.5 approved    0

I guess that the beginning of the file is the BOM and that this causes problems when skipping the rows. Without skiprows everything gets read into one row with the first column containing the BOM.

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns:
��Name\tAd performance report\t\t\t\t\t\t\t\t\t\t\t
...

The error raised is:

pd.read_csv('/home/arthur/Desktop/client 139 - ads report/test_pandas.csv', sep='\t', skiprows=5)
/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, header, index_col, names, skiprows, skipfooter, skip_footer, na_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    361                     buffer_lines=buffer_lines)
    362 
--> 363         return _read(filepath_or_buffer, kwds)
    364 
    365     parser_f.__name__ = name

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    185 
    186     # Create the parser.
--> 187     parser = TextFileReader(filepath_or_buffer, **kwds)
    188 
    189     if nrows is not None:

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    465         self.options, self.engine = self._clean_options(options, engine)
    466 
--> 467         self._make_engine(self.engine)
    468 
    469     def _get_options_with_defaults(self, engine):

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in _make_engine(self, engine)
    567     def _make_engine(self, engine='c'):
    568         if engine == 'c':
--> 569             self._engine = CParserWrapper(self.f, **self.options)
    570         else:
    571             if engine == 'python':

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in __init__(self, src, **kwds)
    787         ParserBase.__init__(self, kwds)
    788 
--> 789         self._reader = _parser.TextReader(src, **kwds)
    790 
    791         # XXX

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/_parser.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3579)()

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/_parser.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4590)()

CParserError: Passed header=0 but only 0 lines in file
@ghost

This comment has been minimized.

Copy link

commented Dec 2, 2012

you should specify the encoding arg explicily when reading in non-ascii files, but even with that
it's not functioning, so this is a bug.
as a temporary workaround, you can try

pd.read_csv('data.csv', sep='\t', skiprows=5,encoding='utf-16le',engine='python')

it'll use the slower python parser, but should work.

@ghost

This comment has been minimized.

Copy link

commented Dec 2, 2012

df=pd.read_csv("2.csv",sep=u'\t'.encode('utf-16'),encoding='utf-16')

comes close, but the column names are not properly decoded into unicode.
if you set them manually to ascii/unicode values, the dataframe is fine.

there's work to do here obviously.

edit: looks like the index is not decoded properly as well.
with the hack in 3e76878

wesm added a commit that referenced this issue Dec 2, 2012
@wesm

This comment has been minimized.

Copy link
Member

commented Dec 2, 2012

I wrote a unit test to try to replicate. Encoding unicode with utf-16 adds the BOM, and it seems it can be successfully read using read_csv(path, encoding='utf-16', skiprows=n). ?

@ghost

This comment has been minimized.

Copy link

commented Dec 2, 2012

the tests passes, but the following raises yet another error in ipython:

import  pandas as pd
import random
import pandas.util.testing as tm
import os

data = u"""skip this
skip this too
A,B,C
1,2,3
4,5,6"""
path = '/tmp/1.csv'
enc='utf-16'
bytes = data.encode(enc)
with open(path, 'wb') as f:
    f.write(bytes)

result = pd.read_csv(path, encoding=enc, skiprows=2)
#    expected = pd.read_csv(path,encoding=enc, skiprows=2,engine='python')
#   tm.assert_frame_equal(result, expected)

works with enc='ascii' though. Am I missing something?
once that's working, I think comparing engine:c and engine:python
will surface the issue.

@gerigk

This comment has been minimized.

Copy link
Author

commented Dec 2, 2012

I still get an error for a file that I am directly downloading from Google AdWords (the format is called CSV for Excel in case you have an accessible account).
The BOM is

'\xff\xfe'

and with the code above it fails.
If I use the hint of y-p with the encoded separator

pd.read_csv(paths,sep=u'\t'.encode('utf-16le'), skiprows=5, encoding='utf-16le')

the file is read correctly but the cyrillic letters aren't printed correctly (in IPython I get some non-meaningful latin letters and in the standard python shell I get weird boxes �).
Libre Calc opens the file without problems and shows the letters correctly.
Also Pandas 0.9.1 and 0.10dev work fine with (adding engine='python' for 0.10dev)

pd.read_csv(path, sep='\t', skiprows=5, encoding='utf-16le')

but not with sep = u'\t".encode('utf16-le') (using this addition I get the same weird characters where cyrillic characters are expected).

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 2, 2012

Can you produce a sufficiently obfuscated output (pls the exact unicode or bytes literal that can be copy-pasted into Python to be passed in a StringIO) for me to see where the decoding is going wrong?

@gerigk

This comment has been minimized.

Copy link
Author

commented Dec 2, 2012

I sent you an email.

@wesm

This comment has been minimized.

Copy link
Member

commented Dec 3, 2012

Thanks, I see the problem. The problem is that for little-endian UTF-16, the null byte \x00 falls after ASCII characters like the delimiter. To properly parse this data in C, you'd need to write a custom UTF-16 tokenizer. I think the best approach is probably to transcode the data as UTF-8 and feed that to the parser. I'll take a look this week sometime

@ghost

This comment has been minimized.

Copy link

commented Dec 3, 2012

detecting the BOM at the start of the file might also be workable.
There are just a small number of possible values.

wesm added a commit that referenced this issue Dec 6, 2012
wesm added a commit that referenced this issue Dec 6, 2012
@wesm

This comment has been minimized.

Copy link
Member

commented Dec 6, 2012

Looking good now. Arthur, your test case from your e-mail works fine now (do NOT do u'\t'.encode('utf-16le') though because it adds a BOM to the delimiter and confuses the CSV reader)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.