New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add encoding option to numpy text IO #4208
Conversation
unfinished but so people can have a look at the idea and know what I'm talking about on the mailing list. loadtxt somewhat works genfromtxt not yet. |
numpy/lib/npyio.py
Outdated
@@ -781,9 +783,15 @@ def pack_items(items, packing): | |||
start += length | |||
return tuple(ret) | |||
|
|||
def split_line(line): | |||
def split_line(line, encoding=None): | |||
# decode bytes, default to latin1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this explicit decode is needed for backward compatibility with zipped files which have not been opened in text mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning that unzip doesn't return strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not if you open it with "rb", probably also applies to normal files
The underlying assumption in the I/O port was that all scientific text files are actually 1-byte binary files, not text, which may well have been misguided, and leaves little room for unicode. The use of asbytes originates only from the fact that |
"The use of asbytes originates only from the fact that b'%d' % (20,) does not work." interesting -- for the record, there is a big ol' thread about that on Python-dev, and it looks like that's going to be added.: http://www.python.org/dev/peps/pep-0461/ but there are (far more ugly) ways to do it without new features: '%d' % (20,).encode('ascii') for instance. |
tests should now succeed after adding more hacks to keep supporting broken assumptions on data encoding. |
loading of 'S' dtype in a structured array most likely will not work yet. |
numpy/lib/tests/test_io.py
Outdated
@@ -55,6 +55,10 @@ def strptime(s, fmt=None): | |||
else: | |||
return datetime(*time.strptime(s, fmt)[:3]) | |||
|
|||
def strptime_nonbroken(s, fmt=None): | |||
""" works on strings as it should """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring isn't very informative ;) What, in particular, does the function do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it can be removed its some python2.5 cludge that was made worse in the py3 conversion
@charris I think this and a yet to be done genfromtxt fix should be in 1.9, but getting it regression free is difficult as there is all kind of stuff people can be inputing to these functions and it works by accident. |
I'd be inclined to push this to 1.10 with the other genfromtxt fixes. The masked array fixes are in the same spot and I'd like both to have more time to settle out. Maybe we can do a 1.10 as soon as the datetime stuff gets done, or maybe sooner if it doesn't get done ;) |
@juliantaylor Could you revisit this when you finish with higher priority stuff. Might be easier with support for 2.5 dropped. Also interested in @pv comment about text files. |
Pushing this off (again) to 1.11. |
@juliantaylor Needs a rebase. |
@juliantaylor Still interested in this? |
I hope so -- this would be nice :-) |
@juliantaylor Closing this. Please resubmit if you get the urge to continue. Anyone else interested in this is welcome to pull the code and give it a shot. |
Load data in chunks and fill it into an array grown with resize. This significantly reduces the memory consumption of the function.
47978a7
to
c482a5b
Compare
@juliantaylor Ready for review? |
@@ -9,6 +9,8 @@ Highlights | |||
========== | |||
|
|||
* The `np.einsum` function will use BLAS when possible | |||
* ``genfromtxt``, ``loadtxt``, ``fromregex`` and ``savetxt`` can now handle files | |||
with arbitrary encoding supported by Python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs indentation
|
||
_open = open | ||
|
||
def _check_mode(mode, encoding, newline): | ||
if "t" in mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs docstring.
raise ValueError("Argument 'newline' not supported in binary mode") | ||
|
||
def _python2_bz2open(fn, mode, encoding, newline): | ||
""" wrapper to open bz2 in text mode """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs docstring of the standard type with expanded explanation and documentation of parameters.
return bz2.BZ2File(fn, mode) | ||
|
||
def _python2_gzipopen(fn, mode, encoding, newline): | ||
""" wrapper to open gzip in text mode """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs docstring.
@@ -115,7 +173,7 @@ def __getitem__(self, key): | |||
|
|||
_file_openers = _FileOpeners() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This singleton seems a bit odd. Old design I expect.
if isinstance(delimiter, unicode): | ||
delimiter = delimiter.encode('ascii') | ||
if (delimiter is None) or _is_bytes_like(delimiter): | ||
if (delimiter is None) or isinstance(delimiter, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in Python 2 either ascii or unicode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably OK.
@@ -1190,21 +1250,51 @@ def savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', | |||
fmt = asstr(fmt) | |||
delimiter = asstr(delimiter) | |||
|
|||
class WriteWrap(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need better docstring.
if line: | ||
return line.split(delimiter) | ||
else: | ||
return [] | ||
|
||
def read_data(chunk_size): | ||
# Parse each line, including the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs better docstring, in particular chunk_size
. @
@@ -1460,6 +1562,15 @@ def genfromtxt(fname, dtype=float, comments='#', delimiter=None, | |||
to read the entire file. | |||
|
|||
.. versionadded:: 1.10.0 | |||
encoding: string, optional | |||
Encoding used to decode the inputfile. Does not apply to input streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a stream in this context?
New functions need docstrings. Also, some of the comments look useful. |
Just to be clear, are the assumptions here that:
|
Rebased and squashed in #10054, so closing this. |
ENH: Add encoding option to numpy text IO.
add encoding flag to np.loadtxt to be able to load non default encoded
text files.