Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add encoding option to numpy text IO #4208

Closed
wants to merge 27 commits into from

Conversation

juliantaylor
Copy link
Contributor

add encoding flag to np.loadtxt to be able to load non default encoded
text files.

@juliantaylor
Copy link
Contributor Author

unfinished but so people can have a look at the idea and know what I'm talking about on the mailing list.

loadtxt somewhat works genfromtxt not yet.
to be blunt, the py3 port of the text IO was seriously botched. All uses of asbytes in the text IO related functions are broken. Why would you assume that text is bytes in the python3 port when that this is fixed is the main selling point of python3 in the first place ...

@@ -781,9 +783,15 @@ def pack_items(items, packing):
start += length
return tuple(ret)

def split_line(line):
def split_line(line, encoding=None):
# decode bytes, default to latin1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this explicit decode is needed for backward compatibility with zipped files which have not been opened in text mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning that unzip doesn't return strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not if you open it with "rb", probably also applies to normal files

@pv
Copy link
Member

pv commented Jan 17, 2014

The underlying assumption in the I/O port was that all scientific text files are actually 1-byte binary files, not text, which may well have been misguided, and leaves little room for unicode. The use of asbytes originates only from the fact that b'%d' % (20,) does not work.

@ChrisBarker-NOAA
Copy link
Contributor

"The use of asbytes originates only from the fact that b'%d' % (20,) does not work."

interesting -- for the record, there is a big ol' thread about that on Python-dev, and it looks like that's going to be added.:

http://www.python.org/dev/peps/pep-0461/

but there are (far more ugly) ways to do it without new features:

'%d' % (20,).encode('ascii')

for instance.

@juliantaylor
Copy link
Contributor Author

tests should now succeed after adding more hacks to keep supporting broken assumptions on data encoding.

@juliantaylor
Copy link
Contributor Author

loading of 'S' dtype in a structured array most likely will not work yet.

@@ -55,6 +55,10 @@ def strptime(s, fmt=None):
else:
return datetime(*time.strptime(s, fmt)[:3])

def strptime_nonbroken(s, fmt=None):
""" works on strings as it should """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring isn't very informative ;) What, in particular, does the function do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can be removed its some python2.5 cludge that was made worse in the py3 conversion

@juliantaylor
Copy link
Contributor Author

@charris I think this and a yet to be done genfromtxt fix should be in 1.9, but getting it regression free is difficult as there is all kind of stuff people can be inputing to these functions and it works by accident.
I can possibly go over it again next week, don't know if that still fits your shedule.

@charris
Copy link
Member

charris commented May 4, 2014

I'd be inclined to push this to 1.10 with the other genfromtxt fixes. The masked array fixes are in the same spot and I'd like both to have more time to settle out. Maybe we can do a 1.10 as soon as the datetime stuff gets done, or maybe sooner if it doesn't get done ;)

@juliantaylor juliantaylor added this to the 1.10 blockers milestone Jul 22, 2014
@charris
Copy link
Member

charris commented Jan 25, 2015

@juliantaylor Could you revisit this when you finish with higher priority stuff. Might be easier with support for 2.5 dropped. Also interested in @pv comment about text files.

@charris
Copy link
Member

charris commented Jun 21, 2015

Pushing this off (again) to 1.11.

@charris charris modified the milestones: 1.11.0 release, 1.10 blockers Jun 21, 2015
@charris
Copy link
Member

charris commented Aug 15, 2015

@juliantaylor Needs a rebase.

@charris
Copy link
Member

charris commented Dec 10, 2015

@juliantaylor Still interested in this?

@ChrisBarker-NOAA
Copy link
Contributor

I hope so -- this would be nice :-)

@charris charris modified the milestones: 1.12.0 release, 1.11.0 release Jan 21, 2016
@charris charris removed this from the 1.12.0 release milestone Jun 13, 2016
@charris charris added the 52 - Inactive Pending author response label Jun 13, 2016
@charris
Copy link
Member

charris commented Jun 13, 2016

@juliantaylor Closing this. Please resubmit if you get the urge to continue. Anyone else interested in this is welcome to pull the code and give it a shot.

@charris
Copy link
Member

charris commented Nov 11, 2017

@juliantaylor Ready for review?

@@ -9,6 +9,8 @@ Highlights
==========

* The `np.einsum` function will use BLAS when possible
* ``genfromtxt``, ``loadtxt``, ``fromregex`` and ``savetxt`` can now handle files
with arbitrary encoding supported by Python.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs indentation


_open = open

def _check_mode(mode, encoding, newline):
if "t" in mode:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docstring.

raise ValueError("Argument 'newline' not supported in binary mode")

def _python2_bz2open(fn, mode, encoding, newline):
""" wrapper to open bz2 in text mode """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docstring of the standard type with expanded explanation and documentation of parameters.

return bz2.BZ2File(fn, mode)

def _python2_gzipopen(fn, mode, encoding, newline):
""" wrapper to open gzip in text mode """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docstring.

@@ -115,7 +173,7 @@ def __getitem__(self, key):

_file_openers = _FileOpeners()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This singleton seems a bit odd. Old design I expect.

if isinstance(delimiter, unicode):
delimiter = delimiter.encode('ascii')
if (delimiter is None) or _is_bytes_like(delimiter):
if (delimiter is None) or isinstance(delimiter, basestring):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in Python 2 either ascii or unicode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably OK.

@@ -1190,21 +1250,51 @@ def savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='',
fmt = asstr(fmt)
delimiter = asstr(delimiter)

class WriteWrap(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need better docstring.

if line:
return line.split(delimiter)
else:
return []

def read_data(chunk_size):
# Parse each line, including the first
Copy link
Member

@charris charris Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs better docstring, in particular chunk_size. @

@@ -1460,6 +1562,15 @@ def genfromtxt(fname, dtype=float, comments='#', delimiter=None,
to read the entire file.

.. versionadded:: 1.10.0
encoding: string, optional
Encoding used to decode the inputfile. Does not apply to input streams.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a stream in this context?

@charris
Copy link
Member

charris commented Nov 13, 2017

New functions need docstrings. Also, some of the comments look useful.

@charris
Copy link
Member

charris commented Nov 13, 2017

Just to be clear, are the assumptions here that:

  1. Input files are text, default encoding is ascii (or latin1)?
  2. String data is str in Python 2 and python 3 for default encoding (S).
  3. String data is unicode (U) for other encoding?
  4. Output files are text (unicode).

@charris charris changed the title add encoding option to numpy text IO ENH: Add encoding option to numpy text IO Nov 19, 2017
@charris
Copy link
Member

charris commented Nov 19, 2017

Rebased and squashed in #10054, so closing this.

@charris charris closed this Nov 19, 2017
charris added a commit that referenced this pull request Nov 26, 2017
ENH: Add encoding option to numpy text IO.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement Priority: high High priority, also add milestones for urgent issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants