ENH: Add encoding option to numpy text IO #4208

juliantaylor · 2014-01-17T19:02:48Z

add encoding flag to np.loadtxt to be able to load non default encoded
text files.

juliantaylor · 2014-01-17T19:09:50Z

unfinished but so people can have a look at the idea and know what I'm talking about on the mailing list.

loadtxt somewhat works genfromtxt not yet.
to be blunt, the py3 port of the text IO was seriously botched. All uses of asbytes in the text IO related functions are broken. Why would you assume that text is bytes in the python3 port when that this is fixed is the main selling point of python3 in the first place ...

juliantaylor · 2014-01-17T19:12:49Z

numpy/lib/npyio.py

@@ -781,9 +783,15 @@ def pack_items(items, packing):
                start += length
            return tuple(ret)

-    def split_line(line):
+    def split_line(line, encoding=None):
+        # decode bytes, default to latin1


this explicit decode is needed for backward compatibility with zipped files which have not been opened in text mode

Meaning that unzip doesn't return strings?

not if you open it with "rb", probably also applies to normal files

pv · 2014-01-17T19:17:04Z

The underlying assumption in the I/O port was that all scientific text files are actually 1-byte binary files, not text, which may well have been misguided, and leaves little room for unicode. The use of asbytes originates only from the fact that b'%d' % (20,) does not work.

ChrisBarker-NOAA · 2014-01-17T21:28:13Z

"The use of asbytes originates only from the fact that b'%d' % (20,) does not work."

interesting -- for the record, there is a big ol' thread about that on Python-dev, and it looks like that's going to be added.:

http://www.python.org/dev/peps/pep-0461/

but there are (far more ugly) ways to do it without new features:

'%d' % (20,).encode('ascii')

for instance.

juliantaylor · 2014-01-25T22:33:01Z

tests should now succeed after adding more hacks to keep supporting broken assumptions on data encoding.

juliantaylor · 2014-01-25T23:27:46Z

loading of 'S' dtype in a structured array most likely will not work yet.

charris · 2014-02-15T02:24:12Z

numpy/lib/tests/test_io.py

@@ -55,6 +55,10 @@ def strptime(s, fmt=None):
    else:
        return datetime(*time.strptime(s, fmt)[:3])

+def strptime_nonbroken(s, fmt=None):
+    """ works on strings as it should    """


This docstring isn't very informative ;) What, in particular, does the function do?

I guess it can be removed its some python2.5 cludge that was made worse in the py3 conversion

juliantaylor · 2014-05-04T21:25:49Z

@charris I think this and a yet to be done genfromtxt fix should be in 1.9, but getting it regression free is difficult as there is all kind of stuff people can be inputing to these functions and it works by accident.
I can possibly go over it again next week, don't know if that still fits your shedule.

charris · 2014-05-04T22:50:26Z

I'd be inclined to push this to 1.10 with the other genfromtxt fixes. The masked array fixes are in the same spot and I'd like both to have more time to settle out. Maybe we can do a 1.10 as soon as the datetime stuff gets done, or maybe sooner if it doesn't get done ;)

charris · 2015-01-25T18:14:29Z

@juliantaylor Could you revisit this when you finish with higher priority stuff. Might be easier with support for 2.5 dropped. Also interested in @pv comment about text files.

charris · 2015-06-21T20:44:00Z

Pushing this off (again) to 1.11.

charris · 2015-08-15T01:02:24Z

@juliantaylor Needs a rebase.

charris · 2015-12-10T22:22:45Z

@juliantaylor Still interested in this?

ChrisBarker-NOAA · 2015-12-10T23:28:19Z

I hope so -- this would be nice :-)

charris · 2016-06-13T14:52:49Z

@juliantaylor Closing this. Please resubmit if you get the urge to continue. Anyone else interested in this is welcome to pull the code and give it a shot.

Load data in chunks and fill it into an array grown with resize. This significantly reduces the memory consumption of the function.

charris · 2017-11-11T23:13:11Z

@juliantaylor Ready for review?

charris · 2017-11-13T19:44:32Z

doc/release/1.14.0-notes.rst

@@ -9,6 +9,8 @@ Highlights
 ==========

 * The `np.einsum` function will use BLAS when possible
+* ``genfromtxt``, ``loadtxt``, ``fromregex`` and ``savetxt`` can now handle files
+with arbitrary encoding supported by Python.


Needs indentation

charris · 2017-11-13T19:46:36Z

numpy/lib/_datasource.py


 _open = open

+def _check_mode(mode, encoding, newline):
+    if "t" in mode:


Needs docstring.

charris · 2017-11-13T19:48:47Z

numpy/lib/_datasource.py

+            raise ValueError("Argument 'newline' not supported in binary mode")
+
+def _python2_bz2open(fn, mode, encoding, newline):
+    """ wrapper to open bz2 in text mode """


Needs docstring of the standard type with expanded explanation and documentation of parameters.

charris · 2017-11-13T19:49:02Z

numpy/lib/_datasource.py

+        return bz2.BZ2File(fn, mode)
+
+def _python2_gzipopen(fn, mode, encoding, newline):
+    """ wrapper to open gzip in text mode """


Needs docstring.

charris · 2017-11-13T19:55:15Z

numpy/lib/_datasource.py

@@ -115,7 +173,7 @@ def __getitem__(self, key):

 _file_openers = _FileOpeners()


This singleton seems a bit odd. Old design I expect.

charris · 2017-11-13T20:04:16Z

numpy/lib/_iotools.py

-        if isinstance(delimiter, unicode):
-            delimiter = delimiter.encode('ascii')
-        if (delimiter is None) or _is_bytes_like(delimiter):
+        if (delimiter is None) or isinstance(delimiter, basestring):


So in Python 2 either ascii or unicode?

Probably OK.

charris · 2017-11-13T20:49:27Z

numpy/lib/npyio.py

@@ -1190,21 +1250,51 @@ def savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='',
        fmt = asstr(fmt)
    delimiter = asstr(delimiter)

+    class WriteWrap(object):


Need better docstring.

charris · 2017-11-13T20:50:02Z

numpy/lib/npyio.py

        if line:
            return line.split(delimiter)
        else:
            return []

+    def read_data(chunk_size):
+        # Parse each line, including the first


Needs better docstring, in particular chunk_size. @

charris · 2017-11-13T21:04:25Z

numpy/lib/npyio.py

@@ -1460,6 +1562,15 @@ def genfromtxt(fname, dtype=float, comments='#', delimiter=None,
        to read the entire file.

        .. versionadded:: 1.10.0
+    encoding: string, optional
+        Encoding used to decode the inputfile. Does not apply to input streams.


What is a stream in this context?

charris · 2017-11-13T21:17:05Z

New functions need docstrings. Also, some of the comments look useful.

charris · 2017-11-13T22:41:08Z

Just to be clear, are the assumptions here that:

Input files are text, default encoding is ascii (or latin1)?
String data is str in Python 2 and python 3 for default encoding (S).
String data is unicode (U) for other encoding?
Output files are text (unicode).

charris · 2017-11-19T17:42:07Z

Rebased and squashed in #10054, so closing this.

ENH: Add encoding option to numpy text IO.

juliantaylor reviewed Jan 17, 2014
View reviewed changes

charris reviewed Feb 15, 2014
View reviewed changes

juliantaylor mentioned this pull request Feb 15, 2014

np.genfromtxt hates text files #3184

Open

juliantaylor added the Needs work label Feb 24, 2014

This was referenced Apr 6, 2014

ENH: Quoting support in np.genfromtxt(...) #4594

Closed

Bug with NumPy loadtxt() and unicode strings #4600

Closed

juliantaylor added this to the 1.10 blockers milestone Jul 22, 2014

DavidPowell mentioned this pull request Mar 9, 2015

loadtxt fails with complex data under python 3 #5655

Closed

charris modified the milestones: 1.11.0 release, 1.10 blockers Jun 21, 2015

charris modified the milestones: 1.12.0 release, 1.11.0 release Jan 21, 2016

rgommers mentioned this pull request May 27, 2016

DOC: fix broken genfromtxt examples in user guide. Closes gh-7662. #7688

Merged

charris removed this from the 1.12.0 release milestone Jun 13, 2016

charris added the 52 - Inactive Pending author response label Jun 13, 2016

juliantaylor added 9 commits November 6, 2017 22:24

cleanup compressed file handling in datasource

7f0d6f7

remove two now unnecessary abstractions

088f4b3

fix encoding argument not being passed to Linesplitter

097f7c0

move decoding into Linesplitter's handyman function

053449d

cleanup

3aba208

ENH: change loadtxt to use a generator to load data

bec193e

Load data in chunks and fill it into an array grown with resize. This significantly reduces the memory consumption of the function.

DOC: add release notes for text IO changes

1fe69f3

DEPR: add a deprecation warning when reading strings without encoding

e9ae400

add test for savetxt into StringIO

c482a5b

juliantaylor force-pushed the load-encoding branch from 47978a7 to c482a5b Compare November 6, 2017 21:36

mhvk mentioned this pull request Nov 9, 2017

Make it possible to use Table.read on FITS files with no copying astropy/astropy#6821

Merged

charris reviewed Nov 13, 2017

View reviewed changes

numpy/lib/_datasource.py

_open = open

def _check_mode(mode, encoding, newline):

if "t" in mode:

Copy link

Member

charris Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docstring.

charris reviewed Nov 13, 2017

View reviewed changes

charris changed the title ~~add encoding option to numpy text IO~~ ENH: Add encoding option to numpy text IO Nov 19, 2017

charris mentioned this pull request Nov 19, 2017

ENH: Add encoding option to numpy text IO. #10054

Merged

charris closed this Nov 19, 2017

charris added a commit that referenced this pull request Nov 26, 2017

Merge pull request #10054 from charris/gh-4208

8c441fa

ENH: Add encoding option to numpy text IO.

eric-wieser mentioned this pull request Apr 27, 2018

genfromtxt requires encoding despite input being unicode #10990

Closed

charris mentioned this pull request Jul 29, 2018

np.loadtxt no longer loads bz2 files in python2 #11633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add encoding option to numpy text IO #4208

ENH: Add encoding option to numpy text IO #4208

juliantaylor commented Jan 17, 2014

juliantaylor commented Jan 17, 2014

juliantaylor Jan 17, 2014

charris Feb 15, 2014

juliantaylor Feb 15, 2014

pv commented Jan 17, 2014

ChrisBarker-NOAA commented Jan 17, 2014

juliantaylor commented Jan 25, 2014

juliantaylor commented Jan 25, 2014

charris Feb 15, 2014

juliantaylor Feb 15, 2014

juliantaylor commented May 4, 2014

charris commented May 4, 2014

charris commented Jan 25, 2015

charris commented Jun 21, 2015

charris commented Aug 15, 2015

charris commented Dec 10, 2015

ChrisBarker-NOAA commented Dec 10, 2015

charris commented Jun 13, 2016

charris commented Nov 11, 2017

charris Nov 13, 2017

charris Nov 13, 2017

charris Nov 13, 2017

charris Nov 13, 2017

charris Nov 13, 2017

charris Nov 13, 2017

charris Nov 13, 2017

charris Nov 13, 2017

charris Nov 13, 2017 •

edited

charris Nov 13, 2017

charris commented Nov 13, 2017

charris commented Nov 13, 2017

charris commented Nov 19, 2017

		@@ -115,7 +173,7 @@ def __getitem__(self, key):

		_file_openers = _FileOpeners()

ENH: Add encoding option to numpy text IO #4208

ENH: Add encoding option to numpy text IO #4208

Conversation

juliantaylor commented Jan 17, 2014

juliantaylor commented Jan 17, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pv commented Jan 17, 2014

ChrisBarker-NOAA commented Jan 17, 2014

juliantaylor commented Jan 25, 2014

juliantaylor commented Jan 25, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliantaylor commented May 4, 2014

charris commented May 4, 2014

charris commented Jan 25, 2015

charris commented Jun 21, 2015

charris commented Aug 15, 2015

charris commented Dec 10, 2015

ChrisBarker-NOAA commented Dec 10, 2015

charris commented Jun 13, 2016

charris commented Nov 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charris Nov 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charris commented Nov 13, 2017

charris commented Nov 13, 2017

charris commented Nov 19, 2017

charris Nov 13, 2017 •

edited