Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

ENH: auto-type detect data loader #143

Closed
wants to merge 8 commits into from

7 participants

@chrisjordansquire

Created a function to auto-detect data types for each column in a csv-style data file. Works with arbitrary separators between the data and loads the data into a numpy record array.

This is an implementation of an idea of Mark Wiebe's for how to make a function to load data from a text file into a numpy record array by automatically testing the types. It also handles missing data, placing it into masked arrays. (A future update of this can add support for the NA feature.) It also handles dates.

The need this function fills is for users that want to be able to load their data quickly from a text file when the data contains missing values and is not homogenous. In this case a numpy record array seems like the most natural choice for the base package (though for more domain specific uses other data structures, such as in pandas, might be more appropriate). Unfortunately the current numpy functions for such reading of data are rather cumbersome. This function is much less cumbersome and includes sensible defaults. It is similar to matplotlib.mlab's csv2rec. This function isn't blazingly fast, and there's lots of potential avenues for improvement. However, it is perfectly usable for non-humongous datasets. For example, it loaded a fairly messy file of 2800 lines and 30 columns in about 1.3s, and the algorithm should scale linearly with the size of the input.

For the power-user, it's fairly simple to add support for additional types, such as more elaborate datetimes.

This also has a modest test suite to check functionality.

@pierregm
@rgommers
Owner

I had the same first reaction as Pierre. The new functionality looks useful, but would it be possible to infer the dtypes per column and then feed that to loadtxt/genfromtxt?

@chrisjordansquire

The main advantages of this function over genfromtxt, I think, are

  • More modular (genfromtxt is a rather large, nearly 500 line, monolithic function. No function in this is longer than around 80 lines, and they're fairly self-contained. This makes it easier for power users to make desired modifications.)
  • delimiters can be specified via regex's
  • missing data can be specified via regex's
  • it's bit simpler and has sensible defaults
  • it actually works on some (unfortunately proprietary) data that genfromtxt doesn't seem robust enough for
  • it supports datetimes
  • fairly extensible for the power user
  • makes two passes through the file, the first to determine types/sizes for strings and the second to read in the data, and pre-allocates the array for the second pass. So no giant memory bloating for reading large text files
  • fairly fast, though I think there is plenty of room for optimizations

All that said, it's entirely possible that the innards which determine the type should be ripped out and submitted as a function on their own.

I also just pushed some changes that added a bit of documentation, made a few performance enhancements, and fixed a small bug.

@rgommers
Owner

I do like this code better then genfromtxt, although I'm still worried somewhat about duplication. Did you send a message about this to the mailing list? There are a couple of people making regular improvements to loadtxt and genfromtxt who will be interested.

A comment on how the additions are organized: io is not a good module name because it's the name of a stdlib module. For scipy there's the same problem; we ended up recommending to import like import scipy.io as spio. And the python test files with expected output could probably better be put in a single file.

@rgommers
Owner

I do like the large number of test cases, nice job on that!

@chrisjordansquire

I'll email the general list tonight. Thanks for reminding me.

I actually think I need more test cases. Currently my tests only tell you if it breaks without giving much including about where or how. And it's a big enough method that it's a problem.

You mean put all the expected outputs inside the test file itself?

Placing it into io was Mark's idea. He was hoping it would be a stepping stone to later breaking up all of npyio.py into seperate method files instead of one huge file. I'd be happy to use a different file name, but I think Mark has the right idea of breaking up that huge file and putting its methods in a folder.

Based on some small performance testing I did, the loadtable function's big bottleneck is (1) checking for sizes if that's enabled and (2) reading in the data. The first can be disabled by the user, while the second can be replaced with a more specialized function or a C extension later on. One of the joys of explicit modularity.

I'd also welcome a name different from loadtable. I don't really like the name, but didn't know what else to call it. I was modeling parts of its expected behavior and API on R's read.table function, hence the name.

Code duplication definitely seems like a potential problem. loadtable isn't well developed enough that it could completely take the place of genfromtxt, but I think there is a potential future where it could. Until then I'm just hoping this is simpler to use and fulfills a niche that genfromtxt and its wrappers don't.

@rgommers
Owner

Or in a separate file. I just see a lot of 3-line long .py files, which feels a bit too Matlab-like:)

Some more modularity is good, it just shouldn't be called io.

@dhomeier

I also find it much clearer to have the (StringIO) input and comparison data together in one test file.
The mode of pre-scanning the data makes this (among all the other functionality!) a complement to #144
As I posted to the list, automatic decompression and spaces as default delimiter would let this go more seamlessly with the existing npyio routines.
I could also think of a few more pet options:

enable Fortran-style exponentials (3.1415D+00 ;-) OK, probably easy enough to contribute in that framework

allow record names to be passed by the user - somewhat conflicting with the auto-detection, I admit, but I don't know of a way of changing the field names once the record array has been created.
This brings to my mind that your functionality is approaching some of the things provided by the asciitable package
http://cxc.harvard.edu/contrib/asciitable/
This can e.g. auto-read the field names from a header line, but its performance is way below loadtable (which comes actually very close to loadtxt with the prescan option). But it might be worth checking if the two projects could benefit from each other.

@chrisjordansquire

I'd been told about the asciitable project, and looked at it some. But it ultimately didn't seem relevant to the things I was thinking about at the time, so I didn't look into it further. (Since a lot of the stuff I was worrying about was recognizing types, constructing performant regular expressions--ha!-- and getting type promotion rules right.) If anyone has specific suggests for things I should look at in it, I'd happily give a second go.

As long as we're listing wants, two other options which I'd like this to have (in the 'and I wanna pony' sense) are:

  • read only specific columns (this could actually be added without much trouble once I have time and overcome laziness)
  • able to intelligently read types with each column, if the user is enlightened enough to specify them above/below the header so you don't need to auto-detect the types
@dhomeier

It was just a vague idea for the moment - as asciitable also is already relatively mature, and other packages like ATPy depend on it, it would not be easily replaced by a new module. One idea for producing some synergy in the mid-range is that maybe the BaseReader of asciitable could be replaced by a more performant engine based on your work. But perhaps this is rather up to the asciitable developers to think about.

I'd also rank your first wanted above rather high, to provide an equivalent to the 'usecols' option of loadtxt/genfromtxt.

@chrisjordansquire

A note on an earlier comment. Numpy says it will be deprecated in the future, but you can change the names in a record array via, if x is a record array with 2 fields,

x.dtype.names = ('new_name1', 'new_name2')

@chrisjordansquire

I refactored the tests so it's all in one test file. I also changed the default delimiter from ',' to ' ', as suggested on the mailing list.

@chrisjordansquire

I added the ability to select only certain columns.

@mwiebe
Owner

Would it overcomplicate things to try and automatically detect a delimiter by default, like some spreadsheet apps do? Maybe based on some heuristic regexes on the first line of data?

@chrisjordansquire

I'm not sure what the best heuristics would be. The most common delimiters, according to wikipedia, are whitespace (either tab or space), commas, colons, and pipes. So perhaps just doing a split on each of those and going with the one with the longest list would be appropriate.

But on the mailing list some people indicated they wanted whitespace as the default delimiter. And I kinda think the user should specify if they want some wonky heuristic used to determine the delimiter.

Another route would just be writing a few delimiter specific wrappers. Like a loadcsvtable, except with a non-ugly name.

I'm open to suggestions, as I can't think of a way to do it that really feels 'right'.

@dhomeier

It would probably conflict with comma_decimals, right? Though that option is off by default.

@chrisjordansquire

Yes. That's part of why I'm not satisfied with that heuristic. Though it could be modified easily enough by automatically converting anything between quotes into some placeholder, such as 0. (I already do this at one point in the code.)

@mwiebe
Owner

In some simple cases it seems obvious:

'1, 2, 3, 4' or '1,2,3,4' vs '1 2 3 4'

It's unlikely here that the first example should produce the strings "1," "2," 3," and the number 4 and the second example should produce the string "1,2,3,4". Things get more complicated when you have quoted strings, though, where a simple Python split() would not separate at the right place anyway:

'"A, B, C", 3.1, "(0 1) \" (2 3)"

which I would expect to produce the string "A, B, C", the number 3.1, and the string "(0 1), \" (2 3)".

One approach for the heuristic would be to define an item regex (quoted string including the delimiter or characters all excluding the delimiter), then matching against <item>(<delimiter><item>)+ to see how well the delimiter works. Probably more details would need to be worked out though.

@chrisjordansquire

I stuck np.lib._datasource in, just as in genfromtxt. It appears to be working, but I don't have a good idea how to test that. I'd appreciate suggestions.

@charris
Owner

If you would like to look through _datasource with an eye to simplifying it, I wouldn't complain ;) It seems a bit strange that a module that advertises itself as a generic file opener for the researcher looks like a private module.

@chrisjordansquire

Yes, I was confused by the _datasource code as well. In the end, I just used the same call that genfromtxt does and verified that it passed all my tests.

@charris
Owner

What is the status of this request? Are folks happy with the functionality and the addition of a new function?

@rgommers
Owner

Functionality looks good to me, +1 on adding it.

Some minor cleanups are in order though:

  • module shouldn't be named io because this is a stdlib module. should it be a module at all?
  • loadtable.py needs an __all__ dict, and I prefer the filename to start with an underscore.
  • the types in docstrings should be corrected; integer -> int, string -> str, etc.; returns need names
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((390 lines not shown))
+ type_search_order=['b1', 'i8', 'f8','M8[D]'],
+ skip_lines=0,
+ num_lines_search=0,
+ string_sizes=1,
+ check_sizes=True,
+ is_Inf_NaN=True,
+ NA_re='NA',
+ usecols=None,
+ quoted=False,
+ comma_decimals=False,
+ force_mask=False,
+ date_re=r'\d{4}-\d{2}-\d{2}',
+ date_strp='%Y-%m-%d',
+ default_missing_dtype='|S1'):
+ """
+ Load a text file with rows of data into a numpy record array.
@rgommers Owner
rgommers added a note

Add blank line here. Summary should be single line.

@bsouthey
bsouthey added a note

'input file' not 'text file'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((392 lines not shown))
+ num_lines_search=0,
+ string_sizes=1,
+ check_sizes=True,
+ is_Inf_NaN=True,
+ NA_re='NA',
+ usecols=None,
+ quoted=False,
+ comma_decimals=False,
+ force_mask=False,
+ date_re=r'\d{4}-\d{2}-\d{2}',
+ date_strp='%Y-%m-%d',
+ default_missing_dtype='|S1'):
+ """
+ Load a text file with rows of data into a numpy record array.
+ This function will automatically detect the types of each
+ column, as well as the presense of missing values. If there are
@rgommers Owner
rgommers added a note

presence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((396 lines not shown))
+ NA_re='NA',
+ usecols=None,
+ quoted=False,
+ comma_decimals=False,
+ force_mask=False,
+ date_re=r'\d{4}-\d{2}-\d{2}',
+ date_strp='%Y-%m-%d',
+ default_missing_dtype='|S1'):
+ """
+ Load a text file with rows of data into a numpy record array.
+ This function will automatically detect the types of each
+ column, as well as the presense of missing values. If there are
+ missing values a masked array is returned, otherwise a numpy
+ array is returned.
+
+ It will also automatically detect the prescense of unlabeled row
@rgommers Owner
rgommers added a note

presence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((403 lines not shown))
+ default_missing_dtype='|S1'):
+ """
+ Load a text file with rows of data into a numpy record array.
+ This function will automatically detect the types of each
+ column, as well as the presense of missing values. If there are
+ missing values a masked array is returned, otherwise a numpy
+ array is returned.
+
+ It will also automatically detect the prescense of unlabeled row
+ names, but only if there is one column of them. This is done
+ to make loading data saved from some other systems, such as R,
+ easy to load.
+
+ For most users, the only parameters of interest are fname, delimiter,
+ header, and (possibly) type_search_order. The rest are for various
+ more specialzed/unusual data formats.
@rgommers Owner
rgommers added a note

specialized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((489 lines not shown))
+ entries are missing data.
+
+ Returns
+ -------
+ result: Numpy record array or masked record array
+ The data stored in the file, as a numpy record array. The field
+ names default to 'f'+i for field i if header is False, else the
+ names from the first row of non-whitespace, non-comments.
+
+ Raises
+ ------
+ IOError
+ If the input file does not exist or cannot be read.
+ ValueError
+ If the input file does not contain any data.
+ RuntumeError
@rgommers Owner
rgommers added a note

RuntimeError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((528 lines not shown))
+
+ * Determning the sizes is expensive. For large arrays containing no
+ string data it is best to set check_sizes to False.
+ * Similarly, determining the dtypes can be expensive. If the text
+ file is very large but has very homogeneous data (i.e. the dtypes are
+ easily determined), then it is best to only check the first k lines
+ for some reasonable value of k.
+ * This method defaults to 64-bit ints and floats. If these sizes are
+ unnecessary they should be reduced to 32-bit ints and floats to
+ conserve space.
+ * Converting comma float strings (i.e. '3,24') is about twice as
+ expensive as converting decimal float strings
+
+ Examples
+ --------
+ First create simply data file and then load it.
@rgommers Owner
rgommers added a note

Needs blank line between description and code here. Same for the rest of these examples

@bsouthey
bsouthey added a note

'input file' not 'data file'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((514 lines not shown))
+ In the first pass it determines the dtypes based on regular expressions
+ for the dtypes and custom promotion rules between dtypes. (The promotion
+ rules are used, for example, if a column appears to be integer and then
+ a float is seen.) In the first pass it can also determine the sizes of
+ strings, if that option is enabled. After determining the dtypes and
+ string sizes, it pre-allocates a numpy array of the appropriate size
+ (or masked array in the prescense of missing data) and fills it line
+ by line.
+
+ The methods within this function are fairly modular, and it requires
+ little difficulty to extract, for example, the method that determines
+ dtypes or change the method for reading in data.
+
+ Performance Tips:
+
+ * Determning the sizes is expensive. For large arrays containing no
@rgommers Owner
rgommers added a note

This won't parse correctly. Lists with multi-line elements are problematic anyway. Perhaps best to add :: after Performance Tips to let this be a literal block in html output. Also indent 2nd/3rd lines of multi-line elements

@bsouthey
bsouthey added a note

Determining?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((1,015 lines not shown))
+ else:
+ rowelems = matches.groups()
+ for j,rowelem in enumerate(rowelems):
+ if NA_pattern.match(rowelem):
+ tmpline[j] = dtype_default_missing[
+ dtype_dict[coltypes[j]]]
+ tmpmask[j] = True
+ else:
+ if quoted:
+ rowelem = rowelem.replace('"','')
+ tmpline[j] = dtype_to_conv[coltypes[j]](rowelem)
+ data[i] = tuple(tmpline)
+ data.mask[i] = tuple(tmpmask)
+ i += 1
+ return data
+
@rgommers Owner
rgommers added a note

2 blank lines should be enough. i like pep8, sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((1,069 lines not shown))
+ col_names: list of strings
+ The column names, if header is True
+ skip_lines: integer
+ The number of lines to skip before reading the text file for data
+ quoted: bool
+ Whether to allow the data to contain quotes, such as "3.14" or
+ "North America"
+ usecols : int or sequence, optional
+ Which columns to use. Selected with 0-indexing, using the same
+ sytax as selecting elements of a list. (i.e. -1 refers to the
+ last element, -2 to the second to last, etc.) If this is used,
+ no additional memory will be used for the unselected columns.
+
+ Returns
+ -------
+ numpy array
@rgommers Owner
rgommers added a note

numpy array == ndarray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
Owner

Would be good to also add loadtable in the See Also section of loadtxt and genfromtxt

@pierregm
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((33 lines not shown))
+ r'(?:[eE][-+]?\d+)?|nan|NAN|NaN|[+-]?inf',
+ r'|[+-]?Inf|[+-]?INF)']),
+ 'commafloat64InfNaN': ''.join([r'(?:[+-]?\s*(?:(?:(?:\d+)?,)?\d+|\d+,)',
+ r'(?:[eE][-+]?\d+)?|nan|NAN|NaN|[+-]?inf',
+ r'|[+-]?Inf|[+-]?INF)']),
+ 'complex64': ''.join([r'(?:[-+]?(?:(?:\d+\.?\d*|\d*\.?\d+)(?:[Ee][-+]?',
+ r'\d+)?)?[jJ]|[-+]?(?:\d+\.?\d*|\d*\.?\d+)(?:[Ee]',
+ r'[-+]?\d+)?[-+](?:(?:\d+\.?\d*|\d*\.?\d+)(?:[Ee]',
+ r'[-+]?\d+)?)?[jJ])']),
+ 'complex128': ''.join([r'(?:[-+]?(?:(?:\d+\.?\d*|\d*\.?\d+)(?:[Ee][-+]?',
+ r'\d+)?)?[jJ]|[-+]?(?:\d+\.?\d*|\d*\.?\d+)(?:[Ee]',
+ r'[-+]?\d+)?[-+](?:(?:\d+\.?\d*|\d*\.?\d+)(?:[Ee]',
+ r'[-+]?\d+)?)?[jJ])']),
+ 'datetime64[D]': r'\d{4}-\d{2}-\d{2}',
+ '|S1': r'\S+',
+ }
@bsouthey
bsouthey added a note

All the dtypes currently supported by numpy such as float16, float128, int8 and the unsigned integers must be included - applies to all usages. Also, how can this dictionary be changed on the fly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((380 lines not shown))
+# every time loadtable is called.
+quoted_entry_pattern = re.compile(r'"[^"]*"')
+entry_re = r'"[^"]*"|[^"]*?'
+
+
+
+def loadtable(fname,
+ delimiter=' ',
+ comments='#',
+ header=False,
+ type_search_order=['b1', 'i8', 'f8','M8[D]'],
+ skip_lines=0,
+ num_lines_search=0,
+ string_sizes=1,
+ check_sizes=True,
+ is_Inf_NaN=True,
@bsouthey
bsouthey added a note

It is better to use 'isfinite' to be consistent with numpy usage (see numpy.isfinite). But not sure how to handle the NA object...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((445 lines not shown))
+ to determine the type and size information for data that is very
+ homogenous.
+ string_sizes: int or list of ints, optional
+ If a single int, interpreted as a minimum string size for all entries.
+ If a list of ints, interpreted as the minimum string size for each
+ individual entry. An error is thrown if the lenght of this list
+ differs from the number of entries per row found. If check_sizes is
+ False, then these minimum sizes are never changed.
+ check_sizes: boolean or int, optional
+ Whether to check string sizes in each row for determining the size of
+ string dtypes. This is an expensive option.
+ If true it will check all lines for sizes. If an integer, it will
+ check up to that number of rows from the beginning. And if false
+ it will check no rows and use the defaults given from string_size.
+ is_Inf_NaN: bool, optional
+ Whether to allow floats that are Inf and NaN
@bsouthey
bsouthey added a note

What do you mean here because only floats can be Infinite and Not a Number?
What happens if this is True and what happens if False?
If this is a flag to exclude these values then it should also allow for the NA object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((447 lines not shown))
+ string_sizes: int or list of ints, optional
+ If a single int, interpreted as a minimum string size for all entries.
+ If a list of ints, interpreted as the minimum string size for each
+ individual entry. An error is thrown if the lenght of this list
+ differs from the number of entries per row found. If check_sizes is
+ False, then these minimum sizes are never changed.
+ check_sizes: boolean or int, optional
+ Whether to check string sizes in each row for determining the size of
+ string dtypes. This is an expensive option.
+ If true it will check all lines for sizes. If an integer, it will
+ check up to that number of rows from the beginning. And if false
+ it will check no rows and use the defaults given from string_size.
+ is_Inf_NaN: bool, optional
+ Whether to allow floats that are Inf and NaN
+ NA_re: string, optional
+ Regular expression for missing data The regular
@bsouthey
bsouthey added a note

Period missing? "Regular expression for missing data. The regular"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((795 lines not shown))
+ dtype_to_conv['datetime64[D]'] = f
+ dtype_to_conv['NAdatetime64[D]'] = dtype_to_conv['datetime64[D]']
+
+def init_delimiter_and_NA(delimiter,
+ NA_re,
+ type_search_order,
+ re_dict):
+ """
+ Initialize the delimiter and NA_re
+
+ Parameters
+ ----------
+ delimiter: string
+ Regular expression for the delimeter between data.
+ NA_re: string
+ The the regular expression for missing data
@bsouthey
bsouthey added a note

Extra 'the'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,270 lines not shown))
+ If a single int, interpreted as a minimum string size for all entries.
+ If a list of ints, interpreted as the minimum string size for each
+ individual entry. An error is thrown if the lenght of this list
+ differs from the number of entries per row found. If check_sizes is
+ False, then these minimum sizes are never changed.
+ check_sizes: int
+ Number of lines of data to use for baseline size estimates before
+ checking via regular expressions. This is included because for
+ complicated data files it can be extremely expensive to discover
+ that the current size estimates are wrong, and it is much cheaper
+ to simply spend a longer time determining sizes before assuming
+ you have the correct sizes.
+ col_names:
+ The names for each column
+ NA_re: string
+ The the regular expression for missing data
@bsouthey
bsouthey added a note

Extra 'the'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((364 lines not shown))
+ quoted_dtype_to_re[NAkey] = quoted_dtype_to_re[key]
+
+# Dict for default missing values for each dtype, to use in masked arrays
+dtype_default_missing = {
+ 'bool': True,
+ 'int32': 999999,
+ 'int64': 999999,
+ 'float32': 1.e20,
+ 'float64': 1.e20,
+ 'complex64': 1.e20+0j,
+ 'complex128': 1.e20+0j,
+ 'datetime64[D]': 'NaT',
+ '|S1': 'N/A'
+ }
+
+# For easy in the loadtable function, so the re's aren't recompiled
@bsouthey
bsouthey added a note

'For easy in the loadtable' does not make sense.
Expand 're' to 'regular expressions'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((414 lines not shown))
+ easy to load.
+
+ For most users, the only parameters of interest are fname, delimiter,
+ header, and (possibly) type_search_order. The rest are for various
+ more specialzed/unusual data formats.
+
+ See Notes for performance tips.
+
+ Parameters
+ ----------
+ fname: string or iterable (see description)
+ Either the filename of the file containing the data or an iterable
+ (such as a file object). If an iterable, must have __iter__, next,
+ and seek methods.
+ delimiter: string, optional
+ Regular expression for the delimeter between data. The regular
@bsouthey
bsouthey added a note

You mean 'delimiter' not 'delimeter'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((424 lines not shown))
+ fname: string or iterable (see description)
+ Either the filename of the file containing the data or an iterable
+ (such as a file object). If an iterable, must have __iter__, next,
+ and seek methods.
+ delimiter: string, optional
+ Regular expression for the delimeter between data. The regular
+ expression must be non-capturing. (i.e. r'(?:3.14)' instead of
+ r'(3.14)')
+ comments: string, optional
+ Regular expression for the symbol(s) indicating the start
+ of a comment.
+ header: bool, optional
+ Flag indicating whether the data contains a row of column names.
+ type_search_order: list of strings/dtypes
+ List of objects which np.dtype will recognize as dtypes.
+ skip_lines: int, optional
@bsouthey
bsouthey added a note

Use 'skiprows' to be consistent with numpy.genfromtxt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((435 lines not shown))
+ header: bool, optional
+ Flag indicating whether the data contains a row of column names.
+ type_search_order: list of strings/dtypes
+ List of objects which np.dtype will recognize as dtypes.
+ skip_lines: int, optional
+ Number of lines in the beginning of the text file to skip before
+ reading the data
+ num_lines_search: int, optional
+ Number of lines, not including comments and header, to search
+ for type and size information. Done to decrease the time required
+ to determine the type and size information for data that is very
+ homogenous.
+ string_sizes: int or list of ints, optional
+ If a single int, interpreted as a minimum string size for all entries.
+ If a list of ints, interpreted as the minimum string size for each
+ individual entry. An error is thrown if the lenght of this list
@bsouthey
bsouthey added a note

Length?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((433 lines not shown))
+ Regular expression for the symbol(s) indicating the start
+ of a comment.
+ header: bool, optional
+ Flag indicating whether the data contains a row of column names.
+ type_search_order: list of strings/dtypes
+ List of objects which np.dtype will recognize as dtypes.
+ skip_lines: int, optional
+ Number of lines in the beginning of the text file to skip before
+ reading the data
+ num_lines_search: int, optional
+ Number of lines, not including comments and header, to search
+ for type and size information. Done to decrease the time required
+ to determine the type and size information for data that is very
+ homogenous.
+ string_sizes: int or list of ints, optional
+ If a single int, interpreted as a minimum string size for all entries.
@bsouthey
bsouthey added a note

Use 'integer' rather than int.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((434 lines not shown))
+ of a comment.
+ header: bool, optional
+ Flag indicating whether the data contains a row of column names.
+ type_search_order: list of strings/dtypes
+ List of objects which np.dtype will recognize as dtypes.
+ skip_lines: int, optional
+ Number of lines in the beginning of the text file to skip before
+ reading the data
+ num_lines_search: int, optional
+ Number of lines, not including comments and header, to search
+ for type and size information. Done to decrease the time required
+ to determine the type and size information for data that is very
+ homogenous.
+ string_sizes: int or list of ints, optional
+ If a single int, interpreted as a minimum string size for all entries.
+ If a list of ints, interpreted as the minimum string size for each
@bsouthey
bsouthey added a note

'list of integers'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((438 lines not shown))
+ List of objects which np.dtype will recognize as dtypes.
+ skip_lines: int, optional
+ Number of lines in the beginning of the text file to skip before
+ reading the data
+ num_lines_search: int, optional
+ Number of lines, not including comments and header, to search
+ for type and size information. Done to decrease the time required
+ to determine the type and size information for data that is very
+ homogenous.
+ string_sizes: int or list of ints, optional
+ If a single int, interpreted as a minimum string size for all entries.
+ If a list of ints, interpreted as the minimum string size for each
+ individual entry. An error is thrown if the lenght of this list
+ differs from the number of entries per row found. If check_sizes is
+ False, then these minimum sizes are never changed.
+ check_sizes: boolean or int, optional
@bsouthey
bsouthey added a note

This appears to be out of alignment with the other lines (I have not downloaded it to check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((455 lines not shown))
+ string dtypes. This is an expensive option.
+ If true it will check all lines for sizes. If an integer, it will
+ check up to that number of rows from the beginning. And if false
+ it will check no rows and use the defaults given from string_size.
+ is_Inf_NaN: bool, optional
+ Whether to allow floats that are Inf and NaN
+ NA_re: string, optional
+ Regular expression for missing data The regular
+ expression must be non-capturing. (i.e. r'(?:3.14)' instead of
+ r'(3.14)')
+ usecols : int or sequence, optional
+ Which columns to use. Selected with 0-indexing, using the same
+ sytax as selecting elements of a list. (i.e. -1 refers to the
+ last element, -2 to the second to last, etc.) If this is used,
+ no additional memory will be used for the unselected columns.
+ quoted: bool, optional
@bsouthey
bsouthey added a note

This appears to be out of alignment with the other lines (I have not downloaded it to check).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((505 lines not shown))
+ If any regular expression matching fails
+
+ See Also
+ --------
+ loadtxt, genfromtxt, dtype
+
+ Notes
+ -----
+ This function operates by making two passes through the text file given.
+ In the first pass it determines the dtypes based on regular expressions
+ for the dtypes and custom promotion rules between dtypes. (The promotion
+ rules are used, for example, if a column appears to be integer and then
+ a float is seen.) In the first pass it can also determine the sizes of
+ strings, if that option is enabled. After determining the dtypes and
+ string sizes, it pre-allocates a numpy array of the appropriate size
+ (or masked array in the prescense of missing data) and fills it line
@bsouthey
bsouthey added a note

Presence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((570 lines not shown))
+ masked_array(data = [(--, True, 0.30000001192092896, 5)
+ ('60to70', True, 4.3000001907348633, 3)
+ ('80to90', False, --, 20)],
+ mask = [(True, False, False, False)
+ (False, False, False, False)
+ (False, False, True, False)],
+ fill_value = ('N/A', True, 1.0000000200408773e+20, 999999),
+ dtype = [('TempRange', 'S8'),
+ ('Cloudy', '?'),
+ ('AvgInchesRain', '<f4'),
+ ('Count', '<i4')])
+
+ For more examples see the test files for load_table in numpy/lib/tests
+ """
+
+ # Initialize various variables and sanitize inputs variables
@bsouthey
bsouthey added a note

Input variables?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((684 lines not shown))
+ f = fname
+ else:
+ raise ValueError(''.join(['fname must be filename, file, ',
+ 'iterable with seek, or valid input to ',
+ 'np.DataSource.']))
+ return f
+
+def init_re_dict(quoted):
+ """
+ Determines whether to use quoted or unquoted regular expressions
+ for the data
+
+ Parameters
+ ----------
+ quoted: boolean
+ True of the data can be quoted, False else
@bsouthey
bsouthey added a note

Two errors here. So perhaps:
'True' if the data can be quoted, otherwise 'False'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((758 lines not shown))
+ type_search_order.insert(float_index+2, 'commafloat64InfNaN')
+
+ return type_search_order
+
+def init_datetime(date_re, date_strp, re_dict, quoted):
+ """
+ Initialize the datetime regular expression and converter
+
+ Parameters
+ ----------
+ date_re: string
+ The regular expression for dates.
+ date_strp: string
+ The format to use for converting a date string to a date. Uses
+ the format from datetime.datetime.strptime in the Python Standard
+ Library datetime module.
@bsouthey
bsouthey added a note

This is confusing. I presume that by the second usage that you mean that the string has to conform to the datetime format. If so, can you please rewrite it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((861 lines not shown))
+ commentstr,')|(^\s*$)']))
+ # RE for the delimiter including white space
+ delimiter_pattern = re.compile(''.join(['\s*(?:',
+ delimiter,')\s*']))
+ if delimiter_pattern.groups>0:
+ raise ValueError("Delimiter regular expression must be non-capturing")
+ return ignore_pattern, delimiter_pattern
+
+def init_usecols(usecols, coltypes):
+ """
+ Initialize the usecols of columns to be used
+
+ Parameters
+ ----------
+ usecols: int or sequence
+ The columns to be used
@bsouthey
bsouthey added a note

Should be indented more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,289 lines not shown))
+ Dictionary associating types in type_search_order with regular
+ expressions for each type
+
+ Returns
+ -------
+ integer, list of integers, list of strings, compiled regular expression
+ The number of rows of data, the maximum string size of each column's
+ data, the types of each column, and a compiled regular expression
+ to capture the entries in a row of data
+ """
+
+ nrows_data = 0
+ coltypes = None
+ sizes = None
+ if check_sizes and isinstance(check_sizes,type(True)):
+ check_sizes = np.Inf
@bsouthey
bsouthey added a note

Is 'check_sizes' an integer as stated by the documentation or float or bool?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,361 lines not shown))
+ coltypes,
+ delimiter_pattern,
+ entry_pattern)
+ row_re = make_row_re(coltypes,
+ white,
+ delimiter,
+ NA_re,
+ re_dict)
+ row_re_pattern = re.compile(row_re)
+ if nrows_data<=check_sizes:
+ sizes = update_sizes(sizes, line, entry_pattern)
+ return nrows_data, sizes, coltypes, entry_pattern
+
+def update_sizes(sizes, line, entry_pattern):
+ """
+ Update the sizes for a row of data. Checks a re of the
@bsouthey
bsouthey added a note

What do you mean by 'row' and 'data'? The function only knows the about sizes, line and entry_pattern so you have to document it based on those arguments since someone can use this function directly without reference to loadtable.

Perhaps:
'Update the field sizes of the input line'?

'regular expression' rather than 're'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,364 lines not shown))
+ row_re = make_row_re(coltypes,
+ white,
+ delimiter,
+ NA_re,
+ re_dict)
+ row_re_pattern = re.compile(row_re)
+ if nrows_data<=check_sizes:
+ sizes = update_sizes(sizes, line, entry_pattern)
+ return nrows_data, sizes, coltypes, entry_pattern
+
+def update_sizes(sizes, line, entry_pattern):
+ """
+ Update the sizes for a row of data. Checks a re of the
+ current upper limit for the sizes against the line. If it
+ doesn't match, the elements are seperated by delimiter, taking
+ out the whitespace, and the sizes are increased to the sizes
@bsouthey
bsouthey added a note

Why do you say 'taking out the whitespace'? If the delimiter is not whitespace then the whitespace must remain by definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,378 lines not shown))
+ doesn't match, the elements are seperated by delimiter, taking
+ out the whitespace, and the sizes are increased to the sizes
+ in the current row.
+
+ Parameters
+ ----------
+ sizes: list of integers
+ The current maximum size of each entry in the data
+ line: string
+ The current line of data
+ delimiter_pattern: compiled regular expression
+ The compiled regular expression for the delimiter separating entries
+
+ Returns
+ -------
+ list of ints, compiled regular expression
@bsouthey
bsouthey added a note

'integers' rather than 'ints'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,422 lines not shown))
+ """
+ res = []
+ for t in type_search_order:
+ res.append(''.join(['(^', re_dict[t],'$)']))
+ return re.compile('|'.join(res))
+
+
+def update_coltypes(type_search_order,
+ type_re,
+ line,
+ coltypes,
+ delimiter_pattern,
+ entry_pattern):
+ """
+ Update the current gueses for dtypes for each column.
+ Is only called if there is a mismatch between the current guess
@bsouthey
bsouthey added a note

Just start the sentence with 'Only called'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,429 lines not shown))
+def update_coltypes(type_search_order,
+ type_re,
+ line,
+ coltypes,
+ delimiter_pattern,
+ entry_pattern):
+ """
+ Update the current gueses for dtypes for each column.
+ Is only called if there is a mismatch between the current guess
+ and the data in the row line, according to the current
+ regular expression.
+
+ Parameters
+ ----------
+ type_search_order: list of strings
+ List of the dtypes to be searched for, in order
@bsouthey
bsouthey added a note

'in order' of what?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,178 lines not shown))
+ -------
+ None or list of strings
+ Either returns a list of column names or None if no line
+ containing data is found
+ """
+
+ for count in xrange(skip_lines):
+ next(f)
+
+ for line in f:
+ if(ignore_pattern.match(line)):
+ # line is comment or whitespace
+ pass
+ else:
+ # Find column names. Eliminate double quotes around
+ # column names, if they exist in data file
@bsouthey
bsouthey added a note

'input file' not 'data file'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,263 lines not shown))
+ type_search_order: list of strings
+ List of string representations for the types to be checked for,
+ in the order in the list
+ num_lines_search: int
+ The number of lines to use when determining the dtype for each
+ column
+ string_sizes: int or list of ints
+ If a single int, interpreted as a minimum string size for all entries.
+ If a list of ints, interpreted as the minimum string size for each
+ individual entry. An error is thrown if the lenght of this list
+ differs from the number of entries per row found. If check_sizes is
+ False, then these minimum sizes are never changed.
+ check_sizes: int
+ Number of lines of data to use for baseline size estimates before
+ checking via regular expressions. This is included because for
+ complicated data files it can be extremely expensive to discover
@bsouthey
bsouthey added a note

'input files' not 'data files'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,496 lines not shown))
+ coltypes: list of strings
+ String representation of the current guesses for the types of
+ each column
+ white: string
+ Regular expression for white space
+ delimiter: string
+ Regular expression for delimiter between data
+ NA_re: string
+ Regular expression for missing entries
+ re_dict: dictionary
+ Dictionary giving regular expression for each type
+
+ Returns
+ -------
+ string
+ Regular expression each succeeding row of the data file should match
@bsouthey
bsouthey added a note

What 'data file' as the function has not 'input file' or 'data file' argument.
'should match' what?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((425 lines not shown))
+ Either the filename of the file containing the data or an iterable
+ (such as a file object). If an iterable, must have __iter__, next,
+ and seek methods.
+ delimiter: string, optional
+ Regular expression for the delimeter between data. The regular
+ expression must be non-capturing. (i.e. r'(?:3.14)' instead of
+ r'(3.14)')
+ comments: string, optional
+ Regular expression for the symbol(s) indicating the start
+ of a comment.
+ header: bool, optional
+ Flag indicating whether the data contains a row of column names.
+ type_search_order: list of strings/dtypes
+ List of objects which np.dtype will recognize as dtypes.
+ skip_lines: int, optional
+ Number of lines in the beginning of the text file to skip before
@bsouthey
bsouthey added a note

'Isn't this more than just a 'text file'? I think you mean the 'input' or 'input file-like object' rather than 'text file' especially as your 'iterable' is not a 'text file'. This comment applies to the whole documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((498 lines not shown))
+ Raises
+ ------
+ IOError
+ If the input file does not exist or cannot be read.
+ ValueError
+ If the input file does not contain any data.
+ RuntumeError
+ If any regular expression matching fails
+
+ See Also
+ --------
+ loadtxt, genfromtxt, dtype
+
+ Notes
+ -----
+ This function operates by making two passes through the text file given.
@bsouthey
bsouthey added a note

'input file' not 'text file'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((663 lines not shown))
+ return data
+
+def init_file(fname):
+ """
+ initiate the file variable
+
+ Parameters
+ ----------
+ fname: file or string
+ The file to read data from, or the file's path (absolute or
+ relative)
+
+ Returns
+ -------
+ file
+ The file object for the text file containing the data
@bsouthey
bsouthey added a note

'input file' not 'text file'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((897 lines not shown))
+ nrows_data,
+ sizes,
+ coltypes,
+ header,
+ col_names,
+ skip_lines,
+ re_dict,
+ quoted,
+ usecols):
+ """
+ Function that actually loads the data into a masked array.
+
+ Parameters
+ ----------
+ f: file
+ The text file containing the data
@bsouthey
bsouthey added a note

'input file' not 'text file'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((896 lines not shown))
+ delimiter_pattern,
+ nrows_data,
+ sizes,
+ coltypes,
+ header,
+ col_names,
+ skip_lines,
+ re_dict,
+ quoted,
+ usecols):
+ """
+ Function that actually loads the data into a masked array.
+
+ Parameters
+ ----------
+ f: file
@bsouthey
bsouthey added a note

This this a 'file' or 'file object'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((411 lines not shown))
+ It will also automatically detect the prescense of unlabeled row
+ names, but only if there is one column of them. This is done
+ to make loading data saved from some other systems, such as R,
+ easy to load.
+
+ For most users, the only parameters of interest are fname, delimiter,
+ header, and (possibly) type_search_order. The rest are for various
+ more specialzed/unusual data formats.
+
+ See Notes for performance tips.
+
+ Parameters
+ ----------
+ fname: string or iterable (see description)
+ Either the filename of the file containing the data or an iterable
+ (such as a file object). If an iterable, must have __iter__, next,
@bsouthey
bsouthey added a note

It is a little confusing to me the usage of 'file' within your documentation as a whole. Surely at some stage everything becomes a file-like object or whatever object name you want to use. So that when you use the word 'file' you mean more than just a 'file' but also the 'iterable' object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((479 lines not shown))
+ The regular expression for dates. This assumes that all dates
+ follow the same format. Defaults to the ISO standard. The regular
+ expression must be non-capturing. (i.e. r'(?:3.14)' instead of
+ r'(3.14)')
+ date_strp: string, optional
+ The format to use for converting a date string to a date. Uses
+ the format from datetime.datetime.strptime in the Python Standard
+ Library datetime module.
+ default_missing_dtype: string, optional
+ String representation for the default dtype for columns all of whose
+ entries are missing data.
+
+ Returns
+ -------
+ result: Numpy record array or masked record array
+ The data stored in the file, as a numpy record array. The field
@bsouthey
bsouthey added a note

'input' rather than 'file' (due to the iterable)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((480 lines not shown))
+ follow the same format. Defaults to the ISO standard. The regular
+ expression must be non-capturing. (i.e. r'(?:3.14)' instead of
+ r'(3.14)')
+ date_strp: string, optional
+ The format to use for converting a date string to a date. Uses
+ the format from datetime.datetime.strptime in the Python Standard
+ Library datetime module.
+ default_missing_dtype: string, optional
+ String representation for the default dtype for columns all of whose
+ entries are missing data.
+
+ Returns
+ -------
+ result: Numpy record array or masked record array
+ The data stored in the file, as a numpy record array. The field
+ names default to 'f'+i for field i if header is False, else the
@bsouthey
bsouthey added a note

What is 'f'?
I presume you mean field number prefixed by the character 'f' that is 'f0', 'f1'' etc. rather than the attempt to add an integer to a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,365 lines not shown))
+ white,
+ delimiter,
+ NA_re,
+ re_dict)
+ row_re_pattern = re.compile(row_re)
+ if nrows_data<=check_sizes:
+ sizes = update_sizes(sizes, line, entry_pattern)
+ return nrows_data, sizes, coltypes, entry_pattern
+
+def update_sizes(sizes, line, entry_pattern):
+ """
+ Update the sizes for a row of data. Checks a re of the
+ current upper limit for the sizes against the line. If it
+ doesn't match, the elements are seperated by delimiter, taking
+ out the whitespace, and the sizes are increased to the sizes
+ in the current row.
@bsouthey
bsouthey added a note

'line' not 'row'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,370 lines not shown))
+ if nrows_data<=check_sizes:
+ sizes = update_sizes(sizes, line, entry_pattern)
+ return nrows_data, sizes, coltypes, entry_pattern
+
+def update_sizes(sizes, line, entry_pattern):
+ """
+ Update the sizes for a row of data. Checks a re of the
+ current upper limit for the sizes against the line. If it
+ doesn't match, the elements are seperated by delimiter, taking
+ out the whitespace, and the sizes are increased to the sizes
+ in the current row.
+
+ Parameters
+ ----------
+ sizes: list of integers
+ The current maximum size of each entry in the data
@bsouthey
bsouthey added a note

'line' not 'data'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,372 lines not shown))
+ return nrows_data, sizes, coltypes, entry_pattern
+
+def update_sizes(sizes, line, entry_pattern):
+ """
+ Update the sizes for a row of data. Checks a re of the
+ current upper limit for the sizes against the line. If it
+ doesn't match, the elements are seperated by delimiter, taking
+ out the whitespace, and the sizes are increased to the sizes
+ in the current row.
+
+ Parameters
+ ----------
+ sizes: list of integers
+ The current maximum size of each entry in the data
+ line: string
+ The current line of data
@bsouthey
bsouthey added a note

Can not use 'line' (still not defining line) or 'data' (does not exist in the function).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,396 lines not shown))
+
+ matches = entry_pattern.match(line.strip())
+ if not matches:
+ raise RuntimeError('Cannot parse column data')
+ return map(max, zip(sizes, map(len, matches.groups())))
+
+
+def build_type_re(type_search_order, re_dict):
+ """
+ Builds a regular expression for testing for types, using the
+ types and search order specified in type_search_order.
+
+ Parameters
+ ----------
+ type_search_order: list of strings
+ List of the dtypes allowable in the order they're checked for.
@bsouthey
bsouthey added a note

'checked for' what?
I do not see the need for "order they're checked for' here because the output just has the same order as the input list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((1,491 lines not shown))
+ """
+ Make an re for current column type guesses.
+
+ Parameters
+ ----------
+ coltypes: list of strings
+ String representation of the current guesses for the types of
+ each column
+ white: string
+ Regular expression for white space
+ delimiter: string
+ Regular expression for delimiter between data
+ NA_re: string
+ Regular expression for missing entries
+ re_dict: dictionary
+ Dictionary giving regular expression for each type
@bsouthey
bsouthey added a note

'Dictionary of regular expressions for each coltype'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@bsouthey

The code must work with Python2.4+!
For example, loadtable.py has this invalid Python2.4 syntax:

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "loadtable.py", line 953
    coltypes_input = [dtype_dict[t] if t not in
                                 ^
SyntaxError: invalid syntax
@bsouthey

How do you obtain a specific format such as when you want integers treated as floats or vica versa?
In this example, the first column becomes an float so it requires extra processing to convert it to a integer type.
a,b,c,d
1,2,3,4
1.,3,4,5

@bsouthey bsouthey commented on the diff
numpy/lib/io/loadtable.py
((372 lines not shown))
+ 'float64': 1.e20,
+ 'complex64': 1.e20+0j,
+ 'complex128': 1.e20+0j,
+ 'datetime64[D]': 'NaT',
+ '|S1': 'N/A'
+ }
+
+# For easy in the loadtable function, so the re's aren't recompiled
+# every time loadtable is called.
+quoted_entry_pattern = re.compile(r'"[^"]*"')
+entry_re = r'"[^"]*"|[^"]*?'
+
+
+
+def loadtable(fname,
+ delimiter=' ',

The default delimiter must be 'None' that means whitespace because that is the default for loadtxt and genfromtxt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
Owner

Hi all, there are quite some things to fix in the current patch (good job on the detailed comments Bruce), but it looks like those are relatively straightforward. More important is to define what still has to be done in order for this to be merged. It would be good if we could come to a conclusion on that. It seems Pierre's take on this is that it should be able to replace genfromtxt. What's not completely clear to me is:

  • what exactly has to be done to cover all functionality
  • can loadtable be comparable speed-wise? or is this already the case?
  • if too much corner cases are not yet handled, is it possible to define an intermediate goal that allows us to merge this? either way genfromtxt will not be deprecated for a long time (if ever) I think

Besides the above points, it looks to me like there are a number of features already that set this apart from genfromtxt. Things like automatic type detection, handling datetimes and better memory performance. So in my view it's not a given that loadtable should replace genfromtxt.

Can we discuss this on this pull request? If ML is preferred that's fine too. But if we leave this sitting here without resolving these questions there's a good chance that the effort spent on this will be wasted.....

@bsouthey
@rgommers
Owner

About the genfromtxt auto-detection, it's not nearly as good. I took the first example in the loadtable docstring, and with genfromtxt I need:

np.genfromtxt(f, delimiter=',', skiprows=3, dtype=None)
@rgommers
Owner

About modularity not being a real advantage (guess you would also include code quality in general?), I couldn't disagree more. Given our limited developer time, it's a very important benefit.

@rgommers
Owner

I agree with your first point, that the API should match where possible. You already identified quite a few easy changes to improve the current patch in this respect.

@bsouthey
@bsouthey
@rgommers
Owner

Whether or not functions are used more than once is not relevant. Splitting them off like is done here is still far better than putting them all in a single huge function. The init_xx functions are on average about 10 lines long (plus doc/comments), putting them in the loadtable body would triple its length and make it far less readable.

Right now you can figure out in a few minutes how the logic in loadtable works. This is valuable. For genfromtxt I still wouldn't be able to tell you exactly.

@rgommers rgommers commented on the diff
numpy/lib/io/loadtable.py
((534 lines not shown))
+ for some reasonable value of k.
+ * This method defaults to 64-bit ints and floats. If these sizes are
+ unnecessary they should be reduced to 32-bit ints and floats to
+ conserve space.
+ * Converting comma float strings (i.e. '3,24') is about twice as
+ expensive as converting decimal float strings
+
+ Examples
+ --------
+ First create simply data file and then load it.
+ >>> from StringIO import StringIO #StringIO behaves like a file object
+ >>> s = ''.join(['#Comment \n \n TempRange, Cloudy, AvgInchesRain,',
+ ... 'Count\n 60to70, True, 0.3, 5\n 60to70, True, 4.3,',
+ ... '3 \n 80to90, False, 0, 20'])
+ >>> f = StringIO(s)
+
@rgommers Owner

The above example doesn't work. Delimiter is ', ' except for one value which is ','. Delimiter has to be specified to make it work for me. Delimiter auto-detection would be quite handy here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rgommers
Owner

The first example indeed doesn't work. I had just looked at the docstring and assumed that the output shown there is correct.

For example two, the blank line is the one separating header from data columns, so I don't see a problem here. It's quite useful to be able to say header=True instead of having to count the number of lines to skip by hand.

@charris
Owner

@rgommers @bsouthey I'm going to close this unless there is resistance. It sounds like the functionality might be good, but the development has stalled. Does someone else want to pick this up?

@rgommers
Owner

Unfortunate that this has stalled, but closing it makes sense. We can't leave PRs open forever. If someone wants to pick this up and process all review comments, I'm still in favor of merging it.

@rgommers rgommers closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Sep 6, 2011
  1. @chrisjordansquire

    ENH: auto-type detect data loader

    chrisjordansquire authored
    Created a function to auto-detect data types for each column in a csv-style
    data file. Works with arbitrary separators between the data and loads the
    data into a numpy record array.
  2. @chrisjordansquire
  3. @chrisjordansquire
  4. @chrisjordansquire
  5. @chrisjordansquire

    Refactored tests

    chrisjordansquire authored
  6. @chrisjordansquire
Commits on Sep 9, 2011
  1. @chrisjordansquire
Commits on Sep 12, 2011
  1. @chrisjordansquire

    Faster size checking

    chrisjordansquire authored
Something went wrong with that request. Please try again.