genfromtxt() handles comments incorrectly with names=True (Trac #2184) #637

numpy-gitbot opened this Issue Oct 19, 2012 · 6 comments

Original ticket on 2012-07-11 by trac user khaeru, assigned to unknown.

The documentation for genfromtxt() reads:

When the variables are named (either by a flexible dtype or with names, there must not be any header in the file (else a ValueError exception is raised).

and also:

If names is True, the field names are read from the first valid line after the first skip_header lines.

The cause of this seems to be in [ numpy/lib/ at lines 1347-9]:

    if names is True:
        if comments in first_line:
            first_line = asbytes('').join(first_line.split(comments)[1:])

The last line should read first_line = first_line.split(comments)[0].

With the current code, the input line:

# Example comment line

will be transformed to:

Example comment line

resulting in columns named 'Example', 'comment' and 'line' (this is what the warning in the documentation is about).

But also the input line:

ColumnA ColumnB ColumnC # the column names precede this comment

will be transformed to:

the column names precede this comment

resulting in columns named 'the', 'column', 'names' …etc. In this instance actual column names present in the file are inappropriately discarded.

By taking the [0] portion of the split instead of [1:]:

  • Lines beginning with comments result in an empty string being passed to split_lines() on L1350, producing no usable output and causing the while not first_values loop to try the next line.
  • Partial-line comments following actual heading names are discarded, instead of the names themselves.
  • As a result, files can have commented headers of any length and column names, simultaneously.

trac user khaeru wrote on 2012-07-11

Title changed from Remove to genfromtxt() handles comments incorrectly with names=True by trac user khaeru on 2012-07-11


@rgommers wrote on 2012-07-12

We opened Github issues only a few weeks ago, we're in the process of transitioning all Trac tickets to it. When that's done we'll close Trac, or make it read-only. For now you can use either one.


@rgommers wrote on 2012-07-12

Suggested fix looks correct.


trac user khaeru wrote on 2012-07-12

Oh, I see — well, I also posted a branch with this fix and a pull request: #351

NumPy member

#351 was closed as a wrong fix -- broke user code. Don't know the status of fixing this.

