New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Idea/Suggestion] Improved support for "header" lines in input files #612
Comments
I think what might work is some sort of |
Interesting. I definitely agree that this should be as "generic" as possible, so it can support anything at the start of a file, even if it's only to ignore it. With what you describe, at what point would the custom logic get access to the file data and in what format? Currently, I'm handling this case with a separate transform function between the readstream and the parser. It's a little gross, since it must "accumulate" chunk data until it has gotten enough of the file to get to the column headers. By the way, a |
Yeah, that's a legitimate point. It'd have to be a function that gets the raw FileReader or I think it could get corrupted pretty easily by other parsing options. #609 suggests the |
Another thought on this topic: Perhaps a more generic way to handle this, and open up some other possibilities as well, is to have the option for a function that simply gets a "look" at each line as it's received, but before any other parsing is done. Something like |
3 years on, this issue is still salient |
@theLAZYmd feel free to propose a solution for it |
I'll draft a PR, but I think it should be possible to provide a |
I'm happy to write the code and make the PR but can someone make further commits to the PR to update the documentation? |
@theLAZYmd I do not mind breaking the API, we just publish a new major version and it should be fixed. Also I will like to see a valid use case for it (i do not want to add more complexity without any reason). |
Ok. My judgement is - the formulation of the function which doesn't break the API is so inoffensive to use that it is not worth breaking the API to correct to the 'ideal' way to implement it. |
Use case: it's fairly common for publishers of data (especially financial data) to have multiple lines as their headers, usually because the CSV data is exported from an Excel file. Please see an attached sheet as an example which is data from the ICAP.
|
Proposed in #898 :) |
Just to mention, I've started using this patch in my production code with quite pleasing results! |
A rather common CSV format found in things like web server logs is to have one or more "header" lines at the beginning of the file, like this:
Obviously this library can't be expected to deal with every such header format, but I think this could be solved fairly simply, by adding two props to
options
: a prefix to identify such header lines; and a function (maybe called something like 'parseHeaders'). The idea being to allow a custom function to have a look at the first n lines of the file, until the function figures out the headers and returns them.When the library starts processing a file, its in "seeking header information" state and not parsing data. Every line that starts with the specified prefix is passed to the parseHeaders function (the entire line). That function can either return a falsey value or a hash or array that contains the headers. Once the headers are returned, the library is now in "seeking data" state (but it will continue passing any line identified by the prefix to this function, in case there are other such headers after the one that contains the column headers).
In any case, the general idea of getting the column headers from the file is already there, but it's currently limited to only looking for headers on the first non-comment/non-blank row and then they must follow the same format as the data, with regard to delimiters, etc. This concept allows extendable support for more complex header formats, including cases where the column headers may not even be itemized in the file at all, but are instead identified by some sort of constant, etc.
I realize this is somewhat similar to #186, but in the case of processing a stream (i.e. in node), you have no control over the chunk size and can therefore for not rely on the entire header to be in the chunk. This concept works at the line level, after line-level parsing.
The text was updated successfully, but these errors were encountered: