Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Idea/Suggestion] Improved support for "header" lines in input files #612

Open
Eric24 opened this issue Jan 6, 2019 · 13 comments
Open

[Idea/Suggestion] Improved support for "header" lines in input files #612

Eric24 opened this issue Jan 6, 2019 · 13 comments

Comments

@Eric24
Copy link

Eric24 commented Jan 6, 2019

A rather common CSV format found in things like web server logs is to have one or more "header" lines at the beginning of the file, like this:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields
2019-01-04	11:37:39	ORD50-C1	1030	0.0.0.0	GET	xxxxxxxxxx.domain.net	/	200	-	Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/71.0.3578.98%2520Safari/537.36	-	-	Test	qiW8xjO4CoRnsrW843xPY4fE_sKPZp7YYgpeskJhd3N9ZkArsmUvrw==	host.domain.com	https	283	0.188	-	TLSv1.2	ECDHE-RSA-AES128-GCM-SHA256	Miss	HTTP/2.0	-	-

Obviously this library can't be expected to deal with every such header format, but I think this could be solved fairly simply, by adding two props to options: a prefix to identify such header lines; and a function (maybe called something like 'parseHeaders'). The idea being to allow a custom function to have a look at the first n lines of the file, until the function figures out the headers and returns them.

When the library starts processing a file, its in "seeking header information" state and not parsing data. Every line that starts with the specified prefix is passed to the parseHeaders function (the entire line). That function can either return a falsey value or a hash or array that contains the headers. Once the headers are returned, the library is now in "seeking data" state (but it will continue passing any line identified by the prefix to this function, in case there are other such headers after the one that contains the column headers).

In any case, the general idea of getting the column headers from the file is already there, but it's currently limited to only looking for headers on the first non-comment/non-blank row and then they must follow the same format as the data, with regard to delimiters, etc. This concept allows extendable support for more complex header formats, including cases where the column headers may not even be itemized in the file at all, but are instead identified by some sort of constant, etc.

I realize this is somewhat similar to #186, but in the case of processing a stream (i.e. in node), you have no control over the chunk size and can therefore for not rely on the entire header to be in the chunk. This concept works at the line level, after line-level parsing.

@dboskovic
Copy link
Collaborator

I think what might work is some sort of findHeaders callback that provides the reader instance and allows a user to write some custom logic that would extract and return the header along with an index for where "data starts". This way we're not trying to add support for specific edge cases, but allowing end developers to add some logic of their own where needed.

@Eric24
Copy link
Author

Eric24 commented Jan 6, 2019

Interesting. I definitely agree that this should be as "generic" as possible, so it can support anything at the start of a file, even if it's only to ignore it. With what you describe, at what point would the custom logic get access to the file data and in what format? Currently, I'm handling this case with a separate transform function between the readstream and the parser. It's a little gross, since it must "accumulate" chunk data until it has gotten enough of the file to get to the column headers.

By the way, a headers property on the parser could be an alternative solution. For headerless files, it would allow the user to statically specify headers. Potentially, both features could be combined with the existing header property (false=no headers; true=get headers from first line of file; array=static list of headers; function=get headers with findHeaders function). Everything else works as-is, including transformHeader, which if supplied, would still get the individual headers, regardless of where they came from.

@dboskovic
Copy link
Collaborator

Yeah, that's a legitimate point. It'd have to be a function that gets the raw FileReader or I think it could get corrupted pretty easily by other parsing options.

#609 suggests the headers option for the unparse functionality. Might make sense to consider it for parse as well.

@Eric24
Copy link
Author

Eric24 commented Jan 23, 2019

Another thought on this topic: Perhaps a more generic way to handle this, and open up some other possibilities as well, is to have the option for a function that simply gets a "look" at each line as it's received, but before any other parsing is done. Something like transformHeader, but transformLine. The function can then either return the line as-is as a string (which is essentially the default behavior), return a manipulated version of it as a string, return null (to ignore it), or explicitly set the headers by returning an array. This would solve the "headers in strange formats" issue, as well as allowing "whole line" manipulation for other purposes. Support for headers as an array is still needed (both for headerless files and for the transformLine function), but it would eliminate the need for anything like the findHeaders function and make the whole thing more generic.

@theLAZYmd
Copy link

3 years on, this issue is still salient

@pokoli
Copy link
Collaborator

pokoli commented Nov 3, 2021

@theLAZYmd feel free to propose a solution for it

@theLAZYmd
Copy link

I'll draft a PR, but I think it should be possible to provide a config.headerLines option and modify the config.transformHeaders function to have two more params - acc, the existing header for that option, and j, the row number. The param then uses those to generate the new header as it iterates over rows in an reducer fashion.

image
image

@theLAZYmd
Copy link

I'm happy to write the code and make the PR but can someone make further commits to the PR to update the documentation?
Because of the parameter order, this should also be non-breaking but if we wanted to edit the parameter order to make it more convenient ([header, index, acc, rowNumber] is a bit unconventional) then it would be breaking

@pokoli
Copy link
Collaborator

pokoli commented Nov 3, 2021

@theLAZYmd I do not mind breaking the API, we just publish a new major version and it should be fixed.
But I really care about the documentation, so please include the documentation in the PR if you are going to work on it.
Also we require to have tests to ensure the behaviour is right.

Also I will like to see a valid use case for it (i do not want to add more complexity without any reason).
The above mentioned example should be fixed by setting activating the comments flag.

@theLAZYmd
Copy link

Ok. My judgement is - the formulation of the function which doesn't break the API is so inoffensive to use that it is not worth breaking the API to correct to the 'ideal' way to implement it.
Understood about documentation. Editing that now for the PR.
I wrote one test which is passing (which is the same number as the existing transformHeader tests). No other tests broken.

@theLAZYmd
Copy link

theLAZYmd commented Nov 3, 2021

Use case: it's fairly common for publishers of data (especially financial data) to have multiple lines as their headers, usually because the CSV data is exported from an Excel file. Please see an attached sheet as an example which is data from the ICAP.
Adding support for transforming multiple lines into a consistent header, using a custom parser is basically:

  • All the functionality one could reasonably request on this feature
  • Something which would set this package apart from other parsers
    icap-graph-data-19-03-2019.csv

theLAZYmd added a commit to kayaadvisory/PapaParse that referenced this issue Nov 3, 2021
@theLAZYmd
Copy link

Proposed in #898 :)
Hope to see it implemented soon! If so, we also need to update https://github.com/DefinitelyTyped/DefinitelyTyped/tree/master/types/papaparse (how are these generated by the way?)

@theLAZYmd
Copy link

Just to mention, I've started using this patch in my production code with quite pleasing results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants