New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add In_channel.input_lines
and In_channel.fold_lines
#11843
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much in favour of these functions. They had been considered for the initial PR that introduced In_channel
and Out_channel
but it was decided to leave new functions for a later PR, which we now have :)
A Changes
entry is needed, as well as a second approval.
I don't know where you got that from but that's wrong on almost all accounts.
Point 2. is a very annoying behaviour which I would suggest not to have. It's better for IIRC this is actually the reason why they were not introduced in the initial PR. People were not comfortable with having |
Indeed. Personally, I think this behaviour is correct for most line-oriented programs (I don't remember ever having been bitten by it). But in any case, I see this as an issue with
This is right, I misremembered the discussion. Thanks for the clarification. |
Well I have… That's the case where most but not all leads to difficult to track bugs and lots of programmer time lost. People will happily tab complete Take python's fileinput for example. It also does not report a final empty line but the twist is that it reports the lines with the line endings. This looks like the right interface since you usually Please add tools to the stdlib not more footguns. |
I fixed my comment about
If I understand correctly @dbuenzli's terminology, the "final empty line" in question is the empty sequence of characters that sits between a newline at end of file from end of file. Yes, I don't want to see this nothingness materialized as an extra element in the list, it's just useless. Also, I find the yrtminology needlessly confusing: for me, an empty line is something that my text editor displays as an empty line, namely
This is a non-goal for me. We're parsing line-oriented input. It can be expected to have newlines at the end of every line. If it misses one on the last line, it's probably a mistake but it causes no harm to pretend there was one.
That ship has sailed a long time ago. It was a conscious decision of mine to not include line endings in the result of |
Line endings are badly named: they are new lines. Each new line introduces a new line, if there's nothing on the line before the next new line or the end of file, the line is empty. What is confusing here ? Do a
Certainly if you like new line business churn. In my opinion, very much like text versus binary mode input, these "smart" things add more problems than they solve (because they make assumptions that do not hold in practice). Again, if you are crunching simple line oriented data you will likely Being able to recreate the file you read line by line maybe a non-goal for you, but this is a stdlib, why not have that nice functionality in ? |
Empty lines are quite often significant. For an example, look no further than the markdown language we're using for this spirited exchange, which is kind of designed to be parsed line by line (because end-of-line has a different meaning than white space), and treats empty lines as paragraph breaks. Likewise for (La)TeX. I'm afraid we'll have to agree to disagree. I really want to be able to write
and not have to write
neither
|
(Or maybe we want a quick-and-dirty way to be robust to comments in the middle, and we |
You'll get extra points on the exam for a clever use of
and nothing more clever than that. |
Not really. In fact I'm rather suggesting to keep the newlines at the end of lines (see the python API I linked to), rather than report a final empty line. This has also other benefits if you process files and want to keep their newline convention.
Exactly! So don't pretend they don't exist. Don't pretend the "\n" file and the empty file are equivalent. Don't try to tell me what is (un)important in my data. Yes, I did get puzzling integrity checksum failures because of your smart
I'm afraid the real world is in general a bit more messy than that and if you care about user experience you will likely write anyways: In_channel.input_lines ic |> List.map (fun l -> int_of_string (String.trim l)) which if you include the new lines at the end of lines, is totally equivalent. If you really think what you propose is so good then maybe we could rather have: val In_channel.input_lines : with_newlines:bool -> in_channel -> string list your program still looks good: In_channel.input_lines ~with_newlines:false ic |> List.map int_of_string and we get something nice and useful in the stdlib, not just something for "solving programming puzzles". |
Another way to compromise would be to have (I have no opinion on which is the right default and I don't want to get hit by stray bullets here, but I would point out that I very rarely encounter user complains about the current behavior of |
I rather think that user rarely complain about it because, users have learned, like text mode input, to refrain from using it. A quick github search shows that I'm not the only, see for example here in which you will learn that semgrep actually has lints for both |
So to sum up if one wanted to bring new line sanity for the next generation of OCaml programmers I think we should first have an input line function that:
Then use that function to implement the functions suggested by this PR. That way we do not perpetuate the frustrating and error prone system we have now. |
@dbuenzli : you've stated your points multiple times; I stated mine ( |
stdlib/in_channel.mli
Outdated
|
||
@since 5.1 *) | ||
|
||
val fold_lines : ('a -> string -> 'a) -> 'a -> t -> 'a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be 'acc
for consistency with #11858
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
stdlib/in_channel.mli
Outdated
in the style of a fold. More precisely, [fold_lines f init c] is | ||
[List.fold_left f init (In_channel.input_lines ic)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the style of a fold. More precisely, [fold_lines f init c] is | |
[List.fold_left f init (In_channel.input_lines ic)], | |
in the style of a fold. More precisely, [fold_lines f init ic] is | |
[List.fold_left f init (In_channel.input_lines ic)], |
More precisely, though no more accurately. The interleaving of side effects is different: fold_lines
works on infinite streams and allows writing (a contrived form of) interactive programs.
This could be documented using similar language as Seq.fold_left
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point. I tried to explain a little better, but I'm still not 100% happy. On the other hand I'm not 100% happy with the explanation of Seq.fold_left
either. I'd rather not mention the possibility of infinite I/O channels...
I was surprised to see that python's I personally think removing the newlines is more intuitive. The result of reading the wrong OS file type could be that you end up with |
Ugh, it's really quite hot in here, but I'll risk pointing out that my routine advice for cross-platform programming has for many years been to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm (considerably) more for having these functions in the stdlib, even with input_line
's corner-cases, than not having them at all!
We have reached two maintainer approvals so I think that this is going to be merged. @xavierleroy there is a Changes entry conflict, could you rebase? I don't think that a consensus was reached on the debate on the virtues and vices of |
Please do that. But I'll just notice that we have a dev team member saying:
and yet we choose to continue to build on top of it. I'm always happy to defend the stdlib but this is a bit beyond me. For anyone interested in doing the right things here it is. |
@dra27, as said dev team member, what would you recommend doing with this PR? Edit: I realize that it is a silly question as you just approved the PR. |
(Note regarding "a dev team member said": personally I don't necessarily attach more value to the opinion of people who happen to be maintainers than to the one of external contributors, I try to judge by expertise on the specific question at hand. (I can make better guesses on the expertise of people I know better, and may be more conservative otherwise; and there may be an implicit bias in favor of people I work with on a regular basis.) On this topic, I would consider both Xavier and David and Daniel as experts -- I'm not -- and carefully consider their feedback.) |
This was just to draw attention that not only do external people and linters (I linked to) steer away from At that point I'm not even asking for an Some people are reconsidering OCaml because we know why. Do we really want them (and their end-users) to go through the painful and frustrating process of learning to steer away from I always liked the stance of "do not provide primitives that allow to write inefficient code" in this project (e.g. (This will be my last message to this discussion unless explicitly pinged back to interact) |
More drama. Great. I understand that @dra27 and perhaps @dbuenzli are now suggesting that First, I think this suggestion makes sense (unlike the one that includes the line terminators in the result of Second, I don't understand why this suggestion had to be made using grandiose or vaguely-threatening words, e.g. "avoid input_line and like the proverbial plague", or "some people are reconsidering OCaml because we know why" (I don't know why). Can we keep it professional and effective, please? Third, this PR is about adding |
@xavierleroy I've made a decision to merge the PR once you rebase it so that I can click on the button. Or you click. I'm still interested in seeing if we can offer alternative functions that would satisfy @dbuenzli and @dra27 (but I probably won't have the energy to implement them, and @dbuenzli may be pouty after the merge so he won't, so we are left with @dra27). Let's discuss it separately. |
Co-authored-by: wiktorkuchta <35867657+wiktorkuchta@users.noreply.github.com>
Move `input_lines` and `fold_lines` to the appropriate sections.
ad6a704
to
1808fa4
Compare
Re-based, updated the docstrings to fit the new sectioning in stdlib/in_channels.mli, squashed, and merged. |
This PR adds a function
input_lines
to moduleIn_channel
that reads lines from an input channel until end of file, and returns the lines as a list of strings. In my experience, it's very useful for writing scripts or solving programming puzzles. Compare for examplewith the standard recursive solution
or the standard iterative solution
To facilitate reviewing, the rest of this PR is written in Q&A style
Q: What if my input file is really big? Isn't there a risk of running out of memory?
A: Text files in line-oriented formats are rarely "really big" in the sense of "several gigabytes long". But that's the reason why, as suggested by @Octachron , this PR also adds the
In_channel.fold_lines
function, to support streaming computations on line-oriented files. For example:Q: Isn't this equivalent to
String.split_on_char '\n' (In_channel.input_all ic)
and therefore not needed in the stdlib?A: It's not equivalent because the solution based on
String.split_on_char
adds an extra string at the end of the list, which is empty if the input is well formed (all lines terminated by a newline), and this is not something I want to work with.removes empty lines, whileAlso, this alternate solution uses at least twice as much memory (at the point whereIn_channel.input_lines
carefully preserves them. (They are very often significant.)String.split_on_char
returns and the big string built byinput_all
can finally be freed). Finally, if the input channel is not a file but e.g. a pipe,input_all
incurs some overheads in space and time, and reading line per line is probably more efficient.Q: Why is the result a
string list
and not anarray list
?A: It's naturally built as a list, since the number of lines is not known in advance. For applications that need an array of lines, just pipe into
Array.of_list
.Q: Why not a
string Seq.t
? This would save much space by reading lines on demand!A: As far as I know, we don't want to do lazy I/O because it's too error-prone, e.g. it's all too easy to close the input channel before everything has been read.