Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add In_channel.input_lines and In_channel.fold_lines #11843

Merged
merged 5 commits into from Jan 13, 2023

Conversation

xavierleroy
Copy link
Contributor

@xavierleroy xavierleroy commented Dec 30, 2022

This PR adds a function input_lines to module In_channel that reads lines from an input channel until end of file, and returns the lines as a list of strings. In my experience, it's very useful for writing scripts or solving programming puzzles. Compare for example

    In_channel.input_lines ic |> List.map int_of_string |> List.fold_left (+) 0

with the standard recursive solution

   let rec sum_lines ic accu =
      match input_line ic with
      | exception End_of_file -> accu
      | line -> sum_lines ic (accu + int_of_string line)

or the standard iterative solution

  let sum = ref 0 in
  begin try
    while true do
      let line = input_line ic in sum := !sum + int_of_string line
    done
  with End_of_file -> () end;
  !sum

To facilitate reviewing, the rest of this PR is written in Q&A style

Q: What if my input file is really big? Isn't there a risk of running out of memory?

A: Text files in line-oriented formats are rarely "really big" in the sense of "several gigabytes long". But that's the reason why, as suggested by @Octachron , this PR also adds the In_channel.fold_lines function, to support streaming computations on line-oriented files. For example:

   In_channel.fold_lines (fun accu line -> accu + int_of_string line) 0 ic

Q: Isn't this equivalent to String.split_on_char '\n' (In_channel.input_all ic) and therefore not needed in the stdlib?

A: It's not equivalent because the solution based on String.split_on_char adds an extra string at the end of the list, which is empty if the input is well formed (all lines terminated by a newline), and this is not something I want to work with. removes empty lines, while In_channel.input_lines carefully preserves them. (They are very often significant.) Also, this alternate solution uses at least twice as much memory (at the point where String.split_on_char returns and the big string built by input_all can finally be freed). Finally, if the input channel is not a file but e.g. a pipe, input_all incurs some overheads in space and time, and reading line per line is probably more efficient.

Q: Why is the result a string list and not an array list ?

A: It's naturally built as a list, since the number of lines is not known in advance. For applications that need an array of lines, just pipe into Array.of_list.

Q: Why not a string Seq.t ? This would save much space by reading lines on demand!

A: As far as I know, we don't want to do lazy I/O because it's too error-prone, e.g. it's all too easy to close the input channel before everything has been read.

Copy link
Contributor

@nojb nojb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much in favour of these functions. They had been considered for the initial PR that introduced In_channel and Out_channel but it was decided to leave new functions for a later PR, which we now have :)

A Changes entry is needed, as well as a second approval.

@nojb nojb added the stdlib label Dec 30, 2022
@dbuenzli
Copy link
Contributor

It's not equivalent because the solution based on String.split_on_char removes empty lines, while In_channel.input_lines carefully preserves them.

I don't know where you got that from but that's wrong on almost all accounts.

  1. No line is removed by String.split_on_char.
  2. In_channel.input_lines as proposed carefully drops a final empty line.

Point 2. is a very annoying behaviour which I would suggest not to have. It's better for String.concat "\n" (In_channel.input_lines ic) to recreate the file you just read, just like with the result of String.split_on_char… Especially the function as such is useless to easily update a line-oriented file format without introducing end line churn.

IIRC this is actually the reason why they were not introduced in the initial PR. People were not comfortable with having input_line swallowing the final empty line while input_lines would not (personally I don't mind).

@nojb
Copy link
Contributor

nojb commented Dec 31, 2022

In_channel.input_lines as proposed carefully drops a final empty line.

Indeed. Personally, I think this behaviour is correct for most line-oriented programs (I don't remember ever having been bitten by it). But in any case, I see this as an issue with input_line itself, rather than with the functions of this PR.

IIRC this is actually the reason why they were not introduced in the initial PR. People were not comfortable with having input_line swallowing the final empty line while input_lines would not (personally I don't mind).

This is right, I misremembered the discussion. Thanks for the clarification.

@dbuenzli
Copy link
Contributor

Indeed. Personally, I think this behaviour is correct for most line-oriented programs (I don't remember ever having been bitten by it)

Well I have… That's the case where most but not all leads to difficult to track bugs and lots of programmer time lost.

People will happily tab complete input_lines without realizing that the function silently drops data. For such a basic function, making assumptions of what is right and silently dropping data is not a good idea®.

Take python's fileinput for example. It also does not report a final empty line but the twist is that it reports the lines with the line endings. This looks like the right interface since you usually String.strip your line anyways before say Int.of_string. In python's interface you have the convenience but not a single bit of data is lost. That in turn means that you can easily use such an input function to update a file with the line ending convention that was in the file you read.

Please add tools to the stdlib not more footguns.

@xavierleroy
Copy link
Contributor Author

I fixed my comment about String.split_on_char, I had a different "split" function in mind.

In_channel.input_lines as proposed carefully drops a final empty line.

If I understand correctly @dbuenzli's terminology, the "final empty line" in question is the empty sequence of characters that sits between a newline at end of file from end of file. Yes, I don't want to see this nothingness materialized as an extra element in the list, it's just useless.

Also, I find the yrtminology needlessly confusing: for me, an empty line is something that my text editor displays as an empty line, namely ^\n or \n\n in regexp notation, but definitely not \n$.

Point 2. is a very annoying behaviour which I would suggest not to have. It's better for String.concat "\n" (In_channel.input_lines ic) to recreate the file you just read, just like with the result of String.split_on_char Especially the function as such is useless to easily update a line-oriented file format without introducing end line churn.

This is a non-goal for me. We're parsing line-oriented input. It can be expected to have newlines at the end of every line. If it misses one on the last line, it's probably a mistake but it causes no harm to pretend there was one.
Recreating the file as is, with the mistake on the last line, is not a goal. Re-writing the file with the missing newline is an improvement. (Plus, if you want to copy a file as is, there are better ways than line-oriented input, e.g. In_channel.input or In_channel.input_all.)

it reports the lines with the line endings. This looks like the right interface since you usually String.strip your line anyways before say Int.of_string

That ship has sailed a long time ago. It was a conscious decision of mine to not include line endings in the result of input_line, in large part out of bad experience with Perl (where most read line operations must be followed by chop), and more recently with Python (which has some methods that keep the line endings and others that remove them). I still think the input_line behavior is the one that makes programmers' lives easier. The functions in this PR just continue the same choice.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jan 2, 2023

Also, I find the yrtminology needlessly confusing: for me, an empty line is something that my text editor displays as an empty line, namely ^\n or \n\n in regexp notation, but definitely not \n$.

Line endings are badly named: they are new lines. Each new line introduces a new line, if there's nothing on the line before the next new line or the end of file, the line is empty. What is confusing here ?

Do a printf "\n" > /tmp/a, open this file in emacs, there are two empty lines to navigate here. Your input_lines reports only one. That is confusing.

I still think the input_line behavior is the one that makes programmers' lives easier.

Certainly if you like new line business churn. In my opinion, very much like text versus binary mode input, these "smart" things add more problems than they solve (because they make assumptions that do not hold in practice).

Again, if you are crunching simple line oriented data you will likely String.strip anyways for robustness (or split on whitespace) and most likely ignore empty lines as well wherever they happen.

Being able to recreate the file you read line by line maybe a non-goal for you, but this is a stdlib, why not have that nice functionality in ?

@xavierleroy
Copy link
Contributor Author

you will [...] most likely ignore empty lines as well wherever they happen.

Empty lines are quite often significant. For an example, look no further than the markdown language we're using for this spirited exchange, which is kind of designed to be parsed line by line (because end-of-line has a different meaning than white space), and treats empty lines as paragraph breaks. Likewise for (La)TeX.

I'm afraid we'll have to agree to disagree. I really want to be able to write

In_channel.input_lines ic |> List.map int_of_string |> List.fold_left (+) 0

and not have to write

In_channel.input_lines ic
|> List.filter (fun s -> s <> "")
|> List.map int_of_string
|> List.fold_left (+) 0

neither

In_channel.input_lines ic 
|> List.rev
|> (function "" :: l -> l  | l -> l)  (* get rid of the Buenzli-empty line at end *)
|> List.rev
|> List.map int_of_string
|> List.fold_left (+) 0

@gasche
Copy link
Member

gasche commented Jan 3, 2023

(Or maybe we want a quick-and-dirty way to be robust to comments in the middle, and we List.filter_map int_of_string_opt.)

@xavierleroy
Copy link
Contributor Author

xavierleroy commented Jan 3, 2023

(Or maybe we want a quick-and-dirty way to be robust to comments in the middle, and we List.filter_map int_of_string_opt.)

You'll get extra points on the exam for a clever use of int_of_string_opt. (On the other hand, you're really throwing input validation out the window.) At any rate, I'm still adamant that if I'm told the input consists of decimal integers, one per line, I must be able to read it with

In_channel.input_lines ic |> List.map int_of_string

and nothing more clever than that.

@dbuenzli
Copy link
Contributor

dbuenzli commented Jan 4, 2023

|> (function "" :: l -> l | l -> l) (* get rid of the Buenzli-empty line at end *)

Not really. In fact I'm rather suggesting to keep the newlines at the end of lines (see the python API I linked to), rather than report a final empty line. This has also other benefits if you process files and want to keep their newline convention.

Empty lines are quite often significant.

Exactly! So don't pretend they don't exist. Don't pretend the "\n" file and the empty file are equivalent. Don't try to tell me what is (un)important in my data. Yes, I did get puzzling integrity checksum failures because of your smart input_line.

At any rate, I'm still adamant that if I'm told the input consists of decimal integers, one per line, I must be able to read it with

In_channel.input_lines ic |> List.map int_of_string

I'm afraid the real world is in general a bit more messy than that and if you care about user experience you will likely write anyways:

In_channel.input_lines ic |> List.map (fun l -> int_of_string (String.trim l)) 

which if you include the new lines at the end of lines, is totally equivalent.

If you really think what you propose is so good then maybe we could rather have:

val In_channel.input_lines : with_newlines:bool -> in_channel -> string list

your program still looks good:

In_channel.input_lines ~with_newlines:false ic |> List.map int_of_string

and we get something nice and useful in the stdlib, not just something for "solving programming puzzles".

@gasche
Copy link
Member

gasche commented Jan 4, 2023

Another way to compromise would be to have input_full_line and In_channel.input_full_lines that preserve the newline markers at the end of lines. @dbuenzli believes that keeping the newline is the right default, so will not be fond of this solution where this option is out of the path of least effort, but at least the option will be easily available unlike now.

(I have no opinion on which is the right default and I don't want to get hit by stray bullets here, but I would point out that I very rarely encounter user complains about the current behavior of input_line, which would support the claim that it is okay most of the time.)

@dbuenzli
Copy link
Contributor

dbuenzli commented Jan 4, 2023

but I would point out that I very rarely encounter user complains about the current behavior of input_line,

I rather think that user rarely complain about it because, users have learned, like text mode input, to refrain from using it.

A quick github search shows that I'm not the only, see for example here in which you will learn that semgrep actually has lints for both open_in and input_line

@dbuenzli
Copy link
Contributor

dbuenzli commented Jan 4, 2023

So to sum up if one wanted to bring new line sanity for the next generation of OCaml programmers I think we should first have an input line function that:

  1. Inputs lines by recognizing any of \n, \r, \r\n as new lines regardless of channel mode (or at least in binary mode).
  2. Allows to include the recognized newlines as such in the line data (or simpler: always include it).

Then use that function to implement the functions suggested by this PR. That way we do not perpetuate the frustrating and error prone system we have now.

@xavierleroy
Copy link
Contributor Author

xavierleroy commented Jan 4, 2023

@dbuenzli : you've stated your points multiple times; I stated mine (input_line is fine as it is, this PR is consistent with it, and is genuinely useful); let's agree to disagree. Maybe you could stay silent for a while to let others review this PR, if that's not too much to ask?


@since 5.1 *)

val fold_lines : ('a -> string -> 'a) -> 'a -> t -> 'a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 'acc for consistency with #11858

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 163 to 164
in the style of a fold. More precisely, [fold_lines f init c] is
[List.fold_left f init (In_channel.input_lines ic)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in the style of a fold. More precisely, [fold_lines f init c] is
[List.fold_left f init (In_channel.input_lines ic)],
in the style of a fold. More precisely, [fold_lines f init ic] is
[List.fold_left f init (In_channel.input_lines ic)],

More precisely, though no more accurately. The interleaving of side effects is different: fold_lines works on infinite streams and allows writing (a contrived form of) interactive programs.

This could be documented using similar language as Seq.fold_left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. I tried to explain a little better, but I'm still not 100% happy. On the other hand I'm not 100% happy with the explanation of Seq.fold_left either. I'd rather not mention the possibility of infinite I/O channels...

@bluddy
Copy link
Contributor

bluddy commented Jan 4, 2023

I was surprised to see that python's File.read_lines() function does leave in the \n. At the same time, a quick search brings up issues with this function, as users find it unintuitive, and the solution provided is to just use File.read().splitlines() splitlines() being a string method.

I personally think removing the newlines is more intuitive. The result of reading the wrong OS file type could be that you end up with \rs but that's an unfortunate consequence of the OS split, and maybe we could provide a separate cross-platform solution for that.

@dra27
Copy link
Member

dra27 commented Jan 9, 2023

(I have no opinion on which is the right default and I don't want to get hit by stray bullets here, but I would point out that I very rarely encounter user complains about the current behavior of input_line, which would support the claim that it is okay most of the time.)

Ugh, it's really quite hot in here, but I'll risk pointing out that my routine advice for cross-platform programming has for many years been to avoid input_line and like the proverbial plague (it's both in the St Louis workshop talk I gave in 2018 and suggestions similar to @dbuenzli's for input_line is explicitly in #10109)

Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm (considerably) more for having these functions in the stdlib, even with input_line's corner-cases, than not having them at all!

stdlib/in_channel.mli Outdated Show resolved Hide resolved
@gasche
Copy link
Member

gasche commented Jan 12, 2023

We have reached two maintainer approvals so I think that this is going to be merged.

@xavierleroy there is a Changes entry conflict, could you rebase?

I don't think that a consensus was reached on the debate on the virtues and vices of input_line, but I think that it is reasonable to predict that input_line is not going to change its (default) behavior and that input_lines should reasonably follow what input_line is doing. If we later decide to provide an input_full_line function or somehow add an optional parameter to input_line (I think that input_full_line would be a fine idea, but I am not planning to work on this myself), we can reflect the change in the plural functions as well.

@dbuenzli
Copy link
Contributor

so I think that this is going to be merged.

Please do that. But I'll just notice that we have a dev team member saying:

my routine advice for cross-platform programming has for many years been to avoid input_line and like the proverbial plague

and yet we choose to continue to build on top of it. I'm always happy to defend the stdlib but this is a bit beyond me.

For anyone interested in doing the right things here it is.

@gasche
Copy link
Member

gasche commented Jan 12, 2023

But I'll just notice that we have a dev team member saying: [...]

@dra27, as said dev team member, what would you recommend doing with this PR?

Edit: I realize that it is a silly question as you just approved the PR.

@gasche
Copy link
Member

gasche commented Jan 12, 2023

(Note regarding "a dev team member said": personally I don't necessarily attach more value to the opinion of people who happen to be maintainers than to the one of external contributors, I try to judge by expertise on the specific question at hand. (I can make better guesses on the expertise of people I know better, and may be more conservative otherwise; and there may be an implicit bias in favor of people I work with on a regular basis.) On this topic, I would consider both Xavier and David and Daniel as experts -- I'm not -- and carefully consider their feedback.)

@dbuenzli
Copy link
Contributor

dbuenzli commented Jan 12, 2023

Note regarding "a dev team member said": personally I don't necessarily attach more value to the opinion of people who happen to be maintainers than to the one of external contributors

This was just to draw attention that not only do external people and linters (I linked to) steer away from input_line but also people with the nose deep into this project do.

At that point I'm not even asking for an input_line that preserves endlines or allows to distinguish an empty file from one made of an endline. But at least just for one that is able to parse the lines of a standardized CSV file on unix if you open your file in binary mode (applying String.strip to remove the spurious \r is out of the question here as you may eat field data in the first or last column).

Some people are reconsidering OCaml because we know why. Do we really want them (and their end-users) to go through the painful and frustrating process of learning to steer away from input_line and its derivatives ? If we really want the functions of this PR can't we (1) define an input_line' that is at least endline-convention agnostic in binary mode (2) suggest in general to use that one instead of input_line and (3) base these quick and handy helpers on the better behaved one ?

I always liked the stance of "do not provide primitives that allow to write inefficient code" in this project (e.g. String.explode is a no go). Can we also maybe add the stance "do not provide primitives that allow to write user-unfriendly programs" ?

(This will be my last message to this discussion unless explicitly pinged back to interact)

@xavierleroy
Copy link
Contributor Author

More drama. Great.

I understand that @dra27 and perhaps @dbuenzli are now suggesting that input_line should exclude from its result not only the final \n, but also \r\n, and perhaps even \r+\n. so that CRLF-based formats (many of which can be found in IETF RFCs) are easier to handle.

First, I think this suggestion makes sense (unlike the one that includes the line terminators in the result of input_line), and I'm quite open to implementing it.

Second, I don't understand why this suggestion had to be made using grandiose or vaguely-threatening words, e.g. "avoid input_line and like the proverbial plague", or "some people are reconsidering OCaml because we know why" (I don't know why). Can we keep it professional and effective, please?

Third, this PR is about adding input_lines, not changing input_line. Any improvement to input_line will propagate to input_lines. Yet most of the discussion has been on input_line-bashing. Again, I feel these PR reviews could be more effective if we were to discuss the PR and not our pet peeves with the stdlib.

@gasche
Copy link
Member

gasche commented Jan 13, 2023

@xavierleroy I've made a decision to merge the PR once you rebase it so that I can click on the button. Or you click.

I'm still interested in seeing if we can offer alternative functions that would satisfy @dbuenzli and @dra27 (but I probably won't have the energy to implement them, and @dbuenzli may be pouty after the merge so he won't, so we are left with @dra27). Let's discuss it separately.

xavierleroy and others added 4 commits January 13, 2023 15:54
Co-authored-by: wiktorkuchta <35867657+wiktorkuchta@users.noreply.github.com>
Move `input_lines` and `fold_lines` to the appropriate sections.
@xavierleroy
Copy link
Contributor Author

Re-based, updated the docstrings to fit the new sectioning in stdlib/in_channels.mli, squashed, and merged.

@xavierleroy xavierleroy deleted the input-lines branch January 13, 2023 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants