Skip to content

Loading…

batteries' input_line is slower than the stdlib's one #520

Open
UnixJunkie opened this Issue · 12 comments

6 participants

@UnixJunkie
ocaml-batteries-team member

I can read a 160041 lines file ~4.9 times
faster using Legacy.input_line...

@c-cube
ocaml-batteries-team member

I think we need many more benchmarks...

@UnixJunkie
ocaml-batteries-team member

The culprit looks like BatIO and Enum, that add some wrapping around each call to
read_line.

@agarwal
ocaml-batteries-team member
@hcarty

@agarwal I think you missed the link in your comment.

@agarwal
ocaml-batteries-team member
@c-cube
ocaml-batteries-team member

I looked at the code, and the source of slowness is probably twofold:

  • the overhead of Enum combinators (including Enum.suffix_action that is only useful for the last element, but still wraps every call to Enum.next)
  • assuming @UnixJunkie used BatIO.lines_of (a very convenient combinator indeed), it just calls read_line repeatedly instead of, say, allocating one buffer and re-using it
@UnixJunkie
ocaml-batteries-team member

I used File.lines_of, but yes then at some point it calls BatIO.lines_of.

@UnixJunkie
ocaml-batteries-team member

I think I should close this (since I created it): this is a too old performance regression compared to the stdlib
that was introduced too long time ago.

@gasche
ocaml-batteries-team member

I don't think there is anything wrong with fixing old issues. c-cube looks interested in the performance aspect of BatIO (I personally tend to loathe IO-related stuff, so I should apologize for happily staying away of the discussion). It would help to have your actual benchmark code, though -- I would guess that most reasons we suspect could cause this regression will turn out not to matter that much in a realistic workflow, with one actual suspect being guitly by a large margin.

Thank you for looking at this!

@c-cube
ocaml-batteries-team member

Yet another argument to put IO in a separate library imho: if people want to use batteries in combination with libraries that use the standard input and ouput channels (and know what happens underneath, IO contains some bloat, like weak sets and whatnot).

@rgrinberg
ocaml-batteries-team member

Yep. I'm also pretty sure that IO brings in Unix. Separating the core of batteries from Unix is one of the main goals of refactoring.

@UnixJunkie
ocaml-batteries-team member

Implementing a wc -l in ocaml with batteries (File.lines_of) is enough to see the problem.
The same program using (Legacy.open_in, Legacy.input_line, Legacy.close_in) will be faster I bet.
Here is my version trying to avoid batteries' IO:

open Batteries
open Printf

module MU = My_utils

let with_in_file fn f =
  let input = Legacy.open_in fn in
  let res = f input in
  Legacy.close_in input;
  res

let main () =

  let nb_lines = ref 0 in

  let _all_lines =
    with_in_file Sys.argv.(1) (fun input ->
      let res, _eof =
        MU.unfold_exc
          (fun () -> let l = Legacy.input_line input in
                     incr nb_lines;
                     l
          )
      in
      res
    )
  in
  printf "init %d lines\n" !nb_lines;
;;

main ()

MU.unfold_exc is the new constructor I am pushing for in BatList.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.