Alternative CSV reader #589

Jolanrensen · 2024-02-13T16:43:54Z

should be investigated: https://github.com/doyaaaaaken/kotlin-csv

koperagen · 2024-02-13T17:16:45Z

I tried FastCSV and want to utilize it on JVM for performance that several times better than existing one and beats pandas too
I assume you aim for KMP, so it's a different thing. Just a note to keep in mind

devcrocod · 2024-02-22T12:06:05Z

Keep in mind that you can always write your own interface and hide the platform implementation later

Jolanrensen · 2024-09-02T12:07:24Z

I've been experimenting with different implementations to find the fastest one in combination with DataFrame.

Each test has two versions of the implementation:

The default version first loads the entire CSV into memory. This is usually the fastest for smaller CSVs since the right amount of memory for the columns can be created right away. However, this can run into memory issues more quickly for larger CSV files.
That's why each test is accompanied with a "sequential" version. This version uses data collectors to stream the csv rows into separate string-columns directly. The downside of this is that we don't know the right amount of memory yet, so the ArrayLists need to grow accordingly, but we never get a full List<SomeCsvRowClass>, saving memory in the long run :)

We test:

Apache Commons CSV (the current implementation)
Fast CSV 2.x
Kotlin-CSV

Small CSV: 65.4 kB
(ops/s: Higher score is better)

(s/op: Lower score is better)

Large CSV: 857.7 MB
(ops/s: Higher score is better)

(s/op: Lower score is better)

Jolanrensen · 2024-09-02T18:32:42Z

I now added Deephaven-csv:

(s/op: Lower is better)

Benchmark                                    Mode  Cnt   Score    Error  Units
CsvBenchmark.apacheCsvReader                   ss   10   0.007 ±  0.003   s/op
CsvBenchmark.apacheCsvReaderSequential         ss   10   0.008 ±  0.003   s/op
CsvBenchmark.deephavenCsvReader                ss   10   0.009 ±  0.011   s/op
CsvBenchmark.fastCsvReader                     ss   10   0.004 ±  0.001   s/op
CsvBenchmark.fastCsvReaderSequential           ss   10   0.004 ±  0.002   s/op
CsvBenchmark.kotlinCsvReader                   ss   10   0.008 ±  0.001   s/op
CsvBenchmark.kotlinCsvReaderSequential         ss   10   0.007 ±  0.001   s/op
LargeCsvBenchmark.apacheCsvReader              ss    5  72.809 ± 16.879   s/op
LargeCsvBenchmark.apacheCsvReaderSequential    ss    5  46.433 ± 39.409   s/op
LargeCsvBenchmark.deephavenCsvReader           ss    5  16.640 ±  6.664   s/op
LargeCsvBenchmark.fastCsvReader                ss    5  59.848 ± 22.986   s/op
LargeCsvBenchmark.fastCsvReaderSequential      ss    5  40.747 ±  4.598   s/op
LargeCsvBenchmark.kotlinCsvReader              ss    5  80.383 ± 15.870   s/op
LargeCsvBenchmark.kotlinCsvReaderSequential    ss    5  68.547 ± 20.748   s/op

Note: The deephaven integration might not be optimal yet:

It can parse values by type itself, but I haven't figured out how to make custom parsers for it yet, so parsing a string column requires parsing twice (or more) at the moment.
deephaven allows defining your own (typed and unboxed) data collector which could give an immense boost in combination with Research: ColumnDataHolder/primitive arrays #712

Jolanrensen · 2024-09-17T13:49:22Z

Combining Deephaven with #712 is very promising.
Reading the large csv on the ColumnDataHolder branch with properly set-up deephaven reading yields the following results:

Doing the same on the master branch yields:

Both in terms of memory and performance, there's something to gain from using deephaven and primitive arrays, at least when it comes to reading csvs :)

Jolanrensen · 2024-09-25T17:19:34Z

Deephaven with normal arraylists (that support nulls this time) and new parsers:

Jolanrensen added the research This requires a deeper dive to gather a better understanding label Feb 13, 2024

Jolanrensen added this to the Backlog milestone Feb 13, 2024

Jolanrensen added the csv CSV / delim related issues label Aug 20, 2024

Jolanrensen mentioned this issue Aug 20, 2024

☂ CSV rework #827

Open

24 tasks

Jolanrensen modified the milestones: Backlog, 0.15.0 Sep 4, 2024

Jolanrensen self-assigned this Sep 30, 2024

Jolanrensen linked a pull request Nov 1, 2024 that will close this issue

New CSV implementation #903

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative CSV reader #589

Alternative CSV reader #589

Jolanrensen commented Feb 13, 2024

koperagen commented Feb 13, 2024 •

edited

Loading

devcrocod commented Feb 22, 2024

Jolanrensen commented Sep 2, 2024 •

edited

Loading

Jolanrensen commented Sep 2, 2024

Jolanrensen commented Sep 17, 2024

Jolanrensen commented Sep 25, 2024

Alternative CSV reader #589

Alternative CSV reader #589

Comments

Jolanrensen commented Feb 13, 2024

koperagen commented Feb 13, 2024 • edited Loading

devcrocod commented Feb 22, 2024

Jolanrensen commented Sep 2, 2024 • edited Loading

Jolanrensen commented Sep 2, 2024

Jolanrensen commented Sep 17, 2024

Jolanrensen commented Sep 25, 2024

koperagen commented Feb 13, 2024 •

edited

Loading

Jolanrensen commented Sep 2, 2024 •

edited

Loading