How to stream when reading CSV #20

ms-tg · 2014-10-30T19:00:09Z

How can we use this library without reading the entire CSV into memory at once?

marklister · 2014-10-31T05:16:01Z

That's sadly out of scope for this project. product-collections is primarily about CollSeq and the CSV I/O is really a convenience rather than a focus. Not returning a CollSeq wouldn't make sense for this project. But what you're asking is possible. Here's how I'd go about it:

~~Clone the project and build. Take a look at file CSVParser1-22.scala in the target/src_managed directory. You'll see how the various CSVParsers work based on implicit String converters.~~

~~You'll probably need to swap out opencsv for another or perhaps home grown line by line parser.~~

~~Return an Iterator or a Stream or something.~~

~~If a library like this existed I'd use it as a dependency (but still return a CollSeq)~~

marklister · 2014-10-31T15:24:56Z

I had another look at this and have exposed a scala Iterator in CsvParser. This means you can use product-collections I/O with your own data structures. This work is in the CsvWriter branch for now and need a little tidying up. It returns an Iterator[TupleN] instead of an Iterator[ProductN] because tupled in postfixops -- necessary to instantiate case classes -- seems to demand it (it might be a scala bug).

scala> new java.io.FileReader("abil.csv")
res0: java.io.FileReader = java.io.FileReader@a9046

scala> CsvParser[String,Int,Int,Int].iterator(res0,delimiter="\t",hasHeader=true)
res1: Iterator[(String, Int, Int, Int)] = non-empty iterator

scala> case class Q(d:String,o:Int,c:Int,h:Int)
defined class Q

scala> val f= Q.apply _ tupled
<console>:15: warning: postfix operator tupled should be enabled
by making the implicit value scala.language.postfixOps visible.
This can be achieved by adding the import clause 'import scala.language.postfixOps'
or by setting the compiler option -language:postfixOps.
See the Scala docs for value scala.language.postfixOps for a discussion
why the feature should be explicitly enabled.
       val f= Q.apply _ tupled
                        ^
f: ((String, Int, Int, Int)) => Q = <function1>

scala> res1.map(f(_)).toList
res2: List[Q] = List(Q(30-APR-12,3885,3922,3859), Q(02-MAY-12,3880,3915,3857), Q(03-MAY-12,3920,3948,3874), Q(04-MAY-12,3909,3952,3885), Q(07-MAY-12,3853,3900,3825), Q(08-MAY-12,3770,3851,3755), Q(09-MAY-12,3700,3782,3666), Q(10-MAY-12,3732,3745,3658), Q(11-MAY-12,3760,3765,3703), Q(14-MAY-12,3660,3750,3655), Q(15-MAY-12,3650,3685,3627), Q(16-MAY-12,3661,3663,3555), Q(17-MAY-12,3620,3690,3600), Q(18-MAY-12,3545,3595,3542), Q(21-MAY-12,3602,3608,3546), Q(22-MAY-12,3650,3675,3615), Q(23-MAY-12,3566,3655,3566), Q(24-MAY-12,3632,3645,3586), Q(25-MAY-12,3610,3665,3583), Q(28-MAY-12,3591,3647,3582), Q(29-MAY-12,3629,3630,3575), Q(30-MAY-12,3593,3625,3565), Q(31-MAY-12,3632,3632,3568), Q(01-JUN-12,3585,3617,3552), Q(04-JUN-12,3632,3649,3533), Q(05-JUN-12,3620,3661,3609), Q(06-JUN-12,3676,3676,...
scala>

ms-tg · 2014-10-31T16:26:34Z

@marklister this looks awesome! we are going to check this out and see if this approach will work. I was hoping there would be a way forward so that we could use the excellent type conversion and generally clean interface of your library, but process lazily! :)

marklister · 2014-10-31T16:56:07Z

@ms-tg, let me know how it goes.... The toList above is just so we can see inside the Iterator.

marklister · 2014-10-31T16:58:26Z

And you can always use toStream if you want a Stream.

ms-tg · 2014-10-31T19:25:19Z

@marklister: this is fantastic! :)

I have spiked our code to process a large (~2m line) CSV, and using your iterator with toStream works absolutely perfectly. No memory spike.

We currently are building with an sbt dependency on your specific git commit on github as we explore this. Would you want to see any potential pull requests if we want to change anything in a fork, or are you planning to first refine these changes and merge them back into your own master first?

Again, thanks, and regards,
-Marc

/cc @flicken @waseemtaj

marklister · 2014-10-31T19:57:04Z

It just need some tests and I'll merge branch CsvWriter back to master. Product-collections is such a small project I don't need to get carried away with quality control. I'd appreciate any feedback you have either as comments or pull requests. I expect to be merged with master maybe Monday latest.

I can push a release fairly soon if I'm happy the edge cases are covered. You might help speed that along with feedback...

ms-tg · 2014-10-31T21:06:03Z

?No problem, we will feedback as we go, and submit PRs where it makes sense.

Would you accept conversion typeclasses for BigDeciml, and optional ones for Joda's DateTime and LocalDate?

-Marc

From: mark lister notifications@github.com
Sent: Friday, October 31, 2014 3:57 PM
To: marklister/product-collections
Cc: Marc Siegel
Subject: Re: [product-collections] How to stream when reading CSV (#20)

It just need some tests and I'll merge branch CsvWriter back to master. Product-collections is such a small project I don't need to get carried away with quality control. I'd appreciate any feedback you have either as comments or pull requests. I expect to be merged with master maybe Monday latest.

I can push a release fairly soon if I'm happy the edge cases are covered. You might help speed that along with feedback...

Reply to this email directly or view it on GitHubhttps://github.com//issues/20#issuecomment-61321459.

marklister · 2014-10-31T21:21:43Z

conversion typeclasses for BigDeciml, and optional ones for Joda's DateTime

Sounds good.

ms-tg · 2014-11-05T16:28:05Z

Hi @marklister -- could you perhaps publish a release that has these features, that we could depend on?

Thanks :)

marklister · 2014-11-05T18:23:18Z

I can publish it. It'll be v 1.01. But I'd appreciate some feedback on the resolution to #21 -- is it sufficient? Does it work OK? #20 ( this one) are we happy to close it? And maybe some eyes on #22 -- Nothing untoward introduced? If you can give me some reassurance I'll pop out a release tomorrow GMT+2...

marklister · 2014-11-05T18:24:01Z

Or maybe 1.1 -- I guess it is a feature release.

ms-tg · 2014-11-05T18:40:37Z

Basically, we haven't seen any bugs yet with any of those issues. We are currently struggling with GC overhead limit exceeded errors when doing streaming over very large CSV files, and we are trying to determine the source, but I am doubting that it's in this library -- it's more likely that our usage of Streams is causing some stuff to be uncollectable. We're still investigating that.

marklister · 2014-11-05T18:44:21Z

It's low risk and the feature I'm worried about is labeled experimental. I'll look at this before I go to bed.

marklister · 2014-11-05T19:27:46Z

@ms-tg,
I've pushed to bintray -- I use them to forward to maven central. It can take some time.

ms-tg mentioned this issue Oct 31, 2014

A similar interface to write CSV #21

Closed

marklister mentioned this issue Nov 1, 2014

Type bounds #22

Closed

marklister closed this as completed Nov 5, 2014

marklister added the enhancement label Nov 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to stream when reading CSV #20

How to stream when reading CSV #20

ms-tg commented Oct 30, 2014

marklister commented Oct 31, 2014

marklister commented Oct 31, 2014

ms-tg commented Oct 31, 2014

marklister commented Oct 31, 2014

marklister commented Oct 31, 2014

ms-tg commented Oct 31, 2014

marklister commented Oct 31, 2014

ms-tg commented Oct 31, 2014

marklister commented Oct 31, 2014

ms-tg commented Nov 5, 2014

marklister commented Nov 5, 2014

marklister commented Nov 5, 2014

ms-tg commented Nov 5, 2014

marklister commented Nov 5, 2014

marklister commented Nov 5, 2014

How to stream when reading CSV #20

How to stream when reading CSV #20

Comments

ms-tg commented Oct 30, 2014

marklister commented Oct 31, 2014

marklister commented Oct 31, 2014

ms-tg commented Oct 31, 2014

marklister commented Oct 31, 2014

marklister commented Oct 31, 2014

ms-tg commented Oct 31, 2014

marklister commented Oct 31, 2014

ms-tg commented Oct 31, 2014

marklister commented Oct 31, 2014

ms-tg commented Nov 5, 2014

marklister commented Nov 5, 2014

marklister commented Nov 5, 2014

ms-tg commented Nov 5, 2014

marklister commented Nov 5, 2014

marklister commented Nov 5, 2014