Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to stream when reading CSV #20

Closed
ms-tg opened this issue Oct 30, 2014 · 15 comments
Closed

How to stream when reading CSV #20

ms-tg opened this issue Oct 30, 2014 · 15 comments

Comments

@ms-tg
Copy link

ms-tg commented Oct 30, 2014

How can we use this library without reading the entire CSV into memory at once?

@marklister
Copy link
Owner

That's sadly out of scope for this project. product-collections is primarily about CollSeq and the CSV I/O is really a convenience rather than a focus. Not returning a CollSeq wouldn't make sense for this project. But what you're asking is possible. Here's how I'd go about it:

Clone the project and build. Take a look at file CSVParser1-22.scala in the target/src_managed directory. You'll see how the various CSVParsers work based on implicit String converters.

You'll probably need to swap out opencsv for another or perhaps home grown line by line parser.

Return an Iterator or a Stream or something.

If a library like this existed I'd use it as a dependency (but still return a CollSeq)

@marklister
Copy link
Owner

I had another look at this and have exposed a scala Iterator in CsvParser. This means you can use product-collections I/O with your own data structures. This work is in the CsvWriter branch for now and need a little tidying up. It returns an Iterator[TupleN] instead of an Iterator[ProductN] because tupled in postfixops -- necessary to instantiate case classes -- seems to demand it (it might be a scala bug).

scala> new java.io.FileReader("abil.csv")
res0: java.io.FileReader = java.io.FileReader@a9046

scala> CsvParser[String,Int,Int,Int].iterator(res0,delimiter="\t",hasHeader=true)
res1: Iterator[(String, Int, Int, Int)] = non-empty iterator

scala> case class Q(d:String,o:Int,c:Int,h:Int)
defined class Q

scala> val f= Q.apply _ tupled
<console>:15: warning: postfix operator tupled should be enabled
by making the implicit value scala.language.postfixOps visible.
This can be achieved by adding the import clause 'import scala.language.postfixOps'
or by setting the compiler option -language:postfixOps.
See the Scala docs for value scala.language.postfixOps for a discussion
why the feature should be explicitly enabled.
       val f= Q.apply _ tupled
                        ^
f: ((String, Int, Int, Int)) => Q = <function1>

scala> res1.map(f(_)).toList
res2: List[Q] = List(Q(30-APR-12,3885,3922,3859), Q(02-MAY-12,3880,3915,3857), Q(03-MAY-12,3920,3948,3874), Q(04-MAY-12,3909,3952,3885), Q(07-MAY-12,3853,3900,3825), Q(08-MAY-12,3770,3851,3755), Q(09-MAY-12,3700,3782,3666), Q(10-MAY-12,3732,3745,3658), Q(11-MAY-12,3760,3765,3703), Q(14-MAY-12,3660,3750,3655), Q(15-MAY-12,3650,3685,3627), Q(16-MAY-12,3661,3663,3555), Q(17-MAY-12,3620,3690,3600), Q(18-MAY-12,3545,3595,3542), Q(21-MAY-12,3602,3608,3546), Q(22-MAY-12,3650,3675,3615), Q(23-MAY-12,3566,3655,3566), Q(24-MAY-12,3632,3645,3586), Q(25-MAY-12,3610,3665,3583), Q(28-MAY-12,3591,3647,3582), Q(29-MAY-12,3629,3630,3575), Q(30-MAY-12,3593,3625,3565), Q(31-MAY-12,3632,3632,3568), Q(01-JUN-12,3585,3617,3552), Q(04-JUN-12,3632,3649,3533), Q(05-JUN-12,3620,3661,3609), Q(06-JUN-12,3676,3676,...
scala> 

@ms-tg
Copy link
Author

ms-tg commented Oct 31, 2014

@marklister this looks awesome! we are going to check this out and see if this approach will work. I was hoping there would be a way forward so that we could use the excellent type conversion and generally clean interface of your library, but process lazily! :)

@marklister
Copy link
Owner

@ms-tg, let me know how it goes.... The toList above is just so we can see inside the Iterator.

@marklister
Copy link
Owner

And you can always use toStream if you want a Stream.

@ms-tg
Copy link
Author

ms-tg commented Oct 31, 2014

@marklister: this is fantastic! :)

I have spiked our code to process a large (~2m line) CSV, and using your iterator with toStream works absolutely perfectly. No memory spike.

We currently are building with an sbt dependency on your specific git commit on github as we explore this. Would you want to see any potential pull requests if we want to change anything in a fork, or are you planning to first refine these changes and merge them back into your own master first?

Again, thanks, and regards,
-Marc

/cc @flicken @waseemtaj

@marklister
Copy link
Owner

It just need some tests and I'll merge branch CsvWriter back to master. Product-collections is such a small project I don't need to get carried away with quality control. I'd appreciate any feedback you have either as comments or pull requests. I expect to be merged with master maybe Monday latest.

I can push a release fairly soon if I'm happy the edge cases are covered. You might help speed that along with feedback...

@ms-tg
Copy link
Author

ms-tg commented Oct 31, 2014

?No problem, we will feedback as we go, and submit PRs where it makes sense.

Would you accept conversion typeclasses for BigDeciml, and optional ones for Joda's DateTime and LocalDate?

-Marc


From: mark lister notifications@github.com
Sent: Friday, October 31, 2014 3:57 PM
To: marklister/product-collections
Cc: Marc Siegel
Subject: Re: [product-collections] How to stream when reading CSV (#20)

It just need some tests and I'll merge branch CsvWriter back to master. Product-collections is such a small project I don't need to get carried away with quality control. I'd appreciate any feedback you have either as comments or pull requests. I expect to be merged with master maybe Monday latest.

I can push a release fairly soon if I'm happy the edge cases are covered. You might help speed that along with feedback...

Reply to this email directly or view it on GitHubhttps://github.com//issues/20#issuecomment-61321459.

@marklister
Copy link
Owner

conversion typeclasses for BigDeciml, and optional ones for Joda's DateTime

Sounds good.

@marklister marklister mentioned this issue Nov 1, 2014
@ms-tg
Copy link
Author

ms-tg commented Nov 5, 2014

Hi @marklister -- could you perhaps publish a release that has these features, that we could depend on?

Thanks :)

@marklister
Copy link
Owner

I can publish it. It'll be v 1.01. But I'd appreciate some feedback on the resolution to #21 -- is it sufficient? Does it work OK? #20 ( this one) are we happy to close it? And maybe some eyes on #22 -- Nothing untoward introduced? If you can give me some reassurance I'll pop out a release tomorrow GMT+2...

@marklister
Copy link
Owner

Or maybe 1.1 -- I guess it is a feature release.

@ms-tg
Copy link
Author

ms-tg commented Nov 5, 2014

Basically, we haven't seen any bugs yet with any of those issues. We are currently struggling with GC overhead limit exceeded errors when doing streaming over very large CSV files, and we are trying to determine the source, but I am doubting that it's in this library -- it's more likely that our usage of Streams is causing some stuff to be uncollectable. We're still investigating that.

@marklister
Copy link
Owner

It's low risk and the feature I'm worried about is labeled experimental. I'll look at this before I go to bed.

@marklister
Copy link
Owner

@ms-tg,
I've pushed to bintray -- I use them to forward to maven central. It can take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants