-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to stream when reading CSV #20
Comments
|
I had another look at this and have exposed a scala Iterator in scala> new java.io.FileReader("abil.csv")
res0: java.io.FileReader = java.io.FileReader@a9046
scala> CsvParser[String,Int,Int,Int].iterator(res0,delimiter="\t",hasHeader=true)
res1: Iterator[(String, Int, Int, Int)] = non-empty iterator
scala> case class Q(d:String,o:Int,c:Int,h:Int)
defined class Q
scala> val f= Q.apply _ tupled
<console>:15: warning: postfix operator tupled should be enabled
by making the implicit value scala.language.postfixOps visible.
This can be achieved by adding the import clause 'import scala.language.postfixOps'
or by setting the compiler option -language:postfixOps.
See the Scala docs for value scala.language.postfixOps for a discussion
why the feature should be explicitly enabled.
val f= Q.apply _ tupled
^
f: ((String, Int, Int, Int)) => Q = <function1>
scala> res1.map(f(_)).toList
res2: List[Q] = List(Q(30-APR-12,3885,3922,3859), Q(02-MAY-12,3880,3915,3857), Q(03-MAY-12,3920,3948,3874), Q(04-MAY-12,3909,3952,3885), Q(07-MAY-12,3853,3900,3825), Q(08-MAY-12,3770,3851,3755), Q(09-MAY-12,3700,3782,3666), Q(10-MAY-12,3732,3745,3658), Q(11-MAY-12,3760,3765,3703), Q(14-MAY-12,3660,3750,3655), Q(15-MAY-12,3650,3685,3627), Q(16-MAY-12,3661,3663,3555), Q(17-MAY-12,3620,3690,3600), Q(18-MAY-12,3545,3595,3542), Q(21-MAY-12,3602,3608,3546), Q(22-MAY-12,3650,3675,3615), Q(23-MAY-12,3566,3655,3566), Q(24-MAY-12,3632,3645,3586), Q(25-MAY-12,3610,3665,3583), Q(28-MAY-12,3591,3647,3582), Q(29-MAY-12,3629,3630,3575), Q(30-MAY-12,3593,3625,3565), Q(31-MAY-12,3632,3632,3568), Q(01-JUN-12,3585,3617,3552), Q(04-JUN-12,3632,3649,3533), Q(05-JUN-12,3620,3661,3609), Q(06-JUN-12,3676,3676,...
scala> |
@marklister this looks awesome! we are going to check this out and see if this approach will work. I was hoping there would be a way forward so that we could use the excellent type conversion and generally clean interface of your library, but process lazily! :) |
@ms-tg, let me know how it goes.... The |
And you can always use |
@marklister: this is fantastic! :) I have spiked our code to process a large (~2m line) CSV, and using your iterator with toStream works absolutely perfectly. No memory spike. We currently are building with an sbt dependency on your specific git commit on github as we explore this. Would you want to see any potential pull requests if we want to change anything in a fork, or are you planning to first refine these changes and merge them back into your own master first? Again, thanks, and regards, /cc @flicken @waseemtaj |
It just need some tests and I'll merge branch CsvWriter back to master. Product-collections is such a small project I don't need to get carried away with quality control. I'd appreciate any feedback you have either as comments or pull requests. I expect to be merged with master maybe Monday latest. I can push a release fairly soon if I'm happy the edge cases are covered. You might help speed that along with feedback... |
?No problem, we will feedback as we go, and submit PRs where it makes sense. Would you accept conversion typeclasses for BigDeciml, and optional ones for Joda's DateTime and LocalDate? -Marc From: mark lister notifications@github.com It just need some tests and I'll merge branch CsvWriter back to master. Product-collections is such a small project I don't need to get carried away with quality control. I'd appreciate any feedback you have either as comments or pull requests. I expect to be merged with master maybe Monday latest. I can push a release fairly soon if I'm happy the edge cases are covered. You might help speed that along with feedback... Reply to this email directly or view it on GitHubhttps://github.com//issues/20#issuecomment-61321459. |
Sounds good. |
Hi @marklister -- could you perhaps publish a release that has these features, that we could depend on? Thanks :) |
I can publish it. It'll be v 1.01. But I'd appreciate some feedback on the resolution to #21 -- is it sufficient? Does it work OK? #20 ( this one) are we happy to close it? And maybe some eyes on #22 -- Nothing untoward introduced? If you can give me some reassurance I'll pop out a release tomorrow GMT+2... |
Or maybe 1.1 -- I guess it is a feature release. |
Basically, we haven't seen any bugs yet with any of those issues. We are currently struggling with GC overhead limit exceeded errors when doing streaming over very large CSV files, and we are trying to determine the source, but I am doubting that it's in this library -- it's more likely that our usage of Streams is causing some stuff to be uncollectable. We're still investigating that. |
It's low risk and the feature I'm worried about is labeled experimental. I'll look at this before I go to bed. |
@ms-tg, |
How can we use this library without reading the entire CSV into memory at once?
The text was updated successfully, but these errors were encountered: