[#296] Added chunked(long), chunked(Predicate<T>) #320

billoneil · 2017-07-25T02:19:22Z

#296
I have two of the implementations working and they should be truly lazy.

chunked(long chunkSize)
chunked(Predicate predicate)
chunked(Predicate predicate, boolean excludeDelimiter) ?
chunkedWithIndex(BiPredicate<T, Long> biPredicate)

I'm not sure happy with how the code turned out. The two iterators are a little dependent on each other. Maybe a better approach would be to make a more top level peeking iterator instead of hacking it like I did.

billoneil · 2017-07-27T04:42:38Z

Refactored it a bit to clean it up. The peeking iterator seems to work. Guava's version special cases remove not sure if that's necessary if its not publicly exposed.

lukaseder · 2017-07-28T14:19:53Z

Thank you very much for your suggestion. I'll review shortly

billoneil · 2017-07-28T14:37:21Z

I realized some naming is off takeWhileIterator should probably be takeUntil.

lukaseder · 2017-07-28T14:20:09Z

src/main/java/org/jooq/lambda/Iterators.java

+/**
+ * Created by billoneil on 7/26/17.
+ */
+public class Iterators {


Can you please make this class package-private

I meant to make these package-private oops.

lukaseder · 2017-07-28T14:20:52Z

src/main/java/org/jooq/lambda/PeekingIterator.java

+
+import java.util.Iterator;
+
+public interface PeekingIterator<E> extends Iterator<E> {


Same here. This is an internal type and shouldn't be public, in my opinion.

In fact, I wonder if this needs to be an interface, given we only have a single implementation so far...

lukaseder · 2017-07-28T14:31:59Z

src/main/java/org/jooq/lambda/Seq.java

+     * <p>
+     * <code><pre>
+     * // ((1,1,2), (2), (3,4))
+     * Seq.of(1, 1, 2, 2, 3, 4).chunked(n -> n %2 == 0)


Hmm, I think we should exclude the chunk boundaries by default if using a predicate, and include them only explicitly. When including, the question is whether the boundary should belong to:

The ending chunk ((1, 1, 2), (2), (3, 4), ())

The beginning chunk ((1, 1), (2), (2, 3), (4))

Both chunks (?)

No chunk

Given that there is no clear preference between 1/2 and maybe even 3, the default must be 4. So, an enum might be a reasonable additional argument for this, where "no chunk" would be the default.

I was also thinking an enum would be the appropriate solution for inclusion / exclusion. It makes sense to work for all four.

The beginning chunk ((1, 1), (2), (2, 3), (4))

Should we assume the beginning chunk starts true? or should this actually return ((2), (2, 3), (4)) skipping anything before the first chunk.

No chunk

Would this be excluding the matched element? ((1, 1), (), (3))

Should we assume the beginning chunk starts true? or should this actually return ((2), (2, 3), (4)) skipping anything before the first chunk.

Egh... :) OK, so in the no chunk case, we get:

1 chunk for 0 true evaluations of the predicate

2 chunks for 1 true evaluation of the predicate

3 chunks for 2 true evaluations of the predicate

For example:

// ((1, 1, 2, 2, 3, 4)) Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> false); // ((),(),(),(),(),(),()) Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> true);

I think that options 1-3 should adhere to this logic. My example 1 was wrong, there should be a trailing empty chunk. Will fix the comment.

Hence:

// ((1),(1),(2),(2),(3),(4),()) Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> true, BEFORE); // ((),(1),(1),(2),(2),(3),(4)) Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> true, AFTER); // ((1),(1, 1),(1, 2),(2, 2),(2, 3),(3, 4),(4)) Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> true, BOTH);

Still a bit undecided about the BOTH part, although it could be useful and it would be consistent.

We could always ignore BOTH for now and easily add it later if requested since it will be an enum.

Yes, let's do that.

lukaseder · 2017-07-28T14:35:43Z

src/test/java/org/jooq/lambda/IteratorsTest.java

+    }
+
+    @Test(expected = IllegalArgumentException.class)
+    public void testCountingPredicateThrowsIllegalArg() {


I see, this is probably the reason why you made the class public...

If the tests are placed in the same package, you can access package-private classes.

I'm not too fan of testing internals with unit tests. The internals will be tested transitively by testing the chunked API. With tests on internals, refactoring those internals will be much harder...

I think they are in the same package so it should have been package-private. The first implementation was must have missed it when I cleaned it up.

I'm not too fan of testing internals with unit tests. The internals will be tested transitively by testing the chunked API. With tests on internals, refactoring those internals will be much harder...

👍 totally agree wasn't sure of your preference so opted for more testing

I'm all in for more testing, but high level tests will never break because the API won't change again. Low level tests are a bit of a maintenance burden.

lukaseder · 2017-07-28T14:38:33Z

src/main/java/org/jooq/lambda/Iterators.java

+        return takeWhileIterator(peeking(iterator), predicate);
+    }
+
+    static <E> Iterator<E> takeWhileIterator(PeekingIterator<E> iterator, Predicate<E> predicate) {


I have a feeling that we keep writing this iterator :) Perhaps, can this be refactored with existing code?

I was trying to use the Seq.take* but it closed the stream after the predicate matched and we need it to stay open. I will look into a better way to refactor this.

I meant to say: Perhaps Seq.take should use this new iterator. In that case, we should open up several tickets and do things one step at a time...

billoneil · 2017-07-28T15:00:20Z

I will first look into making the TakeIterator and reusing it, then come back to this after. I'll make a new issue and PR once it's ready.

lukaseder · 2017-07-28T15:20:26Z

I will first look into making the TakeIterator and reusing it, then come back to this after. I'll make a new issue and PR once it's ready.

Great, looking forward to this. Note: If it's too complex, we don't have to do it. I just thought if you do create new classes and API, it would be nice if they're reusable. Otherwise, I might prefer inlining them where they're really needed.

lukaseder · 2017-07-28T15:25:33Z

Btw, there's also some work by @tlinkowski pending review. He's implemented a SeqBuffer. It goes in a similar direction as this TakeIterator: #305

Maybe, have a look at that first, prior to delving into a refactoring...?

billoneil · 2017-07-28T16:14:12Z

Hmm the SeqBuffer doesn't seem like it would work on an infinite sized Seq the iterator would. I'll explore some more over the weekend.

billoneil · 2017-07-28T16:20:01Z

My original use case for this would be the ability to Stream something like a very large file in S3 to a Seq, lazily batch N records, batch process N records, stream to a new S3 file. Ideally it should work on any sized file. So I have been working with the assumption of infinite sized streams.

tlinkowski · 2017-07-28T16:47:00Z

I think that for this (#296) and for Seq.duplicate to work optimally, SeqBuffer would need to support the following things:

SeqBuffer.shutdown() or some other method - after calling this method no more calls to seq() would be legal (similarly to Java's ExecutorService.submit())
field SeqBuffer.activeSpliterators inside SeqBuffer that would store all created BufferSpliterators that haven't yet reached the end of iteration
field SeqBuffer.startIndex that would store the index corresponding to the first element in List<T> buffer
then, upon each call to SeqBuffer.BufferSpliterator.tryAdvance(), SeqBuffer would be notified by the BufferSpliterator, and if this SeqBuffer has been shut down, it would:
-- find minimal nextIndex among all activeSpliterators
-- remove all redundant elements from List<T> buffer
-- and use this minimal index as a new startIndex
additionally, List<T> buffer could be converted to LinkedList upon call to shutdown() to improve removal performance

It's not trivial but it seems doable. What do you think, @billoneil, @lukaseder?

billoneil · 2017-07-28T17:50:05Z

I see two use cases when splitting a Seq

You know the bounds and want to be able to operate on both returned Seq's

It would be totally reasonable to buffer the entire Seq (SeqBuffer) into memory and have logic so that both returned Seq's can be read concurrently (even if they are backed by a single Seq

You don't know the bounds and are splitting for some type of chunking.

In this case You would be forced to read both Seq's sequentially. If you try to read the second before the first is fully consumed you can either throw an error or forward the first Seq all the way to completion.

They are both very valid but different use cases. I wouldn't know the best way to portray that in an API since they have different contracts. Maybe allow some flags like Seq.buffered vs Seq.lazy as a second argument or different methods to handle different implementations? Maybe the API just picks one way always.

One of the reasons I liked the iterator approach is because it's lazy by default and if you want the values in memory you can always call .toList() on either or both Seq. The issue is what happens if you call .toList() on the second before the first?

This also just reminded me of something when I first brought this up.

Maybe the chunked method should return Seq<List<E>> because what happens if someone does the following.

Seq<Seq<Integer>> seqs = ...
seqs.map(s -> seq.limit(5))

This would mess up sequential streams unles sunder the hood it ignored the limit and traversed the whole iterator.

billoneil · 2017-08-01T20:05:54Z

I'm going to hold off until this is discussed a bit more.

billoneil · 2017-08-08T02:19:51Z

@lukaseder @tlinkowski Either of you have any thoughts here?

tlinkowski · 2017-08-08T07:14:39Z

@billoneil Well, I'm under the impression that the modified SeqBuffer that I described above (supporting shutdown method) would absolutely suffice to implement what you described under point no. 2 (BTW. you posted twice almost the same thing - you could delete your first post).

Even more, the advantage of implementing chunked using such a modified SeqBuffer is that you can read the resulting Seqs in any order you like! The only penalty for reading them in a "strange" order are potential performance problems (e.g. if you start reading "from the end", SeqBuffer will have to buffer the entire source Seq into memory). However, if you read these returned "chunk" Seqs in order, SeqBuffer won't actually have to buffer anything! This is because SeqBuffer will know (because it will have been shut down) that the item you've just read from the source Seq won't be accessible from any other Seq. In such scenario, of course, anyone can call anything on the resulting Seqs (e.g. seq.limit(5) that you mentioned) and it doesn't break anything - all calls just forward to SeqBuffer.BufferSpliterator, which (together with SeqBuffer) takes care of what needs to be buffered and what can be discarded

One more thought about the modified SeqBuffer implementation - to construct such chunked Seqs we couldn't just use SeqBuffer.seq().skip(n) because this would yield SeqBuffer.BufferSpliterators with nextIndex = 0, which in turn would lead to a situation where SeqBuffer cannot discard anything until the Spliterators of all the Seqs have been advanced. So we'd need to add some extra method (like SeqBuffer.seqFrom(n)) that would create a SeqBuffer.BufferSpliterator with nextIndex = n.

billoneil · 2017-08-08T10:58:59Z

The only penalty for reading them in a "strange" order are potential performance problems (e.g. if you start reading "from the end", SeqBuffer will have to buffer the entire source Seq into memory)

This sounds reasonable as long as it's documented.

However, if you read these returned "chunk" Seqs in order, SeqBuffer won't actually have to buffer anything!

I might have missed this part when I read the source the first time. This does indeed seem like it can cover all use cases.

In such scenario, of course, anyone can call anything on the resulting Seqs (e.g. seq.limit(5) that you mentioned) and it doesn't break anything - all calls just forward to SeqBuffer.BufferSpliterator, which (together with SeqBuffer) takes care of what needs to be buffered and what can be discarded

I think this would also work and it just needs to be documented since it could be doing extra work under the hood than expected.

tlinkowski · 2017-08-08T11:08:49Z

@billoneil You haven't missed anything in the source code :) I might have been imprecise but please note that in my most recent comment I was referring to my previous comment that contains only a proposal of adapting SeqBuffer to do what is needed to efficiently implement chunked. But it's just an idea - there's no implementation yet, although I'd gladly provide one if @lukaseder confirms that he likes the idea and is willing to have this implemented.

lukaseder · 2020-06-16T14:05:04Z

The PR got closed by github due to a recent rename of the main branch.

[jOOQ#296] Added chunked(long), chunked(Predicate<T>)

df04780

billoneil force-pushed the chunked branch from 728ae82 to df04780 Compare July 27, 2017 04:48

lukaseder added P: Medium T: Enhancement labels Jul 28, 2017

lukaseder added this to the Version 0.9.13 milestone Jul 28, 2017

lukaseder requested changes Jul 28, 2017

View reviewed changes

stellingsimon mentioned this pull request Aug 11, 2017

Added on-line version of groupBy() #321

Closed

billoneil mentioned this pull request Aug 11, 2017

chunked(long), chunked(Predicate<T>) #296

Closed

lukaseder modified the milestones: Version 0.9.13, Version 0.9.14 Feb 28, 2018

lukaseder modified the milestones: Version 0.9.14, Version 0.9.15 Jun 12, 2018

lukaseder closed this Jun 15, 2020

lukaseder added the R: Wontfix label Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#296] Added chunked(long), chunked(Predicate<T>) #320

[#296] Added chunked(long), chunked(Predicate<T>) #320

billoneil commented Jul 25, 2017 •

edited

billoneil commented Jul 27, 2017

lukaseder commented Jul 28, 2017

billoneil commented Jul 28, 2017

lukaseder Jul 28, 2017

billoneil Jul 28, 2017

lukaseder Jul 28, 2017

lukaseder Jul 28, 2017

lukaseder Jul 28, 2017 •

edited

billoneil Jul 28, 2017

lukaseder Jul 28, 2017

lukaseder Jul 28, 2017

billoneil Jul 28, 2017

lukaseder Jul 28, 2017

lukaseder Jul 28, 2017

billoneil Jul 28, 2017

lukaseder Jul 28, 2017

lukaseder Jul 28, 2017

billoneil Jul 28, 2017

lukaseder Jul 28, 2017

billoneil commented Jul 28, 2017

lukaseder commented Jul 28, 2017

lukaseder commented Jul 28, 2017

billoneil commented Jul 28, 2017

billoneil commented Jul 28, 2017

tlinkowski commented Jul 28, 2017

billoneil commented Jul 28, 2017

billoneil commented Aug 1, 2017

billoneil commented Aug 8, 2017

tlinkowski commented Aug 8, 2017

billoneil commented Aug 8, 2017

tlinkowski commented Aug 8, 2017

lukaseder commented Jun 16, 2020


		import java.util.Iterator;

		public interface PeekingIterator<E> extends Iterator<E> {

[#296] Added chunked(long), chunked(Predicate<T>) #320

[#296] Added chunked(long), chunked(Predicate<T>) #320

Conversation

billoneil commented Jul 25, 2017 • edited

billoneil commented Jul 27, 2017

lukaseder commented Jul 28, 2017

billoneil commented Jul 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukaseder Jul 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

billoneil commented Jul 28, 2017

lukaseder commented Jul 28, 2017

lukaseder commented Jul 28, 2017

billoneil commented Jul 28, 2017

billoneil commented Jul 28, 2017

tlinkowski commented Jul 28, 2017

billoneil commented Jul 28, 2017

billoneil commented Aug 1, 2017

billoneil commented Aug 8, 2017

tlinkowski commented Aug 8, 2017

billoneil commented Aug 8, 2017

tlinkowski commented Aug 8, 2017

lukaseder commented Jun 16, 2020

billoneil commented Jul 25, 2017 •

edited

lukaseder Jul 28, 2017 •

edited