New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#296] Added chunked(long), chunked(Predicate<T>) #320
Conversation
Refactored it a bit to clean it up. The peeking iterator seems to work. Guava's version special cases remove not sure if that's necessary if its not publicly exposed. |
Thank you very much for your suggestion. I'll review shortly |
I realized some naming is off |
/** | ||
* Created by billoneil on 7/26/17. | ||
*/ | ||
public class Iterators { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make this class package-private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to make these package-private oops.
|
||
import java.util.Iterator; | ||
|
||
public interface PeekingIterator<E> extends Iterator<E> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. This is an internal type and shouldn't be public, in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I wonder if this needs to be an interface, given we only have a single implementation so far...
* <p> | ||
* <code><pre> | ||
* // ((1,1,2), (2), (3,4)) | ||
* Seq.of(1, 1, 2, 2, 3, 4).chunked(n -> n %2 == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we should exclude the chunk boundaries by default if using a predicate, and include them only explicitly. When including, the question is whether the boundary should belong to:
- The ending chunk
((1, 1, 2), (2), (3, 4), ())
- The beginning chunk
((1, 1), (2), (2, 3), (4))
- Both chunks (?)
- No chunk
Given that there is no clear preference between 1/2 and maybe even 3, the default must be 4. So, an enum might be a reasonable additional argument for this, where "no chunk" would be the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking an enum would be the appropriate solution for inclusion / exclusion. It makes sense to work for all four.
- The beginning chunk ((1, 1), (2), (2, 3), (4))
Should we assume the beginning chunk starts true? or should this actually return ((2), (2, 3), (4))
skipping anything before the first chunk.
- No chunk
Would this be excluding the matched element? ((1, 1), (), (3))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we assume the beginning chunk starts true? or should this actually return ((2), (2, 3), (4)) skipping anything before the first chunk.
Egh... :) OK, so in the no chunk case, we get:
- 1 chunk for 0 true evaluations of the predicate
- 2 chunks for 1 true evaluation of the predicate
- 3 chunks for 2 true evaluations of the predicate
For example:
// ((1, 1, 2, 2, 3, 4))
Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> false);
// ((),(),(),(),(),(),())
Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> true);
I think that options 1-3 should adhere to this logic. My example 1 was wrong, there should be a trailing empty chunk. Will fix the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence:
// ((1),(1),(2),(2),(3),(4),())
Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> true, BEFORE);
// ((),(1),(1),(2),(2),(3),(4))
Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> true, AFTER);
// ((1),(1, 1),(1, 2),(2, 2),(2, 3),(3, 4),(4))
Seq.of(1, 1, 2, 2, 3, 4).chunked(i -> true, BOTH);
Still a bit undecided about the BOTH
part, although it could be useful and it would be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could always ignore BOTH
for now and easily add it later if requested since it will be an enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's do that.
} | ||
|
||
@Test(expected = IllegalArgumentException.class) | ||
public void testCountingPredicateThrowsIllegalArg() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this is probably the reason why you made the class public...
- If the tests are placed in the same package, you can access package-private classes.
- I'm not too fan of testing internals with unit tests. The internals will be tested transitively by testing the chunked API. With tests on internals, refactoring those internals will be much harder...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are in the same package so it should have been package-private. The first implementation was must have missed it when I cleaned it up.
- I'm not too fan of testing internals with unit tests. The internals will be tested transitively by testing the chunked API. With tests on internals, refactoring those internals will be much harder...
👍 totally agree wasn't sure of your preference so opted for more testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all in for more testing, but high level tests will never break because the API won't change again. Low level tests are a bit of a maintenance burden.
return takeWhileIterator(peeking(iterator), predicate); | ||
} | ||
|
||
static <E> Iterator<E> takeWhileIterator(PeekingIterator<E> iterator, Predicate<E> predicate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that we keep writing this iterator :) Perhaps, can this be refactored with existing code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to use the Seq.take*
but it closed the stream after the predicate matched and we need it to stay open. I will look into a better way to refactor this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say: Perhaps Seq.take
should use this new iterator. In that case, we should open up several tickets and do things one step at a time...
I will first look into making the TakeIterator and reusing it, then come back to this after. I'll make a new issue and PR once it's ready. |
Great, looking forward to this. Note: If it's too complex, we don't have to do it. I just thought if you do create new classes and API, it would be nice if they're reusable. Otherwise, I might prefer inlining them where they're really needed. |
Btw, there's also some work by @tlinkowski pending review. He's implemented a Maybe, have a look at that first, prior to delving into a refactoring...? |
Hmm the |
My original use case for this would be the ability to Stream something like a very large file in S3 to a Seq, lazily batch N records, batch process N records, stream to a new S3 file. Ideally it should work on any sized file. So I have been working with the assumption of infinite sized streams. |
I think that for this (#296) and for
It's not trivial but it seems doable. What do you think, @billoneil, @lukaseder? |
I see two use cases when splitting a
They are both very valid but different use cases. I wouldn't know the best way to portray that in an API since they have different contracts. Maybe allow some flags like One of the reasons I liked the iterator approach is because it's lazy by default and if you want the values in memory you can always call This also just reminded me of something when I first brought this up. Maybe the Seq<Seq<Integer>> seqs = ...
seqs.map(s -> seq.limit(5)) This would mess up sequential streams unles sunder the hood it ignored the limit and traversed the whole iterator. |
I'm going to hold off until this is discussed a bit more. |
@lukaseder @tlinkowski Either of you have any thoughts here? |
@billoneil Well, I'm under the impression that the modified Even more, the advantage of implementing One more thought about the modified |
This sounds reasonable as long as it's documented.
I might have missed this part when I read the source the first time. This does indeed seem like it can cover all use cases.
I think this would also work and it just needs to be documented since it could be doing extra work under the hood than expected. |
@billoneil You haven't missed anything in the source code :) I might have been imprecise but please note that in my most recent comment I was referring to my previous comment that contains only a proposal of adapting |
The PR got closed by github due to a recent rename of the main branch. |
#296
I have two of the implementations working and they should be truly lazy.
I'm not sure happy with how the code turned out. The two iterators are a little dependent on each other. Maybe a better approach would be to make a more top level peeking iterator instead of hacking it like I did.