Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add files batching capabilities to the Policies #59

Merged
merged 8 commits into from
Jun 11, 2020

Conversation

Symbianx
Copy link
Contributor

@Symbianx Symbianx commented May 20, 2020

We use the SimplePolicy to restore events from S3 (1 event per file) and noticed messages took a lot of time before they were pushed into the topics.

This PR adds a new configuration option policy.batch_size that allows the connector to handle a maximum number of files at a time as configured.

Fixes #58.

@Symbianx Symbianx mentioned this pull request May 20, 2020
@coveralls
Copy link

coveralls commented May 20, 2020

Coverage Status

Coverage increased (+0.1%) to 95.697% when pulling 04be206 on Symbianx:add-simple-batch-policy into 588d310 on mmolimar:master.

@mmolimar
Copy link
Owner

Many thanks!
I think we could implement this at policy level to take advantage of this functionality to all policies (actually it's a feature of the next minor version).
Also, there is a related PR #60 about this, is that ok?

@Symbianx
Copy link
Contributor Author

Symbianx commented May 25, 2020

I went through the #60 and from my understanding it doesn't solve the same challenge as this one.
It does had chunking but still reads all files/messages before returning them for publishing and commiting.

I'm gonna try to refactor this to work at the AbstractPolicy level.

@mmolimar
Copy link
Owner

Great @Symbianx!
By the way, when you update the PR, could do the PR against develop` branch? I'd like to include all new features there and then I'll update the master branch with the new version released.

@Symbianx Symbianx changed the base branch from master to develop May 25, 2020 15:57
@Symbianx Symbianx force-pushed the add-simple-batch-policy branch 2 times, most recently from 81f6e1e to 21175ba Compare May 25, 2020 16:03
@Symbianx Symbianx changed the title Add a simple batch policy for file systems with a lot of files Add files batching capabilities to the Policies May 25, 2020
@Symbianx
Copy link
Contributor Author

@mmolimar Made the change to setup the batching in the AbstractPolicy like you requested.

@Symbianx
Copy link
Contributor Author

@mmolimar been trying to fix the failing test but with no luck. Can't even reproduce on my machine, you have any idea of what could be wrong?

Copy link
Owner

@mmolimar mmolimar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Symbianx
I added some comments.

@@ -105,16 +109,25 @@ private String convert(String uri) {
if (hasEnded()) {
throw new IllegalWorkerStateException("Policy has ended. Cannot be retried.");
}

if (batchSize > 0 && currentIterator != null && currentIterator.hasNext()) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to avoid this check and set the files iterator with the previous value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we skip this check, the preCheck and postCheck methods will be executed which will cause the CronPolicy, SleepyPolicy and HdfsFileWatcherPolicy to sleep between batches which breaks the current behaviour of sleeping only after handling all the files. Is this what you want?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes totally sense but just returning previous (in case it has elements and ignoring the batch size), ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds ok.
Since previous is the original iterator (not BatchIterator) I will need to create a custom class extending Iterator with a method to reset the counter to 0. Then we can store the BatchIterator in the previous variable instead.

I will change it tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to use com.google.common.collect.Iterators.partition instead which simplified the implementation and becomes more standard.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class BatchIterator {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add this functionality inside the listFiles method in the AbstractPolicy class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put it in the listFiles method we will be batching each file system independently. Feels to me that is not the intended behaviour.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Maybe renaming this class to Iterators with a partition method with the batch size and a "duplicates" flag to allow removing duplicates in the same partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't understand what you mean by "allow removing duplicates in the same partition".

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example: in the first execution of the policy you get an iterator with 5 records (with a batch size of 3) so 2 remaining. In the next iteration (let's say we're using the Cron policy) we concat the previous 2 with other 5 records. We re-split that iterator and "maybe" in this batch could be an item duplicated, so we should ignore this record and get another one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that will not be inline with the present implementation.
An execution is only finished when the iterator is fully traversed, only after that we will sleep. If we mix files from the current iterator with the next one we will be creating a weird behaviour in the policies.

@mmolimar
Copy link
Owner

Thanks!

@mmolimar mmolimar merged commit 7a35c9b into mmolimar:develop Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batching Support
3 participants