-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Threaded Job PR Suggestion #300
Comments
I think I could also update the queue writers in fork-join so they block if the queue reaches a certain size, but I don't think that addresses these issues:
|
Hi @ipropper Thank you for this analysis! I didn't have time yet to look at this issue but I will definitely try to do it this weekend and get back to you asap. Kr |
Hi @ipropper
Correct. I thought it would be good to have each part of the process as a basic job and then compose them to create a fork/join job. But this is probably not a good decision. I like the approch of @DanieleS82 in #299 with a template method. It would be great to have three
Indeed. There is no mecanism to handle back pressure in easy batch. But I was thinking of a
I would go for creating a new job type as said previously, something like
That was the case in v4 where all the pipeline was operating on batches and not records. But this approach has many issues (see change log of v5 and especially #211 ).
I'm not sure I understand this point. The writers can continuously take elements from the queue and write them. By design we don't want them to block when the queue is full (or reaches a certain size). In contrast, they will block when the queue is empty. And that is the reason why we use a I hope I answered all your questions 😄 Kr |
Hi, @benas Thanks for getting back to me, you're always prompt with responses!
Am I correct in assuming the I also implemented #299 in my project, so I definitely see the value 😄 . In my impl I abstracted the queues away from the user (for better or worse). If you want me to look at #299 and suggest some edits I would be happy to do so.
This sounds good to me, however I am not sure this meets my use case, I will elaborate at the bottom.
Yeah let me show you a quick code snippet (pseudo code)
The while loop, and batch size are the only new code added. I added this because I believe batches are used to have only a few records in memory at a time instead of an entire file. With the original blockingQueue writer it was possible to have most of the file in memory because the queues were unbounded. If the whole file could be stored in a blockingQueue, why have batch sizes? Throttling the read like in issue #290 might alleviate this stress, but it does not solve it. It creates a magic number situation. We need to play with |
Not fully, I was referring the current
Great! In hindsight, my approach of implementing fork/join using multiple jobs in not correct. Because as a user, I want my data source to be processed in parallel transparently and have one job report in the end. So it should be a single job, not multiple ones. The idea of template method is the way to go. Unfortunately, the implementation in #299 was inspired by the tutorial which uses multiples jobs. There is nothing wrong with it, but it uses multiple jobs. If you have an alternative (that you implemented in your project), please open a separate PR.
I see. But I think we are talking about different topologies. Reading and Writing data is done in serial, processing is done in parallel by multiple threads. records dispatching and collecting should be abstracted from the user and the degree of parallelism should a parameter: Job job = new MultiThreadedJobBuilder()
.reader(new FlatFileRecordReader("input.txt"))
.processor(new MyProcessor1())
.processor(new MyProcessor2())
.writer(new FileRecordWriter("output.txt"))
.degreeOfParallelism(4)
.build(); Of course, processors should be thread-safe. Do you see? Kr |
I like this API:
How would you feel about the reader reading |
Again this is the behaviour of v4: The reader will read |
I am sorry that I am being annoying, but v4 is not the master branch. Therefore, support for bugs etc. is less likely. All I am hoping to do is to contribute something to the easy-batch that:
The current fork-join implementation does not do 1 and 2 because the https://user-images.githubusercontent.com/1210553/32417196-0028ccfe-c256-11e7-9e0c-c600fd7a14f8.png and this API:
I would be happy to create a version of this API that does not maintain order and does not keep fixed memory (i.e. the readerQueue only blocks when syncing threads) . If you think that contribution is worth while, let me know. Otherwise close this thread so I stop bugging you 😄 |
No, you are not annoying, you are spending your time helping me. And thank you!
I do confirm
Keeping record ordering makes no sense in a parallel setup.
You are welcome! But I think we still need to use a bounded reading queue. Or else we will end up in having the whole data source in memory.. Also watch out for this tricky case : https://github.com/j-easy/easy-batch/wiki/readers#jdbcrecord-caveat Jdbc records cannot be distributed to workers.. Since their payload depends on the connection (which may be closed before records are processed).
Are you willing to share this in a gist? I'm curious 😄 |
Hi yes this is what I had, but i don't think this is the way to go. Keep in mind this gist does not compile, I had to abstract some of my custom code. https://gist.github.com/ipropper/da22825e33c0635dbc8acdab12f4e7a5 The API you mentioned is much better. I will get working on something like your API. |
Great! let me know if you have a working prototype. I will also try to come up with something in the meantime. I'm releasing v5.2 in the coming few days, so let me plan this for v5.3. |
Hi, |
Hi, This is a prototype of the muti-threaded batch: It hasn't been fully tested, the base idea:
What are your thoughts on this model? I can totally decouple multi-threaded and single threaded batch, or I could also build multithreading directly into the batchJob class so there is no need for a multithreaded batch class. Additionally, reading happens separately from processing records, but i could change that to make it happen simultaneously. |
I've added a couple of notes on this in your PR.
Excellent point! The whole pipeline should be executed in parallel, as shown in the previous diagram. Internally, a As a user, when I specify my logic, executing the job in serial or parallel should yield the same result. Only the speed might be better with a parallel version (which is not always the case BTW). Think of it like parallel streams in java 8. Using
Yes, each thread should execute the whole pipeline. This is also important for the semantics of |
I was re-reading this thread and found it really interesting! Many thanks to all for sharing these nice ideas 👍 I have already explained my point of view in the previous messages, so I'm not going to repeat it here. In hindsight, I would not introduce a multi-threaded job implementation to avoid all the joy of thread-safety and what not.. The implementation complexity and the maintenance burden are higher than the added value of this feature. I do believe Easy Batch jobs are lightweight This feature can be implemented in a fork if needed. OSS FTW! |
Hi,
in reference to #299
May I suggest an alternative to the fork-join method? These are the issues I have been having with the fork-join method:
It breaks up your job into multiple jobs.
It does not have the constant memory guarantee of a single batch job. A single batch job will only process a batch size of N, so you will only have N records in memory at any given time. With the fork join model, the memory grows somewhat unexpectedly. If your fork threads are much slower than your file reading thread (as is the case for me) then the memory can grow very fast. I ran the fork-join tutorial with a million line file, changing the following code:
I stopped the process early, but as you can see the memory kept growing. I would not expect a 100 size batch to use 1.5 GB of data!
To solve this issue, I would like to either create a new job type or update the existing BatchJob class.
The current BatchJob code reads as follows:
and I would like to implement something more like this (pseudo code):
The BatchProcessor class allows a processor to run on a batch of records. With this I could create a multiThreadedProccessor that can run on batches. I think this code could provide the following:
Do you think this is a bad idea? Are there any major issues that I am not addressing? If you think this is a good idea, may I attempt a PR?
The text was updated successfully, but these errors were encountered: