Skip to content

java 8 stream support #250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 20, 2016
Merged

java 8 stream support #250

merged 3 commits into from
Oct 20, 2016

Conversation

chsfleury
Copy link
Contributor

Hello,

These classes add support for java 8 streams and path streams.

I've read the framework is compatible with java 7, so i set java.version to 1.8 only for the added module. Build and tests OK for 1.8.

I hope that it will be useful.
Best regards.

Charles Fleury

@fmbenhassine
Copy link
Member

fmbenhassine commented Oct 19, 2016

This is fantastic! Thank you very much for this addition 👍

I was planning to make Java 8 as minimum java version required for the upcoming v5 ( see #197 ) but many users asked me to keep Easy Batch Java 7 compatible ( I took into account their request: 6b0eaf5).

Hence, the project is compiled with Java 7 for now (which makes the build to be in error with your PR, but nothing to do with your code which is fine)

I know we can compile with Java 8 and specify the target to be Java 7, but this does not seem to work and I had some issues on another project for the same problem.

Overriding java version in the added module as you did is the way to go, but we should be able to compile all modules with Java 7 except the new one with Java 8. Probably you have managed to do it? any help is welcome!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 78.603% when pulling 5e63c30 on chsfleury:master into 9489794 on EasyBatch:master.

@chsfleury
Copy link
Contributor Author

Done.

I use this framework in production (and love it).
=> reading 1M+ xml files, map into objects and index into ES

I probably make other small features in the future :)

Good luck for all your projects !

@fmbenhassine
Copy link
Member

Thank you for the quick update! Nice trick 😉 I'm only thinking about how to release this new module since maven release plugin will use java 7.. But no worries, I'll find a way to release it. Probably I'll move all extensions to a separate repo and release them separately (each with its required java version).

Let me merge this addition for now.

A couple of notes though:

  • The StreamRecordReader is perfect. I'll may be add a constructor with only the stream and set the datasource to default value (which can be "In-Memory Stream" to be consistent with IterableRecordReader and StringRecordReader).
  • How is the PathRecordReader different from FileRecordReader. I think there is no need for another directory reader since we have a constructor with a Path in FileRecordReader. Do you agree? We definitely need to refactor the current (ugly!) getFile method with the more elegant Files.walkFileTree as you did and we are done. What do you think?

I use this framework in production (and love it).
=> reading 1M+ xml files, map into objects and index into ES

I'm really glad to hear this 😄

I probably make other small features in the future :)

You are welcome

@chsfleury
Copy link
Contributor Author

The StreamRecordReader is perfect. I'll may be add a constructor with only the stream and set the datasource to default value (which can be "In-Memory Stream" to be consistent with IterableRecordReader and StringRecordReader).

A java 8 stream is not necessarily "in-memory", The Files.walk(Path) (java.nio) is an example. But its a good compromise.

How is the PathRecordReader different from FileRecordReader. I think there is no need for another directory reader since we have a constructor with a Path in FileRecordReader. Do you agree? We definitely need to refactor the current (ugly!) getFile method with the more elegant Files.walkFileTree as you did and we are done. What do you think?

I agree to keep one directory reader, in theory, but ! :)

  • java.io.File is used less and less over time and new APIs do not use it. I do not use it too since the release of java.nio in java 7 (2011) that brought a lot of improvements.
  • The current getFile method list in memory the files before start the batch, 1M+ xml is very very long to list on HDD (its the reason why i create these two classes), on the other side walkFileTree and walk are lazy.
  • Files.walkFileTree is good but its the java 7 version of Files.walk

I recommend (at least) to set as @deprecated for the v5 : FileRecordReader, FileRecord (which belong to java 7), and keep PathRecordReader.

@fmbenhassine
Copy link
Member

I think the new module should only contain the StreamRecordReader.
The improvement/merge of FileRecordReader/PathRecordReader can be done in the core module.

A java 8 stream is not necessarily "in-memory", The Files.walk(Path) (java.nio) is an example. But its a good compromise.

Sure, sorry if I wasn't clear enough. What I mean is that the data source (the stream itself) is in memory. Like an iterable of files: files are on disk, but the data source (ie the iterable object is in memory).

The current getFile method list in memory the files before start the batch, 1M+ xml is very very long to list on HDD (its the reason why i create these two classes), on the other side walkFileTree and walk are lazy.

That's why I'm suggesting to refactor it using a lazy API (be it Files.walkFileTree for Java 7 or Files.walk for Java 8). But since v5 should still Java 7 compatible and this change can be done in the core module, we can use Files.walkFileTree

I recommend (at least) to set as @deprecated for the v5 : FileRecordReader, FileRecord (which belong to java 7), and keep PathRecordReader.

I agree with you, we need to deprecate usage of java.io.File. But, in the end, we are reading "files" from a "directory". So IMHO, FileRecordReader is a good name (regardless of the technical java API representing "files": java.io.File or java.nio.Path). Do you see my point?

What do you think of:

  • Deprecating the constructor taking a File in the FileRecordReader and encourage using the constructor with Path (I should have done this in d7130af)
  • Refactoring getFile method to use Files.walkFileTree?

This way, we take the best of both readers and merge them (this can be done in the core module using Java 7 compatible API, namely Files.walkFileTree).

I don't want to bother you with more updates of the PR, If you agree, I can merge it and do the changes.

Kind regards
Mahmoud

@chsfleury
Copy link
Contributor Author

Ok, i didn't understand that the first time.

I agree, you have a better compromise 😄
I let you do the rest.

Good night !

@fmbenhassine
Copy link
Member

fmbenhassine commented Oct 20, 2016

Great! I'll merge your PR asap.

Thank you for this addition 👍
Just in time before the final v5 release 😄

@fmbenhassine fmbenhassine merged commit cbac281 into j-easy:master Oct 20, 2016
fmbenhassine added a commit that referenced this pull request Oct 20, 2016
fmbenhassine added a commit that referenced this pull request Oct 20, 2016
@fmbenhassine
Copy link
Member

I've merged your changes and added you to the contributors list. I really appreciate your help! Thanks.

You know, the StreamRecordReader is very powerful 😄 Looking at your implementation of PathRecordReader (which is great using a delegate), we can have the equivalent of PathRecordReader with:

StreamRecordReader<Path> streamRecordReader = new StreamRecordReader<>(Files.walk(path)); // recursive

// or

StreamRecordReader<Path> streamRecordReader = new StreamRecordReader<>(Files.list(path)); // non recusrive

Probably in combination with a new Easy Batch RegularFileRecordFilter having this logic stream.filter(Files::isRegularFile). It would be a nice addition BTW.

The current getFile method list in memory the files before start the batch, 1M+ xml is very very long to list on HDD (its the reason why i create these two classes)

I'm not sure it will load the content of all files in memory, but simply pointers to file descriptors. May be even with only pointers in the list, 1M+ would become a performance issue. Did you have performance issues with FileRecordReader ?

@fmbenhassine fmbenhassine added this to the v5.0.0-RC3 milestone Oct 20, 2016
@chsfleury
Copy link
Contributor Author

The real problem with the current getfile method is to visit each file physically on the hdd (descriptors or not) to create the list. its not a memory problem, its a delay-before-start one. walkfiletree send immediatly the first file and continue.

@fmbenhassine
Copy link
Member

Ok I see, I'll refactor this ugly method asap!

@chsfleury
Copy link
Contributor Author

Thank you for the contributor page 😄

fmbenhassine added a commit that referenced this pull request Oct 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants