Skip to content

readers

Mahmoud Ben Hassine edited this page Feb 7, 2020 · 3 revisions

To read records from a data source, you should register an implementation of the RecordReader interface:

Job job = new JobBuilder()
    .reader(new MyRecordReader(myDataSource))
    .build();

There are several built-in record readers to read data from a variety of sources:

  • flat files (delimited and fixed length)
  • xml, json and yaml files
  • MS Excel files
  • in-memory strings
  • databases
  • JMS queues
  • BlockingQueue and Iterable objects
  • Java 8 streams
  • and standard input

Here is a table of built-in readers and how to use them:

Data source Reader Record type Module
String StringRecordReader StringRecord easy-batch-core
Directory FileRecordReader FileRecord easy-batch-core
Iterable IterableRecordReader GenericRecord easy-batch-core
Standard input StandardInputRecordReader StringRecord easy-batch-core
Java 8 Stream StreamRecordReader GenericRecord easy-batch-stream
Flat file FlatFileRecordReader StringRecord easy-batch-flatfile
MS Excel file MsExcelRecordReader MsExcelRecord easy-batch-msexcel
Xml stream XmlRecordReader XmlRecord easy-batch-xml
Xml file XmlFileRecordReader XmlRecord easy-batch-xml
Json stream JsonRecordReader JsonRecord easy-batch-json
Json file JsonFileRecordReader JsonRecord easy-batch-json
Yaml stream YamlRecordReader YamlRecord easy-batch-yaml
Yaml file YamlFileRecordReader YamlRecord easy-batch-yaml
Relational database JdbcRecordReader JdbcRecord easy-batch-jdbc
Relational database JpaRecordReader GenericRecord easy-batch-jpa
Relational database HibernateRecordReader GenericRecord easy-batch-hibernate
BlockingQueue BlockingQueueRecordReader GenericRecord easy-batch-core
JmsQueue JmsRecordReader JmsRecord easybatch-jms

Handling data reading failures

Sometimes, the data source may be temporarily unavailable. In this case, the record reader will fail to read data and the job will be aborted. The RetryableRecordReader can be used to retry reading data using a delegate RecordReader with a RetryPolicy.

Job job = new JobBuilder()
    .reader(new RetryableRecordReader(unreliableDataSourceReader, new RetryPolicy(5, 1, SECONDS)))
    .build();

This will make the reader retries at most 5 times waiting one second between each attempt. If after 5 attempts the data source is still unreachable, the job will be aborted.

Performance notes

  • The JdbcRecordReader reads records in chunks. For large data sets, you can set the maxRows and fetchSize parameters to prevent loading data entirely in memory.

  • The JpaRecordReader loads all data fetched by the JPQL query into a java.util.List object. You should pay attention to large data sets with the JPQL query you specify to the JpaRecordReader. You can specify the maximum number of rows to fetch using the maxResults parameter.

  • The HibernateRecordReader uses the org.hibernate.ScrollableResults behind the scene to stream records in chunks. You can specify the fetch size and the maximum rows to fetch using the fetchSize and maxResult parameters.

Reading data from multiple files

It is possible to read data from multiple files using a MultiFileRecordReader. This assumes that all files have the same format. A MultiFileRecordReader reads files in sequence and all records are passed to the processing pipeline as if they were read from the same file. There are 4 MultiFileRecordReaders : MultiFlatFileRecordReader, MultiXmlFileRecordReader, MultiJsonFileRecordReader and MultiYamlFileRecordReader to read multiples flat, xml, json and yaml files respectively.

JdbcRecord caveat

The JdbcRecordReader produces records of type JdbcRecord. A JdbcRecord has a java.sql.ResultSet as payload. In a scenario where you have a master job that reads data from a relational database and dispatch them to workers, the master job could have finished reading the data source and dispatched all records to worker queues, while workers are still processing those records. Hence, the master job will close the database connection and the dispatched JDBC records are no more usable since their payload depend on the connection that has been closed by the master job!

A solution to this problem is to make the master job map JDBC records to domain objects and dispatch those objects safely to workers. You can find an example in the fork/join tutorial.

Clone this wiki locally