yakety-csv

The yakety-csv or Yet-Another-Know-Everything-Tentative-Yap-CSV Java library.
It should work with any field character separator, so ,, ; and tabs; even custom separators.
At this point it only works with single character line break: \n, \r or custom single character. Soon support for Windows \r\n will be added.

It is meant to be configured and only require little coding for usage. Also, the main feature is to parse the file lazily; keeping memory usage low even while parsing GB long files.

❗	There are plenty implementations out there, please look for them, this is an exercise.

Table of Contents

Features
- Bucket list
Release Plans
Tasks
How to build
- Environment setup requirements
- TL;DR
How to use
Learning notes
Questions

Features

Fully functional, Java Stream based API
Configuration over code:
“library user defines rules not process”
- Columns are defined for reuse (Recommendation to use an enum)
At some later point, full support to RFC4180 both strict and non-strict
Support for different locales
Support for quoted values
On parsing:
- Step 1 list of ordered string values
- Step 2 list of string values by key
- Step 3 nullable values
- Step 4 basic type coercion from string
- Step 5 Boxed variation on basic types
- Step 6 Support for date, time, datetime (timezone or not)
- Step 7 assembled value object

Bucket list

Caching of parsed results (denormalized files tend to repeat a lot, memoized field parsing may save time in some cases).
Support for custom line breaks that are more than one character long
in case there is a coercion error, return a marker that point to file, line, field where problem happened.
replace apache validation library with own parser and include a formatter parameter that is not recreated on every call

Release Plans

v0.0.X - empty project that compiles
v0.1.X - stream String value of fields from a csv using strict RFC4180
v0.2.X - Enable column configuration
v0.3.X?- Proper handle of quotes
v1.0.X - stream Map.Entry of String values by column (RFC4180)
v1.1.X - nullable columns (map stream cannot take null values)
v1.2.X - proper limited parsing of int and double
v1.3.X - proper parsing of long and float
v1.4.X - proper parsing to Boxed numbers
v1.5.X - proper parsing of Date, Time and DateTime
v1.6.X - proper parsing of BigInteger and BigDecimal
v1.7.X - support to domain values (enum) parsing
v2.0.0 - Populate a java bean from the textual field by column map, includes:
- safe parsing of standard nullable types: Integer, Long, Float, Double, BigInteger, BigDecimal, LocalDate, LocalDateTime, LocalTime.
- parsing of domain-based values backed by enums;
- API for custom field parsing;
- API that joins field parser and map’s key lookup;
- any field that fails to be parsed is null;
- all API avoids exceptions, so the stream is never broken;
- API for a transformer from Map<K, String> to java bean.
- API for indexed/id’ed beans
v3.0.0 - Adding feature to identify parsing problems, includes:
- Removal of all deprecated API;
- implementation of Either;
- a new parser and bean assembly that uses Either as output;
- a new interface to represent a problem parsing a field, which can be applied along with ColumnDefinition;
- introduction of result expectation vs actual result;
- marker pointing to location where parsing failed.
- generate a digest based unique id for the source file, use it as identifier
v4.0.0 - first API stable version, includes:
- optimizations
- extra logging
- integration test comments and documentation
- package publication

Tasks

setup project:
- ✓ gradle
- ✓ spock tests
- ✓ spock integration tests
- ✓ git ignores
functionalities:
- ✓ simple csv to stream of fields
- ✓ configurable parser
- ✓ file format configuration
- ✓ column definition interface
- ✓ configurable csv columns to stream of String fieldByColumnName maps
- ✓ indexed row value as field in map
- ✓ use dynamic programming to check if line break is within quotes, ignore it if it is. should consume large files without blowing up the stack.
- ✓ parser localization
- ✓ column definition map to expected type (string for now)
- ✓ from the map result apply identity type coercion to bean
- ❏ add coercion checks with bad results as separate dataset from raw values
- ❏ add null constraints
- ❏ configurable csv columns with type coercion (all types)
- ❏ configurable csv columns with type coercion to list of objects

How to build

Environment setup requirements

Java 16+ is needed, get it with SDKMan Gradle configuration recommended, ~/.gradle/gradle.properties:

org.gradle.parallel=true
org.gradle.jvmargs=-Xmx2048M
org.gradle.caching=true
org.gradle.daemon.idletimeout=1800000
org.gradle.java.home=/home/user/.sdkman/candidates/java/16.0.1-open # (1)

your own path for the JDK 16+

TL;DR

./gradlew

How to use

The concept usage is that you are either: - exploring data from a file you do not know the format or - parsing well known CSV format multiple times from different files.

Simple CSV to Java `Stream<List<String>>`

import org.shimomoto.yakety.csv.config.FileFormatConfiguration;
import org.shimomoto.yakety.csv.CsvParserFactory;

final FileFormatConfiguration config =
        FileFormatConfiguration.builder().build()
final CsvParser textParser =
        CsvParserFactory.toText(config)

final Stream<List<Stream>> textResults =
    textParser.parse(new File("that_data.csv"))

Simple CSV to `Stream<Map<String,String>>`

With added field for the line index, starting at 1 (headers were zero). The field name must not clash with a column name.

It is purely positional (does not check if first field matches first header column name), if you mess up the fields order, you mess up the mapping.

import org.shimomoto.yakety.csv.config.FileFormatConfiguration;
import org.shimomoto.yakety.csv.CsvParserFactory;

final FileFormatConfiguration config =
        FileFormatConfiguration.builder()
            .indexColumn("#")
            .columns(List.of("colA","colB","colC"))
            .build()
final CsvParser indexedMapParser =
        CsvParserFactory.toRowIndexedTextMap(config)

final Stream<Map<String,String>> textResults =
    indexedMapParser.parse(new File("that_data.csv"))

Simple CSV to `Stream<?>` where `?` is a POJO

It builds upon the fields by column map with a dynamic index, those results are used to build a Java Bean.

A transformer from Stream<Map<? extends ColumnDefinition,String>> to whatever aggregate is to be used is needed.

import org.shimomoto.yakety.csv.config.FileFormatConfiguration;
import org.shimomoto.yakety.csv.CsvParserFactory;

final FileFormatConfiguration config =
        FileFormatConfiguration.builder()
            .indexColumn(MyVirtualColumns.INDX) // (1)
            .columns(MyColumns.values()) // (2)
            .build()

final BeanAssembly<MyColumns, MyAggregate> transformer =
    new MyTransformer(Locale.EN)  // (3)

final CsvParser beanParser =
        CsvParserFactory.toBeans(config)

final Stream<MyAggregate> aggregates =
        beanParser.parse(new File("that_data.csv"))

Must implement org.shimomoto.yakety.csv.api.ColumnDefinition and have a common interface with the columns bellow.
Must implement org.shimomoto.yakety.csv.api.ColumnDefinition and have a common interface with the index above.
Must implement BeanAssembly interface; it is responsible for coercing the values from String and assign to POJO field, all done without throwing exceptions.

TL;DR

Snippets are not real life examples?! Ok, read the contents of MarvelIT.groovy, it is creating multiple parsers using different approaches and it does process some data for close inspection.

If you just want to read from the test results:

./gradlew integrationTest

then open build/reports/spock-reports/integrationTest/index.html, these are the integration tests results

Learning notes

Scanner discards empty elements at beginning or end, which works ok when splitting lines, also being lazy is a must; String.split(/pattern/, -1) works correctly (empty fields show up) but takes a String instead of Pattern; the Pattern.split(/string/, -1) works when the number of fields is unknown; when the number of fields is known just pass the number instead of a negative.
Regular expressions with matches and groups take more processing power, the lookahead doesn’t and works as would the index based string walk.
The regular expression break by line blows up the stack; the solution I can think of is to consume lines, then check if there is an open quote, consume another line until all open quotes are closed, then it would be better to just already consume fields while at that.
Java Pattern class cannot be used on hash or equals 🤷.
Apache validation library recreates Formatter on every call, so it does what is needed (no exception while parsing) but may degrade performance on large volumes.

Questions

Should the ColumnDefinition be enforced at API level? That would force split for String columns.
Perhaps it should be enforced when types are to be used…

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
.idea		.idea
.local		.local
gradle/wrapper		gradle/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

License

mashimom/yakety-csv

Folders and files

Latest commit

History

Repository files navigation

yakety-csv

Features

Bucket list

Release Plans

Tasks

How to build

Environment setup requirements

TL;DR

How to use

Simple CSV to Java Stream<List<String>>

Simple CSV to Stream<Map<String,String>>

Simple CSV to Stream<?> where ? is a POJO

TL;DR

Learning notes

Questions

About

Resources

License

Stars

Watchers

Forks

Languages

Simple CSV to Java `Stream<List<String>>`

Simple CSV to `Stream<Map<String,String>>`

Simple CSV to `Stream<?>` where `?` is a POJO