Skip to content

Commit

Permalink
#17 Support custom conversions
Browse files Browse the repository at this point in the history
Implement parsing integers with different radix
  • Loading branch information
mrotteveel committed Jul 14, 2023
1 parent 1c84522 commit 7fa326e
Show file tree
Hide file tree
Showing 42 changed files with 2,039 additions and 203 deletions.
4 changes: 4 additions & 0 deletions .reuse/dep5
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,7 @@ License: Apache-2.0
Files: src/integrationTest/resources/integration-testdata/customers-1000.csv
Copyright: Datablist
License: LicenseRef-Datablist

Files: src/integrationTest/resources/integration-testdata/id-values-*.csv
Copyright: 2023 Mark Rotteveel
License: Apache-2.0
104 changes: 54 additions & 50 deletions devdoc/adr/2023-07-configuring-value-parsing.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@

== Status

* Draft
* Proposed for: 2.0
* Published: 2023-07-14
* Implemented in: 2.0

== Type

Expand All @@ -17,98 +17,102 @@
CSV is a string based format, and non-`char` data types need a pattern to convert from the string format to the desired data type value.
For the integral types we can use the common format of -- in regex -- `[+-]?\d+`, but other data types can have a wide variety of formats, consider dates, or decimal values with either a comma or dot as decimal separator, and others.

While we can use a sensible format as the default, for example ISO 8601 calendar date format -- `yyyy-MM-dd` -- for `date`, this would limit the usefulness to transform real-world data (especially if you don't control the producer of the CSV).
While we can use a sensible format as the default, for example ISO 8601 calendar date format -- `yyyy-MM-dd` -- for `date`, this would limit the usefulness to read real-world data (especially if you don't control the producer of the CSV).

In short, we need a way to transform string input to a destination data type.
In short, we need a way to convert string input to a destination data type.

In the future we may also need to perform the reverse operation: transforming from the column data type to a string value.
In the future we may also need to perform the reverse operation: conversion from the column data type to a string value.

== Decision

Columns receive a common way to define a transformation, to convert a string to the eventual data type of the column.
If the transformation is absent, a default transformation for the specific data type is applied.
Columns receive a common way to define a conversion, or converter, to convert a string to the eventual data type of the column.
If the converter is absent, a default conversion for the specific data type is applied.

A transformation consists of one step to convert from a string to a (Java) data type.
A converter consists of one step to convert from a string to a (Java) data type.
The output data type of this step needs to be compatible with the Java data type of the column data type.
In the future we may introduce support for chaining multiple steps together.

For this design we're ignoring future plans to perform a reverse transformation (from external table to CSV).
We guess and assume this design will also work for that case, for example, by using the transformation in reverse direction, but we'll only evaluate that when the time comes to design for that case.
For this design we're ignoring future plans to perform a reverse conversion (from external table to CSV).
We guess and assume this design will also work for that case, for example, by using the converter in reverse direction, but we'll only evaluate that when the time comes to design for that case.

The XML configuration is extended as follows:

* `column` receives an optional element `transformation`
* `transformation` has one required element which specifies the transformation
* As a starting point, this ADR defines the following transformations.
We opted to define these in terms of Firebird data types, to avoid introducing another set of data type names (e.g. Java names).
The default transformation for the integral number data types is equivalent to using these with radix 10.
** `parseSmallint` -- transforms to Java `Short` -- with attribute:
*** `radix` (optional, defaults to `10`) -- with value in range [2, 36]
** `parseInteger` -- transforms to Java `Integer` -- with attribute:
*** `radix` (optional, defaults to `10`) -- with value in range [2, 36]
** `parseBigint` -- transforms to Java `Long` -- with attribute:
*** `radix` (optional, defaults to `10`) -- with value in range [2, 36]
** `parseInt128` -- transforms to Java `BigInteger`, but restricts it to the range of an `INT128` -- with attributes
* Data types (abstract `DatatypeType`) receives an optional element `converter`
* `converter` has one required element which specifies the converter
* As a starting point, this ADR defines the following converter.
The default conversion for the integral number data types is equivalent to using this with radix 10.
** `parseIntegralNumber` -- which will restrict values based on the enclosing type. It has one attribute:
*** `radix` (optional, defaults to `10`) -- with value in range [2, 36]

Other ADRs will define further transformations to add, for example ADR 2023-08 will add the `parseDate` transform.
Other ADRs will define further converters to add, for example ADR 2023-08 will add the `parseDate` converter.

Given the design of the XML, it is possible to specify incompatible transformations.
For example, using `parseInteger` instead of `parseSmallint` for a `smallint` column:
Given the design of the XML, it is possible to specify incompatible converters.
For example, using `parseIntegralNumber` for a `char` column:

[source,xml]
----
<column name="ID">
<smallint/>
<transformation>
<parseInteger radix="16"/>
</transformation>
<char>
<converter>
<parseIntegralNumber radix="16"/>
</converter>
</char>
</column>
----

For this specific example, we might be able to implement some flexibility, but we won't.

If the XML defines an incompatible transformation, it is rejected with an error when the XML is transformed to the _ext-table-gen_ `Column`, and exits the application.
If the XML defines an incompatible converter, it is rejected with an error when the XML is transformed to the _ext-table-gen_ `FbDatatype`, and exits the application.

=== Rejected Design Decisions

* Nest the transformation in the data type instead of the column
* Put the converter in the column instead of the data type.
+
Maybe this could have been used to validate the compatibility of a transformation in the XSD (we're not entirely sure of that), but this would also have increased complexity.
Our consideration for putting it in the column is that we consider the transformation a bridge between CSV column and Firebird column, so making it a property of `column` seems more appropriate.
* Allow multiple steps in a transformation.
Initially, we implemented this design, with the reasoning that it bridges between the CSV column and the Firebird column, and as such, putting it in the column felt appropriate.
+
For example, consider a conversion from string to long, and then from long (epoch milliseconds) to a datetime value, or using transformations to change formatting (e.g. from string to date, back to string with a different format).
However, during implementation, putting the converter inside the data type turned out to result in simpler code and avoided some problems with generics.
We think that making the converter an aspect of the data type in the XML config as well reduces the cognitive overhead by avoiding two diverging models.
From a design perspective, we can also argue that the data type _is_ the bridge between CSV column and Firebird column, and so putting the converter inside the data type also make sense.
* Allow multiple steps in `converter`.
+
For example, consider a conversion from string to long, and then from long (epoch milliseconds) to a datetime value, or using converters to change formatting (e.g. from string to date, back to string with a different format).
+
While this may be useful to add in the future, all current known cases require a single step, so we'll design for that, but keep the option open to extend this to allow multiple steps in the future.
* Allow reading of incompatible transformations, failing on actual conversion from CSV to external table.
* Allow reading of incompatible converters, failing on actual conversion from CSV to external table.
+
In our opinion, this results in unnecessary complexity, and fail-fast should be preferred.
* Allow reading of incompatible transformations, by converting the transformation result back to string and then using the default transformation.
* Allow reading of incompatible converters, by converting the converter result back to string and then using the default converter.
+
Though this will work for integral types, it is unlikely to work for other data types.
Though this might work for integral types, it is unlikely to work for other data types.
Being explicit about such configuration errors is preferred over trying to make it work.
See also previous item.
* Restrict compatible converters by data type through the XSD.
+
This probably leads to a more complex XSD, and our XML/XSD-fu is also not that great.
We decided to settle on validation when the internal _ext-table-gen_ model is derived from the XML.
+
This may be revisited in the future.

== Consequences

The documentation must be updated to explain how to use these transformations, and to describe the configuration changes.
The documentation must be updated to explain how to use the converter, and to describe the configuration changes.

The XSD will be amended with the new elements, and suitable error handling will be added to reject incompatible transformations.
The XSD will be amended with the new elements, and suitable error handling will be added to reject incompatible converters.

=== Implementation consequences

Most of the consequences listed below should be considered implementation details and guideline for implementation, which may change during actual implementation or by future refactoring.

An interface `Transformation<T, U>` is added, which provides a method `U transform(T)` to provide the actual transformation, and methods `Class<T> inputType()` and `Class<U> outputType()` to support runtime type checking.
An interface `Converter<T>` is added, which provides a method `T convert(String)` to provide the actual conversion, and methods `Class<T> targetType()` to support runtime type checking.

`FbDatatype` will have a property for the `Converter`, when no converter is provided, the default conversion for the data type will be used (which may or may not be implemented using a `Converter`).

A `Column` will have a property for the `Transformation`, when `null`, the default from the data type will be used.
In an earlier iteration, the converter was part of the `Column`, and would be applied before calling `writeValue`, and the first parameter of `writeValue` was parameterized with `T` instead of `String`.
In a second iteration, the converter was passed from the column to the data type on each invocation of `writeValue`
During a third iteration, it turned out that moving the converter into the data type and calling it in the `writeValue` method reduced issues with generic typing, and simplified handling of default behaviour.

Existing data types will be modified to accept a suitable Java data type as input, instead of string.
That is `FbDataType.writeValue(String, EncoderOutputStream)` will be changed to `FbDataType.writeValue(U, EncoderOutputStream)`, with `U` a parameterized type specifying the Java data type (we're using `U` as the type parameter to link its meaning to `Transformation<T, U>`).
`FbDataType` will also receive a method `Class<U> inputType()` to specify the input type.
A data type will define a _default_ transformation, returned from the method `Transformation<String, U> defaultTransformation()`, which the column will use when no transformation has been configured.
Existing conversion from string will be moved to that default transformation.
`FbDataType` will also receive a method `Class<T> targetType()` to specify its Java data type.
A data type will have a _default_ conversion, which will be used when no converter has been configured.
This conversion will not be exposed, and how it is performed is an implementation detail.
Existing conversion from string will be moved to that default conversion.

For the integral types `smallint`, `integer`, `bigint`, it may make sense if the transformation has an additional option to support conversion using a primitive type, to avoid additional overhead of object allocation.
For the integral types `smallint`, `integer`, `bigint`, it may make sense if the converter has an additional option to support conversion using a primitive type, to avoid additional overhead of object allocation.
This will be decided during implementation.
67 changes: 57 additions & 10 deletions src/docs/asciidoc/chapters/_reference.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -314,13 +314,21 @@ Name of the column (used for the DDL; required)
[#ref-xml-datatype]
====== datatype

At this time, _ext-table-gen_ supports the following column datatypes:
A datatype is a placeholder for an element corresponding to a specific Firebird data type.

- <<ref-xml-bigint>>
- <<ref-xml-char>>
- <<ref-xml-integer>>
- <<ref-xml-int128>>
- <<ref-xml-smallint>>
.Attributes
None, though subtypes may define attributes

.Elements, in order
* <<ref-xml-converter>> -- optional
At this time, _ext-table-gen_ supports the following column data types:

* <<ref-xml-bigint>>
* <<ref-xml-char>>
* <<ref-xml-integer>>
* <<ref-xml-int128>>
* <<ref-xml-smallint>>
[float#ref-xml-bigint]
====== `bigint`
Expand All @@ -331,7 +339,7 @@ The `bigint` elements represents the Firebird datatype `BIGINT`.
None

.Elements
None
* <<ref-xml-converter>> -- optional

[float#ref-xml-char]
====== `char`
Expand All @@ -348,7 +356,7 @@ Encoding (character set) of the column, using Firebird character set names (requ
The XSD uses an enum-type, `encoding`, with supported names.

.Elements
None
* <<ref-xml-converter>> -- optional

[float#ref-xml-integer]
====== `integer`
Expand All @@ -359,7 +367,7 @@ The `integer` elements represents the Firebird datatype `INTEGER`.
None

.Elements
None
* <<ref-xml-converter>> -- optional

[float#ref-xml-int128]
====== `int128`
Expand All @@ -370,7 +378,7 @@ The `int128` elements represents the Firebird datatype `INT128`.
None

.Elements
None
* <<ref-xml-converter>> -- optional

[float#ref-xml-smallint]
====== `smallint`
Expand All @@ -380,6 +388,45 @@ The `smallint` elements represents the Firebird datatype `SMALLINT`.
.Attributes
None

.Elements
* <<ref-xml-converter>> -- optional

[#ref-xml-converter]
===== `converter`

The `converter` element -- if present -- defines a conversion from the CSV string value to a Firebird data type value.
If absent, a default conversion will be applied, specific to the column data type.

.Attributes
None

.Elements
* One of the _converter steps_ listed below -- required

.Available converter steps
* <<ref-xml-parseintegralnumber>>

[NOTE]
====
The name _converter step_ may seem to imply that a converter can consist of multiple steps chained together.
That is currently not supported, but it is something that may be introduced in a future version.
In the actual program code, a _converter step_ is also called a _converter_, but in this manual we wanted to prevent overloading the term as it already refers to the `converter` element.
====

[float#ref-xml-parseintegralnumber]
====== `parseIntegralNumber`

Parses a string using a specified radix to an integral number of the type of the enclosing datatype.

This converter step is only valid in <<ref-xml-bigint>>, <<ref-xml-int128>>, <<ref-xml-integer>> or <<ref-xml-smallint>>.

.Attributes
[horizontal]
`radix`::
Radix for parsing a string to an integral number.
Default `10`, range [2, 36].

.Elements
None

Expand Down
22 changes: 22 additions & 0 deletions src/docs/asciidoc/chapters/_usage.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,27 @@ However, if you generate the table file on a little-endian machine, while your F

See <<ref-xml-externaltable>> for more information on `byteOrder`.

[#usage-config-converter]
=== Using a different data type conversion

Each data type has a default conversion from the CSV string value to a Firebird data type value.
This default conversion is not always suitable.
For example, a CSV with hexadecimal string values for integers cannot be read as a `integer`, because the default conversion uses radix 10.

To address this, the data type elements in the configuration have an optional element, <<ref-xml-converter>>, to specify a conversion using a _converter step_.
Currently, there is only one _converter step_, <<ref-xml-parseintegralnumber>>, which can be used to specify a different radix for converting CSV values.

For example, to convert integer values in radix 16 instead of radix 10, use the following data type definition:

[source,xml]
----
<integer>
<converter>
<parseIntegralNumber radix="16"/>
</converter>
</integer>
----

[#usage-workflow]
== Recommended workflow

Expand All @@ -335,6 +356,7 @@ For repeated imports, or where multiple character sets are needed, we recommend
. Increase columns lengths to expected maximums (see <<usage-config-length>>),
. If needed, change the encoding of columns (see <<usage-config-charset>>),
. Determine if there are more appropriate types for columns and change the type if needed (see <<usage-config-datatype>>),
. Determine if a custom converter is needed (see <<usage-config-converter>>),
. Finally, regenerate the config file (see <<usage-config-regen>>) for up-to-date DDL.

Use the DDL from the last step to create the external table, and use the external table file it generated to verify the definition by querying the external table from Firebird.
Expand Down
Loading

0 comments on commit 7fa326e

Please sign in to comment.