Add jdbc connector #931

gurbuzali · 2018-06-21T09:19:04Z

Fixes #869

cangencer · 2018-06-21T12:08:28Z

hazelcast-jet-core/pom.xml

+        <dependency>
+            <groupId>com.h2database</groupId>
+            <artifactId>h2</artifactId>
+            <version>${h2.version}</version>


should have test scope, you can also embed the version like with the other test libraries

cangencer · 2018-06-21T12:09:39Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/core/processor/SourceProcessors.java

+     */
+    public static <T> ProcessorMetaSupplier readJdbcP(
+            @Nonnull DistributedSupplier<java.sql.Connection> connectionSupplier,
+            @Nonnull DistributedFunction<java.sql.Connection, Statement> statementFn,


how come sink works with PreparedStatement and source only with Statement?

PreparedStatement extends Statement, I'll use Statement for both source and sink

I think it makes sense how it was. PreparedStatement extends Statement and allows it to bind variables. For source, we don't need to bind variables, because the statement is executed only once.

Right, I thought it would get repeatedly executed, but we get a cursor so don't need to deal with pagination.

cangencer · 2018-06-21T12:11:17Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/Sources.java

-     *                    projection, use {@link Util#mapPutEvents} to pass only {@link
-     *                    com.hazelcast.core.EntryEventType#ADDED ADDED} and {@link
-     *                    com.hazelcast.core.EntryEventType#UPDATED UPDATED} events.
+     *                    projection, use {@link com.hazelcast.jet.Util#mapPutEvents} to pass


no need for explicit imports here

I've converted these to explicit imports so that I can usecom.hazelcast.jet.impl.util.Util with import

cangencer · 2018-06-21T12:20:04Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/ReadJdbcP.java

+    @Override
+    public void close(@Nullable Throwable error) throws Exception {
+        if (statement != null) {
+            statement.close();


this can throw exception, which must be caught

Closing the connection closes all the statements. No need to catch, exception in close is just logged and ignored.

I meant that if statement.close() throws connection.close() will not be called

I know, I meant connection.close() is enough, no need to close the statement.

Checking google, people tend to recommend closing all three: connection, statement and resultset. Even though after connection is closed there's nothing to do for the other two, they say that it's not specified and that some drivers misbehave...

viliam-durina · 2018-06-21T13:46:39Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/Sinks.java

+    @Nonnull
+    public static <T> Sink<T> jdbc(@Nonnull String connectionUrl,
+                                   @Nonnull String updateQuery,
+                                   @Nonnull DistributedBiConsumer<PreparedStatement, T> updateFn) {


It would be better if this is just bindFn: a function expected to bind parameters in the statement, but not expected to call addBatch. The flushFn does executeBatch, i'm not sure if executeBatch wouldn't fail if nothing was added, for example if the user used executeUpdate instead of addBatch.

this is the convenience method, for the full version user may choose to execute the query directly in the updateFn rather than add to batch. I know batching is better for performance wise but this brings flexibility to the user. If you want batch, use addBatch and executeBatch. If you don't want batch just execute the query and don't do anything on flush function.

Is there any scenario where we would not want to batch?

from javadoc SQLException if a database access error occurs, this method is called on a closed <code>Statement</code> or the driver does not support batch statements.

https://stackoverflow.com/questions/27079070/jdbc-driver-doesnt-support-batch-update-with-retrieval-of-identity-column-why

https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/impala-jdbc-not-support-UPSERT-preparestatement-KUDU/m-p/54419

We can still have bindFn expected only to bind parameters. We can query the driver if batch is supported. If it is, we'll do addBatch/executeBatch. If it's not, we'll do executeQuery only. I think it makes sense for this simple version.

ok, we can do that

viliam-durina · 2018-06-21T13:52:49Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/Sinks.java

+    @Nonnull
+    public static <T> Sink<T> jdbc(@Nonnull String connectionUrl,
+                                   @Nonnull String updateQuery,
+                                   @Nonnull DistributedBiConsumer<PreparedStatement, T> updateFn) {


Here we also require the connection to be in auto-commit mode because we don't commit. We can add

if (!con.getAutoCommit()) { con.commit(); }

to the flushFn.

from javadoc By default a <code>Connection</code> object is in auto-commit mode...

yes, but the user can change it. We depend on auto-commiting, we should document it. But i still prefer not depending on it if you add the code above.

user cannot change it, this is convenience method, we do create the connection using the connectionURL.
() -> uncheckCall(() -> DriverManager.getConnection(connectionUrl)),

ah, true. Then i'll disable auto-commit and i'll commit in flushFn. It's much more performant.

viliam-durina · 2018-06-21T13:53:56Z

What about error handling? I think we should try to reconnect, afaik JDBC drivers don't support this transparently.

viliam-durina · 2018-06-21T14:30:02Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/Sinks.java

+     * DistributedFunction, DistributedBiConsumer, DistributedBiConsumer)}.
+     */
+    @Nonnull
+    public static <T> Sink<T> jdbc(@Nonnull String connectionUrl,


We also need driver class here and do Class.forName(driverClass) before calling DriverManager.getConnection. JDBC drivers register themselves in a static initializer in the driver class, if we don't do this, the connection will fail. The test doesn't fail because the DeleteDbFiles.execute likely loads the driver.

In previous versions of JDBC, to obtain a connection, you first had to initialize your JDBC driver by calling the method Class.forName. This methods required an object of type java.sql.Driver. Each JDBC driver contains one or more classes that implements the interface java.sql.Driver. ... Any JDBC 4.0 drivers that are found in your class path are automatically loaded. (However, you must manually load any drivers prior to JDBC 4.0 with the method Class.forName.)
JDBC 4.0 is introduced with Java6

Ok, didn't know that. Maybe you should test whether it works with the dynamic class loading when the driver is submitted in job resources.

I'll test it

…rPartition

viliam-durina · 2018-06-27T09:34:36Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/ReadJdbcP.java

+        return null;
+    }
+
+    private static ResultSetForPartitionFunction resultSetFunction(String query) {


This function should be inlined or it's name should be more specific, something like singlePartitionResultSetFn

viliam-durina · 2018-06-27T09:35:07Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/ReadJdbcP.java

+
+    private static ResultSetForPartitionFunction resultSetFunction(String query) {
+        return (connection, parallelism, index) -> {
+            PreparedStatement statement = uncheckCall(() -> connection.prepareStatement(query));


we can assert parallelism == 1, otherwise it would emit duplicates.

we use ProcessorMetaSupplier.forceTotalParallelismOne on SourceProcessors.readJdbcP, isn't it enough ?

Yes, it is, but it's in different place. It's safety if anything goes wrong...
Btw, why do we return PS from the supplier and then wrap it in another place? It requires to be wrapped, why not return PMS right away?

viliam-durina · 2018-06-27T13:36:39Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+            flushFn.accept(connection, statement);
+            itemList.clear();
+        } catch (Exception e) {
+            if (e.getCause() instanceof SQLNonTransientException) {


This is inverted

viliam-durina · 2018-06-27T13:37:18Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+    @Override
+    public void init(@Nonnull Outbox outbox, @Nonnull Context context) {
+        logger = context.logger();
+        connection = connectionSupplier.get();


We should not fail on connection error here too. We can assign idleCount=1 and call reconnectIfNecessary

viliam-durina · 2018-06-27T13:37:32Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+public final class WriteJdbcP<T> implements Processor {
+
+    private static final IdleStrategy IDLER =
+            new BackoffIdleStrategy(0, 0, MILLISECONDS.toNanos(1), SECONDS.toNanos(10));


Shortest idle can be 1 second

viliam-durina · 2018-06-27T13:38:34Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+    private List<T> itemList = new ArrayList<>();
+    private int idleCount;
+
+    public WriteJdbcP(


Can be made private

viliam-durina · 2018-06-29T07:35:32Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+                idleCount++;
+            }
+        }
+        idleCount = 0;


We miss test for reconnection, i think it doesn't work.
This line will always assign idleCount=0 in case of a recoverable exception

viliam-durina · 2018-06-29T07:36:40Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+                statement.addBatch();
+            }
+            statement.executeBatch();
+            itemList.clear();


We miss commit here

viliam-durina · 2018-06-29T07:38:06Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+    @Override
+    public void close() throws Exception {
+        Exception stmtException = close(statement);
+        Exception connectionException = close(connection);


We should check for null connection and statement

viliam-durina · 2018-06-29T07:39:19Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+        Exception stmtException = close(statement);
+        Exception connectionException = close(connection);
+        if (stmtException != null) {
+            throw stmtException;


There's also no need to throw, we can just do

closeWithLogging(logger, statement); closeWithLogging(logger, connection);

and add null-check inside of closeWithLogging

viliam-durina · 2018-06-29T07:43:25Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+        try {
+            for (T item : itemList) {
+                bindFn.accept(statement, item);
+                statement.addBatch();


We can also support drivers which don't support batch updates, it's easy to.

cangencer · 2018-06-29T13:48:13Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/Sources.java

+     * processor. For example:
+     * <pre> {@code
+     *  (connection, parallelism, index) ->
+     *      PreparedStatement stmt = connection.prepareStatement("select * from TABLE where mod(id,%d)=%d)


I think it's more typical that SELECT is capitalized and table is lowercase

cangencer · 2018-06-29T13:49:52Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/Sources.java

+     *         },
+     *         (con, parallelism, index) -> {
+     *             try {
+     *                 return con.prepareStatement("select * from TABLE where mod(id, ?) = ?);


this code is repeated twice in the javadoc

cangencer · 2018-06-29T13:53:34Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/ResultSetForPartitionFunction.java

+
+/**
+ * Represents a function that accepts a JDBC connection to the database,
+ * a total parallelism and a processor index as arguments and


should say "the total processor" and "processor index"

cangencer · 2018-06-29T13:54:18Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/ResultSetForPartitionFunction.java

+     * @param parallelism the total parallelism for the processor
+     * @param index the global processor index
+     */
+    ResultSet createResultSet(Connection connection, int parallelism, int index);


the function name says partition, but it's named index here.

method should also throw SQLException

cangencer · 2018-06-29T14:02:32Z

hazelcast-jet-core/src/test/java/com/hazelcast/jet/impl/connector/ReadJdbcPTest.java

+    }
+
+    @Test
+    public void testPartitionedQuery() {


tests should use the when_ convention..

cangencer · 2018-06-29T14:11:31Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/pipeline/ResultSetForPartitionFunction.java

+ * limitations under the License.
+ */
+
+package com.hazelcast.jet.pipeline;


should be under pipeline.jdbc package if there will be several such functions

cangencer · 2018-06-29T14:13:03Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/ReadJdbcP.java

+import static com.hazelcast.jet.impl.util.Util.uncheckRun;
+
+/**
+ * Private API, use {@link SourceProcessors#readJdbcP}.


this is not an API, but impl class so the comment doesn't make sense. You can just refer to the processor static method and it's enough I think.

There are 12 matches for "Private API" in code...

cangencer · 2018-06-29T14:13:59Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/ReadJdbcP.java

+
+    @Override
+    protected void init(@Nonnull Context context) {
+        connection = connectionSupplier.get();


minor but inconsistent use of this

cangencer · 2018-06-29T14:19:14Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+import static java.util.concurrent.TimeUnit.SECONDS;
+
+/**
+ * Private API, use {@link SinkProcessors#writeJdbcP}.


see comments on ReadJdbcP

cangencer · 2018-06-29T14:23:56Z

hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/WriteJdbcP.java

+
+    @Override
+    public void close() {
+        closeWithLogging(statement);


looks like different closing approach with ReadJdbcP

Yes, we are already logging while reconnecting instead of throwing the exception

add jdbc connector

91b6132

gurbuzali requested a review from cangencer June 21, 2018 09:19

gurbuzali self-assigned this Jun 21, 2018

gurbuzali added enhancement core Pipeline API and removed Pipeline API labels Jun 21, 2018

gurbuzali added this to the 0.7 milestone Jun 21, 2018

hazelcast deleted a comment from devOpsHazelcast Jun 21, 2018

cangencer reviewed Jun 21, 2018

View reviewed changes

review comments

6ea08d4

viliam-durina reviewed Jun 21, 2018

View reviewed changes

use Statement instead of PreparedStatement for sink too

1c273a6

viliam-durina reviewed Jun 21, 2018

View reviewed changes

Ali Gurbuz and others added 5 commits June 22, 2018 11:11

more review comments

62d10cc

Javadoc

caa3d76

convert to PreparedStatement for sinks, introduce ResultSetFunctionFo…

ca129f2

…rPartition

Merge branch 'master' of github.com:hazelcast/hazelcast-jet into jdbc

fcf96b5

compilation fix

e0982f7

hazelcast deleted a comment from devOpsHazelcast Jun 27, 2018

Ali Gurbuz and others added 2 commits June 27, 2018 11:23

compilation fix

8c95864

Touchups

935e099

viliam-durina reviewed Jun 27, 2018

View reviewed changes

Ali Gurbuz added 2 commits June 27, 2018 13:44

test fix

7284f3e

add error handling for jdbc sink

dd23c34

viliam-durina reviewed Jun 27, 2018

View reviewed changes

viliam-durina reviewed Jun 28, 2018

View reviewed changes

more review comments

8839847

viliam-durina suggested changes Jun 29, 2018

View reviewed changes

viliam-durina reviewed Jun 29, 2018

View reviewed changes

viliam-durina and others added 3 commits June 29, 2018 09:53

Update javadoc

198abce

add loop for reconnection

842c7ad

refactor test

e91c4ac

hazelcast deleted a comment from devOpsHazelcast Jun 29, 2018

cangencer reviewed Jun 29, 2018

View reviewed changes

Ali Gurbuz added 2 commits June 29, 2018 18:30

review comments

f8fdd4b

rename result set function

a006dc1

hazelcast deleted a comment from devOpsHazelcast Jul 2, 2018

remove try/catch

b89c5e2

cangencer approved these changes Jul 2, 2018

View reviewed changes

viliam-durina approved these changes Jul 2, 2018

View reviewed changes

hazelcast deleted a comment from devOpsHazelcast Jul 2, 2018

gurbuzali merged commit 8b9548b into hazelcast:master Jul 2, 2018

gurbuzali deleted the jdbc branch July 9, 2018 08:06

Add jdbc connector #931

Add jdbc connector #931

Conversation

gurbuzali commented Jun 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cangencer Jun 21, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viliam-durina commented Jun 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gurbuzali commented Jun 21, 2018 •

edited

cangencer Jun 21, 2018 •

edited