Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDBC to Spanner support #60

Closed
toshi0607 opened this issue Mar 22, 2022 · 10 comments
Closed

JDBC to Spanner support #60

toshi0607 opened this issue Mar 22, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@toshi0607
Copy link
Contributor

toshi0607 commented Mar 22, 2022

WHAT

I was impressed that Spanner to JDBC (Cloud SQL) is supported. JDBC to Spanner is also deeply appreciated.

WHY

We are migrating Cloud SQL for PostgreSQL to Spanner and data migration is one of the most difficult part.
When types of columns used for PostgreSQL are primitive, pg_dump as CSV and Cloud Storage Text to Cloud Spanner are available. However, when using more complicated types, such as ARRAY and NUMERIC, we need different method.

HarbourBridge is introduced in GCP doc, but it doesn't work for data migration against the spec. Some of Avro to Spanner templates require to use the same format as Spanner to Avro manifest for each, so it is difficult to use them for data migration even if we can use a date export tools such as spotify/dbeam.

If this feature is supported, many migrators will be happy!!!

Features

  • Type support
    • ARRAY
    • NUMERIC
  • Column Mapping
    • Cloud SQL (column name: phone_number) -> Spanner (column name: PhoneNumber)
    • Select and modify column to sink
  • Integration with secret manager
    • to retrieve a DB password from secret manager to avoid putting password on manifest
@toshi0607 toshi0607 added the enhancement New feature or request label Mar 22, 2022
@toshi0607
Copy link
Contributor Author

toshi0607 commented Mar 24, 2022

JDBC to Spanner is already supported and we are testing the behavior. Please let me clarify how to specify a config file because currently it fails to execute a job (maybe my config fault).

Precondition

Cloud SQL table

a_table

  • column_a
  • column_b
  • column_c
  • column_d
  • column_e
  • column_f
    ...

Spanner table

ATable

  • ColumnA
  • ColumnB
  • ColumnC
  • ColumnD

Config manifest

{
  "sources": [
    {
      "name": "input",
      "module": "jdbc",
      "parameters": {
        "query": "SELECT column_a, column_b, column_c FROM public.a_table",
        "url": "jdbc:postgresql:///xxx?cloudSqlInstance=yyy:asia-northeast1:zzz&socketFactory=com.google.cloud.sql.postgres.SocketFactory",
        "driver": "org.postgresql.Driver",
        "user": "jjjj",
        "password": "kkkk"
      }
    }
  ],
  "sinks": [
    {
      "name": "spanner",
      "module": "spanner",
      "input": "input",
      "parameters": {
        "projectId": "xxxxxx",
        "instanceId": "yyyyyyy",
        "databaseId": "xxxxxxxxxx",
        "table": "TableA",
        "createTable": false,
        "keyFields": ["ColumnA", "ColumnB", "ColumnC"] // extract part of the whole schema to choose supported types (exclude ARRAY and NUMERIC)
      }
    }
  ]
}

Execution

Error

Error message from worker: com.google.cloud.spanner.SpannerException: NOT_FOUND: com.google.api.gax.rpc.NotFoundException: io.grpc.StatusRuntimeException: NOT_FOUND: Column not found in table TableA: column_a resource_type: "spanner.googleapis.com/Column" resource_name: "column_a" com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:284) com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:61) 
...

Does this mean column names are the same between Cloud SQL and Spanner or it there any way to map columns between them?

@orfeon
Copy link
Member

orfeon commented Mar 25, 2022

Thank you for sharing the details of the use case !
The field names and data types need to match the table you are writing to.

Is it possible to convert or refine the field names in jdbc source module query parameter as follows?
SELECT column_a AS ColumnA, column_b AS ColumnB, column_c AS ColumnC FROM public.a_table

@toshi0607
Copy link
Contributor Author

Thank you so much! I tried your suggestion and it worked perfectly as intended!!!

I confirmed the original issue JDBC to Spanner support is already supported. Is it better to separate feature requests, such as additional types and secret manager integration, from this issue?

@orfeon
Copy link
Member

orfeon commented Mar 28, 2022

It seems that support for arrays and numeric types will be necessary when submitting data to Spanner, so you can continue with this issue here!

@orfeon
Copy link
Member

orfeon commented Apr 7, 2022

Jdbc -> Spanner now supports Numeric, Json, and Array types.
And user and password parameter can now be taken from the secret manager resource name.
It is available from the following branch.
We are still planning to add a few Jdbc related features. I will merge them if there is no problem after that.
Please let me know if there are any problems.

https://github.com/mercari/DataflowTemplate/tree/enhance-jdbc

@orfeon
Copy link
Member

orfeon commented Apr 7, 2022

Also, we added table parameter. Instead of query parameter, you can specify the name of the table from which you want to get data.
Compared to executing a query, it has the following advantages

  • Data can be retrieved every specified number of times. This is useful when it is tough to run queries for longer periods of time in a production environment.
  • Data can be acquired at a higher speed because the query range is split and distributed execution is performed according to the load (when enableSplit of the experimental function is enabled).

If the table you want to retrieve data from is huge and it takes a long time to retrieve the data, please consider using this option. The following is an example.

https://github.com/mercari/DataflowTemplate/blob/enhance-jdbc/examples/jdbc-table-to-spanner.json

@orfeon
Copy link
Member

orfeon commented Apr 8, 2022

IAM database authentication using dataflow service account is now supported.

@toshi0607
Copy link
Contributor Author

toshi0607 commented Apr 22, 2022

Thank you for the enhancement! We are testing the enhance-jdbc branch with the schemas that include ARRAY type columns. Some tables succeeded and the other failed. May I ask you whether this case is supported?

Postgres schema

there are other columns, but this one may be the only significant difference

  • name: bar_ids
  • type: bigint[]
  • default: '{}'::bigint[]
  • nullable: false

Spanner shchema

  • name: BarIds
  • type: ARRAY<INT64>
  • default: not using default yet
  • nullable: NOT NULL

Actual data sample in postgres

 {56,78,90}
 {12}
 {34}
 {}
 {}
 {}
...

Error message

Error message from worker: java.lang.NullPointerException com.mercari.solution.util.converter.RecordToMutationConverter.setValue(RecordToMutationConverter.java:167) com.mercari.solution.util.converter.RecordToMutationConverter.setValue(RecordToMutationConverter.java:207) com.mercari.solution.util.converter.RecordToMutationConverter.convert(RecordToMutationConverter.java:53) com.mercari.solution.util.converter.DataTypeTransform$SpannerMutationDoFn.processElement(DataTypeTransform.java:419)

Other

tables including the following columns succeeded in migration

  • name: foos
  • type: text[]
  • default: none
  • nullable: true

@orfeon
Copy link
Member

orfeon commented Apr 22, 2022

Thanks for trying it out right away!
bigint type (also bigint[] type) should also be supported as INT64 type.
I checked the error message and found that there is a bug that causes an error if the value of a field of type Integer (TinyInt, SmallInt, etc.) is NULL.
I have committed the hot fix to the branch in a hurry.
(We will also added a check for the case when Integer type is NULL)

@toshi0607
Copy link
Contributor Author

Thank you for the prompt fix! I confirmed the above case worked correctly 👍 I'll try other cases, such as NUMERIC and report if I find something!

@orfeon orfeon mentioned this issue May 21, 2022
@orfeon orfeon closed this as completed Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants