Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] How to store schema in schema registry for capturing nested arrays and JSONs with column names for Hudi table onboarding from Mongo data source? #253

Open
torvalds-dev-testbot bot opened this issue Aug 22, 2024 · 5 comments

Comments

@torvalds-dev-testbot
Copy link

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Can someone suggest how to store the schema in schema registry to capture all complex nested arrays and JSONs with column names for Hudi table onboarding when the source is a Mongo data source (e.g., when opting for a static schema nature data)?

To Reproduce

Steps to reproduce the behavior:
1.
2.
3.
4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Copy link
Author

Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds

Copy link
Author

Based on the provided documentation snippets, in order to store the schema in a schema registry to capture all complex nested arrays and JSONs with column names for a Hudi table onboarding when the source is a Mongo data source with static schema nature data, you can follow these steps:

  1. Enable schema reconciliation using the config hoodie.datasource.write.reconcile.schema.
  2. Configure the schema evolution scenarios on write to support schema changes such as adding nullable fields or promoting field datatypes.
  3. Use the hoodie utilities to ingest data from Kafka with the Schema Registry and Confluent Kafka set up.
  4. Utilize AWS Glue Data Catalog with Apache Hudi to register tables and interact with data stored in S3.
  5. Make use of Spark to write data to a Hudi table with specific options such as storage type, record key field, table name, partitioning, and more.

By following these steps and configurations, you can effectively store the schema in a schema registry to handle complex nested data structures in Hudi tables when onboarding data from a Mongo source.

Sources:

Copy link
Author

Not satisfied with the answer.
Can you re-look into answer generated by you?

Copy link
Author

Ticket created successfully. Here is the link to the GitHub issue: #253

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants