Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hello. Idea for a sample pipeline #2

Closed
evogelpohl opened this issue Jan 3, 2021 · 2 comments
Closed

Hello. Idea for a sample pipeline #2

evogelpohl opened this issue Jan 3, 2021 · 2 comments

Comments

@evogelpohl
Copy link

evogelpohl commented Jan 3, 2021

I enjoyed reading your post on the databricks autoloader feature. I'm tinkering with a similar pipeline to process AVROs coming from Azure Event Hub (Capture -> to /ADLSGen2).

I'm not finding a lot of sample pipelines that follow the path I'm attempting.

  1. Use Databricks' AutoLoader, Create df w/ ReadStream
  2. Isolate just the [Body] column from the AVRO files - which is in Binary format
  3. Using from_avro function of pyspark.sql.avro.functions & a AVSC file (a simple file, not registered in the schema registry) for the Body's schema.
  4. Fully flatten the Body schema (as it contains 1 nested struct)
  5. WriteStream into Delta, Trigger=Once, as a fully flattened table (-> then CREATE TABLE [...], optional)

I suspect many would benefit from a working demo as Azure Event Hub w/ Capture are common patterns. If you're so inclined to make one, then thanks in advance. -EV

@mdrakiburrahman
Copy link
Owner

@evogelpohl - thanks for the fantastic idea. I've had this in my "to do" list as well, but haven't had a chance to implement it yet. Pulling this up on the list - will tag you with the post here once I have something working.

@mdrakiburrahman
Copy link
Owner

@evogelpohl - here's the article that covers this topic.

Thanks for the idea once again, and feel free to reopen this issue if you have any questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants