Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake integration #12

Closed
joshuataylor opened this issue May 29, 2022 · 22 comments
Closed

Snowflake integration #12

joshuataylor opened this issue May 29, 2022 · 22 comments

Comments

@joshuataylor
Copy link

joshuataylor commented May 29, 2022

Hi!

At the company I work for we use Snowflake. It's pretty good as a data warehouse. It uses JSON as a response format, and also Arrow for supported languages (as Erlang/Elixir doesn't have a library for this, we would have to use a NIF for Rust to get this to work, and the current JSON responses works "good enough" for us). See the blog post about arrow

I also love Elixir, and so does the company I work for.

I have written a Snowflake library for Elixir, and also an Ecto integration, both of which we use to drive internal apps. Works pretty well, however I want to clean up the library a bit. I'm looking over req_bigquery now as well.

I'm looking now into adding Snowflake support, and will support it via a feature branch on my fork until I've got the public snowflake_elixir libraries to a standard I would be comfortable with.

TLDR:
1/ Create a generic snowflake_elixir library, similar to req_bigquery. Maybe req_snowflake ;)
2/ Make snowflake_elixir_dbconnection rely on this
3/ Make snowflake_elixir_ecto rely on snowflake_elixir_dbconnection

Thoughts?

edit:
After an hour, got the table integration in and it works! I'll finish off the tests then create a PR.

image

@josevalim
Copy link
Contributor

Hi @joshuataylor! We actually discussed Snowflake and did early prototypes (aiming for the Arrow version), we made some progress but we did not fully conclude it.

In any case, we agree that getting started with the JSON one is good enough! However, we were really planning to go the req_snowflake route. SnowflakeEx works great for Ecto projects, but I believe the req_snowflake route will be considerably leaner, which is important to reduce the notebook footprint. What do you think?

You may actually even be able to go the other way and have SnowflakeEx use req_snowflake. Is this a route you would be willing to explore?

Thank you! ❤️

@joshuataylor
Copy link
Author

Yes, I'd be happy to create a version for req_snowflake, as we would use it.

Initially this would use JSON, but we can work on getting it to work for Arrow.

I've worked quite a bit on reverse engineering the Snowflake protocol, so happy to champion that.

@josevalim
Copy link
Contributor

@joshuataylor let's create a separate issue then for Snowflake Arrow. My understanding was that they dump the results to a S3 (or S3-like storage). This is a similar design to AWS Athena, so we can also wait until req_athena is done before we focus on Arrrow support. For parsing the results, we can use Explorer (as an optional dependency).

@aleDsz, can you please get started with the issue and dump your findings?

@joshuataylor
Copy link
Author

joshuataylor commented May 29, 2022

Yeah it's in a streaming format for arrow. I also wrote a connector for rust. You download the parts as you need, but that's unrelated to the format.

@joshuataylor
Copy link
Author

So I've started working on req_snowflake, and should have a repository for review and comments in the next day or so.

I haven't used Bigquery, so I'm not sure how that works -- but req_bigquery looks great.

Snowflake is considered a "Data Warehouse", but can be used almost exactly the same way Postgres is used, which is why
snowflake_elixir was pretty straight forward to write once I had figured out how the Snowflake REST API worked and tied in with db_connection pretty easily. We've been using this for 18 months for our internal apps at work and it's been working quite well. Porting to req_snowflake was pretty straight forward once I understood how req works with request/response steps. I really like it.

I have spent a considerable amount of time spelunking and figuring how Snowflakes REST API internals works (Arrow, JSON responses, header quirks etc), so I'm also happy to chat about what I've learned about their implementation as it's a bit wonky in some places, but overall pretty good. Their Python/Golang/NodeJS connectors are pretty good now as well and you can read through their code to figure out how they do things. I think we'll copy the way they do Snowflake Stages as well.

Also, as a random note but about 2 years ago I worked on an query editor style app, similar to Heroku Dataclips, which was built for Snowflake but I gave up on it as some of the technology wasn't there yet. But livebook/Table is exactly what I've been wanting, and it's very polished. So I plan on opensourcing this work and porting this to work in livebook. I reckon a proper query editor would be a killer feature for Livebook, with autocompletions on par with Datagrip (the best SQL editor I've used, hands down) would be great.

@josevalim
Copy link
Contributor

Perfect, we would be very happy to discuss and explore those ideas with you!

@joshuataylor
Copy link
Author

I have created a PR here: joshuataylor/req_snowflake#6

Please give feedback if you want, and to anyone else reading this would also be appreciated.

I am happy to also eventually move this to a generic repo of req plugins if needed.

Maybe we should reopen this ticket when this repo is ready?

@josevalim
Copy link
Contributor

Amazing! @wojtekmach is definitely the best person to review it. Also, I think we can keep this issue open too!

Btw, would you be ok with transferring the repo to the livebook org before merging the kino_db integration? Of course you would retain your access, but we also want to have access over all integrations that can be started with a single click from the UI. :)

Thank you! <3

@joshuataylor
Copy link
Author

Yep, pinged him on Slack about this as well, as I wanted to reach out about some req questions.

Also yep -- i'm happy to move it when required, not sure of the process.

@josevalim
Copy link
Contributor

@joshuataylor the easiest way I know is to transfer it to me, then I transfer to the org. But let's not worry about it right now :D

@aleDsz
Copy link
Member

aleDsz commented May 30, 2022

@joshuataylor Hi, awesome to see you have already know how Snowflake REST API works, which is what I was trying to figure out how to receive the Arrow response format.

I've read the Python Connector repo to see how to "personificate" the Elixir version as a "Python version" to make them to return the response as ARROW format, but didn't have so much progress with that.

Looking to your current PR from req_snowflake I'm very excited to see how we could support the ARROW format, but happy to see your initial work with JSON.

As you need to use S3, what do you think about using req_s3?

@joshuataylor
Copy link
Author

joshuataylor commented May 30, 2022

If you set the application to mimic other apps (Python/Java/etc) it will give you sometimes Arrow, other times JSON. It seemed (from 2 years ago, could have changed) that using any name apart from Javascript and that version returned Arrow.

I ended up converting Arrow using a NIF to get it to work, but found JSON easier to manage (at the time). I'm happy to get this working with Arrow, now that Rustler has precompiled support I'll get CI to build for every platform (including manually building using Mac M1 & ARM if Github doesn't support it yet).

Happy to try req_s3, though i think we might need to support the header here for it to get the response back in the correct format. i'll try removing it and see if it breaks.

@wojtekmach
Copy link

wojtekmach commented May 30, 2022

I wouldn't add req_s3 as a dependency to your plugin. It's only really meant for end-users to easily explore stuff in S3. It's pretty straightforward too, if you find any bits useful, I'd copy them instead.

@josevalim
Copy link
Contributor

Hi @joshuataylor! This is just a quick ping to check if there is anything we can help with.

Regarding Arrow support, note that the Explorer library can read Arrow IPC. At the moment the IPC needs to be stored in disk but we can easily add a function to read it from an Elixir binary. If the read_ipc from disk approach works, then I would actually go with Explorer, because you most likely want a dataframe later to explore the data anyway! Also note Explorer supports only some types (:integer | :float | :boolean | :string | :date | :datetime) but we can also add new data types if so required.

@joshuataylor
Copy link
Author

joshuataylor commented Jun 16, 2022

Hi!

Yep, just been going down the rabbithole in the 2 weeks of playing with Arrow2, Rustler, NIFs and converting to Binaries. I'm currently wrapping up some benchmarks, but we can convert everything in Rust to Elixir terms extremely quickly, the main blocker I had was figuring out how binaries and subbinaries work, major thanks to @filmor for pointing to sub binaries which has massively decreased times.

I'll have something out in the next few days, but I've found that we're much faster than the JSON implementation, and I've also improved the JSON implementation as well so the user has a choice to either:

a/ Use JSON with Jason to decode
b/ Use Arrow

I'll also play with explorer to see how it compares to doing things in Elixir vs not.

I'm also curious how we can quickly convert from columns -> rows so the rows are in lists, see here: https://github.com/joshuataylor/list_of_lists

We get the following back from Arrow:

[
  [1,2,3,4],
  ["a","b","c","d"]
]

Then I do:

    data
    |> Native.convert_arrow_stream(Keyword.get(opts, :cast, false))
    |> Enum.zip_with(& &1)

edit: So Arrow has two different formats for IPC:

1/ IPC
2/ Arrow streaming files

We'll need to add 2/ to polars, which is noted here: pola-rs/nodejs-polars#109

@josevalim
Copy link
Contributor

josevalim commented Jun 16, 2022

That's awesome! Keep in mind that subbinaries have the downside that the underlying binary won't be garbage collected but that should be alright given Arrows' design.

The cool think about Explorer is that, as we improve its APIs, you will be able to build large queries that filter, group_by, order, and execute that only once over the Arrow format in Rust. The conversion to Elixir happens only at the end (as in dataframes in Python).

The other thing about Explorer is that we will likely implement a ExplorerSQL thing, so you can write your SQL queries in Explorer too, translate that to SQL, send to Snowflake (likely using req_snowflake), and continue processing it with Explorer locally. So if you want to go with Arrow out of the box, it is likely the best path forward.

In any case, it is great you are deeper into the Arrow/Rust/NIF side of things, because it is becoming an essential part of the data aspect of Livebook/Nx, and the knowledge will definitely be useful and help!

@joshuataylor
Copy link
Author

Perfect, that is the exact use case I also wanted for a tool. Sounds very similar to the way count.co does things, they have a notebook which converts each cell into actual SQL which is executed against Snowflake.

@forest
Copy link

forest commented Jul 27, 2023

If there is still interest in supporting Snowflake via the JSON API then this library might be useful. It is built on top of Req.
https://github.com/HGInsights/avalanche

@joshuataylor
Copy link
Author

Apologies for the lack of updates on this, life ended up being pretty crazy this year with personal family stuff going on.

Snowflakes API is incredibly undocumented, but is stable and never really changes, as they have large customers who can't have the API change. I'm happy to chat over the API and any nuances, especially regarding JSON vs Arrow rows, etc with anyone -- feel free to email me at joshuataylorx [at] gmail [dot] com and we can either discuss via email or catchup over video.

@josevalim
Copy link
Contributor

For what is worth, we have Elixir+Snowflake+Arrow support directly in ADBC now. Our goal is to integrate it into this project via with Explorer+ADBC.

@forest
Copy link

forest commented Jul 28, 2023

Oh, that's great. I will give ADBC a try.

@josevalim
Copy link
Contributor

This is in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants