Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate solutions to stream into materialized views #56

Open
3 tasks
pld opened this issue Oct 2, 2020 · 5 comments
Open
3 tasks

Investigate solutions to stream into materialized views #56

pld opened this issue Oct 2, 2020 · 5 comments
Assignees

Comments

@pld
Copy link
Member

pld commented Oct 2, 2020

Questions from @moshthepitt

I am generally on board. But these things are unclear to me:

  • is this a SAAS offering, or something that we can self-host if we wanted to?
  • how do we "deploy" to is? our migrations will work?

Tasks:

@moshthepitt
Copy link
Contributor

Posting a comment from slack: if we are going in Pinot's direction we might as well think about something like cassandra which is in the same class of tools imho and arguably more well-known.

@moshthepitt
Copy link
Contributor

I also think that in comparing these different tools we need to make them produce the same report from the same database and then we can objectively weigh them against each other and against postgres. Luckily (ha!) we can use one of the already existing and challenging Zambia IRS reports.

@moshthepitt moshthepitt self-assigned this Oct 2, 2020
@gstuder-ona
Copy link
Contributor

@ukanga pinot's arch at first glance seems fairly similar to Druid, but isn't Superset-compatible. Is there a killer feature it has that you're most interested in? I ask because we've got a little experience in Druid clusters, so wondering how that went for our analytics - I think presto is also compatible with Druid?

@gstuder-ona
Copy link
Contributor

gstuder-ona commented Oct 2, 2020

@moshthepitt re questions

  • seems very self-hosted, not sure if there is a SaaS? But I didn't dive in.
  • that's one of the huge features, if it works - we in theory can manage materialize with sqitch, in the same repo or whatever. It's all dockerized so we should vbe able to put it in the data-solutions docker-compose

There's a lot of details around editing deployed views, etc. to figure out but it's a very targeted fix

@pld
Copy link
Member Author

pld commented Oct 3, 2020

The 3 options here feel like 3 separate approaches, and (I think?) only materlialize.io addresses the specific problem of view recomputation, although they all target the same underlying problem (reduce time to compute analysis results on big data)

This leads to the broader question of how we want to solve the problem, and I see at least 4 approaches:

  1. materialize/etc (what’s comparable)? solve by updating mat views incrementally
  2. bigquery/redshift/cassandra/vertica/etc solve by improving the query engine to reduce time to run (or recompute) queries and stay in (or near) SQL
  3. Druid/Pinot/etc do the same as (2) but not in SQL
  4. Airflow/Oozie/Azkaban optimize dependant queries to minimize recomputation of incremental results

Dumping everything into redshift and seeing how long our expensive mat views run as queries could be worth looking at, but it looks like there’s a bunch of optimization (designing and tuning tables) that happens there too, although it does support geo, https://docs.aws.amazon.com/redshift/latest/dg/geospatial-overview.html, so we'd have to think if we're merely shifting the problem, probably can't answer that until we spend a day testing.

CC @mberg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants