Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialized view proposal #1997

Open
1 task done
d12frosted opened this issue Dec 5, 2021 · 5 comments
Open
1 task done

Materialized view proposal #1997

d12frosted opened this issue Dec 5, 2021 · 5 comments

Comments

@d12frosted
Copy link
Contributor

d12frosted commented Dec 5, 2021

Brief Abstract

Materialized view is a table where each row contains all information about node, including information from the following tables: nodes, aliases, citations, refs, tags and links.

Benefits

This would improve performance of many query operations, where we rely on multiplication of multiple tables. See vulpea#116 for benchmarks of possible implementation. Those benchmarks are using some vulpea functions, but in short it compares approach of multiplication and view table on db of 9554 notes:

test result size regular view table ratio
filter-on-tags-1 30 notes 4.6693460650999995 1.0112478712 4.6174100
filter-on-tags-2 3168 notes 4.7333844436999996 1.0059819176 4.7052381
filter-on-links-1 1657 notes 4.8095771283 1.0462236128999999 4.5970833
filter-on-links-2 92 notes 4.5517473337999995 1.0204833089 4.4603839

As you can see, when all notes needs to be traversed, view table provides x4.595 performance improvement.

Who would benefit?

The following group of users would benefit from this feature:

  • Regular users of Org Roam, since interactive functions of finding and inserting notes would become much faster.
  • Users of various advanced packages based on Org Roam (like delve or maybe even org-roam-ui) as these applications do lots of querying behind the scenes and they needs most of that information.
  • Developers of applications based on Org Roam, since with view table you can quickly get all you need without thinking too much on how to get all required information at once. Writing query like this is hard.

Long Description

Right now when we need information from multiple tables, we use table multiplication. But the more tables we want to multiply the slower this query becomes.

So instead of doing this 'multiplication' on the read side, we could maintain a separate table that contains all this information in one place. The schema would look like:

([(id :not-null :primary-key)
  (path :not-null)
  (level :not-null)
  (title :not-null)
  (properties :not-null)
  aliases
  tags
  meta
  links
  refs
  citations]
 (:foreign-key [path] :references files [file] :on-delete :cascade))

Proposed Implementation (if any)

See vulpea#116 as example of implementation.

Implementation would consist of 2 parts (can and should be released separately):

  1. Implementing view table lifecycle (e.g. writing).
  2. Using it across the org-roam code base where query happens (e.g. reading).

Writing

Whenever the note is being synced, we also add all relevant information into this view table. I suspect that the sync routine needs to be modified a little bit, so we can avoid double parsing or non atomic inserts.

Reading

Instead of doing horrific SQL multiplication, we will use org-roam-query with this new materialized view table.

FAQ

What about write performance?

You might noticed that in vulpea#116 db sync performance degrades quite a bit. This is explained by the fact that I simply duplicated buffer parsing, so it takes almost twice the time. If view table is implemented in org-roam, the footprint of view table should be minimal, e.g. hardly noticeable.

Any volunteers?

Me 😸 If you think that it worth including such table in org-roam I would gladly work on that. Especially since I already have a working implementation that I use on a daily basis.

Please check the following:

  • No similar feature requests

@jethrokuan Please let me know what you think. If something is not clear, just let me know 😄 I would gladly work on this feature for org-roam. In case you think that read performance doesn't cost data duplication, I have another proposal - to ease adding extra tables in org-roam db (and btw I believe it would be nice to have on its own regardless of this proposal - I will send it later) - with this materialized view can come as an extension.

cc @publicimageltd as you were interested in this happening

@jethrokuan
Copy link
Member

jethrokuan commented Dec 9, 2021

@d12frosted sounds good! Thanks a lot for the benchmarks too, it's useful to know how much queries will speed up without the joins, and thankfully you have already implemented the feature elsewhere.

My desired solution would have minimal additional handwritten writes to any tables. I've been thinking of doing materialized views directly via SQL, I'm pretty sure it's possible to have sqlite handle everything, since we obtain our nodes table via a SQL query anyway.

@d12frosted
Copy link
Contributor Author

I'm pretty sure it's possible to have sqlite handle everything, since we obtain our nodes table via a SQL query anyway.

This question was already raised on discourse. See my response here. What do you think?

@jethrokuan
Copy link
Member

@d12frosted you're right, I tried googling around for "sqlite3 materialized views"and it seems that sqlite3 doesn't support it. I'm all for optimizing node reads if writes don't get too expensive, hope to see a contained solution!

@akashpal-21
Copy link

akashpal-21 commented May 31, 2024

I implemented Materialized View using Sqlite Triggers to offload computation to Sqlite
I have provided benchmark for the case of 20k nodes, 30k tags, 10 aliases, and 10k refs.

https://org-roam.discourse.group/t/materialized-view-for-org-roam-db-using-sqlite-triggers/3483/25

#+RESULTS:
|        0.636916469 | 0 | 0.0 |
|        0.644032131 | 0 | 0.0 |
| 0.6499278900000001 | 0 | 0.0 |
|        0.672380125 | 0 | 0.0 |
|        0.654946752 | 0 | 0.0 |
|        0.658287666 | 0 | 0.0 |
|        0.667262851 | 0 | 0.0 |
|        0.670634778 | 0 | 0.0 |
|        0.695153822 | 0 | 0.0 |
|        0.668818478 | 0 | 0.0 |
|        0.686307308 | 0 | 0.0 |
|         0.214723974 | 0 | 0.0 |
|         0.210380056 | 0 | 0.0 |
|         0.209768058 | 0 | 0.0 |
|         0.211162957 | 0 | 0.0 |
|         0.209120208 | 0 | 0.0 |
|         0.231940973 | 0 | 0.0 |
|         0.249701558 | 0 | 0.0 |
| 0.25441987299999996 | 0 | 0.0 |
| 0.25335254100000004 | 0 | 0.0 |
| 0.25053876199999997 | 0 | 0.0 |
| 0.25305776599999996 | 0 | 0.0 |

Please have a look at the post for more detail - implementing it in elisp should be trivial from here, leaving it here for further consideration.

Thanks and Best,

@publicimageltd
Copy link
Contributor

Thanks for the bump and the proposal! I still think that's a good idea, and it could and should be implemented! So here's my +1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants