Optimization of queries on very large datasets #45680

mateusfmcota · 2024-06-05T16:07:12Z

How to use GitHub

Please use the 👍 reaction to show that you are interested into the same feature.
Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
Subscribe to receive notifications on status change and new comments.

Is your feature request related to a problem? Please describe.

Recently I setted up a Nextcloud instance for a company to store over 700k of small XML invoice files, it started 2 weeks ago. After the upload finished I expected that Nextcloud would generate a lot of data on database(index, previews, ....) so I didn't minded that it would take some time.

After a week done running I began to search if this was normal and apparently it wasn't. I noticed that a few of the queries that used a lot of the CPU/Disk IO and wouldn't let the drive to spin down;

What is causing a lot of trouble from what I was looking is the query (a.path LIKE 'appdata\\_oc39a4gb5j29/preview/_/_/_/_/_/_/_/%').

I tried to find where from the code this was sent by the server so I could try to fix it myself and submit an PR but I couldn't find as I never worked on it.

Describe the solution you'd like

Basically would be to do some fast checks before running long and complex query. The second command of the terminal of the image, shows a basic command that it counts all the entries where the fileid is NULL and it returned zero. If something like an if statement was added with this check before this query, the server would be able to skip it and let the drive to spin down and save resources.

Describe alternatives you've considered

Maybe doing a more complex using transactions to filter the data first by simple and fast queries and then execute the complex queries, something like: Get all fileid == null -> filter all that the storagetype == 2 -> ...

Additional context

The text was updated successfully, but these errors were encountered:

csware · 2024-06-06T08:17:31Z

Shouldn't that the DB optimizer do, to filter for fileid and storage first?

kesselb · 2024-06-06T12:37:18Z

Similar report: #28114

The SQL query belongs to: https://github.com/nextcloud/server/blob/master/lib/private/Preview/BackgroundCleanupJob.php

Basically would be to do some fast checks before running long and complex query. The second command of the terminal of the image, shows a basic command that it counts all the entries where the fileid is NULL and it returned zero.

There are no rows in oc_filecache where fileid is null because the fileid is the primary key and therefore your suggestion, to first select all rows with fileid is null, does not work.

mateusfmcota · 2024-06-06T22:59:51Z

There are no rows in oc_filecache where fileid is null because the fileid is the primary key and therefore your suggestion, to first select all rows with fileid is null, does not work.

About this, my idea was mainly because of this line

server/lib/private/Preview/BackgroundCleanupJob.php

Line 74 in 512753a

$qb->expr()->isNull('b.fileid')

, the b is from oc_filecache and checks if is null, so for me it should be nullable.

After a deeper investigation I found that the problem is with the left join, the database takes too much time(hours) to do it. I'm beginning think the problem could be on my mariadb database, it has around 5M of entries and a simple count it takes around 1 minute run.

So, this probably is a question for the foruns then,

mateusfmcota · 2024-06-14T18:10:24Z

I'm commenting more for the case if this issue is found by someone else in the future but I found the problem.

The oc_filecache table corrupted itself. I had to drop the table and recreate it with the most recent script that I found here:

https://help.nextcloud.com/t/how-can-i-recreate-regenerate-the-oc-filecache-table-solve-here-is-the-schema-for-oc-filecache/63395/14

After recreating it I ran occ file:scan --all
I tried to run the mariadb-check/analyze/repair but it couldn't find the problem or fix it, so my last resource was drop it instead.

kesselb · 2024-06-15T13:55:16Z

For anyone coming across: Deleting the filecache table is a bad idea.

The filecache table is not a cache anymore and deleting will lose all share and other information about files.

mateusfmcota added 0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement labels Jun 5, 2024

mateusfmcota closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of queries on very large datasets #45680

Optimization of queries on very large datasets #45680

mateusfmcota commented Jun 5, 2024 •

edited

Loading

csware commented Jun 6, 2024 •

edited

Loading

kesselb commented Jun 6, 2024

mateusfmcota commented Jun 6, 2024

mateusfmcota commented Jun 14, 2024

kesselb commented Jun 15, 2024

Optimization of queries on very large datasets #45680

Optimization of queries on very large datasets #45680

Comments

mateusfmcota commented Jun 5, 2024 • edited Loading

How to use GitHub

csware commented Jun 6, 2024 • edited Loading

kesselb commented Jun 6, 2024

mateusfmcota commented Jun 6, 2024

mateusfmcota commented Jun 14, 2024

kesselb commented Jun 15, 2024

mateusfmcota commented Jun 5, 2024 •

edited

Loading

csware commented Jun 6, 2024 •

edited

Loading