Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of queries on very large datasets #45680

Closed
mateusfmcota opened this issue Jun 5, 2024 · 5 comments
Closed

Optimization of queries on very large datasets #45680

mateusfmcota opened this issue Jun 5, 2024 · 5 comments
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement

Comments

@mateusfmcota
Copy link

mateusfmcota commented Jun 5, 2024

How to use GitHub

  • Please use the 👍 reaction to show that you are interested into the same feature.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.

Is your feature request related to a problem? Please describe.

Recently I setted up a Nextcloud instance for a company to store over 700k of small XML invoice files, it started 2 weeks ago. After the upload finished I expected that Nextcloud would generate a lot of data on database(index, previews, ....) so I didn't minded that it would take some time.

After a week done running I began to search if this was normal and apparently it wasn't. I noticed that a few of the queries that used a lot of the CPU/Disk IO and wouldn't let the drive to spin down;

What is causing a lot of trouble from what I was looking is the query (a.path LIKE 'appdata\\_oc39a4gb5j29/preview/_/_/_/_/_/_/_/%').

I tried to find where from the code this was sent by the server so I could try to fix it myself and submit an PR but I couldn't find as I never worked on it.

Describe the solution you'd like

Basically would be to do some fast checks before running long and complex query. The second command of the terminal of the image, shows a basic command that it counts all the entries where the fileid is NULL and it returned zero. If something like an if statement was added with this check before this query, the server would be able to skip it and let the drive to spin down and save resources.

Describe alternatives you've considered

Maybe doing a more complex using transactions to filter the data first by simple and fast queries and then execute the complex queries, something like: Get all fileid == null -> filter all that the storagetype == 2 -> ...

Additional context

image

@mateusfmcota mateusfmcota added 0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement labels Jun 5, 2024
@csware
Copy link
Contributor

csware commented Jun 6, 2024

Shouldn't that the DB optimizer do, to filter for fileid and storage first?

@kesselb
Copy link
Contributor

kesselb commented Jun 6, 2024

Similar report: #28114

The SQL query belongs to: https://github.com/nextcloud/server/blob/master/lib/private/Preview/BackgroundCleanupJob.php

Basically would be to do some fast checks before running long and complex query. The second command of the terminal of the image, shows a basic command that it counts all the entries where the fileid is NULL and it returned zero.

There are no rows in oc_filecache where fileid is null because the fileid is the primary key and therefore your suggestion, to first select all rows with fileid is null, does not work.

@mateusfmcota
Copy link
Author

There are no rows in oc_filecache where fileid is null because the fileid is the primary key and therefore your suggestion, to first select all rows with fileid is null, does not work.

About this, my idea was mainly because of this line

$qb->expr()->isNull('b.fileid')
, the b is from oc_filecache and checks if is null, so for me it should be nullable.

After a deeper investigation I found that the problem is with the left join, the database takes too much time(hours) to do it. I'm beginning think the problem could be on my mariadb database, it has around 5M of entries and a simple count it takes around 1 minute run.

image

So, this probably is a question for the foruns then,

@mateusfmcota
Copy link
Author

I'm commenting more for the case if this issue is found by someone else in the future but I found the problem.

The oc_filecache table corrupted itself. I had to drop the table and recreate it with the most recent script that I found here:

https://help.nextcloud.com/t/how-can-i-recreate-regenerate-the-oc-filecache-table-solve-here-is-the-schema-for-oc-filecache/63395/14

After recreating it I ran occ file:scan --all
I tried to run the mariadb-check/analyze/repair but it couldn't find the problem or fix it, so my last resource was drop it instead.

@kesselb
Copy link
Contributor

kesselb commented Jun 15, 2024

For anyone coming across: Deleting the filecache table is a bad idea.

The filecache table is not a cache anymore and deleting will lose all share and other information about files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement
Projects
None yet
Development

No branches or pull requests

3 participants