Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traverse folders in php to search in shared files #10048

Merged
merged 2 commits into from
Jul 31, 2014

Conversation

butonic
Copy link
Member

@butonic butonic commented Jul 30, 2014

makeshift fix for #10009

@karlitschek @DeepDiver1975 @craigpg needed for 7.0.1

Why makeshift, you ask? The solution uses the same directory traversing as searchByMime ... which is a performance nightmare and ugly. But it works.

@icewind1991 @PVince81 @schiesbn We definitely need to come up with a better solution... I still vote for a n:m relation table between share and files. Yes it takes longer to share hundreds of files ... but it is a much cleaner solution and allows us to use joins and let the db do it's magic.

@butonic butonic added this to the ownCloud 7.0.1 milestone Jul 30, 2014
@karlitschek
Copy link
Contributor

@schiesbn What is your opinion?

@schiessle
Copy link
Contributor

@butonic do you mean that a share table entry should not only point to the shared folder but to every file below the folder? I would really avoid to go into this direction. This would not only make the initial share operation more expensive but the share table would also need to track all changes to the file system. This would duplicate (large) parts of the file cache and would add additional costs to every file system operation which adds/removes a file.

Currently the share table gives you the starting point at the file cache table and from there you can operate on the file cache table which provides every information needed and keeps track of all changes. I don't see the need to duplicate it.

@ghost
Copy link

ghost commented Jul 30, 2014

🚀 Test Passed. 🚀
Refer to this link for build results: https://ci.owncloud.org/job/pull-request-analyser/6477/

@schiessle
Copy link
Contributor

Regarding your changes, the code looks good to me. Although I never used the search() method by myself. I have no idea how such a "pattern" can look like and whether your changes (for example trim($pattern,'%');) works for all pattern provided to this function.

Would be great if you could add some unit tests to it, thanks!

@butonic
Copy link
Member Author

butonic commented Jul 30, 2014

The pattern is another thing that is broken. Currently OC\Files\View encloses the search term in % to make the pattern usable in a SQL LIKE statement. But not all storage backends might use SQL to search for files. Files_external could query amazon s3 or swift and won't need that kind of pattern.

The search method itself is only directly called from the tests and OC\Files\View https://github.com/owncloud/core/blob/master/lib/private/files/view.php#L1102 / https://github.com/owncloud/core/blob/master/lib/private/files/view.php#L1119

@schiessle
Copy link
Contributor

The pattern is another thing that is broken. Currently OC\Files\View encloses the search term in % to make the pattern usable in a SQL LIKE statement. But not all storage backends might use SQL to search for files. Files_external could query amazon s3 or swift and won't need that kind of pattern.

I think that's just a syntax question right? We need to come up with a syntax for this pattern. Ideally this syntax works for all storage back-ends so that the pattern can be used right away. I believe this was true before we supported object storage. If such a syntax exists we can use it. If not we need to make sure that we translate it, if we call a a storage which expects a different syntax.

The search method itself is only directly called from the tests

But this tests will never call the search method implemented by the Shared_Cache right? Would be good to have a unit test for it too.

@butonic
Copy link
Member Author

butonic commented Jul 30, 2014

The current implementation does not scale because searchByMime() as well as search() uses dozens of sql queries when walking the directory tree: getFolderContents first looks up the source cache via the shares table and that then looks up the files in the filecache. That is at least two queries per folder from what I can tell. Have a look at the following query log: http://pastebin.com/6V0M2NAg

And I already stripped it of completely unneeded duplicate queries for user group and what not. But that is a different issue.

What I mean is having the current filecache and share tables and an additional relation table called 'oc_shares_files' having the fileid and the shareid as foreign keys:

CREATE TABLE `oc_shares_files` (`shareid` int NOT NULL , `fileid` int NOT NULL , PRIMARY KEY (`shareid`, `fileid`));

-- We can then join the share table via the relations to search files
SELECT "sf"."shareid", "fc"."fileid"
FROM "oc_filecache" "fc"
JOIN "oc_share_files" "sf" ON "sf"."fileid"="fc"."fileid"
JOIN "oc_share" "s" ON "sf"."shareid"="s"."id"
 WHERE "fc"."name" LIKE ?
AND "s"."share_with" = ?

pass in the pattern and the user searching through shared files. The query could be optimized to also show files owned by the user.

Maintaining the relations is not that hard:

-- add all when a new share is created:
INSERT INTO  `oc_shares_files` (`shareid`,`fileid`) SELECT ?, `fileid` FROM `oc_filecache` WHERE `path` LIKE ?

with the new share id and the file path appended with '%' to get all children of a folder. All this happens inside the database. A quick test on an oracle db that had to insert 7736 rows took less than a second. Honestly, this is what RDBMS are made for. On my productivy mysql machine I did:

SELECT count(*) FROM `oc_filecache` WHERE `path` LIKE 'files%';
+----------+
| count(*) |
+----------+
|    21750 |
+----------+
1 row in set (0.19 sec)

INSERT INTO  `oc_shares_files` (`shareid`,`fileid`) SELECT 1, `fileid` FROM `oc_filecache` WHERE `path` LIKE 'files%';
Query OK, 21750 rows affected (0.07 sec)
Records: 21750  Duplicates: 0  Warnings: 0

so ... IMHO performance is clearly not an issue here.

Other operations can now also be offloaded to the database:

-- on a create hook add the file to the shares_files relation by looking up the parent in the share_files table
INSERT INTO `oc_shares_files` (`shareid`,`fileid`) SELECT `sf`.`shareid`, `fc`.`fileid` FROM `oc_filecache` `fc` JOIN `oc_share_files` `sf` ON `sf`.`fileid`=`fc`.`parent` WHERE `fc`.`fileid` = ?

-- on a delete hook remove the file or all files of a share from the shares_files relation
-- when deleting a file (or folder)
DELETE FROM `oc_shares_files` WHERE `fileid` IN (SELECT `fileid` FROM `oc_filecache` WHERE `path` LIKE ? )
-- when deleting a share
DELETE FROM `oc_shares_files` WHERE `shareid` = ?

If we were using foreign keys we could even use the cascade option to remove entries from the relation table.

Migration to a relations table is easy because it only adds the relation table.

It is basically a speed tradeoff where we maintain the relations of the shared files in the database to speed up all queries related to file shares, because we can let the database handle permissions and don't have to go through php and the connection.

Looking at [nested foreach loops that are needed to update the etags](https://github.com/owncloud/core/blob/master/apps/files_sharing/lib/updater.php#L72 for all users with access to a shared file) (which kills sync performance btw) my experience tells me that this can also be improved. Yes we have to recursively walk up the tree as well but we can do that for all users in parallel.

@schiessle
Copy link
Contributor

@butonic thanks for the detailed explanation. I'm not sure if it is that easy in all circumstances, e.g. your initial insert if a share was created only looks for the path, but this way you would also insert files from different users who might have a similar path. Also we could not just add the storage ID as a second parameter because the storage ID could change if the shared folder contains a sub-folder which is a external mount. This could probably be solved, but I assume that some extra queries would be involved.

I didn't thought about it in detail, this are just some thoughts coming to my mind while reading it. We would also have to check the other queries if they work in all circumstances. I also wonder if this is a sharing-only issue. Does the searchByMime() method for the local storage is more efficient? I would prefer to optimize this method, if needed. This way all other storage could re-use it. As said, as soon as we know the real path and user of a shared folder, all operation can happen on the file cache without any special code regarding sharing. So if the file cache operation is fast then the same operation for shared files should be fast too.

For now, let's fix it they way you did it. I would only like to have some unit tests for it. 😉
Maybe you can copy&paste your last post to a separate issue so that we can re-consider it during oc8 development where we will add some changes/improvements to sharing anyway.

@butonic
Copy link
Member Author

butonic commented Jul 31, 2014

@schiesbn the SQL was meant to show that inserting the relations is done with one query, limiting the query to only one user (AND user = ?) ony shrinks the result set and has a neglectable performance impact.

A move operation will become more complicated because we need to update the relation as well as the etags. But that is only one query in addition to all the etag updates (which have to traverse up the tree for source and target).

searchByMime() uses the same traversal algorithm, which makes it equally slow. Yet another reason why the gallery will take some time to show images. Both search methods need to walk the tree or we need to come up with a smarter algorithm.

I'll write the tests and create a new issue for this.

@schiessle
Copy link
Contributor

the SQL was meant to show that inserting the relations is done with one query, limiting the query to only one user (AND user = ?) ony shrinks the result set and has a neglectable performance impact.

Did you had a look at the file cache, table? There is no 'user'. Unfortunately it is not that easy to say which file cache entry belongs to which user. The only reference is the storage ID, resolving this to the user seems to be not that easy for this use case. First you would need to find out which storage IDs belongs to which user. Another problem could be that the path is relative to the storage. So even if you know all storage IDs of a given user it could happen that you get false-positives. All solvable, but probably not as easy as it seems at the first moment and it is not sure how much queries we would really need at the end. But let's discuss it in the new issue.

@butonic
Copy link
Member Author

butonic commented Jul 31, 2014

@schiesbn thx for making me write unit tests, found and fixed two bugs.

@ghost
Copy link

ghost commented Jul 31, 2014

💣 Test Failed. 💣
Refer to this link for build results: https://ci.owncloud.org/job/pull-request-analyser/6494/

@icewind1991
Copy link
Contributor

Instead of mapping shareid -> fileids it would make more sense to keep a fileid -> fileid mapping of the three, something like

CREATE TABLE oc_files_three (parentint NOT NULL ,child int NOT NULL, , PRIMARY KEY (parent, child));

Then for searching the shared cache can map the shareid to the parent and create a join with the filecache from there.

Additionally this table can be used for moving and deleting within the cache and searching inside a folder

@butonic
Copy link
Member Author

butonic commented Jul 31, 2014

@icewind1991 can you elaborate on that in #10077

@butonic
Copy link
Member Author

butonic commented Jul 31, 2014

@icewind1991 @schiesbn can I get a 👍 for this to make search in shared files work at least until we come up with a better solution in #10077 ?

@schiessle
Copy link
Contributor

👍 under the assumption that Jenkins will finish successfully

@scrutinizer-notifier
Copy link

The inspection completed: 4 new issues, 3 updated code elements

@ghost
Copy link

ghost commented Jul 31, 2014

🚀 Test Passed. 🚀
Refer to this link for build results: https://ci.owncloud.org/job/pull-request-analyser/6500/

@icewind1991
Copy link
Contributor

👍

butonic added a commit that referenced this pull request Jul 31, 2014
traverse folders in php to search in shared files
@butonic butonic merged commit dc61804 into master Jul 31, 2014
@butonic butonic deleted the fix_search_in_shared_files branch July 31, 2014 16:27
@butonic
Copy link
Member Author

butonic commented Jul 31, 2014

@karlitschek @craigpg backport to 7.0.1? is easy and contained.

@karlitschek
Copy link
Contributor

yes please. but not before monday when 7.0.1 is out

@butonic butonic modified the milestones: ownCloud 7.0.1, ownCloud 7.0.2 Aug 1, 2014
@butonic butonic self-assigned this Aug 1, 2014
@butonic butonic modified the milestones: ownCloud 7.0.2, 2014-sprint-01-current Aug 7, 2014
@lock lock bot locked as resolved and limited conversation to collaborators Aug 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants