Auto load pages on content search #6901

yurabakhtin · 2024-03-26T14:37:06Z

What kind of change does this PR introduce? (check at least one)

Feature

The PR fulfills these requirements:

It's submitted to the develop branch, not the master branch if no hotfix
All tests are passing
Changelog was modified

Other information:
Issue: https://github.com/humhub/humhub-internal/issues/207

luke- · 2024-03-27T14:37:53Z

protected/humhub/modules/content/controllers/SearchController.php

@@ -40,7 +40,7 @@ public function actionResults()
    {
        $resultSet = null;

-        $this->searchRequest = new SearchRequest();
+        $this->searchRequest = new SearchRequest(['pageSize' => 3]);


@yurabakhtin Search queries are very resource-intensive. I would prefer a solution where we make a prefetch with a page size of e.g. 30 and cache it (content id's). And then use this cache for the streaming response and always deliver it in junks of 3.

@luke- I think it is possible like I did here for Search Providers, but we need to do an additional modification in the pagination part.
I have already started this but don't have time to finish this today.

@luke- I have done the caching with implementing new option SearchRequest::$cachePageNumber.
If call new SearchRequest(['pageSize' => 3, 'cachePageNumber' => 10]) it means 30 records will be cached on first request, i.e. 10 pages, when going to 11th page then next search process will be called and results will be cached as well for pages from 11 to 20.

Please note I have implemented new abstract method runSearch(), review how old method search() should be modified for SolrDriver like I did for Mysql and ZendLucence drivers.

@yurabakhtin Is there a reason why you didn't add a new AbstractDriver::searchCached(SearchRequest $request): ResultSet) method? (instead runSearch) This could be non-abstract and only contain the page caching logic?

The real search logic could stay in the abstract AbstractDriver::search(SearchRequest $request): ResultSet) method.

Both methods search and searchCached could stay public, and searchCached just adds a layer on top of search.

I see one disadvantage, because the search() returns an array of content models (ResultSet). But I think if we get back 100 plain Content models and only store the IDs in the cache, the performance should be ok.

@luke- I did that way because I thought it is better to keep a calling the code SearchDriver->search(...) from place where we should not have a logic. I.e. it would be better to call the method from Controller side and keep all logic inside the method search().
I mean a developer of the Search Controller part should not think about what code is executed inside the method search().
However maybe you are right because I have implemented it so the developer does it by new param 'cachePageNumber' => 10, i.e. currently my code looks like this in the Search Controller:

$searchRequest = new SearchRequest(['cachePageNumber' => 10]); $resultSet = $searchDriver->search($searchRequest);

and if I have understood you corrently new code should be like this:

$searchRequest = new SearchRequest(); $resultSet = $searchDriver->searchCached($searchRequest, $cachePageNumber = 10);

right? Should I redo it?

I see one disadvantage, because the search() returns an array of content models (ResultSet). But I think if we get back 100 plain Content models and only store the IDs in the cache, the performance should be ok.

Currently all Search Drivers return an array of Content records.
MySQL does a code like Content::find()->where(...)->all().
Zend and Solr do this:

$content = Content::findOne(['id' => $contentId]); if ($content !== null) { $resultSet->results[] = $content; } else { throw new Exception('Could not load result! Content ID: ' . $contentId); // ToDo: Delete Result Yii::error("Could not load search result content: " . $contentId); }

If we will modify all Search Drivers to return only Content IDs then we could remove the code from Zend and Solr drivers, so a removing the code Content::findOne(['id' => $contentId]) could improve a performance.

But if we will keep it then a performance may be worse, because the code Content::findOne() will be run twice(before caching and after get from cache).

In additional we will need to implement new method ResultSet->getResults() in order to use it instead of current code in view files: <?php foreach ($resultSet->results as $result): ?>:

class ResultSet { public function getResults(): array { return Content::findAll(['IN', 'id', $this->results]); } }

Do you agree such modifications?

@luke- I did that way because I thought it is better to keep a calling the code SearchDriver->search(...) from place where we should not have a logic. I.e. it would be better to call the method from Controller side and keep all logic inside the method search(). I mean a developer of the Search Controller part should not think about what code is executed inside the method search(). However maybe you are right because I have implemented it so the developer does it by new param 'cachePageNumber' => 10, i.e. currently my code looks like this in the Search Controller:

$searchRequest = new SearchRequest(['cachePageNumber' => 10]); $resultSet = $searchDriver->search($searchRequest);

and if I have understood you corrently new code should be like this:

$searchRequest = new SearchRequest(); $resultSet = $searchDriver->searchCached($searchRequest, $cachePageNumber = 10);

right? Should I redo it?

Yes, please do.

I would prefer this, because then, this code block https://github.com/humhub/humhub/pull/6901/files#diff-b226b67b6a65e71e7d905c17f216f4eeaab4a2387a29d40ce5996999884fef29R39-R70 - which is rather complex is only in the searchCached method, and it's always obvious caching related.

I see one disadvantage, because the search() returns an array of content models (ResultSet). But I think if we get back 100 plain Content models and only store the IDs in the cache, the performance should be ok.

Currently all Search Drivers return an array of Content records. MySQL does a code like Content::find()->where(...)->all(). Zend and Solr do this:

$content = Content::findOne(['id' => $contentId]); if ($content !== null) { $resultSet->results[] = $content; } else { throw new Exception('Could not load result! Content ID: ' . $contentId); // ToDo: Delete Result Yii::error("Could not load search result content: " . $contentId); }

If we will modify all Search Drivers to return only Content IDs then we could remove the code from Zend and Solr drivers, so a removing the code Content::findOne(['id' => $contentId]) could improve a performance.

But if we will keep it then a performance may be worse, because the code Content::findOne() will be run twice(before caching and after get from cache).

In additional we will need to implement new method ResultSet->getResults() in order to use it instead of current code in view files: <?php foreach ($resultSet->results as $result): ?>:

class ResultSet { public function getResults(): array { return Content::findAll(['IN', 'id', $this->results]); } }

Do you agree such modifications?

Hmm, that`s tricky.

What about (not sure it's possible), maybe, when ResultSet:

$results attribute is Content[]|int[]

Is serializable, on serialize, all $results entries which are type of Content are converted into int (content ID)

Implement Iterator, only related/required $results are always be loaded from int to object, ideally as batch using IN sql.

Like this (similar to your current approach):

class AbstractDriver { public function searchCached(SearchRequest $request): ResultSet { // $request->pageSize = 4 $largeResultSet = Yii::$app->cache->getOrSet('some-unique', function () { $requestLarge = clone $request; $requestLarge->pageSize * 10; return $this->search($requestLarge); }); return array_slice(/* of $largeResultSet */); } public function search(SearchRequest $request): ResultSet { return $resultSet; } }

Yes, please do.

I would prefer this, because then, this code block https://github.com/humhub/humhub/pull/6901/files#diff-b226b67b6a65e71e7d905c17f216f4eeaab4a2387a29d40ce5996999884fef29R39-R70 - which is rather complex is only in the searchCached method, and it's always obvious caching related.

@luke- I have done this in the commit 24ed4ee.

@yurabakhtin Thanks, looks good for me!

What about (not sure it's possible), maybe, when ResultSet:

$results attribute is Content[]|int[]

Is serializable, on serialize, all $results entries which are type of Content are converted into int (content ID)

Implement Iterator, only related/required $results are always be loaded from int to object, ideally as batch using IN sql.

@luke- Please review commit 5e87703.

Auto load pages on content search

adbe37f

yurabakhtin requested a review from luke- March 26, 2024 14:37

yurabakhtin added 2 commits March 27, 2024 13:29

Merge branch 'develop' into enh/207-content-search-speed

9923b9e

Fix default page size for tests

02182c7

luke- requested changes Mar 27, 2024

View reviewed changes

yurabakhtin and others added 5 commits March 28, 2024 11:41

Cache content searching by portion of 10 pages

e5b325f

Merge branch 'develop' into enh/207-content-search-speed

d7f759e

Refactor search methods

24ed4ee

Merge branch 'develop' into enh/207-content-search-speed

1e893ad

Cache only Content ID on searching

5e87703

luke- approved these changes Apr 2, 2024

View reviewed changes

luke- added this pull request to the merge queue Apr 2, 2024

Merged via the queue into develop with commit 317d598 Apr 2, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto load pages on content search #6901

Auto load pages on content search #6901

yurabakhtin commented Mar 26, 2024 •

edited

luke- Mar 27, 2024

yurabakhtin Mar 27, 2024

yurabakhtin Mar 28, 2024

luke- Apr 1, 2024

yurabakhtin Apr 1, 2024

luke- Apr 1, 2024

yurabakhtin Apr 1, 2024

luke- Apr 1, 2024

yurabakhtin Apr 2, 2024

Auto load pages on content search #6901

Auto load pages on content search #6901

Conversation

yurabakhtin commented Mar 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurabakhtin commented Mar 26, 2024 •

edited