Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

another index performance issue related to #25 #26

Closed
asakous opened this issue Nov 26, 2020 · 8 comments
Closed

another index performance issue related to #25 #26

asakous opened this issue Nov 26, 2020 · 8 comments
Assignees

Comments

@asakous
Copy link

asakous commented Nov 26, 2020

assume relevanssi table has millions record
in order to retrieve un-index post id . the query (see below) need to be execute each loop and it will perform slower and slower .
my solution is create another table that only record indexed post id (need modify code)and create a pk key for the column.
after that change left join relevanssi table to what I just created.
theory is quite simple. instead of join a big table why not just join small one.

$q = "SELECT post.ID
FROM $wpdb->posts post
LEFT JOIN $wpdb->posts parent ON (post.post_parent=parent.ID)
LEFT JOIN $relevanssi_table r ON (post.ID=r.doc)

@msaari
Copy link
Owner

msaari commented Nov 26, 2020

In general Relevanssi is not recommended for cases where the table has millions of records. That may work, if the hardware is up to it, but I've seen a case where on a shared hosting a site with four million rows in the wp_relevanssi table spent 75 seconds doing a search. Having the indexing run faster does not help there at all. (With good hardware, apparently indexing millions of posts is possible.)

My test site only has 250k rows in the wp_relevanssi table. There the lookup time increases steadily up to a point, then fluctuates without growing much. It doesn't get to problematic levels there. What kind of query times are you seeing? If you log the times in relevanssi_build_index() like this, how does it look like?

	$time_before = microtime();
	$content = $wpdb->get_results( $query ); // phpcs:ignore WordPress.DB.PreparedSQL.NotPrepared
	$time_spent = microtime() - $time_before;
	error_log("Time spent: $time_spent");

I don't mind making the indexing faster, but adding another table is a very big complication in the works, and one I'd like to avoid as far as possible. I would need to see some numbers on how bad the problem is, and how big a difference the change can make – in my case this query takes 1.29 seconds of time, when the whole indexing process takes 309 seconds. This is less than half a percent of the time spent, so it does feel like overoptimizing. But with a big database, the situation may be different. Can you provide some data on this?

@msaari msaari self-assigned this Nov 26, 2020
@asakous
Copy link
Author

asakous commented Nov 27, 2020

same hardware 400K+ record. I just change left join table and insert new post_id to new table.
it seems double the performance at least.

the result
Time spent:00:00:05.670 indexed:20
Time spent:00:00:09.724 indexed:20
Time spent:00:00:08.804 indexed:20
Time spent:00:00:08.012 indexed:20
Time spent:00:00:06.270 indexed:20
Time spent:00:00:18.588 indexed:20
Time spent:00:00:12.463 indexed:15
Time spent:00:00:26.075 indexed:10
Time spent:00:00:06.694 indexed:5
Time spent:00:00:00.694 indexed:5
Time spent:00:00:02.435 indexed:11
Time spent:00:00:07.737 indexed:16
Time spent:00:00:08.043 indexed:16
Time spent:00:00:05.737 indexed:16
Time spent:00:00:06.892 indexed:16
Time spent:00:00:04.763 indexed:16
Time spent:00:00:05.993 indexed:16
Time spent:00:00:08.580 indexed:16
-------------------------------------------- below new method
Time spent:00:00:01.518 indexed:9
Time spent:00:00:04.584 indexed:19
Time spent:00:00:04.091 indexed:24
Time spent:00:00:05.226 indexed:29
Time spent:00:00:03.573 indexed:29
Time spent:00:00:05.087 indexed:34
Time spent:00:00:04.928 indexed:34
Time spent:00:00:04.644 indexed:34
Time spent:00:00:07.229 indexed:39
Time spent:00:00:06.947 indexed:39
Time spent:00:00:05.692 indexed:39
Time spent:00:00:04.733 indexed:39

@msaari
Copy link
Owner

msaari commented Nov 27, 2020

It's faster, yes, but if you compare the total indexing time, how much faster is that? Is there a significant difference in the total indexing time? What exactly did you change here?

@asakous
Copy link
Author

asakous commented Nov 27, 2020

total indexing time.
same hardware
2116391 record in the table and I delete 150654 record.
re-index unindexed post

Time elapsed: 0:05:13 | Time remaining: we're done!-->my join table method
Time elapsed: 0:09:15 | Time remaining: we're done!

my method only affect people who have millions record in the table.
for the small table. don't bother to change it.

@msaari
Copy link
Owner

msaari commented Nov 27, 2020

Well, that is a significant difference there. On the other hand, nine minutes is much faster than I expected in the first place.

What exactly did you do here? I'd like to give it a go.

@asakous
Copy link
Author

asakous commented Nov 27, 2020

1:
CREATE TABLE wp_relevanssi_id (
doc int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (doc)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
2:indexing.php
if ( ! empty( $values ) ) {
$values = implode( ', ', $values );
$query = "INSERT IGNORE INTO $relevanssi_table (doc, term, term_reverse, content, title, comment, tag, link, author, category, excerpt, taxonomy, customfield, type, taxonomy_detail, customfield_detail, mysqlcolumn) VALUES $values";
$query2 = "INSERT IGNORE INTO wp_relevanssi_id (doc) VALUES (".$post->ID.") ";-->add
if ( $debug ) {
relevanssi_debug_echo( "Final indexing query:\n\t$query" );
}

	$wpdb->query( $query2 );-->add 
	$wpdb->query( $query ); // phpcs:ignore WordPress.DB.PreparedSQL.NotPrepared
	
}

3:indexing.php
$q = "SELECT post.ID
FROM $wpdb->posts post
LEFT JOIN $wpdb->posts parent ON (post.post_parent=parent.ID)
LEFT JOIN wp_relevanssi_id r ON (post.ID=r.doc)-->change table name
WHERE
$processed_post_filter
AND
(post.post_status IN ($valid_status)
OR
(post.post_status='inherit'
AND(
(parent.ID is not null AND (parent.post_status IN ($valid_status)))
OR (post.post_parent=0)
)
)
)

not implemented
rebuild whole index should truncate table wp_relevanssi_id too.

another possible optimization is that
when people try to rebuild whole index don't truncate table first.
the rebuild flow should be
1:create a identical relevanssi temporary table name relevanssi_temp
2:insert index to relevanssi_temp
3: when build index completed ,
rename relevanssi table to relevanssi_temp_b
rename relevanssi_temp to relevanssi
rename table relevanssi_temp_b to relevanssi_temp
truncate relevanssi_temp
that way to aviod search downtime

@msaari
Copy link
Owner

msaari commented Nov 27, 2020

Thanks. I'll try this myself. I was thinking it might be possible to implement this as an add-on plugin that could be used when necessary. That might make sense. I'll have to think about this.

Your optimization is a nice trick, but Relevanssi is already taking up so much database space that doubling that – even momentarily – is probably too much.

@asakous
Copy link
Author

asakous commented Nov 27, 2020

thanks.

@asakous asakous closed this as completed Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants