Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facet calculation seems tied to search query rather than result set #35

Open
minusdavid opened this issue May 23, 2022 · 15 comments
Open

Comments

@minusdavid
Copy link

I have a Zebra database with over 1 million records that contain the word "the". If I do a complex query like this:

@attrset Bib-1 @Not @or @or @or @or @or @attr 1=36 @attr 4=1 @attr 6=3 @attr 9=32 @attr 2=102 "Yogi the bear" @attr 1=4 @attr 4=1 @attr 6=3 @attr 9=28 @attr 2=102 "Yogi the bear" @attr 1=36 @attr 4=1 @attr 9=26 @attr 2=102 "Yogi the bear" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102 "Yogi the bear" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "yogi? the? bear? " @attr 4=6 @attr 9=14 @attr 2=102 "Yogi the bear" @attr 1=9011 @attr 14=1 1

It takes about 30 seconds to return with a hit count of 3325. Getting a facet response takes at least 60 seconds using yaz-client. (Unable to get the Perl ZOOM libraries to return a facet response even with connection timeouts above 60 seconds.)

If I do a very similar query without the "the":

@attrset Bib-1 @Not @or @or @or @or @or @attr 1=36 @attr 4=1 @attr 6=3 @attr 9=32 @attr 2=102 "Yogi bear" @attr 1=4 @attr 4=1 @attr 6=3 @attr 9=28 @attr 2=102 "Yogi bear" @attr 1=36 @attr 4=1 @attr 9=26 @attr 2=102 "Yogi bear" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102 "Yogi bear" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "Yogi? bear? " @attr 4=6 @attr 9=14 @attr 2=102 "Yogi bear" @attr 1=9011 @attr 14=1 1

It returns instantly with a hit count of 3325. Getting a facet response takes about 2 seconds using yaz-client. (Perl ZOOM libraries cope easily.)

--

Since the result set should be the same for both queries, it seems that the facet calculation cannot be based on the result set alone, and must involve the records that contribute to the creation of the result set.

I don't know enough about Zebra's internals to troubleshoot this one too much further.

@minusdavid
Copy link
Author

Actually, my counts were slightly off. The "the" query returned 3323 results, while the query without "the" returned 3325 results. Comparing the Zebra facet responses... the difference is much greater than 2 although I'm guessing the term occurrence is based off indexed values rather than records...

--

The Zebra configuration is using "facetNumRecs:1000" so in theory that should limit it further?

It collects 20 terms as per "int no_collect_terms = 20" in index/retrieve.c...

At a glance, the "term_collect_freq" function looks like it should use the result set. But beyond that it starts getting a bit obscure for me.

Do you know what might be causing this large difference in facet calculation times?

@minusdavid
Copy link
Author

I cloned idzebra, added some additional logging, statically compiled, and then ran on a 1,000,000+ records Zebra database.

For 20 facet terms, it seems to be taking about 3 seconds per term, which then aggregates up to that 60+ seconds.

But I must not be logging the right thing as the output looks the same for the 60 second facet generation as the 2 second facet generation... aside from the first one being much slower...

@minusdavid
Copy link
Author

The slowdown appears to be in index/zsets.c in the zebra_count_set function.

In the query without "the", the while loop with rset_read executes quickly and with a small number of iterations.

However, the query with "the", the while loop takes a long time. The 1st iteration for the rset_read can take up to 2 seconds sometimes and it iterates many more times (while ultimately ending up with the same occurrence count).

@minusdavid
Copy link
Author

The "zebra_count_set" is called from "freq_term" in index/retrieve.c via the following:

zebra_count_set(zh, rset, &hits, zh->approx_limit);

The issue must be with the result set in rset then...

over the past 40 minutes, I've managed to create 4 temporary result files that are 1.3GB in size... as per #33

@minusdavid
Copy link
Author

So going back to the original result set... it takes nearly 30 seconds to do that lookup for "the", which probably makes sense as there are over 1,000,000 records that contain "the", although historically I thought Zebra was supposed to be able to process that many records quickly...

14:56:47-24/05 zebrasrv(1) [log] dict_lookup_grep: (\x01\x01\x03)(th(e|\xC3\xA9|\xC3\xA8|\xC3\xAA|\xE1\xBA\xBD|\xC4\x95|\xC4\x99|\xC4\x97|\xC4\x9B|\xC8\x85|\xC8\x87).*)

@minusdavid
Copy link
Author

Ah not only that... but now that I look at that regex.. it's also trying all kinds of variations of "e" with accents. That must be the ICU coming into play. And then there's truncation there as well. So it's certainly doing a lot there...

@minusdavid
Copy link
Author

When I look at "freq_term" in index/retrieve.c, I can get the correct hit count from "reset_set". But then I don't understand what's happening with the "rset" RSET struct that's passed to "zebra_count_set".

It looks like an empty rset is created with the original result set as its child...

@minusdavid
Copy link
Author

I've run out of time but it's kind of looking like "the result set" also contains all the result sets that went into creating it?

If that's true, that might explain why two results sets with the same hit count can have vastly different faceting times?

It was interesting hacking on Zebra, but the code gets a bit obscure for me.

@MikeTaylor
Copy link
Contributor

Hi, @minusdavid , and thanks for this and other well-documented issues. Sorry for radio silence. @adamdickmeiss, who is the principal Zebra wizard is out of the office this week. I imagine he will get back to you early next week. Sorry for the delay, and thank you for the investigations so far.

@minusdavid
Copy link
Author

minusdavid commented May 24, 2022

No worries @MikeTaylor . My apologies for all the comments! Hopefully they're helpful.

I probably don't have heaps of time to work on this particular issue, but if @adamdickmeiss can give me some guidance I think I'm in a good place to do more troubleshooting.

It's too bad I didn't set up my little Zebra dev environment sooner. I could've probably sent a PR for that memMax issue heh.

@minusdavid
Copy link
Author

Hi, @minusdavid , and thanks for this and other well-documented issues. Sorry for radio silence. @adamdickmeiss, who is the principal Zebra wizard is out of the office this week. I imagine he will get back to you early next week. Sorry for the delay, and thank you for the investigations so far.

Is there anymore word on this one?

@minusdavid
Copy link
Author

Still noticing very slow facet calculation. Considerably slower than the actual search even.

@mrenvoize
Copy link

It would be great to see this one move forward.. might help stave off/slow the move to elasticsearch we're seeing. I have a soft spot for Zebra still personally.

@sebhammer
Copy link

sebhammer commented Oct 17, 2023 via email

@minusdavid
Copy link
Author

Thanks for getting back to us, Sebastian.

It was about a decade ago that the Koha community started using the facets in Zebra. I think we still have it on as the default option for new installations, although larger libraries have to turn it off because it's too slow on a large result set. I think that the larger databases will need to switch over to Elastic at some point. As you say, the writing is on the wall.

It's too bad though as Zebra is such a great lightweight tool. Years ago, I actually learned to read C just so that I could read Zebra source code! Zebra was the scariest/least understood part of the Koha stack, so naturally I wanted to learn everything about it.

Like yourselves though, it's tough to find time to hack on Zebra/YAZ. I'll keep that moral support in mind though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants