Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr 7 - Highlighting #4836

Closed
mheppler opened this issue Jul 12, 2018 · 17 comments
Closed

Solr 7 - Highlighting #4836

mheppler opened this issue Jul 12, 2018 · 17 comments

Comments

@mheppler
Copy link
Contributor

mheppler commented Jul 12, 2018

[Note: this issue # has been changed to only capture highlighting. The information on search result older is being kept for historical reasons and future development]

I will preface this by saying I am searching production as a super user for the first time in a long while, so maybe I am not familiar with the type of results I should expect... that said...

When searching for "murray" on production -- which I do quite regularly on production as a guest -- I usually expect to find the MRA dataverse up near the top of the results. Currently 4th in this set of results...

screen shot 2018-07-12 at 5 06 56 pm

As a super user, the MRA dataverse comes in at a cool 39, on the 4th page of results...

screen shot 2018-07-12 at 5 13 45 pm

I had wrongfully accused @landreev of breaking indexing, but he assures me that all is right in our config settings to bump dataverses. Maybe this relates to recent updates to Solr. I am not sure.

There appear to be a lot of unpublished datasets from the Robert M. Townsend Dataverse in the first four pages of results. There is nothing to indicate why those results return for "murray".

Also, highlighting in the search results was not added to the new Solr settings and needs to be returned ASAP.

@landreev
Copy link
Contributor

landreev commented Jul 13, 2018

FWIW, the way I understand the issue, the fact that the superuser is seeing results different from regular users is NOT the main problem. (Admins/users with more permissions are supposed to see different search results by design).
The main issue is that our system of "bumping" certain hits up in the sort order - so that dataverses would appear first, then datasets, and then files - is no longer working.
We supply the configuration that's supposed to achieve this in the file solrconfig.xml (we also explain this in the solr install guide), as follows:

<str name="qf">
dvName^170
dvSubject^160
dvDescription^150
dvAffiliation^140
title^130
subject^120
keyword^110
topicClassValue^100
dsDescriptionValue^90
authorName^80
authorAffiliation^70
publicationCitation^60
producerName^50
fileName^40
fileDescription^30
variableLabel^20
variableName^10
text^1.0
</str>

This used to work back when we were using solr 4* (the config lines are taken verbatim from our old solrconfig). But, for whatever reason, they appear not to be producing the desired effect under solr 7.

The fact that the superuser is seeing things in a different order most likely means that the order is simply random for all users.

For clarity, I would rename the issue to something like "Ordering of solr search results is broken under solr 7".

(there may be something super simple we are missing; maybe it just needs another piece of configuration somewhere else we are still missing...)

@djbrooke djbrooke changed the title Search - Super User vs Guest seeing different results? Ordering of solr search results is broken under solr 7 Jul 13, 2018
@mheppler
Copy link
Contributor Author

@landreev thank you for investigating and clarifying. :dataverseman: emoji

@pdurbin
Copy link
Member

pdurbin commented Jul 13, 2018

Are we sure boosting isn't working under Solr 7? @matthew-a-dunlap wrote "Also, confirm the solr boosting is working as expected. I did a simple test, taking it out and putting it back, and it seemed to work." over at #4158 (comment) and he documented how to remove the boosting (if installations don't like it) in f90e00a.

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Jul 13, 2018

I know when I did my test it was mainly to see if there was any impact. I did not have a great understanding of what the desired outcome was so I did not test deeply. I don't mind looking into this more but I won't be able to do so before my appointment today.

@mheppler
Copy link
Contributor Author

mheppler commented Jul 13, 2018

I am quite sure that as a super user, I found it rather frustrating that the dataverse I was looking for by searching "murray" was four pages of results deep because unpublished and deaccessioned datasets were being returned in the top 10 results just because the MRA is listed as a distributor.

The MRA and the datasets and dataverses that are it's children should be bumped higher than a dataset that gets a hit on the distributor field.

I will again bring up that the highlighting from Solr needs to be turned back on, which would make it a lot easier to determine why these results are being returned by displaying values with a bold styling, right in the results card.

screen shot 2018-07-13 at 11 55 57 am

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Jul 15, 2018

I spent a bit of time investigating the solr highlighting issue. We had punted on it during the upgrade because the problems seems amorphous and we didn't want to hold up the release.

The highlighting is on in some form, but weird and inconsistent. For example, the words "test" and "test1" get highlighted in the description, but "Murray" and "murray" do not. "murrayz" does though. Maybe a dictionary is in play?

It looks like we just use the default configuration at 4.2.1 (or earlier?). When I remove the whole section about it from solrconfig.xml some form of highlighting happens. I don't think a reindex is needed for highlighting config changes.

My guess is that a default configuration exists outside our solrconfig.xml and that configuration differs from the defaults we expected back in 4.2.1. But just a guess. I wouldn't be surprised if this is also part of what's happening with the superuser search results.

The newer documentation does not discuss the xml configuration files, but looking back this section is of help: https://wiki.apache.org/solr/SolrConfigXml#The_Highlighter_plugin_configuration_section . We may need to go about using the new managed schema approach for solr, as no one is documenting the xml configurations for the newer versions (even though they are supported).

Hopefully this'll be of help when we pick up this work.

@mheppler mheppler changed the title Ordering of solr search results is broken under solr 7 Solr 7 - Highlighting + ordering of search results is broken/wacky Jul 16, 2018
@mheppler
Copy link
Contributor Author

Thanks for the input @matthew-a-dunlap. I have added highlighting to the issue title. We need this feature back. The combination of the two problems makes for some confusing results.

@djbrooke
Copy link
Contributor

Thanks for the investigation!

Note to self for backlog grooming, we should estimate this with and without the highlighting piece and consider smaller batches. I'm OK with no highlighting (I made the call to not include it earlier and we haven't heard any feedback aside from @mheppler) but I'm not OK with no boosting.

@TaniaSchlatter
Copy link
Member

Is there a description of what "working as expected" is that we can use for a baseline shared understanding, and to make judgements against?

@djbrooke
Copy link
Contributor

  • We used to have rules about boosting and highlighting, we should investigate why these are no longer being followed (solr upgrade related or otherwise) and reapply them
  • There are some things happening in the search (ex. highlighting of "king") that we should investigate and document

There was some discussion of re-evaluating how we rank search results, but we'll not do this now because this would be a large effort. Instead, we'll plan to restore how it was.

@mheppler
Copy link
Contributor Author

mheppler commented Jul 18, 2018

The "super user" aspect of this story maybe a red herring. After demoing this issue in our sprint planning mtg, we saw questionable results returned for a guest.

screen shot 2018-07-18 at 3 08 55 pm

The last three results for a guest searching "murray" perfectly illustrate this issue. The 8th and 9th results have "murray" hits in the distributor field (which are not highlighted in bold) and the 10th result has a hit in the title (also not highlighted).

And yes, the top three results are no better. Three files with "murray" in the name are returned higher than a dataverse name hit.

screen shot 2018-07-18 at 3 12 35 pm

So there appears to be not only issues with dataverse vs dataset vs file bumping, but also an issue with title/name vs distributor bumping.

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Jul 30, 2018

While waiting on things for other stories, I took another look at the configs around this and touched base with folks on the solr irc. Their recommendations were to start over with a new solrconfig.xml file out of 7.3.0 and customize that as we need it. It seems like a good path as we mostly used defaults before anyways.

[17:04] matthew-dv: Question for y'all! The open source java project I'm a part of (https://dataverse.org/) recently upgraded our solr version from 4.2.1 to 7.3.0. We also updated the java libraries. We opted to use our solrconfig.xml instead of the managed schema approach. After the upgrade most functionality is working but we have noticed that our highlighting is acting differently.
[17:04] matthew-dv: It seems the configuration we have in solrconfig.xml does not have an impact for if I remove the whole section and reload http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1 the highlighting is in place. Where should I look next to understand this and fix our highlighting? Fwiw it looks like we used the default highlighting that came with the sample 4.2.1 configuration.
[17:06] elyograg: i have never used highlighting, don't know how. Probably what you should do is start with the solrconfigl.xml file in the examples for 7.3.0 and build up a config that does what you need.
[17:31] matthew-dv: Thanks @elyoqrag! I think looking over a few various solrconfig.xml's that we may have missed some of the defined default. I'll look to those next
[17:52] ctargett: matthew-dv - between Solr 4.2.1 and 7.3.0, highlighting underwent a big transformation. I think some of the classes that might have been default in 4.2.1 have been removed and replaced. I think it will be a lot easier to re-implement it as new instead of trying to figure out what changed.
[18:13] matthew-dv: Thanks @ctargett, I think that's possibly the best approach. I tried doublechecking how we ported over functionality from before but nothing in that seems too off. I guess we'll have to dig into the new configurations even more.

@matthew-a-dunlap matthew-a-dunlap self-assigned this Aug 6, 2018
matthew-a-dunlap added a commit that referenced this issue Aug 6, 2018
Also storing two other versions of the config in the project for dev work.
@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Aug 6, 2018

Some to-do:

  • Get highlighting working
    • Cause of immediate issue is related to our text_en field type definition, specifically its filter PorterStemFilterFactory. We either need to reconfigure PorterStemFilterFactory or use a different stemmer.
      • Root cause was that after the upgrade the stemmer was no longer keeping the original word when stemming
      • While in here I should probably clean up some of our fieldType definitions as a lot of them look unused
  • Get boost working
  • Update our core creation steps as they don't follow the correct practice and may caused unforseen issues
    • Test changes
    • Maybe add some documentation on how editing solrconfig.xml & schema.xml function
      • solrconfig.xml edits should probably be done with reindexing (some changes may not but many do)
      • schema.xml requires reindexing
  • Maybe removed the managed-schema from our deploy steps (it should be getting ignored already)
  • Removed old solrconfig.xml & schema.xml added for ease of development

matthew-a-dunlap added a commit that referenced this issue Aug 6, 2018
We are not using collections, those are only a part of SolrCloud. This is the first half of fixing our installation steps via recommendation on solr's IRC
matthew-a-dunlap added a commit that referenced this issue Aug 6, 2018
Before we were creating a folder with our configs and then installing, but the installer itself expects the folder passed with -d to be a reference template. It did not seem to break anything but is bad practice and came up when asking for help from folks at solr
@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Aug 6, 2018

Discussing with the folks in the solr IRC, I learned that if you do not provide configuration for highlighting but your queries to solr have highlighting params, solr use a system default. This happens with other aspects of configuration as well. This is why removing the highlighting section from our configs had no effect, as either way it was the same configuration.

The current solrconfig.xml we have in develop is not actually much different than the default. Tomorrow I'll start modifying the solrconfig.xml section for highlighting to get it back to a more acceptable form. What exactly we want in the end is vague, but if anything I'll try to understand why "Murray" does not highlight but "Murra*" does.

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Aug 7, 2018

Looks like the highlighting problem is related to how the schema field type text_en has changed between solr 4.6 and 7.3 . Switching dsDescription & title to text_general causes the exact matches to show up correctly for highlighting. We switched away from text_general for the better english language support #444 .

Next step is to alter text_en's configuration or switch to a newer type if one is available. Though we may want to think of a more holistic approach as we are looking to support other languages better

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Aug 8, 2018

I have created a pull request with just the fix for solr highlighting. My "best practice" solr fixes and my start on fixing the boosting are not in this branch.

For this fix to take effect, the schema.xml file in dataverse needs to be added to solr we must reindex.

@matthew-a-dunlap matthew-a-dunlap removed their assignment Aug 8, 2018
@matthew-a-dunlap matthew-a-dunlap changed the title Solr 7 - Highlighting + ordering of search results is broken/wacky Solr 7 - Highlighting Aug 8, 2018
@pdurbin
Copy link
Member

pdurbin commented Aug 9, 2018

Pull request #4937 looks good so I moved it to QA in https://waffle.io/IQSS/dataverse

Here's a copy and paste from my review:

Looks good. I'm glad to see the solution ("Solution was to ensure original word is kept by stemmer") and the link to the answer on Stack Overflow.

Please note that if you want to see fewer lines in the diff (it's mostly whitespace changes), you should add ?w=1 like this: https://github.com/IQSS/dataverse/pull/4937/files?w=1 . I mentioned to @matthew-a-dunlap that I posted some thoughts about whitespace and such back in #3418.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants