Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale search results due to SOLR latency & reindex failures #628

Closed
hornc opened this issue Nov 1, 2017 · 21 comments
Closed

Stale search results due to SOLR latency & reindex failures #628

hornc opened this issue Nov 1, 2017 · 21 comments
Assignees
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] State: Work In Progress This issue is being actively worked on. [managed] Theme: Search Issues related to search UI and backend. [managed] Type: Bug Something isn't working. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed]

Comments

@hornc
Copy link
Collaborator

hornc commented Nov 1, 2017

I have fixed the majority of author records from issue #482
but the search results still list them all:
https://openlibrary.org/search/authors?q=please+see

I have tried manually adding the records to the admin/solr list but I beleive something about these records causes the reindex to (silently?) fail so they are skipped.

There may be a general problem that reindexes fail and are not suitably logged for us to investigate. It was difficult to get decent logs to gather more info on this issue.

@dvanduzer I'd love help with this one if you are looking for something to do with SOLR ;)

@hornc hornc added the Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] label Nov 1, 2017
@LeadSongDog
Copy link

Ok, looking at one of these, OL3392693A, and inserting a publisher Science Research Associates with:
https://openlibrary.org/books/OL19033152M/SRA_Primary_Mental_Abilities/edit
expected /type/author, found /type/delete
Searching:
https://openlibrary.org/search/authors?q=Louis+Leon+Thurstone&mode=everything
I see multiple responses.
The first response:
"
Louis Leon Thurstone 1887 - 1955
11 books about Psychology, Methodology, Intelligence tests, including Primary mental abilities
"
-actually lists three works of one edition each

The second response:
"
Louis Leon Thurstone 1887 - 1955
25 books about Psychology, Factor analysis, Accessible book, Protected DAISY, Methodology, Educational tests and measurements, Psychological tests, Intelligence tests, Psychometrics, Primary mental abilities tests, including The measurement of values
"
-actually lists 37 works in a total of 60 editions
-of which 16 works in 26 editions are by Louis Leon Thurstone and the remainder by Thurstone, L. L.

Four more responses have no works or editions (three having been deleted but not deindexed, and one simply being blank of works):
"
please see Louis Leon Thurstone
4 books including SRA Primary Mental Abilities
please see Louis Leon Thurstone
1 book including Multiple Factor Analysis
please see Louis Leon Thurstone 1887 - 1955
0 books including
please see Louis Leon Thurstone
0 books including
"

@LeadSongDog
Copy link

A few minutes later that's changed from:
37 works in 60 editions of which 16 works in 26 editions are by Louis Leon Thurstone and the remainder by Thurstone, L. L.
to:
40 works in 65 editions of which 19 works in 29 editions are by Louis Leon Thurstone and the remainder by Thurstone, L. L.
This lag in indexing is a real pain...

@LeadSongDog
Copy link

LeadSongDog commented Dec 14, 2017

@hornc I see that on 18 Jan 2016 you reverted spam at https://openlibrary.org/works/OL102793W/The_Jesuits_in_North_America_in_the_seventeenth_century?m=history
Yet the spammed info remained, showing up in https://openlibrary.org/authors/OL29501A/Francis_Parkman though not in https://openlibrary.org/works/OL102793W/ until I deleted the WORK cover image today (and remained cached for perhaps another half-hour), then it changed to show the work title. This suggests that there may be other WORK covers that have no valid corresponding EDITION covers. There ought to be a way to query for this, but it's far beyond my $0 pay grade to do so.

@hornc
Copy link
Collaborator Author

hornc commented Feb 10, 2018

From looking at production updater logs there are numerous timeouts when POSTing update to http://ol-solr2:8983/solr/update

This causes the updater script to die, then re-spawn. The batch of changes for which the POST failed (last 60 seconds worth of changed items, or greater than 100 items) is not updated in SOLR.

I'm theorising that some item updates always cause the POST to timeout since they never get indexed when added to the /admin/solr list

solr-updater.log.5-2017-08-10 05:24:19 [24475] [openlibrary.solr-updater] [INFO] BEGIN commit
solr-updater.log.5-2017-08-10 05:24:19 [24475] [openlibrary.solr] [INFO] POSTing update to http://ol-solr2:8983/solr/update
solr-updater.log.5-Traceback (most recent call last):
solr-updater.log.5-  File "/opt/openlibrary/openlibrary/scripts/new-solr-updater.py", line 279, in <module>
solr-updater.log.5-    main()
solr-updater.log.5-  File "/opt/openlibrary/openlibrary/scripts/new-solr-updater.py", line 267, in main
solr-updater.log.5-    solr.commit(ndocs=count)
solr-updater.log.5-  File "/opt/openlibrary/openlibrary/scripts/new-solr-updater.py", line 206, in commit
solr-updater.log.5-    self._solr_commit()
solr-updater.log.5-  File "/opt/openlibrary/openlibrary/scripts/new-solr-updater.py", line 213, in _solr_commit
solr-updater.log.5-    update_work.solr_update(['<commit/>'], index="works")
solr-updater.log.5-  File "/opt/openlibrary/deploys/openlibrary/51ac973/openlibrary/solr/update_work.py", line 780, in solr_update
solr-updater.log.5-    response = h1.getresponse()
solr-updater.log.5-  File "/usr/lib/python2.7/httplib.py", line 1045, in getresponse
solr-updater.log.5-    response.begin()
solr-updater.log.5-  File "/usr/lib/python2.7/httplib.py", line 409, in begin
solr-updater.log.5-    version, status, reason = self._read_status()
solr-updater.log.5-  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
solr-updater.log.5-    line = self.fp.readline(_MAXLINE + 1)
solr-updater.log.5-  File "/usr/lib/python2.7/socket.py", line 476, in readline
solr-updater.log.5-    data = self._sock.recv(self._rbufsize)
solr-updater.log.5:socket.timeout: timed out

Trying to determine how often this socket.timeout: timed out happens:
755 between 2017-08-03 04:40:54 and 2017-08-10 05:34:22
1278 between 2017-08-10 05:34:22 and 2017-08-21 10:51:17
628 between 2017-08-27 13:23:01 and 2017-09-02 14:17:28

so roughly 100 times per day.

@LeadSongDog
Copy link

So roughly every 900 seconds it loses 60 seconds of transactions? Ugh! Is there a regular time interval, or is that just a mean figure?

@hornc
Copy link
Collaborator Author

hornc commented Feb 13, 2018

That does sound bad doesn't it :( It is just a mean figure. From what I have seen in the logs, sometimes it is only one record in the 60 second batch, but other times there are other, presumably ok updates that are not being committed too.

I'm still digging into the causes, but have eliminated a range of potential issues in the OL code and updater script (and adding tests and docstrings to the OL code!), so hopefully I am close to finding the root.

@tfmorris
Copy link
Contributor

Would reducing the batch size be a potential workaround for this issue?

@mekarpeles mekarpeles added this to @hornc in 2018 Q2 Mar 13, 2018
2018 Q2 automation moved this from @hornc to Done Mar 15, 2018
@tfmorris
Copy link
Contributor

So, what's implicit in closing this is that the Feb. 10 hypothesis about this being caused by Solr update timeouts was incorrect. I believe that that is, in the main, true, but I wonder if we've got a (smaller) hole where timeouts or other errors in the Solr updater cause us to lose updates.

Do we have a way of detecting and/or recovering from that scenario? Should we be doing periodic full reindexing sweeps?

@hornc
Copy link
Collaborator Author

hornc commented Mar 16, 2018

@tfmorris I think you are correct. I was wrong that it was the cause of this issue, but the logs appeared to show it happening, so it must be having some negative effect. This issue turned out to leave no error traces in logs. I'm trying to go over all the things I've learned and raise new issues for what is outstanding. I there are likely multiple outstanding issues. I want to add issues separating and characterising them, get an idea of their impact.

As an example, I have discovered that there is no automated way that orphaned editions are removed from Solr when they are fixed by associating them with a work... so the orphan will appear as effectivly a duplicate in search results. There may already be an open issue for this one. I'll need to look for it.

@sbshah97
Copy link
Contributor

@hornc Can #482 be closed as well? As mentioned in the first comment in this Issue?

@hornc
Copy link
Collaborator Author

hornc commented Mar 19, 2018

#891 fixed a real bug, but these records are still not being removed because there are still Edition records that reference these Authors. This is an additional data problem with the records on top of the code bug. Once the data is tidied and the stale reference removed, these will be reindexable.

@tfmorris
Copy link
Contributor

@hornc Do you have an example of an edition record which references a to-be-deleted author? Edition pages only show the work author, as far as I can tell, even though I can see the reference in the JSON.

@LeadSongDog
Copy link

See
https://openlibrary.org/authors/OL1422909A/Robert_Ellis_undifferentiated
(until recently just "Robert Ellis".
It still lists 57 works, most as having one edition. Most of those, when one clicks through, show a different Robert Ellis as author, yet they still appear under OL1422909A

@mekarpeles mekarpeles moved this from Done to In Progress in 2018 Q2 Jun 21, 2018
@LeadSongDog
Copy link

A year on from my above, https://openlibrary.org/books/OL9761061M.json still shows the old OL1422909A even though the also-linked work OL5805701W shows OL5207025A.

@cdrini
Copy link
Collaborator

cdrini commented Jul 19, 2019

The existing cases should be fixed by the solr reindex. I created a new issue to track the reindexing progress: #2222 . Although note the root cause as to why these aren't getting updating during the normal flow will not be fixed by that.

@mekarpeles mekarpeles changed the title Cleaned up author records still appearing in search results / SOLR reindex failing Stale search results due to SOLR latency & reindex failures Nov 22, 2019
@mekarpeles mekarpeles added the Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] label Dec 18, 2019
@xayhewalo xayhewalo added State: Work In Progress This issue is being actively worked on. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Dec 24, 2019
@xayhewalo xayhewalo moved this from Needs: Assessment to Triaged in Triage Dec 24, 2019
@mekarpeles
Copy link
Member

mekarpeles commented May 11, 2020

@cdrini mentions this will be fixed by solr 8 -- #3317

@tfmorris
Copy link
Contributor

That seems to contradict his July 2019 #628 (comment) "Although note the root cause as to why these aren't getting updating during the normal flow will not be fixed by that," which I find more believable.

I don't think root cause for the solr updater errors was ever found and we know the logging needs to be improved to even start investigating them. This has absolutely nothing to do with the version of Solr.

@LeadSongDog
Copy link

Well, it’s now TWO years on, and the edition still contains the stale link to the wrong author record. Surely it is clear that an edition which links to a work should not also contain a link to an author except for secondary attributions.

@mekarpeles mekarpeles added this to the Next (proposed) milestone Oct 19, 2020
@cdrini cdrini modified the milestones: Active Sprint, Next (proposed) Jan 4, 2021
@cdrini cdrini modified the milestones: Active Sprint, Next (proposed) Feb 9, 2021
@cdrini cdrini added the Type: Epic A feature or refactor that is big enough to require subissues. [managed] label Feb 9, 2021
@cclauss cclauss added Priority: 2 Important, as time permits. [managed] and removed Priority: 1 Do this week, receiving emails, time sensitive, . [managed] labels Mar 12, 2021
@LeadSongDog
Copy link

Three years on, and the edition record stilll shows the (wrong) author, even after the reindexing
https://testing.openlibrary.org/books/OL9761061M.json

Again, edition records should not have explicit, uncorrectable links to authors. Author links belong on the work records where they can be corrected.

@cdrini cdrini modified the milestones: Next (proposed), Active Sprint Jun 2, 2021
@cdrini
Copy link
Collaborator

cdrini commented Jun 2, 2021

@LeadSongDog I believe the issue you're describing is documented here: #2625 . It's not related to solr indexing.

@cdrini
Copy link
Collaborator

cdrini commented Jun 26, 2021

Calling this one closed (after all this time!); we've been on solr8 for ~2 weeks now, and (1) the caching issue was fixed in a PR, and (2) solr latency has gone from ~15min -> ~1min ! If anyone notices other issues of this nature, where they're edits are not reflected in search, please do create a new issue for it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] State: Work In Progress This issue is being actively worked on. [managed] Theme: Search Issues related to search UI and backend. [managed] Type: Bug Something isn't working. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed]
Projects
No open projects
2018 Q2
  
In Progress
Triage
  
Triaged
Development

No branches or pull requests

8 participants