Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Authors not findable using Search #699

Closed
tfmorris opened this issue Dec 29, 2017 · 18 comments
Closed

Authors not findable using Search #699

tfmorris opened this issue Dec 29, 2017 · 18 comments
Assignees
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Authors Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Search Issues related to search UI and backend. [managed] Type: Bug Something isn't working. [managed]

Comments

@tfmorris
Copy link
Contributor

None of these authors are findable using search even though (many) author records exist for them.

  • United States. Congress. House. Committee on the District of Columbia. Subcommittee on Investigation of Food Storage and Prices
  • United States. Congress. House. Committee on the Pacific Railroad
  • United States. Congress. House. Committee on Transportation and Infrastructure. Subcommittee on Aviation

I thought perhaps it was associated with long authors, but this one is findable even though it's longer than the Committee on the Pacific Railroad.
Princeton University. Dept. of Economics and Social Institutions. Industrial Relations Section.

Here's the list of author records for one of the unsearchable names:
/authors/OL4620383A United States. Congress. House. Committee on Transportation and Infrastructure. Subcommittee on Aviation
/authors/OL4620614A
/authors/OL4625592A
/authors/OL4620175A
/authors/OL4620217A
/authors/OL4625064A
/authors/OL48266A
/authors/OL4626004A
/authors/OL4625755A
/authors/OL4625259A
/authors/OL4625904A
/authors/OL4625065A
/authors/OL4625754A
/authors/OL4620231A
/authors/OL4620213A
/authors/OL4623159A
/authors/OL4625899A

@cdrini
Copy link
Collaborator

cdrini commented Dec 29, 2017

Example: https://openlibrary.org/search/authors?q=United+States+Congress+House+Aviation has 0 results but should yield https://openlibrary.org/authors/OL4620383A (United States. Congress. House. Committee on Transportation and Infrastructure. Subcommittee on Aviation)

@tfmorris
Copy link
Contributor Author

A shorter example is "New Hampshire. Council" which fails to return any of these records:

/authors/OL4896972A	New Hampshire. Council		
/authors/OL4629754A	New Hampshire. Council.		
/authors/OL4896987A	New Hampshire. Council		
/authors/OL4896997A	New Hampshire. Council		
/authors/OL4896999A	New Hampshire. Council		
/authors/OL4896975A	New Hampshire. Council		
/authors/OL4896989A	New Hampshire. Council		
/authors/OL4896984A	New Hampshire. Council		
/authors/OL4896980A	New Hampshire. Council		
/authors/OL4896996A	New Hampshire. Council		
/authors/OL2194235A	New Hampshire. Council on Problems of the Aging.		
/authors/OL4896994A	New Hampshire. Council		
/authors/OL4896982A	New Hampshire. Council		
/authors/OL4896998A	New Hampshire. Council		
/authors/OL4896979A	New Hampshire. Council		
/authors/OL4897000A	New Hampshire. Council		
/authors/OL4896991A	New Hampshire. Council		
/authors/OL4896977A	New Hampshire. Council		
/authors/OL4896990A	New Hampshire. Council		
/authors/OL2374994A	New Hampshire. Council on Postwar Planning and Rehabilitation.		
/authors/OL4896986A	New Hampshire. Council	

@tfmorris
Copy link
Contributor Author

Not sure it's significant, but https://openlibrary.org/authors/OL4620383A.json doesn't have a created key, while https://openlibrary.org/authors/OL4943246A.json, which is searchable, does.

If the update code is depending on that to exist for some reason, it could be unhappy.

@mekarpeles mekarpeles added Theme: Search Issues related to search UI and backend. [managed] Module: Authors labels Dec 30, 2017
@tfmorris
Copy link
Contributor Author

I think my note above about the created key was a red herring.

I was looking at the most prolific authors and have a few new record setters which don't show up in search. The first column is the number of works they've authored.

14131 /authors/OL2336667A United States. Congress. Senate. Committee on Pensions
7237 /authors/OL4789289A United States. Congress. Senate. Committee on Claims
6047 /authors/OL2375088A United States. Congress. House. Committee on Invalid Pensions.
5498 /authors/OL4766486A United States. Congress. House. Committee on Claims

@tfmorris
Copy link
Contributor Author

Here's another batch. Except for the New Hampshire. Council author mentioned above, all others appear to be United States. Congress. entries of some flavor or another. The range of IDs indicates that they weren't all created at the same time.

As an aside, the number in parentheses is the number of works listed on the author's page. The entries with asterisks have counts which are off pretty dramatically.

5082 /authors/OL4521280A United States. Congress. Senate. Committee on Commerce (4560)
5066 /authors/OL2323345A United States. Congress. House. Committee on War Claims.
4894 /authors/OL4523254A United States. Congress. House. Committee on the Judiciary (3643) **
4839 /authors/OL184870A United States. Congress. House. Committee on Military Affairs. (4813)
3531 /authors/OL4521525A United States. Congress. House. Committee on Interstate and Foreign Commerce (3044) *
3407 /authors/OL4521082A United States. Congress. Senate. Committee on the Judiciary (2906)
3326 /authors/OL4774429A United States. Congress. Senate. Committee on Military Affairs (3050)
3230 /authors/OL47374A United States. Congress. House. Committee on Rules. (1517) **
2941 /authors/OL4648820A United States. Congress. House. Committee on Naval Affairs (2577)
2915 /authors/OL4835960A United States. Congress. House. Committee on Rivers and Harbors (699) ***
2861 /authors/OL4521469A United States. Congress. Senate. Committee on Foreign Relations (2271)
2761 /authors/OL43204A United States. Congress. Senate. Committee on Energy and Natural Resources. (1889)
2699 /authors/OL4527173A United States. Congress. House. Committee on Ways and Means (1843) **
2545 /authors/OL4522330A United States. Congress. Senate. Committee on Finance (2292)
2436 /authors/OL4528217A United States. Congress. House. Committee on Foreign Affairs (1822) **
2268 /authors/OL4521848A United States. Congress. Senate. Committee on Appropriations (1746)
2086 /authors/OL4657773A United States. Congress. Senate. Committee on the District of Columbia (1675)
2008 /authors/OL159513A United States. Congress. House. Committee on Public Lands (1836) *
1004 /authors/OL4839666A United States. Congress. Senate. Committee on Public Lands and Surveys (928)
100 /authors/OL868250A United States. Congress. House. Committee on the Judiciary. Subcommittee on Monopolies and Commercial Law. (90)
90 /authors/OL988950A United States. Congress. House. Committee on Science and Technology. Subcommittee on Natural Resources, Agriculture Research, and Environment. (82)

@tfmorris
Copy link
Contributor Author

tfmorris commented Jan 2, 2018

A couple more and a new theory:

789 "/authors/OL24127A" Metropolitan Museum of Art (New York, N.Y.) (408)
374 "/authors/OL4480A" India. Parliament. Committee on Public Undertakings. (125)

Perhaps two or more periods in the name is what causes the problem? Or non-terminal periods?

On the other hand, there's a duplicate MOMA entry with the exact same name which did get indexed correctly:

Of course the author which can't be found has 408 works associated with it, while the correctly indexed author has none. :-(

@LeadSongDog
Copy link

https://openlibrary.org/search/authors?q=Metropolitan+Museum+of+Art finds the merged author after I made this edit:
https://openlibrary.org/authors/OL24127A/Metropolitan_Museum_of_Art_(New_York_N.Y.)?b=4&a=3&_compare=Compare&m=diff
One might suspect the two are somehow related. There were earlier issues related to searches when the stopword "New" was part of the query.

@mekarpeles
Copy link
Member

I remember seeing something similar and thinking "New" was a problematic keyword. There's an issue about it, I don't recall if it was a related issue or if new was actually the problem. I'll look into it!

@mekarpeles
Copy link
Member

@LeadSongDog re: Author search, please see #699. Somewhat embarrassed to say, I'm not sure re-indexing is occurring at all in several such cases. #351 (comment)

@mekarpeles
Copy link
Member

I may need to cc: @gdamdam to make sure I kick off this solr-updater process correctly. I believe he has internal docs on this process which I should try to document more publicly

@mekarpeles
Copy link
Member

related: #714

@hornc
Copy link
Collaborator

hornc commented Mar 16, 2018

https://openlibrary.org/search/authors?q=New+Hampshire.+Council
give 2 results @ 7:40 UTC
https://openlibrary.org/authors/OL7359992A
and
https://openlibrary.org/authors/OL7406663A

As an experiment I am going to add https://openlibrary.org/authors/OL4896977A/New_Hampshire._Council to the manual admin/solr interface @ 7:40 UTC

and check the search results sometime later.
EDIT: search results had not changed within 7mins, but OL4896977A was in the search results at 9:15 UTC (the next time I checked, I'm sure it was added a lot sooner than that). This shows that these authors can be added to the index. Normally this will occur on any edit to the record.

Authors are added by the solr updater if they appear in the infogami edit logs, which means when they any of the record's data changes. The admin/solr interface allows admins to add a record into that same update pipeline. I expect OL4896977A will show up in search results within 15mins.

I think we need a way to identify and re-index items that have, for whatever reason, missed indexing in the past. There may be a way to do targeted partial re-indexes if we can identify the targets.

@hornc
Copy link
Collaborator

hornc commented Mar 16, 2018

The one thing I notice these authors have in common is that they were all initially imported in 2008, which is the earliest year OL records were added, and before a lot of the processes were finalised.

https://openlibrary.org/authors/OL4528217A/ was created in 2008, but last edited in 2012, which by my theory above, should have been indexed. It's not in search results https://openlibrary.org/search/authors?q=United+States.+Congress.+House.+Committee+on+Foreign+Affairs @ 9:32 UTC (when I made the edit)

I am making an edit to the record now to see if it gets added to the index soon.
EDIT OL4528217A showed up in search results at 9:47 UTC

@tfmorris
Copy link
Contributor Author

tfmorris commented Mar 17, 2018

Good to know that these records aren't fundamentally broken in some way and can be indexed if we can identify them.

Implicit in the results of this experiment is that the search index probably hasn't been rebuilt since 2008, which is kind of a frightening thought. Who knows how many holes and errors are in it...

@tfmorris
Copy link
Contributor Author

tfmorris commented Aug 4, 2019

A spot check shows that these are successfully indexed in my dev Solr instance. For example, "New Hampshire. Council" returns all 21 author records listed above and "United States. Congress. Senate. Committee on Pensions". The issues with the work_count also appear to be fixed in the new index.

@tfmorris tfmorris self-assigned this Aug 4, 2019
@xayhewalo xayhewalo added this to Un-Triaged in Triage Oct 20, 2019
@xayhewalo xayhewalo added Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] State: Backlogged Type: Bug Something isn't working. [managed] labels Nov 19, 2019
@xayhewalo xayhewalo moved this from Un-Triaged to Triaged in Triage Nov 19, 2019
@xayhewalo
Copy link
Collaborator

Another issue that will be affected/fixed by #2246

@mekarpeles
Copy link
Member

I think we could use a top-level issue which more surgically outlines and enumerates things which are not indexed by search (there are plenty of works as well which exist and don't seem to be indexed)

@tfmorris tfmorris removed their assignment Apr 30, 2020
@cdrini
Copy link
Collaborator

cdrini commented Feb 24, 2021

I think @hornc is correct, solr-updater was likely broken/down/? at the time these authors were created, and they were never re-indexed. All the example here now work, because we've done a few full re-indexes over the last year.

@cdrini cdrini closed this as completed Feb 24, 2021
@cdrini cdrini added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] and removed Needs: Lead labels Feb 24, 2021
@cdrini cdrini self-assigned this Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Authors Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Search Issues related to search UI and backend. [managed] Type: Bug Something isn't working. [managed]
Projects
No open projects
2018 Q2
  
@LeadSongDog
Triage
  
Triaged
Development

No branches or pull requests

6 participants