-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Authors not findable using Search #699
Comments
Example: https://openlibrary.org/search/authors?q=United+States+Congress+House+Aviation has 0 results but should yield https://openlibrary.org/authors/OL4620383A (United States. Congress. House. Committee on Transportation and Infrastructure. Subcommittee on Aviation) |
A shorter example is "New Hampshire. Council" which fails to return any of these records:
|
Not sure it's significant, but https://openlibrary.org/authors/OL4620383A.json doesn't have a created key, while https://openlibrary.org/authors/OL4943246A.json, which is searchable, does. If the update code is depending on that to exist for some reason, it could be unhappy. |
I think my note above about the created key was a red herring. I was looking at the most prolific authors and have a few new record setters which don't show up in search. The first column is the number of works they've authored. 14131 /authors/OL2336667A United States. Congress. Senate. Committee on Pensions |
Here's another batch. Except for the New Hampshire. Council author mentioned above, all others appear to be United States. Congress. entries of some flavor or another. The range of IDs indicates that they weren't all created at the same time. As an aside, the number in parentheses is the number of works listed on the author's page. The entries with asterisks have counts which are off pretty dramatically. 5082 /authors/OL4521280A United States. Congress. Senate. Committee on Commerce (4560) |
A couple more and a new theory: 789 "/authors/OL24127A" Metropolitan Museum of Art (New York, N.Y.) (408) Perhaps two or more periods in the name is what causes the problem? Or non-terminal periods? On the other hand, there's a duplicate MOMA entry with the exact same name which did get indexed correctly:
Of course the author which can't be found has 408 works associated with it, while the correctly indexed author has none. :-( |
https://openlibrary.org/search/authors?q=Metropolitan+Museum+of+Art finds the merged author after I made this edit: |
I remember seeing something similar and thinking "New" was a problematic keyword. There's an issue about it, I don't recall if it was a related issue or if new was actually the problem. I'll look into it! |
@LeadSongDog re: Author search, please see #699. Somewhat embarrassed to say, I'm not sure re-indexing is occurring at all in several such cases. #351 (comment) |
I may need to cc: @gdamdam to make sure I kick off this solr-updater process correctly. I believe he has internal docs on this process which I should try to document more publicly |
related: #714 |
https://openlibrary.org/search/authors?q=New+Hampshire.+Council As an experiment I am going to add https://openlibrary.org/authors/OL4896977A/New_Hampshire._Council to the manual admin/solr interface @ 7:40 UTC and check the search results sometime later. Authors are added by the solr updater if they appear in the infogami edit logs, which means when they any of the record's data changes. The admin/solr interface allows admins to add a record into that same update pipeline. I expect OL4896977A will show up in search results within 15mins. I think we need a way to identify and re-index items that have, for whatever reason, missed indexing in the past. There may be a way to do targeted partial re-indexes if we can identify the targets. |
The one thing I notice these authors have in common is that they were all initially imported in 2008, which is the earliest year OL records were added, and before a lot of the processes were finalised. https://openlibrary.org/authors/OL4528217A/ was created in 2008, but last edited in 2012, which by my theory above, should have been indexed. It's not in search results https://openlibrary.org/search/authors?q=United+States.+Congress.+House.+Committee+on+Foreign+Affairs @ 9:32 UTC (when I made the edit) I am making an edit to the record now to see if it gets added to the index soon. |
Good to know that these records aren't fundamentally broken in some way and can be indexed if we can identify them. Implicit in the results of this experiment is that the search index probably hasn't been rebuilt since 2008, which is kind of a frightening thought. Who knows how many holes and errors are in it... |
A spot check shows that these are successfully indexed in my dev Solr instance. For example, "New Hampshire. Council" returns all 21 author records listed above and "United States. Congress. Senate. Committee on Pensions". The issues with the work_count also appear to be fixed in the new index. |
Another issue that will be affected/fixed by #2246 |
I think we could use a top-level issue which more surgically outlines and enumerates things which are not indexed by search (there are plenty of works as well which exist and don't seem to be indexed) |
I think @hornc is correct, solr-updater was likely broken/down/? at the time these authors were created, and they were never re-indexed. All the example here now work, because we've done a few full re-indexes over the last year. |
None of these authors are findable using search even though (many) author records exist for them.
I thought perhaps it was associated with long authors, but this one is findable even though it's longer than the Committee on the Pacific Railroad.
Princeton University. Dept. of Economics and Social Institutions. Industrial Relations Section.
Here's the list of author records for one of the unsearchable names:
/authors/OL4620383A United States. Congress. House. Committee on Transportation and Infrastructure. Subcommittee on Aviation
/authors/OL4620614A
/authors/OL4625592A
/authors/OL4620175A
/authors/OL4620217A
/authors/OL4625064A
/authors/OL48266A
/authors/OL4626004A
/authors/OL4625755A
/authors/OL4625259A
/authors/OL4625904A
/authors/OL4625065A
/authors/OL4625754A
/authors/OL4620231A
/authors/OL4620213A
/authors/OL4623159A
/authors/OL4625899A
The text was updated successfully, but these errors were encountered: