-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Convert Library of Congress codes into a classification #3309
Conversation
@cclauss it shows on this thread that the error checks failed. Should I still try it out? I will check out the code and see what's going on, as I don't have python on my computer to run it and my computer's been having problems lately, so I can't add it on - unless there's a different, easier approach to this. |
I looked at the code and it seems at the end it's cut short - there's no return like for the last section. |
As I thought about MLCM95 ['Music'], you're right - sometimes there are misspellings and other issues that may require the number or other otpions. I'm assuming this this one that it was 'ML' ['Music', 'Literature on music'], but with extra lettering typos. So if we're using the 1st two letters, then it'll come out like I showed, but if we're truly worried about potential false info, it'll be good to leave it off like you did. I'm not sure. |
Fixed. I have gotten out of the habit of writing Python 2 code but the linter caught me so now it is OK to test.
What is your operating system? I can build a standalone app if I know the OS. So, for
If you get a chance to test then please try any valid letter combinations but especially:
|
Let's be careful to use the correct terminology. LC Classification is different from genre. |
OK...
|
Yes, please. |
If we look at https://openlibrary.org/books/OL1025841M/Multiregional_demography, we are categorizing it by seven different subjects. And if we follow the |
Subject headings are different from classification. So if you think of a physical library and a physical card catalog, the book is a single item, located in a single place - this is identified by the classification (call number). The card catalog, on the other hand, has index cards to represent the one physical item across multiple categories - those are the subject headings. And again those are separate from genre, which is what the book is versus what the book is about. |
Nice way to put in. I understand now. So in the end, we want just one Class and one Subclass for each book? |
Usually, yes, but I've seen some cases where a book has been assigned more than one LC number. Not so common and I don't think it's something to worry about. |
I've installed it before. I have Windows 8.1. I haven't done testing of code like this yet, so I'm still learning.
That is true. The LoC calls them 'classes' and 'subclasses'. I was assigning 'class' as 'genre' when it's posted on the OL UI and 'subclasses' as a 'subgenre' there too. I've seen them called 'categories' on other sites too. Thank you @seabelis for clarifying. You're right - I wasn't specific enough on that. |
What's the relationship of this code to the linked issue? Instead of indexing LCC directly, as that issue describes, is the proposal now to convert them to text and index them as subjects instead? |
@@ -0,0 +1,242 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the provenance of this file (and the other JSON file)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We parse these .pdf files with a variant of #3290 (comment) (parsing code is MIT License) That is the linkage to that issue.
This is a WIP Draft because there are trade-offs to consider between looking something up in a dict vs. embedding it in every record.
scripts/lcc_to_class_subclass.py
Outdated
>>> get_ol_book_info() # doctest: +ELLIPSIS | ||
{'olid:OL26617202M': ... | ||
""" | ||
url = "https://openlibrary.org/api/books?jscmd=details&format=json&bibkeys=olid:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be easier to read/update if the query parameters were passed in a dict, as supported by requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a WIP in Draft mode. Those two utility functions are present to allow testers to experiment if they are more comfortable entering OL codes rather than LCC codes. Given that we want get detailed data formatted as json from a fixed endpoint, these functions only need to modify a single parameter. Dict-driven URLs are verbose and by no means mandatory for fixed endpoints such as APIs.
>>> get_ol_book_info() # doctest: +ELLIPSIS | ||
{'olid:OL26617202M': ... | ||
""" | ||
url = "https://openlibrary.org/api/books?jscmd=details&format=json&bibkeys=olid:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file will be removed before we exit WIP/Draft staus.
You should be able to run I think of a classification as a hierarchy of classes like zoologists order living creatures by Kingdom, Phylum, Class, Order, Family, Genus, Species. For a creature, I render its classification listing out its membership at all of these levels. I interpret LoC classification starting with a single letter In zoology classifications range over kingdom, phylum, class, order, family, genus, species while in LoC classifications range over class, subclass, subclass, subclass... |
NOTE: Python 3 has order-preserving dicts while Python 2 does not. |
Ok. For some reason it seems like my Python library (version 3.8.2) doesn't work well with the code (I keep getting errors). Which Python version do you have (so I can run it on that)? |
Try the Binder link above instead. |
MLC stands for Minimal Level Cataloging. More info here. https://www.loc.gov/rr/cmd/collservedcalm.html and here https://www.loc.gov/catdir/cpso/catmodes.pdf. |
*see newest comment Here are the ones that were incorrect (LoC, what shows, Correct version. Source: https://www.loc.gov/catdir/cpso/lcco/):
I'm assuming some of these is due to not enough space, so it gets cut off. The others seem that the lines below the first line don't get read. |
Scratch what I said earlier. I do notice something about the 4 letter LCC's - there's possibly two LC's in them (all the ones with 4 letters have a "/" in them - I believe on purpose. As @seabelis said - some books have 2 LCC's). Is this correct @seabelis? |
No. Not correct. MLC stands for Minimal Level Cataloging. I posted two links with more information above. |
@seabelis To make sure I understand everything: MLC12 3/4 (5)
So MLCS 98/02371 (H) is a small book from 1998 in Social Sciences. For class, it'll just show 'Social Sciences'. |
Your help please on this one...
Why would |
@cclauss > Why would That appears to be an error. DP1-402 is History of Spain. Asia is DS. Africa is DT. Australia is under DU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of binder! :) This looks great! Note MLC* are invalid for our purposes here; they don't denote any classes, just physical properties about the book (it would incorrectly list all such books as being about Music :P ). Once out of WIP, you might want to use utils.lcc.short_lcc_to_sortable_lcc
from #3302, which will filter those out.
In terms of permanent home for this, a utils.lcc_classes.py
might be a good place. The main "public" method we'd need to show these in the UI would be the lcc_to_classification
function 👍
scripts/lcc_to_class_subclass.py
Outdated
return chars, 0 | ||
|
||
|
||
def lcc_to_classification(lcc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this name might be better?
def lcc_to_classification(lcc): | |
def get_lcc_classes(lcc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdrini the MLCs do have classes in parentheses, but I could see leaving it out for now - as it's extra coding. They just don't have subclasses, but just classes is something that should be included (maybe one day).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name change does look a lot cleaner and more readable :), but I'll leave it to the code writers to make the final decision.
@BrittanyBunk That sounds perfect 👍 Long term, we should probably store these in a different field in openlibrary, not in the LCC field; e.g. maybe something like: {
"lc_classifications": ["S"],
"mlc": ["MLCS 91-9437"],
} But for now excluding them seems best 👍 |
@cdrini maybe it'll even be named `{ I think this MLC file would need to refer to the lc_classifiers_letters_only.json, but I'm thinking it could refer to a smaller one, as it'll use only the first letters. I created this: lc_classifiers_first_letter_only.json. There were some typos that I corrected from the original one, so if the MLCs get added in, they should be consistent. @cclauss should I create a corrected version (that capitalizes letters and such) of lc_classifiers_letters_only.json and place the link here? |
I made a mistake with which LoC page to use. I used the outline, but in reality I should've used the schedules (more specific and up-to-date). So I am going to redo my answers and add them to test_lcc_classifier.py :). |
Cool but... We are using https://github.com/thisismattmiller/lcc-pdf-to-json to download our translation table. This is using outline URLs so what "schedules" URLs do we want to use instead?
|
You're using the correct ones, the schedule outlines. I accidentally used the outline outlines without realizing they weren't the same ones. So I will correct my answers. The LoC sure has ambiguous terminology lol, but I'll get this on track. |
OK. My sense is that the parser currently has difficulties with classifier text that is wrapped across two or more lines. In some cases these should be one long string and in others they should be two separate strings. Please add some of these to our test data to ensure that our testing properly covers these cases. |
I'm having confusion. Here's the thing: |
@BrittanyBunk I too noticed the discrepancy on the I hope to have time this weekend to get this back into working order so that testing can resume. If you have the bandwidth, please keep creating test cases here or here. Thanks much... |
@cclauss I noticed more issues when going through the whole lc_classifiers_letters_and_numbers.json:
My recommendations:
Does this work for everybody? |
OK.
I worry about you spending a ton of time manually correcting the .json doc that was auto-generated. Are the changes that you plan to make from reading newer .pdf files? If not, what other sources of data do you use? If so, we should try to regenerate the .json file using #3309 (comment) . Please show me your corrections after you have done a few letters (A, B, C) and the I can tell you if we can automate some (much!) of this. Thx. |
@cclauss I worried about it too. The new pdf files I use are https://www.loc.gov/aba/cataloging/classification/. I put them into one pdf here. They are not the ones in the comment you linked (those are different - I already made that mistake :)). I am able to generate an entire list of all the exceptions that need to be changed - that doesn't take long. Maybe you could make the .json file from them? I'll get started on that and message when that's finished. |
There were other issues, but we'll ignore them for now (just putting for reference, but we won't worry about it):
|
@cclauss The numbers refer to the ones in this comment. Exceptions lists: These are all the exceptions lists you need to get started. #1 would need to have its lines combined. I didn't do that - but there are ways - as I've done something similar before. Note: |
Using the schedules is harder than the LCC outline, as there's times that look like this: It's a tradeoff - either:
Since we're choosing the first option (a full outline), these consequences (misalignment) will emerge. Fortunately, it's only a few places, so it shouldn't be a big deal. |
@cclauss how come this closed, is it complete now? |
Fixes #3396
Related to #3290
NOTE: See the new testing instructions below.
Run this experiment with the command
scripts/lcc_to_genre_subgenre.py
and then enter Library of Congress Classification codes to see theirgenre
andsubgenre
. Problem codes will be written to the filelcc_to_genre_subgenre.py_debug.txt
so that you can copy from that file and paste Into comments here to show us which codes were not able to generate genreand
subgenre`.@BrittanyBunk @finnless Please try to run this and enter valid LCC codes from OpenLibrary to see if you get two entries (genre and subgenre) each time. Are they the classification codes that you expected for each work?
There are two functions in this code. The first one attempts to use the LCC letters only. If that results in both a genre and subgenre then we are done. If not, then we run the second function that uses both the letters and the numbers to attempt to get both a genre and subgenre. If you find codes that do not return both a genre and subgenre then please add comments to this PR.
NC248
uses only the first functionKLA940
needs to use the second function to obtain the subgenreMLCM95
delivers a genre but no subgenre%
scripts/lcc_to_genre_subgenre.py
Please enter Library of Congress codes like: HB1951 .R64 1995...
Or leave blank to quit:
NC248
NC248 ['Fine Arts', 'Drawing. Design. Illustration']
Or leave blank to quit:
KLA940
Needed numbers: KLA940
KLA940 ['Law', 'Russia, Soviet Union']
Or leave blank to quit:
MLCM95
Needed numbers: MLCM95
MLCM95 ['Music']
Or leave blank to quit:
Technical
Testing
Yes. Please. (I have another version with tons of doctests.)
Evidence
Stakeholders