-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic classification setup #681
Comments
A quick shot: Interesting fields could be |
There has already been a lot of discussion about automatic subject indexing whoch I have only noticed a small part. From what I heard, it makes most sense to assist subject indexers with automatic tools in a semi-automatic process. And I guess it doesn't make much sense to make a subject guess based only on the bibliographic data. My feeling is that you'd at least need an abstract to get a satisfying result (but I haven't reviewed the literature & could not quickly find a project where this has been tested). We have around 446k resources with a fulltext or summary link and without subject information, |
I would have loved to play with automatic subject enrichment ever since. Without fulltexts we would have to be more brave since we will produce more inadequate subjects. Also, concerning the non-fulltexts. for those resources without subjects but with authors linked with other resources which already have subjects I think we won't be that bad in guessing the proper subjects - because then we would already have some domain specific knowledge. (same would be true, although broader and less exact, with publisher: e.g. O'reilly is to be expected of being in the IT/tech domain).
Add/sub/rearrange that list. |
Here is a highly relevant article from 2017 titled "Using Titles vs. Full-text as Source for Automated Semantic Document Annotation": https://arxiv.org/abs/1705.05311 The abstract says:
|
There is a no newer article which implies the same https://arxiv.org/abs/1801.06717 - so we should go with title-based subject extraction! |
Notably, the projects in the two papers operated over quite homogenous data ("two datasets are obtained from scientific digital libraries in the domains of economics and political sciences along with two news datasets from Reuters and New York Times"). Compared to this, the hbz catalog has descriptions of very heterogeneous bibliographic resources. It would probably make sense to create more homogenous subsets first before conducting the automatic indexing or at least to take important information about the field of the resource into account. As I recently said it in face-to-face talk: A good indicator of the topic is also given by the holding institutions as many of them have a focus of collection. E.g. to get a list of libraries from the hbz network whose collections include resources on economics you can ask lobid-organisations like this: http://lobid.org/organisations/search?q=linkedTo.id%3A%22http%3A%2F%2Flobid.org%2Forganisations%2FDE-605%23%21%22+AND+Wirtschaftswissenschaften&location= |
As a smaller set that is also somewhat more homogeneous than the full catalog we could use NWBib: https://test.nwbib.de/search?location=&q=NOT+_exists_%3Asubject Completing the NWBib subjects would also be a self-contained project before taking on the full catalog. |
+1 I think the NWBib editors would even review the results and take over correct subjects into Aleph. |
Get data with training, testing, and missing subjects from NWBib, starting with small training and testing sets, using a single subject from the NWBib classification (Sachsystematik). See hbz/lobid-resources#681
Load data from the local CSV files into a corpus, create feature vectors for each document using the tf-idf values of each term. See hbz/lobid-resources#681
Vectorize and classify the local data from CSV files using different vectorizer and classifier implementations, evaluate using the test set, and write best result to a CSV file. Trying to approach full NWBib set for experiments, see details in nwbib_subjects_load.py. Using tiny training and test sets, so absolute values are not meaningful, but relative differences should provide hints about which methods to test with the full data set. See hbz/lobid-resources#681
Get larger sample data in `bulk` format from the Lobid API, vectorize using stateless HashingVectorizer (requires no fitting of full corpus), use stop words, classify using sparse vectors, write trained classifier to disk. See hbz/lobid-resources#681
I think we have a useful setup for further experiments in https://github.com/fsteeg/python-data-analysis:
The bulk classification uses a different vectorizer than the small set experiments to make it work with larger data sets, so the results are not directly comparable. The mid-size and full-size setup however use the same setup, and yield very comparable results, so the mid-size setup should be a useful basis for further experiments with a runtime low enough to run multiple setups (which features to use, how to configure the vectorizer, which classifier to use and how to configure it). Some areas to investigate next:
|
Find best parameter set to use with vectorizer and classifier by running all combinations of parameters in a cross-validated grid search. Runs jobs concurrently and outputs runtime info. See hbz/lobid-resources#681
Should we reconsider this since this was tested NWBIB hbz/nwbib#560 |
As announced in our internal planning document (AEP), we want to expand our expertise in text mining.
As a reasonable, well-defined, and useful project in that area, I suggest we should attempt to set up automatic classification for titles in our union catalog. More than half of our catalog has no subjects:
http://lobid.org/resources/search?q=NOT+_exists_%3Asubject
http://lobid.org/resources/search?q=_exists_%3Asubject
The basic approach could be this: we use a part of the classified documents as our training set, and the rest of the classified documents as the gold standard to evaluate our classification method. The training and gold sets will have to contain a selection of documents across all subjects. With this basic setup, we can experiment with different features to represent a document and different classification algorithms.
When we get good results with our gold set, we can apply our classification method to the unclassified documents without subjects. These can get a new value for the
subject.source
, allowing them to be treated differently in queries and display.The text was updated successfully, but these errors were encountered: