Automatic classification setup #681

fsteeg · 2018-01-19T15:13:27Z

As announced in our internal planning document (AEP), we want to expand our expertise in text mining.

As a reasonable, well-defined, and useful project in that area, I suggest we should attempt to set up automatic classification for titles in our union catalog. More than half of our catalog has no subjects:

http://lobid.org/resources/search?q=NOT+_exists_%3Asubject
http://lobid.org/resources/search?q=_exists_%3Asubject

The basic approach could be this: we use a part of the classified documents as our training set, and the rest of the classified documents as the gold standard to evaluate our classification method. The training and gold sets will have to contain a selection of documents across all subjects. With this basic setup, we can experiment with different features to represent a document and different classification algorithms.

When we get good results with our gold set, we can apply our classification method to the unclassified documents without subjects. These can get a new value for the subject.source, allowing them to be treated differently in queries and display.

The text was updated successfully, but these errors were encountered:

fsteeg · 2018-01-19T15:36:36Z

Added more text to the initial comment after accidentally posting it. Thoughts, @acka47 @dr0i?

ChristophEwertowski · 2018-01-22T08:53:34Z

A quick shot: Interesting fields could be title, shortTitle, otherTitleInformation or bibliographicCitation. The source would have to be a different one or maybe add it to the title like the DNB supposedly does (I can't open it, the converter is broken).

acka47 · 2018-01-22T14:49:42Z

There has already been a lot of discussion about automatic subject indexing whoch I have only noticed a small part. From what I heard, it makes most sense to assist subject indexers with automatic tools in a semi-automatic process. And I guess it doesn't make much sense to make a subject guess based only on the bibliographic data. My feeling is that you'd at least need an abstract to get a satisfying result (but I haven't reviewed the literature & could not quickly find a project where this has been tested).

We have around 446k resources with a fulltext or summary link and without subject information,
see http://lobid.org/resources/search?q=NOT+_exists_%3Asubject+AND+%28_exists_%3AfulltextOnline+OR+_exists_%3Adescription%29 compared to 532k that have both (http://lobid.org/resources/search?q=_exists_%3Asubject+AND+%28_exists_%3AfulltextOnline+OR+_exists_%3Adescription%29). Maybe that is a good subset to start with. (Probably, we won't have access to 100% of the linked fulltexts via the hbz network but I guess we could retrieve the majority.

dr0i · 2018-01-22T17:12:42Z

I would have loved to play with automatic subject enrichment ever since. Without fulltexts we would have to be more brave since we will produce more inadequate subjects. Also, concerning the non-fulltexts. for those resources without subjects but with authors linked with other resources which already have subjects I think we won't be that bad in guessing the proper subjects - because then we would already have some domain specific knowledge. (same would be true, although broader and less exact, with publisher: e.g. O'reilly is to be expected of being in the IT/tech domain).
So, there are ways to get proper subjects. Also, I think, RVK is not totally incorporated in hbz01 because there is no good matching with resources in the e.g. BVB, right @acka47 ? I would go the easy and save pathes first, in this order:

RVK enrichment
text mining using fulltexts/abstract/tocs
text mining using the metadata, combined with domain-specific knowledge (author, publisher ...)
text mining metadata with expected highly inaccurate results

Add/sub/rearrange that list.

acka47 · 2018-01-24T12:14:28Z

Here is a highly relevant article from 2017 titled "Using Titles vs. Full-text as Source for Automated Semantic Document Annotation": https://arxiv.org/abs/1705.05311

The abstract says:

The results show that across three of our four datasets, the performance of the classifications using only titles reaches over 90% of the quality compared to the classification performance when using the full-text. Thus, conducting document classification by just using the titles is a reasonable approach for automated semantic annotation and opens up new possibilities for enriching Knowledge Graphs.

dr0i · 2018-01-25T07:48:58Z

There is a no newer article which implies the same https://arxiv.org/abs/1801.06717 - so we should go with title-based subject extraction!

acka47 · 2018-01-26T13:13:03Z

Notably, the projects in the two papers operated over quite homogenous data ("two datasets are obtained from scientific digital libraries in the domains of economics and political sciences along with two news datasets from Reuters and New York Times"). Compared to this, the hbz catalog has descriptions of very heterogeneous bibliographic resources. It would probably make sense to create more homogenous subsets first before conducting the automatic indexing or at least to take important information about the field of the resource into account. As I recently said it in face-to-face talk: A good indicator of the topic is also given by the holding institutions as many of them have a focus of collection.

E.g. to get a list of libraries from the hbz network whose collections include resources on economics you can ask lobid-organisations like this: http://lobid.org/organisations/search?q=linkedTo.id%3A%22http%3A%2F%2Flobid.org%2Forganisations%2FDE-605%23%21%22+AND+Wirtschaftswissenschaften&location=

fsteeg · 2018-02-02T10:47:41Z

As a smaller set that is also somewhat more homogeneous than the full catalog we could use NWBib:

https://test.nwbib.de/search?location=&q=NOT+_exists_%3Asubject
https://test.nwbib.de/search?location=&q=_exists_%3Asubject

Completing the NWBib subjects would also be a self-contained project before taking on the full catalog.

acka47 · 2018-02-02T14:33:08Z

+1 I think the NWBib editors would even review the results and take over correct subjects into Aleph.

Get data with training, testing, and missing subjects from NWBib, starting with small training and testing sets, using a single subject from the NWBib classification (Sachsystematik). See hbz/lobid-resources#681

Load data from the local CSV files into a corpus, create feature vectors for each document using the tf-idf values of each term. See hbz/lobid-resources#681

Vectorize and classify the local data from CSV files using different vectorizer and classifier implementations, evaluate using the test set, and write best result to a CSV file. Trying to approach full NWBib set for experiments, see details in nwbib_subjects_load.py. Using tiny training and test sets, so absolute values are not meaningful, but relative differences should provide hints about which methods to test with the full data set. See hbz/lobid-resources#681

Get larger sample data in `bulk` format from the Lobid API, vectorize using stateless HashingVectorizer (requires no fitting of full corpus), use stop words, classify using sparse vectors, write trained classifier to disk. See hbz/lobid-resources#681

fsteeg · 2018-02-19T11:53:27Z

Find best parameter set to use with vectorizer and classifier by running all combinations of parameters in a cross-validated grid search. Runs jobs concurrently and outputs runtime info. See hbz/lobid-resources#681

See: https://radimrehurek.com/gensim/models/doc2vec.html hbz/lobid-resources#681

TobiasNx · 2022-09-15T16:07:35Z

Should we reconsider this since this was tested NWBIB hbz/nwbib#560

fsteeg self-assigned this Jan 19, 2018

fsteeg added the ready label Jan 19, 2018

ChristophEwertowski mentioned this issue Jan 25, 2018

Incorporate new vocabularies #691

Closed

fsteeg added working and removed ready labels Feb 6, 2018

fsteeg added ready and removed working labels Feb 19, 2018

fsteeg added working and removed ready labels Mar 5, 2018

fsteeg added a commit to fsteeg/python-data-analysis that referenced this issue Mar 6, 2018

Experiments with paragraph vectors using gensim's doc2vec model

cf9f153

See: https://radimrehurek.com/gensim/models/doc2vec.html hbz/lobid-resources#681

fsteeg removed the working label Mar 8, 2018

fsteeg added the ready label Mar 8, 2018

acka47 removed the ready label Apr 9, 2019

acka47 mentioned this issue Dec 3, 2020

Create Annif corpus hbz/nwbib#560

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic classification setup #681

Automatic classification setup #681

fsteeg commented Jan 19, 2018 •

edited

Loading

fsteeg commented Jan 19, 2018

ChristophEwertowski commented Jan 22, 2018

acka47 commented Jan 22, 2018

dr0i commented Jan 22, 2018 •

edited

Loading

acka47 commented Jan 24, 2018

dr0i commented Jan 25, 2018 •

edited

Loading

acka47 commented Jan 26, 2018

fsteeg commented Feb 2, 2018

acka47 commented Feb 2, 2018

fsteeg commented Feb 19, 2018 •

edited

Loading

TobiasNx commented Sep 15, 2022

Automatic classification setup #681

Automatic classification setup #681

Comments

fsteeg commented Jan 19, 2018 • edited Loading

fsteeg commented Jan 19, 2018

ChristophEwertowski commented Jan 22, 2018

acka47 commented Jan 22, 2018

dr0i commented Jan 22, 2018 • edited Loading

acka47 commented Jan 24, 2018

dr0i commented Jan 25, 2018 • edited Loading

acka47 commented Jan 26, 2018

fsteeg commented Feb 2, 2018

acka47 commented Feb 2, 2018

fsteeg commented Feb 19, 2018 • edited Loading

TobiasNx commented Sep 15, 2022

fsteeg commented Jan 19, 2018 •

edited

Loading

dr0i commented Jan 22, 2018 •

edited

Loading

dr0i commented Jan 25, 2018 •

edited

Loading

fsteeg commented Feb 19, 2018 •

edited

Loading