Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic classification setup #681

Open
fsteeg opened this issue Jan 19, 2018 · 11 comments
Open

Automatic classification setup #681

fsteeg opened this issue Jan 19, 2018 · 11 comments
Assignees

Comments

@fsteeg
Copy link
Member

fsteeg commented Jan 19, 2018

As announced in our internal planning document (AEP), we want to expand our expertise in text mining.

As a reasonable, well-defined, and useful project in that area, I suggest we should attempt to set up automatic classification for titles in our union catalog. More than half of our catalog has no subjects:

http://lobid.org/resources/search?q=NOT+_exists_%3Asubject
http://lobid.org/resources/search?q=_exists_%3Asubject

The basic approach could be this: we use a part of the classified documents as our training set, and the rest of the classified documents as the gold standard to evaluate our classification method. The training and gold sets will have to contain a selection of documents across all subjects. With this basic setup, we can experiment with different features to represent a document and different classification algorithms.

When we get good results with our gold set, we can apply our classification method to the unclassified documents without subjects. These can get a new value for the subject.source, allowing them to be treated differently in queries and display.

@fsteeg fsteeg self-assigned this Jan 19, 2018
@fsteeg fsteeg added the ready label Jan 19, 2018
@fsteeg
Copy link
Member Author

fsteeg commented Jan 19, 2018

Added more text to the initial comment after accidentally posting it. Thoughts, @acka47 @dr0i?

@ChristophEwertowski
Copy link
Contributor

A quick shot: Interesting fields could be title, shortTitle, otherTitleInformation or bibliographicCitation. The source would have to be a different one or maybe add it to the title like the DNB supposedly does (I can't open it, the converter is broken).

@acka47
Copy link
Contributor

acka47 commented Jan 22, 2018

There has already been a lot of discussion about automatic subject indexing whoch I have only noticed a small part. From what I heard, it makes most sense to assist subject indexers with automatic tools in a semi-automatic process. And I guess it doesn't make much sense to make a subject guess based only on the bibliographic data. My feeling is that you'd at least need an abstract to get a satisfying result (but I haven't reviewed the literature & could not quickly find a project where this has been tested).

We have around 446k resources with a fulltext or summary link and without subject information,
see http://lobid.org/resources/search?q=NOT+_exists_%3Asubject+AND+%28_exists_%3AfulltextOnline+OR+_exists_%3Adescription%29 compared to 532k that have both (http://lobid.org/resources/search?q=_exists_%3Asubject+AND+%28_exists_%3AfulltextOnline+OR+_exists_%3Adescription%29). Maybe that is a good subset to start with. (Probably, we won't have access to 100% of the linked fulltexts via the hbz network but I guess we could retrieve the majority.

@dr0i
Copy link
Member

dr0i commented Jan 22, 2018

I would have loved to play with automatic subject enrichment ever since. Without fulltexts we would have to be more brave since we will produce more inadequate subjects. Also, concerning the non-fulltexts. for those resources without subjects but with authors linked with other resources which already have subjects I think we won't be that bad in guessing the proper subjects - because then we would already have some domain specific knowledge. (same would be true, although broader and less exact, with publisher: e.g. O'reilly is to be expected of being in the IT/tech domain).
So, there are ways to get proper subjects. Also, I think, RVK is not totally incorporated in hbz01 because there is no good matching with resources in the e.g. BVB, right @acka47 ? I would go the easy and save pathes first, in this order:

  • RVK enrichment
  • text mining using fulltexts/abstract/tocs
  • text mining using the metadata, combined with domain-specific knowledge (author, publisher ...)
  • text mining metadata with expected highly inaccurate results

Add/sub/rearrange that list.

@acka47
Copy link
Contributor

acka47 commented Jan 24, 2018

Here is a highly relevant article from 2017 titled "Using Titles vs. Full-text as Source for Automated Semantic Document Annotation": https://arxiv.org/abs/1705.05311

The abstract says:

The results show that across three of our four datasets, the performance of the classifications using only titles reaches over 90% of the quality compared to the classification performance when using the full-text. Thus, conducting document classification by just using the titles is a reasonable approach for automated semantic annotation and opens up new possibilities for enriching Knowledge Graphs.

@dr0i
Copy link
Member

dr0i commented Jan 25, 2018

There is a no newer article which implies the same https://arxiv.org/abs/1801.06717 - so we should go with title-based subject extraction!

@acka47
Copy link
Contributor

acka47 commented Jan 26, 2018

Notably, the projects in the two papers operated over quite homogenous data ("two datasets are obtained from scientific digital libraries in the domains of economics and political sciences along with two news datasets from Reuters and New York Times"). Compared to this, the hbz catalog has descriptions of very heterogeneous bibliographic resources. It would probably make sense to create more homogenous subsets first before conducting the automatic indexing or at least to take important information about the field of the resource into account. As I recently said it in face-to-face talk: A good indicator of the topic is also given by the holding institutions as many of them have a focus of collection.

E.g. to get a list of libraries from the hbz network whose collections include resources on economics you can ask lobid-organisations like this: http://lobid.org/organisations/search?q=linkedTo.id%3A%22http%3A%2F%2Flobid.org%2Forganisations%2FDE-605%23%21%22+AND+Wirtschaftswissenschaften&location=

@fsteeg
Copy link
Member Author

fsteeg commented Feb 2, 2018

As a smaller set that is also somewhat more homogeneous than the full catalog we could use NWBib:

https://test.nwbib.de/search?location=&q=NOT+_exists_%3Asubject
https://test.nwbib.de/search?location=&q=_exists_%3Asubject

Completing the NWBib subjects would also be a self-contained project before taking on the full catalog.

@acka47
Copy link
Contributor

acka47 commented Feb 2, 2018

+1 I think the NWBib editors would even review the results and take over correct subjects into Aleph.

fsteeg added a commit to fsteeg/python-data-analysis that referenced this issue Feb 2, 2018
Get data with training, testing, and missing subjects from NWBib,
starting with small training and testing sets, using a single
subject from the NWBib classification (Sachsystematik).

See hbz/lobid-resources#681
@fsteeg fsteeg added working and removed ready labels Feb 6, 2018
fsteeg added a commit to fsteeg/python-data-analysis that referenced this issue Feb 7, 2018
Load data from the local CSV files into a corpus, create feature
vectors for each document using the tf-idf values of each term.

See hbz/lobid-resources#681
fsteeg added a commit to fsteeg/python-data-analysis that referenced this issue Feb 16, 2018
Vectorize and classify the local data from CSV files using
different vectorizer and classifier implementations, evaluate
using the test set, and write best result to a CSV file.

Trying to approach full NWBib set for experiments, see details in
nwbib_subjects_load.py. Using tiny training and test sets, so
absolute values are not meaningful, but relative differences should
provide hints about which methods to test with the full data set.

See hbz/lobid-resources#681
fsteeg added a commit to fsteeg/python-data-analysis that referenced this issue Feb 16, 2018
Get larger sample data in `bulk` format from the Lobid API,
vectorize using stateless HashingVectorizer (requires no fitting
of full corpus), use stop words, classify using sparse vectors,
write trained classifier to disk.

See hbz/lobid-resources#681
@fsteeg
Copy link
Member Author

fsteeg commented Feb 19, 2018

I think we have a useful setup for further experiments in https://github.com/fsteeg/python-data-analysis:

The bulk classification uses a different vectorizer than the small set experiments to make it work with larger data sets, so the results are not directly comparable. The mid-size and full-size setup however use the same setup, and yield very comparable results, so the mid-size setup should be a useful basis for further experiments with a runtime low enough to run multiple setups (which features to use, how to configure the vectorizer, which classifier to use and how to configure it).

Some areas to investigate next:

  • Use the sklearn pipeline API
  • Runtime issues: measure, include in scoring, run different setups concurrently
  • Experiment with word embeddings, paragraph vectors
  • Consider scikit-learn wrappers in gensime (doc2vec, word2vec)
  • Set up Jupyter notebooks (see gensime notebooks, Word2Vec for document classification)
  • Visualize classifier result distribution and result accuracy
  • Experiment with which fields to use, run experiments with all subsets of all fields
  • Collect additional textual data from super- and subordinate entries, compare results
  • Also work with Raumsystematik, not just Sachsystematik, what about GND subjects?
  • Investigate multi-class classification (currently only 1 class used for training and testing)

@fsteeg fsteeg added ready and removed working labels Feb 19, 2018
fsteeg added a commit to fsteeg/python-data-analysis that referenced this issue Feb 23, 2018
Find best parameter set to use with vectorizer and classifier
by running all combinations of parameters in a cross-validated
grid search. Runs jobs concurrently and outputs runtime info.

See hbz/lobid-resources#681
@fsteeg fsteeg added working and removed ready labels Mar 5, 2018
@fsteeg fsteeg removed the working label Mar 8, 2018
@fsteeg fsteeg added the ready label Mar 8, 2018
@acka47 acka47 removed the ready label Apr 9, 2019
@TobiasNx
Copy link
Contributor

Should we reconsider this since this was tested NWBIB hbz/nwbib#560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

5 participants