Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

classifier: add core- and anti-HEP-ontology #2282

Closed
wants to merge 4 commits into from

Conversation

fschwenn
Copy link
Contributor

  • Adds a reduced ontology with only the core HEP keywords.

  • Adds an anti-HEP-ontology with keywords indicating irrelevance
    for HEP.

Signed-off-by: fschwenn florian.schwennsen@desy.de

* Adds a reduced ontology with only the core HEP keywords.

* Adds an anti-HEP-ontology with keywords indicating irrelevance
  for HEP.

Signed-off-by: fschwenn <florian.schwennsen@desy.de>
@kaplun kaplun changed the title assifierext: add core- and anti-HEP-ontology classifier: add core- and anti-HEP-ontology Apr 25, 2017
@kaplun
Copy link
Contributor

kaplun commented Apr 25, 2017

@fschwenn do you have instead changes WRT the current full ontology? And also, how do you think of using the antihep one? As soon as there is an antiHEP keyword matched then it's a reject?

@michamos
Copy link
Contributor

michamos commented Apr 25, 2017

As soon as there is an antiHEP keyword matched then it's a reject?

I hope not, anti-HEP contains for example Fibonacci, and we have 335 core records with Fibonacci in the fulltext.

@kaplun
Copy link
Contributor

kaplun commented Apr 25, 2017

@michamos, @fschwenn I have digged the code and discovered that antiHEP KB is already there: https://github.com/inspirehep/inspire-next/blob/master/inspirehep/kbs/corepar.kb
(a bit hidden given it's called corepar.kb rather than antihep.kb.)
It is used here:
https://github.com/inspirehep/inspire-next/blob/master/inspirehep/modules/workflows/proxies.py#L38-L47

https://github.com/inspirehep/inspire-next/blob/master/inspirehep/modules/workflows/tasks/classifier.py#L43

So the current meaning is: list all core keywords found in the paper that are not anti HEP.

Now I am puzzled: why are there keywords that are core but anti hep? What does it mean? Is it because you want to consider these as keywords that you want to have in the record but not use as criteria to evaluate coreness of a paper?

@fschwenn
Copy link
Contributor Author

The HEP ontology is up to date.
corepar.kb is the list of particle keywords (core and not-core) which in fact should not be counted for the decision whether some paper is core or not since keywords like "B" or "K" can be found in any kind of paper.
antihep.kb contains keywords which appear in articles we scan but strongly indicate that it is not relevant (like 'interplanetary' in asto-papers, 'positron emission tomography' in instrumental papers, 'Brillouin' in cond-mat papers). Appearance of such a keyword does not directly lead to rejection but together with other indicators. For the journals at DESY we use core-keywords, antiHEP keywords, #of references in INSPIRE, #of references in INSPIRE which are CORE, #of core-PACS, #of CORE-papers by the authors in combination.

@kaplun
Copy link
Contributor

kaplun commented Jul 5, 2017

@fschwenn coming back to this.

For the journals at DESY we use core-keywords, antiHEP keywords, #of references in INSPIRE, #of references in INSPIRE which are CORE, #of core-PACS, #of CORE-papers by the authors in combination.

Does that mean that you display all this information and then a human manually takes the final decision?

All these metrics sounds like something we can compute and then we can train some ML to learn from the human decisions.

@fschwenn
Copy link
Contributor Author

fschwenn commented Jul 5, 2017

Does that mean that you display all this information and then a human manually takes the final decision?

Exactly.

@kaplun
Copy link
Contributor

kaplun commented Jul 5, 2017

What exactly do you mean by #of core-PACS? Also How do you know core authors? Given that you need first to guess who is who?

@fschwenn
Copy link
Contributor Author

fschwenn commented Jul 6, 2017

I aggregated a list of mappings 'PACS->plain text' from the different edditions of PACS lists, since for selection purposes '04.60.Bc' is not very helpful but 'Phenomenology of quantum gravity' is! I also added the information whether a PACS indicates COREness like for the example '04.60.Bc' and all PACS starting with 1. If a metadata of a journal article include PACS, I (my program) translates them to plain text and counts how many are CORE.
My routine counting 'core authors' is very rude. If there are less than 20 authors on the paper, it just looks for "ea:[author as in the metadata]". It counts how many core papers each indiviual author has, how many the authors have where they all together are authors of the same paper. From these INSPIRE papers my code tries to guess whether there is fieldcode which the authors have very often. For chinese papers this is not very helpful, but for instrumentational papers the output "0,0,5;0" with fieldcode-guess=HEP-Theory tells me two authors (probably) have no CORE paper in INSPIRE and the third one probably has the same name as a theorist => paper is (probably) not interesting. In contrast "4,30,10;3" with fieldcode-guess=Instrumentation would tell me the three authors already have written 3 CORE papers as a team with the fieldcode-guess confirming that it's really them => paper is interesting.
All in all it is not an extremely strong indicator which can not be used to immediately reject papers but I would miss it, as it sometimes draws my attention to papers which I otherwise would have missed. Also for conference proceedings this way of guessing the field code is quite efficient to distinguish 'n' from 'x' and 'p' from 'e' where it just saves a bit of my time.

* Updates the DESY maintained HEP ontology.

Signed-off-by: fschwenn <florian.schwennsen@desy.de>
* Updates the DESY maintained HEP ontology.

Signed-off-by: fschwenn <florian.schwennsen@desy.de>
* Updates the DESY maintained reduced HEP ontology with only
  the CORE keywords.

Signed-off-by: fschwenn <florian.schwennsen@desy.de>
@jacquerie jacquerie mentioned this pull request Dec 12, 2017
8 tasks
@ghost ghost removed the Status: WIP label Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants