Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data model: full support for references #430

Closed
2 tasks done
kaplun opened this issue Oct 20, 2015 · 16 comments · Fixed by #1279
Closed
2 tasks done

Data model: full support for references #430

kaplun opened this issue Oct 20, 2015 · 16 comments · Fixed by #1279

Comments

@kaplun
Copy link
Contributor

kaplun commented Oct 20, 2015

  • references should be lists of lists (because there might be errata/ibid on the same reference)
  • What was once a pubnote in $s should be splitted up into journal title, volume, issue, start page
@kaplun kaplun added the roadmap label Oct 20, 2015
@kaplun kaplun added this to the Citation machinery on Labs milestone Oct 20, 2015
@kaplun
Copy link
Contributor Author

kaplun commented Oct 20, 2015

cc: @jalavik, @annetteholtkamp

@annetteholtkamp
Copy link

We would also need an object for “reportnr, page” in case a report contains several contributions - see e.g. CERN Yellow Reports.
And also “conf acronym, page/article id” e.g. for JACoW conferences.

  • Annette

On 20 Oct 2015, at 14:30, Samuele Kaplun notifications@github.com wrote:

cc: @jalavik, @annetteholtkamp


Reply to this email directly or view it on GitHub.

@kaplun
Copy link
Contributor Author

kaplun commented Oct 20, 2015

Well, if we split up $s into its components then we would naturally have separate page and reportnumber supporting your above Yellow report use case. Regarding Conf acronym, do we have some today in 999C5?

@annetteholtkamp
Copy link

Not yet, but we’ve already some in 773. As soon as we’re exposing that in our bibliographic data we should be able to recognise the corresponding references as well.

  • Annette

On 20 Oct 2015, at 14:53, Samuele Kaplun notifications@github.com wrote:

Well, if we split up $s into its components then we would naturally have separate page and reportnumber supporting your above Yellow report use case. Regarding Conf acronym, do we have some today in 999C5?


Reply to this email directly or view it on GitHub.

@aw-bib
Copy link

aw-bib commented Oct 20, 2015

Just to ask:

  • a reference is (usually) pointing to an existing record
  • basically, thus it has the very same structure as a record
  • it even probably needs this complex structure to model all different pub types etc.

Thus, isn't the link(tm) enough probably drawing in some display and get expansions for indexing?

Ie for me references sound a bit like "just the same as the gigantic workflow, except it's children live in HEP space".

The exception are references that are not in inspire, ie. records that usually do not get a curation etc. Thus they will end up in some free form text anyway.

I wonder if such an approach would not simplify the model a lot.

@kaplun
Copy link
Contributor Author

kaplun commented Oct 20, 2015

@aw-bib we have to store the whole reference (possibly already structured) because we don't know if:

  • this match any record at the time the record is ingested
  • it will possibly match a future record still to arrive
  • it is currently matching a record but this is a mistake and the reference structure is used to check this.

So the link is not enough.

But indeed it is a good point that the reference could basically be structured almost as whole record. That open up quite some reflection points...

In fact it all boils down to how much information we are able to match from publishers or guess from PDFs via refextract/Grobid.

@aw-bib
Copy link

aw-bib commented Oct 21, 2015

I see your point, but wouldn't in this case storage of the raw string be enough?

If not, why not use the data extracted and create a (stub) record and use this for linking. Then you are sure you can map every needs. If the real record comes in later, just brush up the stub by usual merging. I think this simplifies the overall data structure.

So the question is it worth the efford to add a nested record structure, with all its complications.

@kaplun
Copy link
Contributor Author

kaplun commented May 3, 2016

Moving discussion to dedicated issue #1099

@kaplun
Copy link
Contributor Author

kaplun commented May 3, 2016

@bittirousku can you take care of the above points:

references should be lists of lists (because there might be errata/ibid on the same reference)
What was once a pubnote in $s should be splitted up into journal title, volume, issue, start page

@bittirousku
Copy link
Contributor

Sure, I can do that.

@eamonnmag
Copy link
Contributor

List of dicts is always better. Otherwise one has to iterate over all items
and check properties to discover which reference should be displayed.

On Tue, 19 Jul 2016, 10:45 Jacopo Notarstefano, notifications@github.com
wrote:

Closed #430 #430 via
#1279 #1279.


You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub
#430 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AARPHAU6D6dy1KIL8pGVClor4EpMG43qks5qXI7AgaJpZM4GSGxd
.

@kaplun
Copy link
Contributor Author

kaplun commented Jul 19, 2016

Uh! @eamonnmag what are you referring to?

@eamonnmag
Copy link
Contributor

references should be lists of lists (because there might be errata/ibid on the same reference)
What was once a pubnote in $s should be splitted up into journal title, volume, issue, start page

Unless I've misunderstood what you're storing in the references block and it's different from say, holdingpen, then if they are lists of lists you have this:

[
    [{'type': 'erratum', 'title': 'bla'},
    {'type': 'correct', 'title': 'blah'}]
]

Now to display the references, I will have to loop over the array and get the 'correct' type in this case. If instead we have this:

[
    {
        'correct': {'title': 'blah'},
        'erratum': {'title': 'bla'}
    }
]

I can just iterate over each reference and get 'correct'.

Obviously the keys are rubbish in this case. But something like this. This would be especially convenient when you have perhaps even more than two.

@jacquerie jacquerie reopened this Jul 19, 2016
@jacquerie
Copy link
Contributor

Currently references are still list of dicts, but each reference has a list of raw_references inside.

@mihaibivol
Copy link
Contributor

The correct / erratum use-case should always happen when adding info between various versions and has to do with the way versioning is done. It will aways be v0 = [{'title': 'bla'}] v1 = [{'titles': 'blah'}] -- use magic --> correct = [{'title': 'bla'}] and you will only display that title.

The problem with list of lists was with references that share the same number. @kaplun You had a pdf example. So far, refextract did a mess in legacy references. List of lists are generally bad for versioning and merging since you have to match a list of reference-like things denoting a single reference with another list of reference-like things. The goal is to always keep correct on top and previous versions of raw_reference only for curators to have fast access in fixing things.

@jacquerie
Copy link
Contributor

Everything that had to be decided about this has already been decided in inspirehep/inspire-schemas#130.

@ghost ghost removed the Status: RFC label Apr 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants