New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include sameAs and depiction information from EntityFacts in lobid-gnd index #69

Closed
acka47 opened this Issue Apr 12, 2018 · 17 comments

Comments

Projects
None yet
3 participants
@acka47
Contributor

acka47 commented Apr 12, 2018

I wrote in #44:

Possible further step: Enrich JSON-LD with links from EntityFacts. There might be a dump in the future we could build a map from.

The dump is there now (see https://data.dnb.de/opendata/) and we already set up an ES index with the data.

Reasons for including the information in the data:

  • For some use cases (get all entries with a depiction, give me the GND entry with a specific ISNI etc.) we will need to query the data.
  • Also, it is confusing when some things are shown in the HTML that aren't in the JSON.
@acka47

This comment has been minimized.

Contributor

acka47 commented Apr 12, 2018

How will we embed the information? It gets a bit problematic as we already have sameAs information from from the GND via OAI-PMH that is in part redundant but structured differently (string array vs. object array) . For example the sameAs for Lambert Heller:

Current lobid:

{
  "sameAs": [
    "http://viaf.org/viaf/313478392",
    "http://orcid.org/0000-0003-0232-7085"
  ]
}

EntityFacts:

{
  "sameAs":[
    {
      "@id":"http://d-nb.info/gnd/1066621098/about",
      "collection":{
        "abbr":"DNB",
        "name":"Gemeinsame Normdatei (GND) im Katalog der Deutschen Nationalbibliothek",
        "publisher":"Deutsche Nationalbibliothek",
        "icon":"http://www.dnb.de/SiteGlobals/StyleBundles/Bilder/favicon.png?__blob=normal&v=1"
      }
    },
    {
      "@id":"http://orcid.org/0000-0003-0232-7085",
      "collection":{
        "abbr":"ORCID",
        "name":"Open Researcher and Contributor ID",
        "publisher":"ORCID"
      }
    },
    {
      "@id":"http://viaf.org/viaf/313478392",
      "collection":{
        "abbr":"VIAF",
        "name":"Virtual International Authority File (VIAF)",
        "publisher":"OCLC",
        "icon":"http://viaf.org/viaf/images/viaf.ico"
      }
    }
  ]
}

After some discussion with @fsteeg, we prefer the following approach:

  • Adjust the JSON so that there is always an object array for sameAs like
{
  "sameAs":[
    {
      "@id":"http://viaf.org/viaf/313478392"
    },
    {
      "@id":"http://orcid.org/0000-0003-0232-7085"
    }
  ]
}
  • When ingesting EntityFacts datas, replace sameAs arrays.
@acka47

This comment has been minimized.

Contributor

acka47 commented Apr 12, 2018

Another challenge is the linking to wikipedia. From GND we get the following:

{
  "id":"http://d-nb.info/gnd/118634313",
  "wikipedia":[
    "https://de.wikipedia.org/wiki/Ludwig_Wittgenstein"
  ]
}

In EntityFacts we get links to the German and English wikipedia as part of the sameAs array:

{
  "@id":"http://d-nb.info/gnd/118634313",
  "sameAs":[
    {
      "@id":"https://de.wikipedia.org/wiki/Ludwig_Wittgenstein",
      "collection":{
        "abbr":"dewiki",
        "name":"Wikipedia (Deutsch)",
        "publisher":"Wikimedia Foundation Inc.",
        "icon":"https://de.wikipedia.org/static/favicon/wikipedia.ico"
      }
    },
    {
      "@id":"https://en.wikipedia.org/wiki/Ludwig_Wittgenstein",
      "collection":{
        "abbr":"enwiki",
        "name":"Wikipedia (English)",
        "publisher":"Wikimedia Foundation Inc.",
        "icon":"https://en.wikipedia.org/static/favicon/wikipedia.ico"
      }
    }
  ]
}

One solution would be to ignore the wikipedia links from GND and only use those from EntityFacts. This would mean that we would lose the links for all resources that are not part of EntityFacts. I think this is ok as on ~170 resources have a wikipedia link that are not also in EntitFacts, see this query for resources with wikipedia link that are neither of type person, nor corporate body nor PlaceOrGeographicName.

@acka47

This comment has been minimized.

Contributor

acka47 commented Apr 12, 2018

As discussed offline, I am also ok with keeping the – then mostly redundant – wikipedia links in wikipedia.

@fsteeg fsteeg added the ready label Apr 12, 2018

@fsteeg fsteeg changed the title from Include sameAs and depiction information from EntityFacts to Include sameAs and depiction information from EntityFacts in lobid-gnd index Apr 17, 2018

@fsteeg fsteeg self-assigned this May 2, 2018

@fsteeg fsteeg added working and removed ready labels May 2, 2018

@fsteeg

This comment has been minimized.

Contributor

fsteeg commented May 3, 2018

Currently processing EntityFacts data integrated in our JSON data. When we last discussed this, we said we'd like to stay close to the source data, as we do for the core GND data.

For the depiction, I only replaced @id with id (note it's not an array):

"depiction": {
  "id": "https://commons.wikimedia.org/wiki/Special:FilePath/MarkTwain.LOC.jpg",
  "thumbnail": {
    "id": "https://commons.wikimedia.org/wiki/Special:FilePath/MarkTwain.LOC.jpg?width=270"
  },
  "url": "https://commons.wikimedia.org/wiki/File:MarkTwain.LOC.jpg?uselang=en"
}

For sameAs, I replaced @id with id and added collection.name as sameAs.label:

"sameAs": [ {
  "id": "https://www.deutsche-digitale-bibliothek.de/entity/118624822",
  "label": "Deutsche Digitale Bibliothek",
  "collection": {
    "abbr": "DDB",
    "name": "Deutsche Digitale Bibliothek",
    "publisher": "Deutsche Digitale Bibliothek",
    "icon": "https://www.deutsche-digitale-bibliothek.de/appStatic/images/favicon.ico"
  }
} ]

Added the label to be consistent with the normal, non-EntityFacts sameAs structure with id and label (although we don't have actual labels in the GND sameAs data yet). The same consistency with other fields would be nice for depiction too (but not that urgent, as we don't have depiction fields with the id and label structure, so maybe a separate format for depiction is OK).

I think the whole collection thing is not ideal. It has no id, making it a bit pointless to introduce a separate entity (it's a blank node when converted to triples). The icon is the useful part, and that's similar to the depiction, also an image.

Maybe a id/label/image structure could work both for depiction:

"depiction": [ {
  "id": "https://commons.wikimedia.org/wiki/File:MarkTwain.LOC.jpg?uselang=en",
  "label": "Mark Twain",
  "image": "https://commons.wikimedia.org/wiki/Special:FilePath/MarkTwain.LOC.jpg"
} ]

And for sameAs:

"sameAs": [ {
  "id": "https://www.deutsche-digitale-bibliothek.de/entity/118624822",
  "label": "Deutsche Digitale Bibliothek",
  "image": "https://www.deutsche-digitale-bibliothek.de/appStatic/images/favicon.ico"
} ]

Current implementation uses the EntityFacts JSON and inserts the data into the framed, compacted GND JSON. We will have to look into the context and add some things there to keep our JSON consumable as RDF (depiction and image if we go with the unified approach, a few more if we stay closer to the EntityFacts format).

@acka47

This comment has been minimized.

Contributor

acka47 commented May 3, 2018

Maybe a id/label/image structure could work both for depiction:

"depiction": [ {
 "id": "https://commons.wikimedia.org/wiki/File:MarkTwain.LOC.jpg?uselang=en",
 "label": "Mark Twain",
 "image": "https://commons.wikimedia.org/wiki/Special:FilePath/MarkTwain.LOC.jpg"
} ]

If we create a better structure from this, I would like to not to go off too far from the source. Maybe something like this?

{
  "depiction":[
    {
      "id":"https://commons.wikimedia.org/wiki/Special:FilePath/MarkTwain.LOC.jpg",
      "url":"https://commons.wikimedia.org/wiki/File:MarkTwain.LOC.jpg?uselang=en",
      "thumbnail":"https://commons.wikimedia.org/wiki/Special:FilePath/MarkTwain.LOC.jpg?width=270"
    }
  ]
}

And for sameAs:

"sameAs": [ {
  "id": "https://www.deutsche-digitale-bibliothek.de/entity/118624822",
  "label": "Deutsche Digitale Bibliothek",
  "image": "https://www.deutsche-digitale-bibliothek.de/appStatic/images/favicon.ico"
} ]

In this case, I'd rather keep the structure as the icon is for the collection and not the linked thing/entry. The abbreviation etc. is also useful for some and should be included. We should probably add ids for the collections, e.g.:

{
  "id":"https://www.deutsche-digitale-bibliothek.de/entity/118624822",
  "collection":{
    "id":"http://d-nb.info/gnd/1070828033",
    "abbr":"DDB",
    "name":"Deutsche Digitale Bibliothek",
    "publisher":"Deutsche Digitale Bibliothek",
    "icon":"https://www.deutsche-digitale-bibliothek.de/appStatic/images/favicon.ico"
  }
}
@fsteeg

This comment has been minimized.

Contributor

fsteeg commented May 3, 2018

To summarize our offline discussion.

Reorder fields for depiction:

"depiction": [ {
  "id":"https://commons.wikimedia.org/wiki/Special:FilePath/MarkTwain.LOC.jpg",
  "url":"https://commons.wikimedia.org/wiki/File:MarkTwain.LOC.jpg?uselang=en",
  "thumbnail":"https://commons.wikimedia.org/wiki/Special:FilePath/MarkTwain.LOC.jpg?width=270"
} ]

Add id to collection for sameAs from EntityFacts:

"sameAs": [ {
  "id":"https://www.deutsche-digitale-bibliothek.de/entity/118624822",
  "collection":{
    "id":"http://d-nb.info/gnd/1070828033",
    "abbr":"DDB",
    "name":"Deutsche Digitale Bibliothek",
    "publisher":"Deutsche Digitale Bibliothek",
    "icon":"https://www.deutsche-digitale-bibliothek.de/appStatic/images/favicon.ico"
  }
} ]

Add collection to sameAs from GND instead of label:

"sameAs": [ {
  "id": "http://viaf.org/viaf/9537769",
  "collection": {
    "id": "https://viaf.org",
    "abbr": "VIAF",
    "name": "Virtual International Authority File (VIAF)",
    "publisher": "OCLC",
    "icon": "http://viaf.org/viaf/images/viaf.ico"
  }
} ]

fsteeg added a commit that referenced this issue May 3, 2018

Consistent JSON structure for `depiction` and `sameAs`
- Tweak structure for enriched data from EntityFacts
- Add collection details for GND `sameAs` data
- Add new properties to JSON-LD context

See #69

fsteeg added a commit that referenced this issue May 4, 2018

Use enriched `depiction` and `sameAs` data for details view
Remove EntityFacts object and additional index lookup

See #69

fsteeg added a commit that referenced this issue May 4, 2018

Update index config
- Set up dynamic template for *.id subfields
- Replace string type with text or keyword

See #69
@acka47

This comment has been minimized.

Contributor

acka47 commented May 4, 2018

Here are the URIs for the different collections in EntityFacts. For consistency's sake I only took URIs from Wikidata. I am not sure that these are all collections in EntityFacts. Also, there are some URIs missing where we have to create a wikidata entry first.

fsteeg added a commit that referenced this issue May 7, 2018

fsteeg added a commit that referenced this issue May 7, 2018

@fsteeg

This comment has been minimized.

Contributor

fsteeg commented May 7, 2018

Deployed new consistent JSON structure to test:

http://test.lobid.org/gnd/search?q=depiction.id:*&format=json
http://test.lobid.org/gnd/118512676.json (sameAs with EntityFacts)
http://test.lobid.org/gnd/1006691-3.json (sameAs without EntityFacts)

The collection.id values deployed are just the domains of the sameAs.id URLs.

The Wikidata QIDs from the comment above will be in the next index (will start conversion now).

@fsteeg

This comment has been minimized.

Contributor

fsteeg commented May 9, 2018

Wikidata based collection.id values now also deployed to test:

http://test.lobid.org/gnd/118512676.json (sameAs with EntityFacts)
http://test.lobid.org/gnd/1006691-3.json (sameAs without EntityFacts)

@acka47

This comment has been minimized.

Contributor

acka47 commented May 9, 2018

Everything looks good except one minor thing: publisher is mapped to dct:publisher in the context which has rdfs:range dct:Agent and thus rather should be used with the publishers URI and not a string. I suggest to just use http://purl.org/dc/elements/1.1/publisher instead but would like to coordinate this with the DNB as they probably want to fix this in EntityFacts as well. Pinging @jentschk.

@jentschk

This comment has been minimized.

jentschk commented May 9, 2018

Absolutely with you. Will try to last minute squeeze this in for correction in Release 2018.03 (https://wiki.dnb.de/x/wgcbBQ)! Many thanks for alerting.

@acka47

This comment has been minimized.

Contributor

acka47 commented May 9, 2018

@jentschk Nice, thanks.

fsteeg added a commit that referenced this issue May 9, 2018

@fsteeg

This comment has been minimized.

Contributor

fsteeg commented May 9, 2018

@acka47

This comment has been minimized.

Contributor

acka47 commented May 9, 2018

@jentschk provided me with a list of all collections used in EntityFacts. Looks like I have missed some in #69 (comment):

fsteeg added a commit that referenced this issue May 9, 2018

@fsteeg

This comment has been minimized.

Contributor

fsteeg commented May 9, 2018

Looks like I have missed some [...]

Current fallback is to use the domain of the linked resource as the collection ID, see https://test.lobid.org/gnd/125217145.json so I think it's fine to deploy what we have here. I'll move the missing collections to a new issue and assign the pull request for this issue for review.

@acka47

This comment has been minimized.

Contributor

acka47 commented May 14, 2018

I now added the missing links to the comment above. When doing this I noticed that the link to the Sophie digital library is broken and the icon link also gives a 404, see http://sophie.byu.edu/. @jentschk, can you please exchange this in EntityFacts and use https://scholarsarchive.byu.edu/sophie/ instead?

I also noticed that the links to the Voralberg Chronik do not work anymore and notified the submitter, see https://de.wikipedia.org/wiki/Benutzer_Diskussion:AndreasPraefcke/BEACON#Links_in_die_Voralberg-Chronik_sind_tot. It probably would be good to remove them from EntityFacts for now...

Also, are there any plans for a new (and at best regular) EntityFacts dump?

@jentschk

This comment has been minimized.

jentschk commented May 14, 2018

Thanks, @acka47, for alerting. Both linking targets will be removed in our next monthly "enrichment update" (should be available by Wednesday morning).

@fsteeg fsteeg closed this in #99 May 15, 2018

fsteeg added a commit that referenced this issue May 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment