Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection from Union Catalogue missed in lobid #1052

Closed
hagbeck opened this issue Jan 30, 2020 · 16 comments
Closed

Collection from Union Catalogue missed in lobid #1052

hagbeck opened this issue Jan 30, 2020 · 16 comments
Assignees
Projects

Comments

@hagbeck
Copy link
Collaborator

hagbeck commented Jan 30, 2020

In the union catalogue exists collection i.e. for e-books which are not starting with "ZDB". These are missed in lobid.

In http://lobid.org/resources/HT020022241.json there is the collection name "cuvillier". Other collection names can be found in https://service-wiki.hbz-nrw.de/display/VDBE/Produktsigel+und+interne+Selektionskennzeichen

@acka47
Copy link
Contributor

acka47 commented Jan 30, 2020

Thanks for the request. Collection information is in 078 in the source data, from the example http://lobid.org/hbz01/HT020022241:

<datafield tag="078" ind1="e" ind2="1">
  <subfield code="a">cuvillier</subfield>
</datafield>

Currently, we are only transforming some dedicated collection information (NWBib, Edoweb, FRL, ZDB) to RDF. We will have to think about a way for adding generic collection information. Maybe with a bnode, so that the resulting JSON could look like this:

{
    "inCollection": [
        {
            "label": "cuvillier"
        }
    ]
}

However, before approaching this it would be nice to know how many different collections are named in 078 and what their tags are.

@acka47 acka47 added this to Backlog in lobid board via automation Jan 30, 2020
@acka47 acka47 moved this from Backlog to Ready in lobid board Jan 30, 2020
@acka47 acka47 removed their assignment Feb 13, 2020
@dr0i dr0i moved this from Ready to Working in lobid board Feb 25, 2020
@dr0i
Copy link
Member

dr0i commented Feb 25, 2020

However, before approaching this it would be nice to know how many different collections are named in 078 and what their tags are.

We could do this with ES aggs but this would mean to explicitly allow aggregations over labels (meaning: new index config, and performance impact). Doing this on the 36k smallTest reveals just:

{
"key" : "Zeitschriftendatenbank (ZDB)",
"doc_count" : 163
},
{
"key" : "eResource package",
"doc_count" : 129
},
{
"key" : "Nordrhein-Westfälische Bibliographie (NWBib)",
"doc_count" : 123
},
{
"key" : "Edoweb Rheinland-Pfalz",
"doc_count" : 16
},
{
"key" : "Elektronische Zeitschriftenbibliothek (EZB)",
"doc_count" : 10
},
{
"key" : "Fachrepositorium Lebenswissenschaften",
"doc_count" : 10
},
{
"key" : "Rheinland-Pfälzische Bibliographie",
"doc_count" : 8
}
so you may still want to have a complete list on the whole index?

@acka47
Copy link
Contributor

acka47 commented Mar 5, 2020

However, before approaching this it would be nice to know how many different collections are named in 078 and what their tags are.

The list with all the collections is linked in the original issue comment:
https://service-wiki.hbz-nrw.de/display/VDBE/Produktsigel+und+interne+Selektionskennzeichen

Here are the 36 IDs that are not an ISIL:

asmi
budri
caso
chiso
Cont
cuvillier
dawsonera
edso 
editlib
elgar
lyell
hade
hirzel
huguenots
iorm
kenso
learntechlib
Logos
mansi
misso
MPSO
mpig
NNg
obp
oso
pearson
philon
luther
smalib
minnso
uncso
vkal
vkv
vogel
wageningen
woodhead
wtm

Here is another proposal to model this in JSON-LD with a newly added identifier key that is mapped to dct:identifier. The advantage would be that we'd also have the label, a disadvantage is that we'd have to maintain a id->label map:

{
    "@context":{
        "@import":"http://lobid.org/resources/context.jsonld",
        "identifier":"http://purl.org/dc/terms/identifier"
    },
    "inCollection":[
        {
            "identifier":"cuvillier",
            "label":"Cuvillier-E-Books"
        }
    ]
}

@dr0i
Copy link
Member

dr0i commented Mar 12, 2020

Hm, would it make sense to use something like this:

"inCollection":[
    {
        "id":"https://lobid.org/vocabs/cuvillier",
        "label":"Cuvillier-E-Books"
    }

With this approach we could utilize Etikett to set the label as we normaly do. Also, we could provide some more data about the collections if we want. Disadvantage: we have to enhance the vocabs.

@dr0i dr0i removed their assignment Mar 12, 2020
@acka47
Copy link
Contributor

acka47 commented Mar 12, 2020

+1 But I think we should then use lobid-resources URIs that – in this case – would not resolve:

"inCollection":[
    {
        "id":"https://lobid.org/resources/cuvillier",
        "label":"Cuvillier-E-Books"
    }

@acka47
Copy link
Contributor

acka47 commented Mar 12, 2020

Decision after offline discussion:

"inCollection":[
    {
        "id":"https://lobid.org/collections#cuvillier",
        "label":"Cuvillier-E-Books"
    }

For now, these will not resolve but we could add a file at https://lobid.org/collections in the future if needed.

@dr0i dr0i assigned dr0i and unassigned acka47 Mar 13, 2020
dr0i added a commit that referenced this issue Mar 13, 2020
@dr0i
Copy link
Member

dr0i commented Mar 13, 2020

There seem to be more than just 36 IDs, e.g. dilibri. After complete indexing we can obtain a list of them by querying the api.

@dr0i dr0i moved this from Working to Review in lobid board Mar 13, 2020
@dr0i dr0i assigned acka47 and unassigned dr0i Mar 13, 2020
acka47 added a commit that referenced this issue Mar 13, 2020
@dr0i
Copy link
Member

dr0i commented Mar 13, 2020

A little bit confused why you removed the dilibri label. If it's not an e-book at all (seems possible, see definition of dilibri) it should be given a better name (not the Dilibri E-Book i gave it) and, worse, if it's not a collection at all (which IMO it is) it should be somehow filtered out in the morph completely. For now it is subsumed under inCollection missing a proper label.

dr0i added a commit that referenced this issue Mar 13, 2020
@dr0i
Copy link
Member

dr0i commented Mar 17, 2020

In production. Getting this list of ids with counts:

https://lobid.org/collections#NLZ: 491824
https://lobid.org/collections#ldd: 132898
https://lobid.org/collections#vl-ulbd: 64608
https://lobid.org/collections#Springer: 60456
https://lobid.org/collections#vl-ulbms: 22128
https://lobid.org/collections#dilibri: 8249
https://lobid.org/collections#s2w-zbmed: 6139
https://lobid.org/collections#s2w-retropadubpb: 4042
https://lobid.org/collections#GBV-1-NEF: 3392
https://lobid.org/collections#s2w-ulbbonn: 3241
https://lobid.org/collections#s2w-hsspadubpb: 2604
https://lobid.org/collections#vd18: 1666
https://lobid.org/collections#vl-ddbk: 1439
https://lobid.org/collections#s2w-llbdetmold: 1371
https://lobid.org/collections#GBV-1-NEL: 1001
https://lobid.org/collections#s2w-ulbbonndfg: 947
https://lobid.org/collections#Lizenz2009: 913
https://lobid.org/collections#Lizenz2008: 739
https://lobid.org/collections#dawsonera: 669
https://lobid.org/collections#Lizenz2010: 566
https://lobid.org/collections#rez: 546
https://lobid.org/collections#lyell: 475
https://lobid.org/collections#Cont: 443
https://lobid.org/collections#taylor francis: 369
https://lobid.org/collections#Lizenz2011: 321
https://lobid.org/collections#Lizenz2014: 311
https://lobid.org/collections#Lizenz2016: 291
https://lobid.org/collections#Lizenz2012: 226
https://lobid.org/collections#Lizenz2013: 211
https://lobid.org/collections#wbv: 207
https://lobid.org/collections#BeltzLizenz2016: 203
https://lobid.org/collections#BeltzLizenz2017: 187
https://lobid.org/collections#BeltzLizenz2015: 168
https://lobid.org/collections#Lizenz2017: 156
https://lobid.org/collections#thieref: 152
https://lobid.org/collections#mansi: 149
https://lobid.org/collections#Lizenz2018: 141
https://lobid.org/collections#V&RELibraryLizenz2014: 135
https://lobid.org/collections#budri: 134
https://lobid.org/collections#fzo: 131
https://lobid.org/collections#luther: 128
https://lobid.org/collections#elgar: 126
https://lobid.org/collections#Lizenz2015: 123
https://lobid.org/collections#huguenots: 121
https://lobid.org/collections#bloomsbury2016: 118
https://lobid.org/collections#juris: 112
https://lobid.org/collections#bloomsbury2014: 103
https://lobid.org/collections#bloomsbury2015: 101
https://lobid.org/collections#BeltzLizenz2018: 98
https://lobid.org/collections#BeltzLizenz2019: 95
https://lobid.org/collections#BeltzLizenz2014: 92
https://lobid.org/collections#bloomsbury2017: 91
https://lobid.org/collections#bloomsbury2013: 80
https://lobid.org/collections#vogel: 80
https://lobid.org/collections#KohlhammerLizenz2014: 78
https://lobid.org/collections#BeltzLizenz2013: 77
https://lobid.org/collections#smalib: 74
https://lobid.org/collections#pearson: 73
https://lobid.org/collections#mpig: 72
https://lobid.org/collections#Lizenz2019: 69
https://lobid.org/collections#igi global: 68
https://lobid.org/collections#synthesis lectures: 56
https://lobid.org/collections#V&RELibraryLizenz2017: 53
https://lobid.org/collections#MohrSiebeckLizenz2018: 51
https://lobid.org/collections#V&RELibraryLizenz2016: 50
https://lobid.org/collections#WallsteinLizenz2019: 48
https://lobid.org/collections#KohlhammerLizenz2016: 44
https://lobid.org/collections#oso: 39
https://lobid.org/collections#vkal: 39
https://lobid.org/collections#BeltzLizenz2012: 38
https://lobid.org/collections#wageningen: 38
https://lobid.org/collections#KohlhammerLizenz2018: 36
https://lobid.org/collections#WBGLizenz2017: 35
https://lobid.org/collections#KohlhammerLizenz2019: 32
https://lobid.org/collections#learntechlib: 31
https://lobid.org/collections#KohlhammerLizenz2013: 30
https://lobid.org/collections#melanchthon: 30
https://lobid.org/collections#V&RELibraryLizenz2018: 29
https://lobid.org/collections#beofamilien: 28
https://lobid.org/collections#vkv: 26
https://lobid.org/collections#beozivil: 24
https://lobid.org/collections#WallsteinLizenz2018: 23
https://lobid.org/collections#woodhead: 20
https://lobid.org/collections#KohlhammerLizenz2017: 16
https://lobid.org/collections#Lizenz2007: 14
https://lobid.org/collections#WBGLizenz2019: 14
https://lobid.org/collections#chiso: 13
https://lobid.org/collections#cuvillier: 13
https://lobid.org/collections#WBGLizenz2016: 10
https://lobid.org/collections#WBGLizenz2018: 9
https://lobid.org/collections#MPSO: 8
https://lobid.org/collections#hade: 8
https://lobid.org/collections#MohrSiebeckLizenz2013-2015: 7
https://lobid.org/collections#MohrSiebeckLizenz2019: 5
https://lobid.org/collections#caso: 5
https://lobid.org/collections#obp: 5
https://lobid.org/collections#Logos: 4
https://lobid.org/collections#Lizenz2005: 3
https://lobid.org/collections#beosteuer: 3
https://lobid.org/collections#iorm: 3
https://lobid.org/collections#uncso: 3
https://lobid.org/collections#TTP-MCE: 2
https://lobid.org/collections#asmi: 2
https://lobid.org/collections#beoarbeit: 2
https://lobid.org/collections#edso: 2
https://lobid.org/collections#minnso: 2
https://lobid.org/collections#misso: 2
https://lobid.org/collections#mso: 2
https://lobid.org/collections#s2w-hsspadmindest: 2
https://lobid.org/collections#KohlhammerLizenz2015: 1
https://lobid.org/collections#Lizenz2001: 1
https://lobid.org/collections#Lizenz2004: 1
https://lobid.org/collections#Lizenz2006: 1
https://lobid.org/collections#NNg: 1
https://lobid.org/collections#kenso: 1
https://lobid.org/collections#palgraveoa: 1

by doing:

curl -XGET  'http://weywot3.hbz-nrw.de:9200/resources/_search?q=inCollection.id:*collections*&pretty=true' -d '
{
  "size": 0,
  "aggs": {
          "aggs1": {
              "terms": {
                "field": "inCollection.id",
                "size": 11350
              }
          }
        }
}
' | paste - - |grep -v "{" |grep -v "}"| sed 's#.*"key" : "##g' |sed 's#\(.*\)",.*"doc_count" \(.*\)#\1\2#g' | grep  collections

@acka47
Copy link
Contributor

acka47 commented Mar 17, 2020

Thanks @dr0i, I will look into adding some more labels to the labels.json.

@acka47
Copy link
Contributor

acka47 commented Mar 17, 2020

I don't like the URI with a space in it: "https://lobid.org/collections#synthesis lectures" Could you just remove spaces during the transformation, @dr0i?

Also https://lobid.org/collections#taylor francis.

@acka47
Copy link
Contributor

acka47 commented Mar 17, 2020

There are also some with & in them, e.g. https://lobid.org/collections#V&RELibraryLizenz2014. It is probably no problem after the hash, but we might remove those as well...

@dr0i
Copy link
Member

dr0i commented Mar 17, 2020

Is this really neccessary? Good using hash-URIS, because as you've noted, these URLs don't make any problem (at least not in indexing, in showing, in querying, in (not) resolving). Also, we would have a "natural" way of retrieving/building these URLs. Maybe we should concentrate on real use cases. Are there any?

acka47 added a commit that referenced this issue Mar 17, 2020
@hagbeck
Copy link
Collaborator Author

hagbeck commented Mar 17, 2020

It seems to be OK. I think there is nobody who can verify the counts, so we will see in practice, if its complete. Our examples are OK.

Many thanks!

@acka47
Copy link
Contributor

acka47 commented Mar 18, 2020

Thanks for the feedback, @hagbeck . We will close this issue as soon as #1062 is deployed.

@acka47
Copy link
Contributor

acka47 commented Mar 19, 2020

Closing, as #1062 is deployed to production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants