Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing collections #32

Closed
fsteeg opened this issue Oct 17, 2016 · 9 comments · Fixed by #46
Closed

Missing collections #32

fsteeg opened this issue Oct 17, 2016 · 9 comments · Fixed by #46
Labels

Comments

@fsteeg
Copy link
Member

fsteeg commented Oct 17, 2016

Missing collections:

But they have associated titles:

@fsteeg
Copy link
Member Author

fsteeg commented Oct 21, 2016

This is weird. The old website lists 146 collections [1], we have 147 (the additional 1 is probably our own supercollection), so this does not fit with 9 missing collections. Maybe these are wrong collection IDs? In the old system, the collections have numerical IDs like the resources, see links in [1], while we use the content of 992 a as the ID. That's a custom field, right? @dr0i, do you remember why you picked that?

The collections have a 024 field, which sounds like the thing to use for the ID [2]:

<marc:datafield ind1="8" ind2=" " tag="024">
    <marc:subfield code="a">oai:digitalisiertedrucke.de:46713</marc:subfield>
    <marc:subfield code="p">collection</marc:subfield>
</marc:datafield>

Would it make sense to use that number as the ID? It would yield URIs like http://digitalisiertedrucke/collections/46713 for collections, instead of the current http://digitalisiertedrucke/collections/feldzeitungen.ub.hd.de. Don't know if that would solve the issue here, but it seems more correct to me. It might also fix some other collection-related issues we have.

[1] http://web.archive.org/web/20130526203147/http://www.digitalisiertedrucke.de/collection/Sammlungsbeschreibungen?ln=de
[2] https://www.loc.gov/marc/bibliographic/bd024.html

@fsteeg fsteeg assigned acka47 and dr0i and unassigned fsteeg Oct 21, 2016
@acka47
Copy link
Contributor

acka47 commented Oct 21, 2016

+1 for using the ID from 024 for collections.

@fsteeg
Copy link
Member Author

fsteeg commented Oct 21, 2016

After some further discussion with @acka47 we figured that the 001 actually contains the ID.

I had tested this with our current system and we get a resource with the same ID as the collection (see http://beta.digitalisiertedrucke.de/resources/46713), but that seems to be due to some kind of error, since the 001 IDs are actually unique:

$ cat hbz_zvdd_resource_marc.xml | grep "tag=\"001\"" | wc -l
491271
$ cat hbz_zvdd_resource_marc.xml | grep "tag=\"001\"" | sort -u | wc -l
491271

With this approach, we can also reconstruct the old URLs like
http://digitalisiertedrucke.de/record/46713 (I'll open a separate issue for that).

@fsteeg fsteeg mentioned this issue Oct 21, 2016
@fsteeg fsteeg assigned fsteeg and unassigned acka47 and dr0i Oct 21, 2016
@fsteeg fsteeg added working and removed ready labels Oct 21, 2016
@dr0i
Copy link
Member

dr0i commented Oct 21, 2016

sounds reasonable. +1

fsteeg added a commit that referenced this issue Oct 21, 2016
fsteeg added a commit that referenced this issue Oct 24, 2016
@fsteeg
Copy link
Member Author

fsteeg commented Oct 24, 2016

So here's the problem: in the 024 field in #32 (comment), the a subfield's ID is not the collection from the p subfield, but the ID of the title resource itself. See title MAB-MXL in [1].

So it seems the isPartOf relationship is only expressed in our data with the IDs we used before. I guess this was the reason @dr0i used that as the ID when he originally wrote the transformation.

We could either stick with the current IDs, or implement a mapping from these to the numerical ones.

This affects #45 (restoring old URLs).

What do you think @acka47?

[1] https://github.com/hbz/digitalisiertedrucke/wiki/Example-resources

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Oct 24, 2016
@fsteeg fsteeg added review and removed working labels Oct 24, 2016
@acka47
Copy link
Contributor

acka47 commented Oct 25, 2016

@dr0i and me just sat down to understand this. It seems as it was decided to not add any collection descriptions at the end of digitalisiertedrucke but to only create collections by linking records together. That's why 8-9 collections are missing. We already created rudimentary descriptions for these with https://github.com/lobid/lodmill/blob/master/lodmill-rd/transformations/zvdd/data/missing_collections.ttl. These should just be added to the data.

For two collections there is no description, though: collection:einblattdrucke_vd17.gbv.goe.de & collections:zvdd.hbz.k.de. The first probably is an error and we should use collection:einblattdrucke.vd17.oo.de instead (see also https://github.com/lobid/lodmill/blob/master/lodmill-rd/transformations/zvdd/statistic/mismatching_collection-IDs.textile). The second is the super collection for all other collection. I think we can completely discard it in the UI.

Regarding the identifiers, we should keep using the literal ones for collections and look into using the /record/$ID pattern for single resources. Thus, we would regain old URLs for the biggest part of resources.

@fsteeg fsteeg added review and removed working labels Oct 31, 2016
@fsteeg fsteeg assigned acka47 and unassigned fsteeg Oct 31, 2016
fsteeg added a commit that referenced this issue Oct 31, 2016
@acka47
Copy link
Contributor

acka47 commented Nov 2, 2016

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants