-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get GND labels from base data #139
Comments
Subject headings and preferred labels are in 902, alternate labels in 952. To find out of which type a GND entity is, you have to take a look at the indicator of 902. From the MAB documentation:
|
Example 1 (without contributor and with only one subject headings type): http://lobid.org/resource/HT010726584 Desired outcome is to have the preferred names as usual associated with the GND objects and the alternate names along witht eh prefered names in field {
"@graph" : [ {
"@id" : "http://d-nb.info/gnd/4046259-6",
"preferredName" : "Plasmaphysik",
"preferredNameForTheSubjectHeading" : "Plasmaphysik"
}, {
"@id" : "http://d-nb.info/gnd/4067488-5",
"preferredName" : "Zeitschrift",
"preferredNameForTheSubjectHeading" : "Zeitschrift"
}, {
"@id" : "http://d-nb.info/gnd/4511937-5",
"preferredName" : "Online-Publikation",
"preferredNameForTheSubjectHeading" : "Online-Publikation"
}, {
"@id" : "http://dewey.info/class/530/",
"prefLabel" : [ {
"@language" : "en",
"@value" : "Physics"
}, {
"@language" : "de",
"@value" : "Physik"
} ]
}, {
"@id" : "http://lobid.org/resource/HT010726584",
...
"subject" : [ "http://d-nb.info/gnd/4067488-5", "http://dewey.info/class/530/", "http://d-nb.info/gnd/4046259-6", "http://d-nb.info/gnd/4511937-5" ],
"subjectLabel" : [ "On-line-Dokument", "Online-Dokument", "On-line-Publikation", "Online-Ressource", "Computerdatei im Fernzugriff (Formschlagwort)", "Netzpublikation", "Zeitschriften", "Online-Datenbank (Formschlagwort)", "Periodikum", "On-line-Datenbank (Formschlagwort)" ],
...
} ]
...
} Aleph XML (snippet): ...
<datafield tag="902" ind1="-" ind2="1">
<subfield code="s">Plasmaphysik</subfield>
<subfield code="9">(DE-588)4046259-6</subfield>undefined</datafield>undefined<datafield tag="902" ind1="-" ind2="1">
<subfield code="s">Zeitschrift</subfield>
<subfield code="9">(DE-588)4067488-5</subfield>undefined</datafield>undefined<datafield tag="902" ind1="-" ind2="1">
<subfield code="s">Online-Publikation</subfield>
<subfield code="9">(DE-588)4511937-5</subfield>undefined</datafield>
...
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">Computerdatei im Fernzugriff</subfield>
<subfield code="h">Formschlagwort</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">Online-Datenbank</subfield>
<subfield code="h">Formschlagwort</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">Online-Dokument</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">On-line-Datenbank</subfield>
<subfield code="h">Formschlagwort</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">On-line-Dokument</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">Online-Ressource</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">On-line-Publikation</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">Netzpublikation</subfield>
</datafield> The implementation looks quite straightforward. For |
Example 2 (with corporate body as contribtuor and three different types of subject headings): http://lobid.org/resource/HT013077595/about Desired outcome: {
"@graph" : [ {
"@id" : "http://d-nb.info/gnd/109490312",
"preferredName" : "Boer, Hans-Peter",
},
"preferredNameForThePerson" : "Boer, Hans-Peter"
}, {
"@id" : "http://d-nb.info/gnd/11079267X",
"preferredName" : "Balke, Kirsten",
"preferredNameForThePerson" : "Balke, Kirsten"
}, {
"@id" : "http://d-nb.info/gnd/128755-2",
"preferredName" : "Kreisheimatverein <Coesfeld>",
"preferredNameForTheCorporateBody" : "Kreisheimatverein <Coesfeld>"
}, {
"@id" : "http://d-nb.info/gnd/4010355-9",
"preferredName" : "Coesfeld",
"preferredNameForThePlaceOrGeographicName" : "Coesfeld"
}, {
"@id" : "http://d-nb.info/gnd/4010356-0",
"preferredName" : "Kreis Coesfeld",
"preferredNameForThePlaceOrGeographicName" : "Kreis Coesfeld"
}, {
"@id" : "http://d-nb.info/gnd/4024116-6",
"preferredName" : "Heimatkundeunterricht",
"preferredNameForTheSubjectHeading" : "Heimatkundeunterricht"
}, {
"@id" : "http://lobid.org/resource/HT013077595",
"contributorLabel" : [ "Balke, Kirsten", "Boer, Hans Peter", "Boer, Hans-Peter" ],
"subjectLabel" : [ "Coesfeld. Hauptamt", "Landkreis Coesfeld", "Kreis Coesfeld. Kreistag", "Kreis Coesfeld. Hauptamt", "Kosfel'd", "Kreis Coesfeld. Oberkreisdirektor", "Coesfeld (Kreis)", "Kreis Coesfeld. Landrat", "Landrat (Kreis Coesfeld)", "Oberkreisdirektor (Kreis Coesfeld)", "Kreisverwaltung (Kreis Coesfeld)", "Kreistag (Kreis Coesfeld)", "Heimatkunde (Unterricht)", "Hauptamt (Kreis Coesfeld)", "Heimatkundedidaktik", "Stadtdirektor (Coesfeld)", "Pressestelle (Coesfeld)", "Hauptamt (Coesfeld)", "Coesfeld. Pressestelle", "Coesfeld. Stadtdirektor", "Heimatkunde / Didaktik", "Stadt Coesfeld", "Kreis Coesfeld. Kreisverwaltung" ],
"contributor" : [ "http://d-nb.info/gnd/11079267X", "http://d-nb.info/gnd/128755-2", "http://d-nb.info/gnd/109490312" ],
"subject" : [ "http://d-nb.info/gnd/4010355-9", "http://d-nb.info/gnd/4024116-6", "http://d-nb.info/gnd/4010356-0" ],
"subjectChain" : [ "Coesfeld | Heimatkundeunterricht | Lehrmittel", "Kreis Coesfeld | Heimatkundeunterricht | Lehrmittel (213)", "Kreis Coesfeld | Heimatkundeunterricht | Lehrmittel", "Coesfeld | Heimatkundeunterricht | Lehrmittel (213)" ],
...
}]
...
} Source data (snippet): <datafield tag="104" ind1="b" ind2="1">
<subfield code="p">Boer, Hans-Peter</subfield>
<subfield code="d">1949-</subfield>
<subfield code="b">[Red.]</subfield>
<subfield code="9">(DE-588)109490312</subfield>
</datafield>
<datafield tag="105" ind1="-" ind2="1">
<subfield code="p">Boer, Hans Peter</subfield>
<subfield code="d">1949-</subfield>
</datafield>
<datafield tag="200" ind1="b" ind2="1">
<subfield code="k">Kreisheimatverein</subfield>
<subfield code="h">Coesfeld</subfield>
<subfield code="9">(DE-588)128755-2</subfield>
</datafield>
<datafield tag="331" ind1="-" ind2="1">
<subfield code="a">Geschichte hier</subfield>
</datafield>
...
<datafield tag="902" ind1="-" ind2="1">
<subfield code="g">Coesfeld</subfield>
<subfield code="9">(DE-588)4010355-9</subfield>
</datafield>
<datafield tag="902" ind1="-" ind2="1">
<subfield code="s">Heimatkundeunterricht</subfield>
<subfield code="9">(DE-588)4024116-6</subfield>
</datafield>
<datafield tag="902" ind1="-" ind2="1">
<subfield code="f">Lehrmittel</subfield>
</datafield>
...
<datafield tag="902" ind1="-" ind2="1">
<subfield code="s">Heimatkundeunterricht</subfield>
<subfield code="9">(DE-588)4024116-6</subfield>
</datafield>
<datafield tag="902" ind1="-" ind2="1">
<subfield code="f">Lehrmittel</subfield>
</datafield>
<datafield tag="903" ind1="-" ind2="1">
<subfield code="a">213</subfield>
</datafield>
<datafield tag="907" ind1="-" ind2="1">
<subfield code="g">Kreis Coesfeld</subfield>
<subfield code="9">(DE-588)4010356-0</subfield>
</datafield>
<datafield tag="907" ind1="-" ind2="1">
<subfield code="s">Heimatkundeunterricht</subfield>
<subfield code="9">(DE-588)4024116-6</subfield>
</datafield>
<datafield tag="907" ind1="-" ind2="1">
<subfield code="f">Lehrmittel</subfield>
</datafield>
<datafield tag="908" ind1="-" ind2="1">
<subfield code="a">213</subfield>
</datafield>
<controlfield tag="SYS">011404221</controlfield>
<datafield tag="LOW" ind1="-" ind2="1">
<subfield code="a">M0001</subfield>
</datafield>
<datafield tag="LOW" ind1="-" ind2="1">
<subfield code="a">M1168</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="k">Coesfeld</subfield>
<subfield code="b">Hauptamt</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="k">Hauptamt</subfield>
<subfield code="h">Coesfeld</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="k">Coesfeld</subfield>
<subfield code="b">Stadtdirektor</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="k">Stadtdirektor</subfield>
<subfield code="h">Coesfeld</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="k">Coesfeld</subfield>
<subfield code="b">Pressestelle</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="k">Pressestelle</subfield>
<subfield code="h">Coesfeld</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="g">Kosfel'd</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="g">Stadt Coesfeld</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">Heimatkunde</subfield>
<subfield code="h">Unterricht</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">Heimatkunde</subfield>
<subfield code="x">Didaktik</subfield>
</datafield>
<datafield tag="952" ind1="-" ind2="1">
<subfield code="s">Heimatkundedidaktik</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Kreis Coesfeld</subfield>
<subfield code="b">Oberkreisdirektor</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Oberkreisdirektor</subfield>
<subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Kreis Coesfeld</subfield>
<subfield code="b">Kreistag</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Kreistag</subfield>
<subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Kreis Coesfeld</subfield>
<subfield code="b">Hauptamt</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Hauptamt</subfield>
<subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Kreis Coesfeld</subfield>
<subfield code="b">Landrat</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Landrat</subfield>
<subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Kreis Coesfeld</subfield>
<subfield code="b">Kreisverwaltung</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="k">Kreisverwaltung</subfield>
<subfield code="h">Kreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="g">Landkreis Coesfeld</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="g">Coesfeld</subfield>
<subfield code="h">Kreis</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="s">Heimatkunde</subfield>
<subfield code="h">Unterricht</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="s">Heimatkunde</subfield>
<subfield code="x">Didaktik</subfield>
</datafield>
<datafield tag="957" ind1="-" ind2="1">
<subfield code="s">Heimatkundedidaktik</subfield>
</datafield> |
As we currently do, we should record the preferred Name in the RDF using both the general and the more specific property, e.g.: "@id" : "http://d-nb.info/gnd/4076769-3",
"preferredName" : "Römerzeit",
"preferredNameForTheSubjectHeading" : "Römerzeit" Mapping the subfields from #139 (comment) to RDF properties, respectively their JSON object keys: p: |
Regarding subfield c, can you point me to an example, @dr0i? |
At the NWBib meeting, customers asked for GND work titles having the author name in the label (see https://wiki1.hbz-nrw.de/x/DQBEB). Example: http://lobid.org/resource?id=HT018312899&format=full Instead of:
it should look like this:
|
See hbz/lobid#139. This will make obsolete the enrichment with gnd using hadoop.
See hbz/lobid#139. On the way to make obsolete the enrichment with gnd using hadoop. * update tests
See #141. Not exactly sure why the old settings weren't working anymore. Mind that it broke against an index where the @graph.@id of the items wasn't used yet (as might the history of lodmill-ld suggest, see 7f0b3bbe2f825268fc06b286938fc09f03b943b8 committed at 2015-03-06 in lodmill-ld) as we switched back to an old index because of the hadoop enrichment issue, see #139). This phrase query against the internal id is working fine, though.
See hbz/lobid#139. * add test and test data This metafacture module generates json-ld from a jena rdf model. The generated documents are ready to be elasticsearch bulk indexed. There are highly specific requirements for generating documents out of the hbz01 catalog graph build with morph. One hbz01 catalog entry may result in dozens of documents, namely the - 'main' doc (data about the main resource) - 'items' of a doc - 'super' docs ("hasPart" nodes) - 'sameAs' docs But not all nodes should be nodes on their own: the gnd nodes must stay sub nodes of the main node. So this module is not generic and may be made generic only to a certain degree.
See hbz/lobid#139. * add test and test data This metafacture module generates json-ld from a jena rdf model. The generated documents are ready to be elasticsearch bulk indexed. There are highly specific requirements for generating documents out of the hbz01 catalog graph build with morph. One hbz01 catalog entry may result in dozens of documents, namely the - 'main' doc (data about the main resource) - 'items' of a doc - 'super' docs ("hasPart" nodes) - 'sameAs' docs But not all nodes should be nodes on their own: the gnd nodes must stay sub nodes of the main node. So this module is not generic and may be made generic only to a certain degree.
See hbz/lobid#139. * add test and test data This metafacture module generates json-ld from a jena rdf model. The generated documents are ready to be elasticsearch bulk indexed. There are highly specific requirements for generating documents out of the hbz01 catalog graph build with morph. One hbz01 catalog entry may result in dozens of documents, namely the - 'main' doc (data about the main resource) - 'items' of a doc - 'super' docs ("hasPart" nodes) - 'sameAs' docs But not all nodes should be nodes on their own: the gnd nodes must stay sub nodes of the main node. So this module is not generic and may be made generic only to a certain degree.
* add productive lobid index config This metafacture command consumes a HashMap and index the json values into an Elasticsearch index. See hbz/lobid#139.
Ready for testing. |
I believe that restricting the type of a resource is now broken, e.g. http://test.lobid.org/resource?name=Tom%2BSawyer&from=0&size=10&type=http%3A%2F%2Fpurl.org%2Fontology%2Fbibo%2FBook returns resoruces that are not bibo:Book (e.g. http://lobid.org/resource/HT016678345). |
As we don't want to make use of metafacture flow anymore a "run" package is added for starting processes. The "flow" class starts the transformation and json conversion and indexing into elasticsearch. Fixed: The bulk indexer was not reset so that update requests were ever more added and indexed all over again which which results in low performance of course. * update tests See hbz/lobid#139.
As we don't want to make use of metafacture flow anymore a "run" package is added for starting processes. The "flow" class starts the transformation and json conversion and indexing into elasticsearch. Fixed: The bulk indexer was not reset so that update requests were ever more added and indexed all over again which which results in low performance of course. * update tests See hbz/lobid#139.
We had some redundancy observed in the index mappings of elasticsearch. Most things run smoothly enough and we thought that that was ok. But it is not. This commit adjusts the config mappings so that a lookup of the mappings of the index dont't hold any "redundancy" (more accurate: "not used definitions"). It should fix the type-query mentioned in hbz/lobid#139 and also hbz/lobid#141 (see the commit a6219bbb3bf5596b3a030da1e489fc1ba852d60a "... the @graph.@id of the items wasn't used yet" ).
Deployed to staging and production. |
We can close this one as we have this in production and there probably only will be some minor adjustments in the future |
Since hbz/lobid#139 we domn't use anymore hadoop and thus the test set is way easier to generate.
Since hbz/lobid#139 we domn't use anymore hadoop and thus the test set is way easier to generate.
Since hbz/lobid#139 we domn't use anymore hadoop and thus the test set is way easier to generate.
This is necessary because of hbz/lobid#139.
Since hbz/lobid#139 we don't use anymore hadoop and thus the test set is way easier to generate. * add some more test resources mentioned in hbz/lobid#153
As of hbz/lobid#139 we (mostly) don't use lodmill-ld anymore. It's jsut needed by the old lobid-organisations , which will be exchanged with a new way to make the data. Furthermore, the old lobid-organisations will not be enhanced anymore. Thus, it is expected to not alter lodmill-ld anymore. Thus, it is safe to remove lodmill-ld form the build processes, especially for travis since such a build with all the tests and mockups take around 7 minutes or so and even sometimes fail because travis has problems with it (memory, especially).
Making queries using lv#contributorLabel instead of dc:contributor. See hbz/nwbib#117. After resolving #139 an update of the test data in the API reveals what is now missing, e.g. searching resources by author with date of birth and date of death. Enrichment with gutenberg, dbpedia and OpenLIbrary are also missing. See also #106.
Making queries using lv#contributorLabel instead of dc:contributor. See hbz/nwbib#117. After resolving #139 an update of the test data in the API reveals what is now missing, e.g. searching resources by author with date of birth and date of death. Enrichment with gutenberg, dbpedia and OpenLibrary are also missing. Also, the organisations index was not properly configured (missing @graph.@properties) so that auto completion didn't work. See also #106.
Making queries using lv#contributorLabel instead of dc:contributor. See hbz/nwbib#117. After resolving #139 an update of the test data in the API reveals what is now missing, e.g. searching resources by author with date of birth and date of death. Enrichment with gutenberg, dbpedia and OpenLibrary are also missing. Also, the organisations index was not properly configured (missing @graph.@properties) so that auto completion didn't work. See also #106.
After resolving lobid/hbz#169 we forgot to update the API: Making queries using lv#contributorLabel instead of dc:contributor. After resolving #139 an update of the test data in the API reveals what is now missing, e.g. searching resources by author with date of birth and date of death. Enrichment with gutenberg, dbpedia and OpenLibrary are also missing. Also, the organisations index was not properly configured (missing @graph.@properties) so that auto completion didn't work. See also #106.
* add productive lobid index config This metafacture command consumes a HashMap and index the json values into an Elasticsearch index. See hbz/lobid#139.
As we don't want to make use of metafacture flow anymore a "run" package is added for starting processes. The "flow" class starts the transformation and json conversion and indexing into elasticsearch. Fixed: The bulk indexer was not reset so that update requests were ever more added and indexed all over again which which results in low performance of course. * update tests See hbz/lobid#139.
* add productive lobid index config This metafacture command consumes a HashMap and index the json values into an Elasticsearch index. See hbz/lobid#139.
As we don't want to make use of metafacture flow anymore a "run" package is added for starting processes. The "flow" class starts the transformation and json conversion and indexing into elasticsearch. Fixed: The bulk indexer was not reset so that update requests were ever more added and indexed all over again which which results in low performance of course. * update tests See hbz/lobid#139.
Currently, we are enriching the title data with GND labels using hadoop job. There are at least two problems with this approach: #84 and one problem not documented appearing after the last morph adjustment.
To avoid these problems and reduce transformation time, we will get the labels directly out of the Aleph XML using morph.
Amongst others, we need to know:
The text was updated successfully, but these errors were encountered: