New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up Elasticsearch & openly available kibana with complete Loc-instances-Bibframe dataset #23

Closed
acka47 opened this Issue Oct 24, 2018 · 10 comments

Comments

Projects
None yet
3 participants
@acka47
Copy link
Contributor

acka47 commented Oct 24, 2018

No description provided.

@fsteeg fsteeg added the ready label Oct 31, 2018

@dr0i dr0i added working and removed ready labels Nov 5, 2018

@dr0i dr0i changed the title Set up Elasticsearch & openly available kibana with complete Bibframe dataset Set up Elasticsearch & openly available kibana with complete Loc-instances-Bibframe dataset Nov 5, 2018

@dr0i

This comment has been minimized.

Copy link
Contributor

dr0i commented Nov 5, 2018

While the solution provided in a) hbz/lobid-resources#851 is not proper enough but b) #2 not fast enough we go again with a) using java-jsonld, facing some issues which must be resolved, see hbz/lobid-resources#940.

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Nov 9, 2018

@dr0i is now able to convert the LOC data with the Java JSON-LD library and the output looks much like what we generate in the workshop. Here is the current status with the same example that we use in the workshop:

The current differences to the JSON-LD from the workshop (at least those I have found) are:

  • all types have an array (which is intended but we will have to think about whether we actually want it for the LoC data)
  • http://id.loc.gov/ontologies/bflc/EncodingLevel is mapped to EncodedLevel in the context (instead of EncodingLevel), see https://github.com/hbz/lobid-resources/blob/523-addLanguageTagSupport/web/conf/context-loc.jsonld#L65
  • titleSortKey has an array ("@container": "@set" in the context)
  • Also, the pending fix for the missing datatypes in the context (#30) has to also be implemented in lobid-resources.

dr0i added a commit to hbz/lobid-resources that referenced this issue Nov 12, 2018

Improve loc context; add runner
- remove unused JsonConverter et al
- remove unused libraries in pom.xml

See hbz/swib18-workshop#23 and #523.
@dr0i

This comment has been minimized.

Copy link
Contributor

dr0i commented Nov 12, 2018

Did that. Also indexed a 100k test resources , see curl "http://weywot5.hbz-nrw.de:9200/loc_works/_search?q=*" |json_pp. ETL time was 5 minutes so it takes roughly the same time as the ETL of lobid-resources.

@dr0i dr0i assigned acka47 and unassigned dr0i Nov 12, 2018

@dr0i dr0i added review and removed working labels Nov 12, 2018

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Nov 12, 2018

Thanks for indexing. Will soon take a look.

so it takes roughly the same time as the ETL of lobid-resources.

\o/ This is good to hear.

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Nov 13, 2018

Just took a look at the data. Here is what I noticed.

  1. Some things are missing in the context as MADS classes are not compacted: http://www.loc.gov/mads/rdf/v1#ComplexSubject, http://www.loc.gov/mads/rdf/v1#GenreForm, http://www.loc.gov/mads/rdf/v1#Temporal, http://www.loc.gov/mads/rdf/v1#Geographic etc. also the MADSproperty http://www.loc.gov/mads/rdf/v1#isMemberOfMADSScheme. Also some bflc properties: http://id.loc.gov/ontologies/bflc/consolidates, http://id.loc.gov/ontologies/bflc/relatorMatchKey, http://id.loc.gov/ontologies/bflc/applicableInstitution where the first two are also missing in the workshop context (see #32).
    Why not use the context at https://github.com/hbz/swib18-workshop/blob/master/data/context.json? (It would be easier for me to work with one context anyway.)
  2. Kibana is quite helpful for finding other errors. When I got to "Management" -> "Index Patterns" -> select "loc_works" and then filter by http:// I get all fields with a "http" in them. There are those already noted in 1.) and some more that have in common that they have "@language": null in the context at https://github.com/hbz/lobid-resources/blob/523-addLanguageTagSupport/web/conf/context-loc.jsonld

I guess, nearly all errors will be fixed with using the workshop context. The rest will be fixed with #32.

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Nov 13, 2018

For documentation: Some records are not indexed in ES because they have no valid date as value of adminMetadata.changeDate.

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Nov 13, 2018

BTW, a lessons learned for Kibana: At "Management" -> "index patterns" -> "loc_works" I had to refresh the field list. Otherwise the fields from the previous index would be shown there and also in the field autosuggests for creating a visualization.

dr0i added a commit to hbz/lobid-resources that referenced this issue Nov 13, 2018

dr0i added a commit to hbz/lobid-resources that referenced this issue Nov 13, 2018

@dr0i

This comment has been minimized.

Copy link
Contributor

dr0i commented Nov 13, 2018

@acka47 I indexed loc-works-bibframe with the workshop's json-context. You have to set the index-alias to use it for kibana. 3k more documents are missing now, you will get tomorrow morning the lobid-admin mail complaining about [MapperParsingException[object mapping for [language] tried to parse field [null] and some more for subject etc. and the known adminMetadata.changeDate.

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Nov 14, 2018

Taking a look at the new index, some of the property/type URIs are now compacted but some others are still missing:

  • http://www.loc.gov/mads/rdf/v1#ComplexSubject
  • http://www.loc.gov/mads/rdf/v1#GenreForm -> missing in context as there is also bf:GenreForm
  • http://www.loc.gov/mads/rdf/v1#Temporal -> missing in context as there is also bf:Temporal
  • http://www.loc.gov/mads/rdf/v1#Geographic
  • http://www.loc.gov/mads/rdf/v1#isMemberOfMADSScheme
  • http://id.loc.gov/ontologies/bflc/consolidates
  • http://id.loc.gov/ontologies/bflc/relatorMatchKey
  • http://id.loc.gov/ontologies/bflc/applicableInstitution

For bflc:consolidates and bflc:relatorMatchKey see #32 (comment). I also found the problem with mads:GenreForm and mads:Temporal. What I do not understand is why mads:isMemberOfMADSScheme is not compacted as it is in the current context, see https://github.com/hbz/swib18-workshop/blob/master/data/context.json#L1114

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Jan 7, 2019

Closing this one as index and Kibana are up. The data may not be perfect but it is totally fine for an experimental setup.

@acka47 acka47 closed this Jan 7, 2019

@acka47 acka47 removed the review label Jan 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment