Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest catalogue data #4

Open
4 tasks
raquelalegre opened this issue May 2, 2018 · 3 comments
Open
4 tasks

Ingest catalogue data #4

raquelalegre opened this issue May 2, 2018 · 3 comments
Assignees
Labels
data Ingestion and preprocessing of data

Comments

@raquelalegre
Copy link
Contributor

raquelalegre commented May 2, 2018

Period information is already in the glossaries ES DB, but location and genre (and other possibly interesting bits like sub/super genre) are not linked to individual glossary entries. Those are in the catalogue.json file Steve sent. We need to:

  • Check all entries in our ES glossary are linked to a P-object in catalogue.json.
  • Add the instance IDs to the ES glossary entries.
  • Add the catalogue "members" (i.e. P-objects).
  • Change the search_all endpoint to return also genre and location (and maybe other things) by joining catalogue entries and glossary entries.
@ageorgou
Copy link
Contributor

ageorgou commented May 2, 2018

Some relevant info on adding instance IDs from oracc/elastic-search-poc#4:

  • The instances field of the glossary has a list of of (lists of occurrences). Each element (i.e. list of occurrences) has an id, referred to from the xis field of an entry, with two caveats:
    • Some ids have the form [lan].[abcde].p.[per], where the part p.[per] refers to a particular period. These are probably not referred to directly from the entries but are generated automatically for other reasons. Additionally, for each of these, there should be a corresponding instance with id [lan].[abcde], containing the same list of occurrences. We can forget about the period-specific instances and only use the "general" ones (i.e. without the .p.[per] part).
    • Some sub-fields of entries (norms, forms, ...) also have xis fields referring to potentially distinct elements of instances. These will be sub-lists of the list of occurrences referred to by the top-level entry. We can decide how to present the results, whether it's just using the top-level xis or providing more detail.

@ageorgou
Copy link
Contributor

ageorgou commented May 10, 2018

Two questions (probably for Steve):

  • Is it true that all items have a supergenre and genre, but not necessarily a subgenre?
  • Are P-numbers (eg P010632) unique across all projects, or only within a certain project?

Looking into this further, it seems that some catalogue entries are missing at least one of genre, supergenre or period (see results for catalogue.json for the "neo" project: catalogue_missing.txt)

@raquelalegre
Copy link
Contributor Author

raquelalegre commented May 15, 2018

From conversation with Steve:

  • Pnumbers are unique across the whole DB.
  • Default to unknown genre if it's not in the DB. Subgenres have not been curated. We can offer users search on them, but it's not standardized. Instead of dropdown with limited list of options for subgenre, we can use a free text box (Steve thinks it would be useful for users to search sub-genres).

@ageorgou ageorgou added the data Ingestion and preprocessing of data label May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ingestion and preprocessing of data
Projects
None yet
Development

No branches or pull requests

2 participants