Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximize number of containedIn statements #253

Open
acka47 opened this issue Sep 1, 2016 · 11 comments
Open

Maximize number of containedIn statements #253

acka47 opened this issue Sep 1, 2016 · 11 comments
Assignees

Comments

@acka47
Copy link
Contributor

acka47 commented Sep 1, 2016

Currently, we have ~6700 entries with a containedIn link to geonames. We get this by querying geonames with the Gemeindeschlüssel, see the current morp-enriched, lines 331-335

The link is missing for ~15k, see http://beta.lobid.org/organisations/search?q=_missing_:containedIn. As mapzen also provides geonames data, we should probably query it with address plus Gemeindeschlüssel (if available).

@acka47 acka47 changed the title Check provenance of containedIn statements Maximize number of containedIn statements Sep 1, 2016
@acka47 acka47 added the launch label Sep 1, 2016
@fsteeg fsteeg added the ready label Sep 5, 2016
@fsteeg
Copy link
Member

fsteeg commented Sep 5, 2016

Would be possible with Mapzen API [1, 2], but that would both significantly increase our Mapzen call count, and the transformation runtime. The current approach based on the Gemeindeschlüssel is much more efficient. Since we also want to improve RS/AGS (see #134), I think we should stay with the current approach and try to improve the numbers from that side.

[1] https://search.mapzen.com/v1/search?text=50676+Köln&sources=geonames&layers=coarse
[2] http://www.geonames.org/2886242

@fsteeg fsteeg added review and removed ready labels Sep 5, 2016
@fsteeg fsteeg assigned acka47 and unassigned fsteeg Sep 5, 2016
@acka47 acka47 assigned fsteeg and unassigned acka47 Sep 8, 2016
@fsteeg
Copy link
Member

fsteeg commented Sep 9, 2016

With the increased number of AGS values on staging, we have more containedIn values, too:

http://beta.lobid.org/organisations/search?q=containedIn:*
http://test.lobid.org/organisations/search?q=containedIn:*

These are still way lower than the AGS numbers. Should there be a geonames value for every AGS? Are these missing in https://raw.githubusercontent.com/hbz/lookup-tables/master/data/geonames-map.csv? @SBRitter, where did you get that data from?

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Sep 9, 2016
@acka47
Copy link
Contributor Author

acka47 commented Sep 12, 2016

Here is the current status from the perspective of entries missing a ContainedIn statement:

http://beta.lobid.org/organisations/search?q=_missing_:containedIn
http://test.lobid.org/organisations/search?q=_missing_:containedIn

@acka47 acka47 assigned SBRitter and unassigned acka47 Sep 16, 2016
@SBRitter
Copy link
Contributor

@fsteeg, I'm sorry, I don't know and I can't find hints in the git log. Maybe from a HashMap by @dr0i? Going to think about it...

@fsteeg fsteeg removed the review label Sep 20, 2016
@acka47 acka47 assigned dr0i and unassigned SBRitter Sep 21, 2016
@acka47 acka47 added review and removed review labels Sep 21, 2016
@dr0i
Copy link
Member

dr0i commented Sep 21, 2016

In the lodmill repo is the file https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/geonames_DE.csv. It's a simple csv, the source must be http://download.geonames.org/export/dump/DE.zip. Transformation of this csv is done in lodmill repo using metamorph (https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/morphGeonamesCsv2ld.xml) to create triples, which are linked by the Gemeindeschlüssel-Object (found in ISIL field 032P.n , see https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/morph_zdb-isil-file-pica2ld.xml#L110)).

@dr0i dr0i assigned fsteeg and unassigned dr0i Sep 21, 2016
@dr0i
Copy link
Member

dr0i commented Sep 21, 2016

Ah, and in the lodmill csv there are way more entries (~180k) than in comparison to the lookup-table repo (~11k).

@fsteeg fsteeg removed their assignment Sep 22, 2016
@acka47 acka47 removed the working label Sep 26, 2016
@acka47 acka47 added the ready label Sep 26, 2016
@fsteeg fsteeg self-assigned this Dec 6, 2016
@fsteeg fsteeg added working and removed ready labels Dec 6, 2016
@fsteeg
Copy link
Member

fsteeg commented Dec 6, 2016

OK, trying to understand how to proceed here and how it's all connected. From #253 (comment) it seems the issue is that we are missing too many containedIn fields, which should be created from the ags field, for which less is missing:

http://test.lobid.org/organisations/search?q=_missing_:containedIn
http://test.lobid.org/organisations/search?q=_missing_:ags

A potential solution would be to use https://github.com/lobid/lodmill/blob/master/lodmill-rd/src/main/resources/geonames_DE.csv, which contains data occuring in 032P.n in its first column, instead of https://raw.githubusercontent.com/hbz/lookup-tables/master/data/geonames-map.csv.

@fsteeg fsteeg added ready and removed working labels Dec 6, 2016
@fsteeg
Copy link
Member

fsteeg commented Dec 6, 2016

In an offline discussion, @acka47 mentioned that we might actually not need containedIn at all.

Instead, #268 might be the relevant area to improve the data.

In any case, it seems this should not be blocking the launch.

@acka47
Copy link
Contributor Author

acka47 commented Dec 6, 2016

In an offline discussion, @acka47 mentioned that we might actually not need containedIn at all.

Links to GeoNames address some interesting use cases, e.g. you can get data on population size and make queries like: get all museums in places with < 10,000 residents. As this will be easy to fix as soon as #268 is finished, we shouldn't close it.

@acka47
Copy link
Contributor Author

acka47 commented Jul 12, 2017

With our own pelias service running, we could do it like @fsteeg suggested in #253 (comment).

@acka47
Copy link
Contributor Author

acka47 commented Jul 12, 2017

See also lobid/lodmill#488.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants