Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBS-derived organizations without geo coordinates #89

Closed
acka47 opened this Issue Dec 2, 2015 · 24 comments

Comments

Projects
None yet
4 participants
@acka47
Copy link
Contributor

acka47 commented Dec 2, 2015

As said in #49 (comment), there are a lot DBS-derived entries without geo coordinates. See, e.g. this list: http://beta.lobid.org/organisations/search?q=@id=DBS&size=20&from=250

At öleast for some entries, the reason for the missing geo data is that no street address is provided by DBS. From the csv file, I can see that 124 of the entries.

First of all, I would like to know how many entries in total are lacking geo corrdinates. Can I find this out with elasticsearch, i.e. can I get all entries where a particular field is missing?

@hauschke

This comment has been minimized.

Copy link

hauschke commented Dec 9, 2015

I have found 148 institutions without geonames, and subsequently(?) without lat&lon:
https://gist.github.com/hauschke/b4c998dbee1bb0f75947#file-missing_geonames-csv

Related?

@dr0i

This comment has been minimized.

Copy link
Contributor

dr0i commented Dec 10, 2015

Just as a note - in the old lobid-organisations, most (all?)of this data having an ISIL do have geo-data, e.g. DE-198 and DE-MUS-430817.

@fsteeg

This comment has been minimized.

Copy link
Member

fsteeg commented Dec 16, 2015

To fix this, we need to

  • do a full reindexing to get coordinates for updated entries
  • implement automatic geo lookup during updates as in the old lobid-organisations

@fsteeg fsteeg self-assigned this Dec 16, 2015

@fsteeg fsteeg added ready nwbib-launch and removed ready labels Dec 16, 2015

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Dec 17, 2015

Re. containedIn, as asked by @fsteeg on twitter (https://twitter.com/fsteeg/status/677483772451954688). This is added based on the "Regionalschlüssel" (rs). See #17 (comment) where it reads:

  • Add look-up of geonames in map using 8-digit "Gemeindeschlüssel" ("032P.n" of Sigel and a 8-digit derivation of "gemeindekennzahl" in DBS) for field "containedIn"
@fsteeg

This comment has been minimized.

Copy link
Member

fsteeg commented Dec 17, 2015

Some confusion here:

So what I wrote in the comment above about the steps we need to take is nonsense. The original issue remains (geo coordinates for DBS entries), and geonames are what they are and we have no plans to change anything. For Sigel-Orgs everything seems to be fine, thus I'll remove the nwbib-launch label, unassign myself and move to backlog.

@hauschke If there's anything you need, get in contact (perhaps in a new issue).

@hauschke

This comment has been minimized.

Copy link

hauschke commented Dec 17, 2015

Example of organisation without lat+long, and without geonamed id:
http://beta.lobid.org/organisations/DBS-WA339

Should have:
"containedIn" : "http://www.geonames.org/2875558"

@fsteeg fsteeg removed their assignment Dec 17, 2015

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Dec 17, 2015

Example of organisation without lat+long, and without geonamed id:
http://beta.lobid.org/organisations/DBS-WA339

Should have:
"containedIn" : "http://www.geonames.org/2875558"

Generally, only DBS entries contain "Regionalschlüssel" which are then mapped to GeoNames. There can be three reasons why entries don't have a GeoNames link.

  1. Entries solely derived from ISIL registry (i.e. that are not also part of DBS) don't have a link to GeoNames as the "Regionalschlüssel" is missing. [Edit: This is wrong as the key is in "032P.n as said above. The key simply doesn't get into the RDF.]
  2. For some DBS-derived entries, the base data has a "Regionalschlüssel" in it, it can not be found in the RDF, though, and probably isn't used for matching with Geo Names.
  3. Some entries might have a "Regionalschlüssel" and can't be mapped to GeoNames as they key is missing there. (I am not sure that these cases actually exist.)
@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Dec 17, 2015

At least 739 entries don't have rs and containedIn (link to GeoNames) because they are entries for moved/renamed libraries: http://beta.lobid.org/organisations/search?q=name:fr%C3%BCher

See also #76.

@dr0i

This comment has been minimized.

Copy link
Contributor

dr0i commented Dec 18, 2015

@hauschke wrote:

I have found 148 institutions without geonames, and subsequently(?) without lat&lon:

How do you have created that data? Seems that many/all ISILs mentioned there in fact do have lat/lon & geonames data.

@hauschke

This comment has been minimized.

Copy link

hauschke commented Dec 18, 2015

My mistake. Geonames is missing, lat&long is mostly there. I used #55 (comment) to gather the data and extracted lat&long from geoname.rdf.

@fsteeg

This comment has been minimized.

Copy link
Member

fsteeg commented Jan 15, 2016

So #92 and #93 look good, we have a new issue #90.

I think we can close this. Assiging to @acka47 for review.

@fsteeg fsteeg added the review label Jan 15, 2016

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Jan 18, 2016

Checking DBS entries as we did for ISIL entries in #93 (comment):

  1. get total of entries with dbsID: https://beta.lobid.org/organisations/search?q=dbsID:* =>​ 12.299
  2. get total of entries with dbsID and geo data : "https://beta.lobid.org/organisations/search?q=dbsID:*%20AND%20location.geo:* => ​9.714
  3. DBS-derived entires that are not expected to have geo data (because of missing street address: => 124

=> 2461 (still ~20 %) DBS-derived entries without geo data

@acka47 acka47 changed the title Organizations without geo coordinates DBS-derived organizations without geo coordinates Jan 18, 2016

@acka47 acka47 added ready and removed review labels Jan 25, 2016

@acka47 acka47 removed their assignment Jan 25, 2016

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Sep 1, 2016

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Sep 1, 2016

\o/ +1

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Sep 1, 2016

For checking accuracy of OSM lookups, we can at some point automatically compare geo coordinates with regionalschlüssel. E.g. I can easily see querying for libraries in NRW via field rs that at least the two lookups went wrong: http://beta.lobid.org/organisations/search?q=rs:05*

fsteeg added a commit that referenced this issue Sep 1, 2016

Tweak API query format for better results, update tests (#89)
Use correct separation of street address and city in query.

This commit also fixes a UI issue for organisations without
classification (discovered with updated test data).
@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Sep 1, 2016

@acka47 acka47 assigned fsteeg and unassigned acka47 Sep 1, 2016

fsteeg added a commit that referenced this issue Sep 1, 2016

Include postcode in geo lookup, use confidence treshold (#89)
To improve lookup result quality and avoid false positives.

fsteeg added a commit that referenced this issue Sep 1, 2016

Include postcode in geo lookup, use confidence treshold (#89)
To improve lookup result quality and avoid false positives.
@fsteeg

This comment has been minimized.

Copy link
Member

fsteeg commented Sep 1, 2016

Deployed a new version to staging where these queries look better. The confidence treshold used is too high though, resulting in missing location data for 3056 organisations. Plus, due to a bug, these organisations are skipped entirely. Will continue by fixing the bug and lowering the treshold.

fsteeg added a commit that referenced this issue Sep 1, 2016

Include postcode in geo lookup, use confidence treshold (#89)
To improve lookup result quality and avoid false positives.
@fsteeg

This comment has been minimized.

Copy link
Member

fsteeg commented Sep 1, 2016

Processed with treshold of 0.7, yields similar values to current beta, slight improvement:

http://test.lobid.org/organisations/search?q=dbsID:*+AND+_missing_:location.geo
http://beta.lobid.org/organisations/search?q=dbsID:*+AND+_missing_:location.geo

I think we should move the treshold down a little further and check the results again.

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Sep 2, 2016

Yes, the results look very good. We might even test a lower threshold.

fsteeg added a commit that referenced this issue Sep 2, 2016

fsteeg added a commit that referenced this issue Sep 2, 2016

Tweak logging, match address of best result (#89)
Accept result independent of treshold if street and city match.

fsteeg added a commit that referenced this issue Sep 2, 2016

Tweak logging, match address of best result (#89)
Accept result independent of treshold if street and city match.
@fsteeg

This comment has been minimized.

Copy link
Member

fsteeg commented Sep 2, 2016

Reprocessed with slightly lower treshold (0.675) and additional rule: if (street AND (postal code OR city)) match, use the result, no matter what the confidence is (I saw cases with missing postal codes in Mapzen but complete street and city match resulting in 0.5 confidence).

Results in 1218 missing locations (1921 in beta):

http://test.lobid.org/organisations/search?q=dbsID:*+AND+_missing_:location.geo

False positives look good too:

http://test.lobid.org/organisations/search?q=rs:01*
http://test.lobid.org/organisations/search?q=rs:03*
http://test.lobid.org/organisations/search?q=rs:05*
http://test.lobid.org/organisations/search?q=rs:06*
http://test.lobid.org/organisations/search?q=rs:07*
http://test.lobid.org/organisations/search?q=rs:08*

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Sep 2, 2016

@fsteeg

This comment has been minimized.

Copy link
Member

fsteeg commented Sep 2, 2016

Log output for the 1218 missing locations: missing-geo.txt

@acka47

This comment has been minimized.

Copy link
Contributor Author

acka47 commented Sep 2, 2016

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.