Implement regular DBS updates #94

acka47 · 2016-01-07T12:52:03Z

To be discussed with DBS colleagues. Questions (amongst others):

How often do we get regular updates of DBS data?
What happens when people report errors in lobid-organisations entires derived from DBS?

acka47 · 2017-05-10T13:01:52Z

I will speak to Ira soon about this. There are different options – short-term and mid-/long-term.

Short-term: Notify DBS regarding changes

Notify DBS staff about necessary corrections like here and here.
We could
1. mail the corrections to them and ask them for an updated export as soon as the data is corrected.
2. do the corrections ourselves and send them a link to the diff. For this, the DBS csv file would have to be in a git repository somewhere. (I think currently this is not the case.)

Mid- to long-term: move DBS-Stammdaten management to another system

Move administration of DBS Stammdaten to the to-be-built hbz customer relationship graph.
This will enable workflows where more people can directly make "pull requests".
Estimated time frame: could be set up until end of 2018

acka47 · 2017-07-27T09:16:57Z

I discussed with Ira that @dr0i will get access to the DBS database. If Gerhard is back in September he will instruct @dr0i.

acka47 · 2017-09-18T07:52:05Z

We will have a meeting with DBS in October where we will probably get access to the database so that we can set up regular exports of DBS data.

acka47 · 2017-10-25T06:48:55Z

Renamed the issue to set up regular DBS updates. We now have access to the database. @dr0i had some questions regarding the next steps.

The current version of DBS export we use for transformation can be found at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip (see https://github.com/hbz/lobid-organisations/blob/master/README.textile#deployment).

I think we made sure in the past that we want the DBS export as near to the original data as possible. Thus, the column names in the csv should be named the same as those in the data base. In the past we exported the following information (though we don't use it all): inr, iso, bib_typ, nam, plz, ort, str, stk_2007, sbi, isil, tvw, tel, fax, ema, url, opa, sta, oef, typ_text, utr_text, gro_text, leitung

fsteeg · 2017-10-25T07:29:11Z

Note that the file at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip contains manual changes, which ideally should be merged with a new database export.

acka47 · 2017-10-25T07:44:12Z

Note that the file at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip contains manual changes, which ideally should be merged with a new database export.

I communicated about those and especially other changes with the DBS colleagues. So the DBS source should be much better than what we currently have in dbs.zip. No need to merge.

dr0i · 2017-11-17T08:37:01Z

@acka47 for your initial test use sol@weywot2. That's our staging environment.

acka47 · 2017-11-17T09:51:56Z

We get nearly all relevant data with SELECT * FROM dbs.adresse;. I put the row names and two examples rows at https://gist.github.com/acka47/67199ec585f1f6adeea3176b0d69dd76.

Filter out active libraries with status_2_id = 1: We only want to import the active libraries to lobid, this means to only get rows with value 1 in status_2_id.

We will have to adjust the transformation a bit as column names differ from the current dbs.csv file. Here is a comparison old -> new:

inr -> inr
iso -> iso
bib_typ -> fb
nam -> name
plz -> plz
ort -> ort
str -> strasse
stk_2007 -> gvz_id
sbi -> sigel
isil -> isigel
tvw -> vorwahl (with other syntax, e.g. "(0221)" instead of "-221")
tel -> tel
fax -> fax
ema -> email
url -> url
opa -> opac
sta -> ?? (Not sure, but I think we don't use this information anyway.)
oef -> oeffnungszeiten
typ_text -> String is in another table, but I think we use ~~bib_typ/fb~~ typ_id anyway.
gro_text -> bestandsgroessenklasse_id (We will have to adjust the mapping here as we will get IDs for the Bestandsgrößenklasse ("1" to "11") instead of textual values like "100.001 - 300.000", see the csv at https://gist.github.com/acka47/f6a248a46249367becc681774d9d0b14
leitung -> leitung

Missing field "Unterhaltsträger"

What is missing in dbs.adresse is information on Unterhaltsträger. This is in another table dbs.utr_von where a DBS ID is associated with a Unterhaltsträger ID (1-15, see this gist). So, @dr0i was right that we have to join different tables, at least in this case.

*Additional fields

We should add sortiername as alternateName.
We should add postfach
We should use fachstelle_inr to add a link to the responsible Fachstelle
We should think about how to add the information from dbv_id

acka47 · 2017-11-17T10:12:56Z

I tried to join the tables without any previous SQL experience and finally this did work:

SELECT dbs.adresse.*, dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr;

acka47 · 2017-11-17T10:16:22Z

Here is the query with filtering out non-active libraries:

SELECT dbs.adresse.*, dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;

See #94 (comment)

See #94

dr0i · 2018-07-26T11:50:10Z

Because of DSGVO it was decided to not allow any updates. Closing.

acka47 · 2018-07-26T11:56:30Z

Reopening as the status of DBS data access is still unclear.

acka47 · 2018-08-22T12:32:51Z

It looks like there currently is no DBS data at all in lobid-organisations: https://lobid.org/organisations/search?q=_exists_%3AdbsID+AND+NOT+_exists_%3Aisil

dr0i · 2018-09-06T16:45:44Z

Transformation fails with (see application.log):

Transformation failed
java.lang.IllegalArgumentException: wrong number of columns (expected 18, was 19) in input line:

Comparing the old csv and the new one:
in the old csv there are sometimes commatas part of a value e.g.

... ,"Mo-Fr.: 8-24, Sa: 9-24 Uhr", ...

and these possible fields are then in quotes ("csv with enclosed fields"). In the new dbs-dump these quotes are missing.
So, my sed-command to substitute tabs with commata is not enough, it should be like
sed 's#^#"#g;s#\t#","#g;s#$#"#g'. Don't know if this will work correctly, though, because there exist many fields without quotes in the old csv (the new one would then always have a comma and quote).
Tabs as delimiter (in generally: a delimiter that's not used in the value) are so much easier. However I will write to Therese N. to update the postprocessing regex and see if the transformation will work correctly.

acka47 · 2018-09-06T17:19:06Z

I don't know whether you interpreted the problem correctly. We actually removed one column (fax) from the SQL query, so it is correct that there is one less. So we must adjust the code there to expect 18 columns.

dr0i · 2018-09-07T07:07:24Z

Yeah, me too saw that at first as the most plausible source of the error. I admit to have had problems to easily find where this number is specified, but I believe the first line (the description of the csv columns) is analyzed. There is now correctly the "fax" missing, and, counting the commatas, this makes exactly 18 defined columns. Now, my example of the "opening hours" shows that the data has in fact 19 columns because the quotes are missing.
Also I've made a quick test by manually correcting a line and index that and this was good without errors.

dr0i · 2018-09-07T08:08:13Z

Sent Therese this regex: sed 's#^#"#g;s#\t#","#g;s#$#"#g'.

dr0i · 2018-09-10T08:12:27Z

Update finished. Looks good to me! Assigning @acka47 for review.

acka47 · 2018-09-10T08:38:41Z

+1 Did you already implement an automatic update procedure for DBS data?

acka47 · 2018-09-10T08:41:40Z

As discussed offline, @dr0i already implemented automatic updates. We will have to check whether this works next week.

dr0i · 2018-09-17T12:44:45Z

Seems to work, at least:

file is there
organisations update was triggered

Closing.

acka47 mentioned this issue Jan 7, 2016

Organisation with wrong ISIL/ID #53

Closed

fsteeg mentioned this issue Apr 13, 2016

lobid-organisations frontend #132

Closed

fsteeg mentioned this issue Jul 7, 2016

Enable third party updates to lobid-organisations data #226

Open

acka47 mentioned this issue May 9, 2017

Wrong organisations entry #354

Closed

acka47 self-assigned this May 10, 2017

acka47 added the working label May 10, 2017

acka47 added ready and removed working labels May 31, 2017

acka47 mentioned this issue Sep 18, 2017

Duplicate entries #358

Closed

acka47 mentioned this issue Sep 21, 2017

Can't find out provenance of DE-Frei99 (should now be DE-25-99) #381

Closed

acka47 changed the title ~~Define correction workflow for DBS data~~ Implement regular DBS updates Oct 25, 2017

acka47 assigned dr0i and unassigned acka47 Oct 25, 2017

acka47 assigned acka47 and unassigned dr0i Nov 17, 2017

acka47 added working and removed ready labels Nov 17, 2017

acka47 added a commit that referenced this issue Nov 17, 2017

Adjust field names to new export

7ff8621

See #94 (comment)

acka47 added a commit that referenced this issue Nov 17, 2017

Adjust fields & mappings for controlled values in DBS

40d0eec

See #94 (comment)

acka47 added a commit that referenced this issue Nov 17, 2017

Adjust FileOpener configuration for new DBS export

c367ece

See #94

dr0i closed this as completed Jul 26, 2018

dr0i removed the ready label Jul 26, 2018

acka47 reopened this Aug 22, 2018

acka47 added ready and removed ready labels Sep 6, 2018

dr0i added review and removed working labels Sep 7, 2018

dr0i assigned acka47 and unassigned dr0i Sep 10, 2018

dr0i added review and removed deploy labels Sep 10, 2018

acka47 assigned dr0i and unassigned acka47 Sep 10, 2018

dr0i closed this as completed Sep 17, 2018

dr0i removed the review label Sep 17, 2018

TobiasNx mentioned this issue Aug 9, 2023

Identify the several libraries and ZDB-packages that are more in test #479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement regular DBS updates #94

Implement regular DBS updates #94

acka47 commented Jan 7, 2016

acka47 commented May 10, 2017

acka47 commented Jul 27, 2017

acka47 commented Sep 18, 2017

acka47 commented Oct 25, 2017

fsteeg commented Oct 25, 2017

acka47 commented Oct 25, 2017

dr0i commented Nov 17, 2017

acka47 commented Nov 17, 2017 •

edited

acka47 commented Nov 17, 2017

acka47 commented Nov 17, 2017

dr0i commented Jul 26, 2018

acka47 commented Jul 26, 2018

acka47 commented Aug 22, 2018

dr0i commented Sep 6, 2018 •

edited

acka47 commented Sep 6, 2018

dr0i commented Sep 7, 2018 •

edited

dr0i commented Sep 7, 2018

dr0i commented Sep 10, 2018

acka47 commented Sep 10, 2018

acka47 commented Sep 10, 2018

dr0i commented Sep 17, 2018 •

edited

Implement regular DBS updates #94

Implement regular DBS updates #94

Comments

acka47 commented Jan 7, 2016

acka47 commented May 10, 2017

Short-term: Notify DBS regarding changes

Mid- to long-term: move DBS-Stammdaten management to another system

acka47 commented Jul 27, 2017

acka47 commented Sep 18, 2017

acka47 commented Oct 25, 2017

fsteeg commented Oct 25, 2017

acka47 commented Oct 25, 2017

dr0i commented Nov 17, 2017

acka47 commented Nov 17, 2017 • edited

acka47 commented Nov 17, 2017

acka47 commented Nov 17, 2017

dr0i commented Jul 26, 2018

acka47 commented Jul 26, 2018

acka47 commented Aug 22, 2018

dr0i commented Sep 6, 2018 • edited

acka47 commented Sep 6, 2018

dr0i commented Sep 7, 2018 • edited

dr0i commented Sep 7, 2018

dr0i commented Sep 10, 2018

acka47 commented Sep 10, 2018

acka47 commented Sep 10, 2018

dr0i commented Sep 17, 2018 • edited

acka47 commented Nov 17, 2017 •

edited

dr0i commented Sep 6, 2018 •

edited

dr0i commented Sep 7, 2018 •

edited

dr0i commented Sep 17, 2018 •

edited