Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement regular DBS updates #94

Closed
acka47 opened this issue Jan 7, 2016 · 40 comments
Closed

Implement regular DBS updates #94

acka47 opened this issue Jan 7, 2016 · 40 comments
Assignees

Comments

@acka47
Copy link
Contributor

acka47 commented Jan 7, 2016

To be discussed with DBS colleagues. Questions (amongst others):

  • How often do we get regular updates of DBS data?
  • What happens when people report errors in lobid-organisations entires derived from DBS?
@acka47
Copy link
Contributor Author

acka47 commented May 10, 2017

I will speak to Ira soon about this. There are different options – short-term and mid-/long-term.

Short-term: Notify DBS regarding changes

  • Notify DBS staff about necessary corrections like here and here.
  • We could
    1. mail the corrections to them and ask them for an updated export as soon as the data is corrected.
    2. do the corrections ourselves and send them a link to the diff. For this, the DBS csv file would have to be in a git repository somewhere. (I think currently this is not the case.)

Mid- to long-term: move DBS-Stammdaten management to another system

  • Move administration of DBS Stammdaten to the to-be-built hbz customer relationship graph.
  • This will enable workflows where more people can directly make "pull requests".
  • Estimated time frame: could be set up until end of 2018

@acka47 acka47 added ready and removed working labels May 31, 2017
@acka47
Copy link
Contributor Author

acka47 commented Jul 27, 2017

I discussed with Ira that @dr0i will get access to the DBS database. If Gerhard is back in September he will instruct @dr0i.

@acka47
Copy link
Contributor Author

acka47 commented Sep 18, 2017

We will have a meeting with DBS in October where we will probably get access to the database so that we can set up regular exports of DBS data.

@acka47 acka47 changed the title Define correction workflow for DBS data Implement regular DBS updates Oct 25, 2017
@acka47 acka47 assigned dr0i and unassigned acka47 Oct 25, 2017
@acka47
Copy link
Contributor Author

acka47 commented Oct 25, 2017

Renamed the issue to set up regular DBS updates. We now have access to the database. @dr0i had some questions regarding the next steps.

The current version of DBS export we use for transformation can be found at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip (see https://github.com/hbz/lobid-organisations/blob/master/README.textile#deployment).

I think we made sure in the past that we want the DBS export as near to the original data as possible. Thus, the column names in the csv should be named the same as those in the data base. In the past we exported the following information (though we don't use it all): inr, iso, bib_typ, nam, plz, ort, str, stk_2007, sbi, isil, tvw, tel, fax, ema, url, opa, sta, oef, typ_text, utr_text, gro_text, leitung

@fsteeg
Copy link
Member

fsteeg commented Oct 25, 2017

Note that the file at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip contains manual changes, which ideally should be merged with a new database export.

@acka47
Copy link
Contributor Author

acka47 commented Oct 25, 2017

Note that the file at http://quaoar1.hbz-nrw.de:7001/assets/data/dbs.zip contains manual changes, which ideally should be merged with a new database export.

I communicated about those and especially other changes with the DBS colleagues. So the DBS source should be much better than what we currently have in dbs.zip. No need to merge.

@acka47 acka47 assigned acka47 and unassigned dr0i Nov 17, 2017
@acka47 acka47 added working and removed ready labels Nov 17, 2017
@dr0i
Copy link
Member

dr0i commented Nov 17, 2017

@acka47 for your initial test use sol@weywot2. That's our staging environment.

@acka47
Copy link
Contributor Author

acka47 commented Nov 17, 2017

We get nearly all relevant data with SELECT * FROM dbs.adresse;. I put the row names and two examples rows at https://gist.github.com/acka47/67199ec585f1f6adeea3176b0d69dd76.

Filter out active libraries with status_2_id = 1: We only want to import the active libraries to lobid, this means to only get rows with value 1 in status_2_id.

We will have to adjust the transformation a bit as column names differ from the current dbs.csv file. Here is a comparison old -> new:

  • inr -> inr
  • iso -> iso
  • bib_typ -> fb
  • nam -> name
  • plz -> plz
  • ort -> ort
  • str -> strasse
  • stk_2007 -> gvz_id
  • sbi -> sigel
  • isil -> isigel
  • tvw -> vorwahl (with other syntax, e.g. "(0221)" instead of "-221")
  • tel -> tel
  • fax -> fax
  • ema -> email
  • url -> url
  • opa -> opac
  • sta -> ?? (Not sure, but I think we don't use this information anyway.)
  • oef -> oeffnungszeiten
  • typ_text -> String is in another table, but I think we use bib_typ/fb typ_id anyway.
  • gro_text -> bestandsgroessenklasse_id (We will have to adjust the mapping here as we will get IDs for the Bestandsgrößenklasse ("1" to "11") instead of textual values like "100.001 - 300.000", see the csv at https://gist.github.com/acka47/f6a248a46249367becc681774d9d0b14
  • leitung -> leitung

Missing field "Unterhaltsträger"

What is missing in dbs.adresse is information on Unterhaltsträger. This is in another table dbs.utr_von where a DBS ID is associated with a Unterhaltsträger ID (1-15, see this gist). So, @dr0i was right that we have to join different tables, at least in this case.

*Additional fields

  • We should add sortiername as alternateName.
  • We should add postfach
  • We should use fachstelle_inr to add a link to the responsible Fachstelle
  • We should think about how to add the information from dbv_id

@acka47
Copy link
Contributor Author

acka47 commented Nov 17, 2017

I tried to join the tables without any previous SQL experience and finally this did work:

SELECT dbs.adresse.*, dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr;

@acka47
Copy link
Contributor Author

acka47 commented Nov 17, 2017

Here is the query with filtering out non-active libraries:

SELECT dbs.adresse.*, dbs.utr_von.utr_id
FROM dbs.adresse
INNER JOIN dbs.utr_von ON dbs.adresse.inr = dbs.utr_von.adresse_inr
WHERE dbs.adresse.status_2_id = 1;

@dr0i
Copy link
Member

dr0i commented Jul 26, 2018

Because of DSGVO it was decided to not allow any updates. Closing.

@dr0i dr0i closed this as completed Jul 26, 2018
@dr0i dr0i removed the ready label Jul 26, 2018
@acka47
Copy link
Contributor Author

acka47 commented Jul 26, 2018

Reopening as the status of DBS data access is still unclear.

@acka47 acka47 reopened this Aug 22, 2018
@acka47
Copy link
Contributor Author

acka47 commented Aug 22, 2018

It looks like there currently is no DBS data at all in lobid-organisations: https://lobid.org/organisations/search?q=_exists_%3AdbsID+AND+NOT+_exists_%3Aisil

@acka47 acka47 added ready and removed ready labels Sep 6, 2018
@dr0i
Copy link
Member

dr0i commented Sep 6, 2018

Transformation fails with (see application.log):

Transformation failed
java.lang.IllegalArgumentException: wrong number of columns (expected 18, was 19) in input line:

Comparing the old csv and the new one:
in the old csv there are sometimes commatas part of a value e.g.

... ,"Mo-Fr.: 8-24, Sa: 9-24 Uhr", ...

and these possible fields are then in quotes ("csv with enclosed fields"). In the new dbs-dump these quotes are missing.
So, my sed-command to substitute tabs with commata is not enough, it should be like
sed 's#^#"#g;s#\t#","#g;s#$#"#g'. Don't know if this will work correctly, though, because there exist many fields without quotes in the old csv (the new one would then always have a comma and quote).
Tabs as delimiter (in generally: a delimiter that's not used in the value) are so much easier. However I will write to Therese N. to update the postprocessing regex and see if the transformation will work correctly.

@acka47
Copy link
Contributor Author

acka47 commented Sep 6, 2018

I don't know whether you interpreted the problem correctly. We actually removed one column (fax) from the SQL query, so it is correct that there is one less. So we must adjust the code there to expect 18 columns.

@dr0i
Copy link
Member

dr0i commented Sep 7, 2018

Yeah, me too saw that at first as the most plausible source of the error. I admit to have had problems to easily find where this number is specified, but I believe the first line (the description of the csv columns) is analyzed. There is now correctly the "fax" missing, and, counting the commatas, this makes exactly 18 defined columns. Now, my example of the "opening hours" shows that the data has in fact 19 columns because the quotes are missing.
Also I've made a quick test by manually correcting a line and index that and this was good without errors.

@dr0i
Copy link
Member

dr0i commented Sep 7, 2018

Sent Therese this regex: sed 's#^#"#g;s#\t#","#g;s#$#"#g'.

@dr0i dr0i added review and removed working labels Sep 7, 2018
@dr0i
Copy link
Member

dr0i commented Sep 10, 2018

Update finished. Looks good to me! Assigning @acka47 for review.

@dr0i dr0i assigned acka47 and unassigned dr0i Sep 10, 2018
@dr0i dr0i added review and removed deploy labels Sep 10, 2018
@acka47
Copy link
Contributor Author

acka47 commented Sep 10, 2018

+1 Did you already implement an automatic update procedure for DBS data?

@acka47 acka47 assigned dr0i and unassigned acka47 Sep 10, 2018
@acka47
Copy link
Contributor Author

acka47 commented Sep 10, 2018

As discussed offline, @dr0i already implemented automatic updates. We will have to check whether this works next week.

@dr0i
Copy link
Member

dr0i commented Sep 17, 2018

Seems to work, at least:

  • file is there
  • organisations update was triggered

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants