Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anonymize user id's in logs / Matomo #799

Open
ChristophEwertowski opened this issue Apr 13, 2018 · 19 comments
Open

Anonymize user id's in logs / Matomo #799

ChristophEwertowski opened this issue Apr 13, 2018 · 19 comments
Assignees

Comments

@ChristophEwertowski
Copy link
Contributor

Sub-issue of hbz/lobid#363. We have to pseudonymise user id's in logs. I heard multiple numbers flowing around for how fast we have to do it.
@dr0i : Does the NWBib have it's own logs with user id's or are all requests to lobid-resources logged in one file?

@ChristophEwertowski
Copy link
Contributor Author

At https://wiki.hbz-nrw.de/pages/viewpage.action?pageId=765100087 some solutions are listed, which maybe can be used.

@dr0i dr0i removed the bug label Apr 13, 2018
@dr0i
Copy link
Member

dr0i commented Apr 13, 2018

All apache logs dealing with IPs are logged into one file. The NWBib-webApp doesn't log IPs in its log.

@dr0i
Copy link
Member

dr0i commented Apr 30, 2018

At https://wiki.hbz-nrw.de/pages/viewpage.action?pageId=765100087 some solutions are listed, which maybe can be used.

These are anonymizers, not pseudonymizers. I wrote a pseudonymizer myself, code resides at @weywot1:/export/lobid-files/logs_apache_emphytos/pseudonymizer.sh .
Let's talk about it at thursday meeting.

@dr0i
Copy link
Member

dr0i commented May 3, 2018

After offline discussion: will set up matomo which is also dsgvo compatible.

@dr0i
Copy link
Member

dr0i commented May 4, 2018

Matomo is set up. Available at it's real name as subdomain of lobid. Note: only https is allowed.

@dr0i
Copy link
Member

dr0i commented May 7, 2018

Weekend's triggering of uploading the logs failed after 12 min. Not sure why. Triggered again and seems to work since 5 h. A bit weired: no logs even when in DEBUG mode. But top shows apache and python script working, so we'll wait it out.

@dr0i
Copy link
Member

dr0i commented May 7, 2018

Errors occur when the webserver is restarted. As top shows the script working I am still hopeful. Also, hopefully the "attempt number 2" appearing in the logs doesn't mean "start at the beginning anew". If so, my.

@dr0i
Copy link
Member

dr0i commented May 14, 2018

The "error" messages like [INFO] Error when connecting to Matomo: HTTP Error 404: Not Found
are ignorable.
Import as slow as expected - but it works!

Logs import summary

123680924 requests imported successfully
880610 requests were downloads
202411 requests ignored:
    0 HTTP errors
    0 HTTP redirects
    202411 invalid log lines
    0 filtered log lines
    0 requests did not match any known site
    0 requests did not match any --hostname
    0 requests done by bots, search engines...
    0 requests to static resources (css, js, images, ico, ttf...)
    0 requests to file downloads did not match any --download-extensions

Website import summary

123680924 requests imported to 1 sites
    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 440650 seconds
Requests imported per second: 280.68 requests per second

It took 5 days to process these logs:
access_log-20170501, access_log-20170601, access_log-20170701, access_log-20170801, access_log-20170901, access_log-20171001, access_log-20171101, access_log-20171201, access_log-20180101, access_log-20180201, access_log-20180301, access_log-20180401, access_log-20180501

After the import an 'archiving' process must be executed to use matomo. This took 5.5h.

TODO:

  • as the ibdata1 mysql-file ist 175GB huge and there is no easy way to split this into smaller pieces a new matomo-instance with a fresh mysql installation should be set up on gaia.
  • check if all important is there
  • import the rest of the data
  • get rid of the original apache logs (hm, hm ... prediction of the day: we will regret this)

"Check if all is there" is crucial. As far as I can see there is no possibility to discriminate lobid.org from subdomains at the moment, nwbib is missing etc. I remember to have changed once years ago the apache-logs syntax to log the subdomains as a column of its own. GoAccess was configured in that way subsequently. Now one would have to do this for matomo also. We have two syntactically different logs and I don't know if these can be merged into one in matomo.
I also can't see any referrer. What else is missing @acka47 ?

@dr0i dr0i assigned acka47 and unassigned dr0i and ChristophEwertowski May 14, 2018
@dr0i dr0i added review and removed working labels May 14, 2018
@dr0i dr0i added working and removed review labels May 22, 2018
@dr0i
Copy link
Member

dr0i commented May 22, 2018

As discussed offline:

  • remove bots
  • test if nwbib domain is covered. If not, import the two syntactically different logs accordingly
  • check why geo ip city detection is not so good

@dr0i dr0i closed this as completed May 22, 2018
@dr0i dr0i assigned dr0i and unassigned acka47 and ChristophEwertowski Jun 8, 2018
@dr0i
Copy link
Member

dr0i commented Jun 19, 2018

Data is imported up to May. Please check. Pathes likes lobid.org/resources have their own WebsiteID now. Rendering the data for the first time may take some time, but will eventually be stored as a preprocessed archived report, becoming displayable in no time.

@dr0i dr0i added the review label Jun 19, 2018
@dr0i dr0i assigned acka47 and ChristophEwertowski and unassigned dr0i Jun 19, 2018
@dr0i dr0i changed the title Pseudonymise user id's in logs / Matomo ~Pseudonymise user id's in logs / Matomo Jun 19, 2018
@dr0i dr0i changed the title ~Pseudonymise user id's in logs / Matomo ~Pseudonymise~ user id's in logs / Matomo Jun 19, 2018
@dr0i dr0i changed the title ~Pseudonymise~ user id's in logs / Matomo Anonymize user id's in logs / Matomo Jun 19, 2018
@acka47
Copy link
Contributor

acka47 commented Jun 19, 2018

Looks good. I noticed that two widgets only give information for nwbib.de but not for the lobid services:

  1. "Pages following a site search"
  2. "Keywords" in the widget "Referrer Types" -> Search engines

Why is that?

@acka47
Copy link
Contributor

acka47 commented Jun 21, 2018

Looks better now, but links behind the shown "Pages following a site search" don't work for lobid-gnd and lobid-resources.

@dr0i
Copy link
Member

dr0i commented Jun 22, 2018

URLS have to be defined one per line, not comma-separated. Have to reindex again.

@dr0i dr0i self-assigned this Jun 22, 2018
@dr0i dr0i added ready and removed review labels Jul 13, 2018
@dr0i
Copy link
Member

dr0i commented Jul 24, 2018

Reindexing is finished, data up to date (=>including June).

@dr0i
Copy link
Member

dr0i commented Sep 10, 2018

Automatically splitting and indexing using crontab www-data and root @gaia. Script revisioned by using internal git (git@gaia).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants