New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log import: avoid importing duplicates #12622

Open
sebalis opened this Issue Mar 16, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@sebalis
Copy link

sebalis commented Mar 16, 2018

If the import script is called on the same logfile more than once, entries imported in the first run and still present in the file during the second run are imported twice (at least when I tested it on Apache logs). This creates a few problems: running the script has to be tied to the web server rotating the file, and new entries can not be seen in Matomo until the log has been rotated and the importer run again.

It would be much better if it were possible to run the importer frequently, maybe even every minute. In order to do this, the importer needs to be able to know which entries have already been imported. I don’t know how to implement this in detail. I suppose it should be reasonable to expect that the log entries have time data with one-second granularity. So by remembering the time of the latest entry already imported and perhaps adding some logic for re-identifying entries from that last second, it would become feasible to run the importer as often as one would like without messing up the data. What do you think? For me this might make the difference between choosing Matomo or some other log analysis software.

@fdellwing

This comment has been minimized.

Copy link
Contributor

fdellwing commented Mar 16, 2018

There is no other open source analystic software with this many features, but I'm with you non the less.

Personally I'm importing access.log.1 every day at 0:00 and log rotating is happening at 5:00. So I have 2 day old logs, but no duplicates,

@sebalis

This comment has been minimized.

Copy link
Author

sebalis commented Mar 18, 2018

Matomo looks very good, I would like to use it – although for privacy reasons I will restrict myself to the log importer, which reduces the difference to other products. Also this makes it all the more important to get as much accuracy and timeliness out of the importer as possible.

@fdellwing

This comment has been minimized.

Copy link
Contributor

fdellwing commented Mar 19, 2018

To respect privacy you should imho use the JS tracker because he respects DNT and other tracking blockers while log import will not do that.

@sebalis

This comment has been minimized.

Copy link
Author

sebalis commented Mar 19, 2018

Using JS trackers is out of the question – I do see your point about DNT but let’s not even begin to discuss that. And with my concerns I do of course anonymise my logs (by zeroing the final two bytes of the IP).

@mackuba

This comment has been minimized.

Copy link

mackuba commented Aug 4, 2018

I see that the import_logs script has an --exclude-older-than option (added in December) - would that work, with some kind of "last import" flag that's kept in a file and updated whenever the log is parsed, and then passed to that option? Anyway, I'm planning to set it up this way myself :)

@sebalis

This comment has been minimized.

Copy link
Author

sebalis commented Nov 27, 2018

Sorry for the late response. I havent’t tested it as my workaround was to set up a job to import the ‘.1’ log file (the first ‘rotated’ one). This was possible since the rotation at this site takes place at regular intervals. But it does seem that this option would work. I might be interested in using it for another case where the logs do not rotate so regularly but don’t know when I will get round to it. Feel free to close the issue.

@sebalis

This comment has been minimized.

Copy link
Author

sebalis commented Nov 27, 2018

One minor quibble: --exclude-older-than t₁ would appear to restrict the import to records with a time t ≥ t₁. If I have imported a logfile I know the time t₀ of the latest record I have imported, so it would be convenient to restrict to t > t₀. It seems like I will have to calculate t₁ = t₀ + 1s in order to use this option. Something like --only-newer-than would be slightly better.

@mackuba

This comment has been minimized.

Copy link

mackuba commented Nov 27, 2018

I've actually implemented a PR doing something like this in the meantime: matomo-org/matomo-log-analytics#232

And yes, I tried to use --exclude-older-than first until I noticed I need to do > and not ≥ 😉

My first approach was to find the last timestamp using a separate script in Ruby, but then I realized that the import_logs.py is already going through all lines and parsing dates and stuff from them, so it makes more sense if it finds the timestamp during the import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment