Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Log Analytics could detect log lines that were already imported and skip them automatically #144
Log Analytics is a powerful tool of the Piwik platform, and used by thousands of people in many interesting use cases. It is quite powerful and relatively easy to use, and offers has many options and features. We like to make our tools as easy as possible to use... this issue is about making Log Analytics easier to use and even more flexible for people.
Issue: the log data is not deduplicated
When you import logs in Piwik, Piwik will always import and track all the logs. When you import again the same log file in Piwik, it will be imported again by the Tracking API, and the data will end up being duplicated in the Piwik database.
Why this is not good enough
our users rightfully expect Piwik to be easy to use and do the right thing. Recently @Synchro reported this issue and did not expect Log Analytics to import the data again and again. See the description at: matomo-org/matomo#10248 (comment)
Over the years many users have reported experiencing this issue.
So far most people manage to use Log Analytics despite this limitation. The common workaround is to create one log file per hour, or one log file per day, and import each log file only once. Commonly, people write a script which makes sure that each log file is imported only once. For example, the log files may be ingested into Piwik while/after they have been rotated.
Ideally, we do not want people to worry whether they have imported a given log file, or even whether a log file was partially imported before and is re-imported again. We want Piwik to automatically deduplicate the tracking API data.
so far I see two possible ways to fix this issue:
1. new Piwik Tracking API feature: request id deduplicator
The Tracking API could introduce a new feature, to let tracking api users specify a
The Log Analytics tool will then simply create, for each log file's line that is parsed, a request ID and pass it to the tracking api request to let the Tracking API deduplicate the requests. Log analytics could create this request id as a hash of the log line or so.
2. Implement request ID deduplicator in Log Analytics only
Alternatively, we could implement this feature exclusively in the Log Analytics, and make this tool clever enough so that it will only send each Log Line's tracking data once to the Piwik Tracking API.
The Log Analytics Python app could for example keep track of the list of log files that were imported before, as well as a list of the request ID /hashes of all the log lines that were imported before, indexed by date or so. Maybe in SQLite database or so.
this feature would be awesome to have, and will make log analytics much more flexible and easier to use and setup.
What do you think?
Simply keeping track of the last imported log line of a file would suffice. Only records after this log line would need to be imported.
To solve the issue of log file rotation (last tracked line not contained in log file anymore, so nothing is imported at all), the first line of the log file would also need to be tracked. If the first line is not the same as the tracked first line, then the log file has been rotated and all of the file needs to be imported.
Maybe look into how the
@mattab Thanks for clear issue description.
I am all in for an improvement. I recall that in my tests not so long ago importing the same file twice did not result in doubling of view count, but I might be mistaken. Where there any news on this issue?
In my use case it is important to be able to import log files that were building "gaps", due to network or other issues. Say, I import weekly via cron and realize that one week in the last month was skipped. That is why @cweiske s approach would not help ME very much.
I think one can line up the use cases and solutions and sort them kind-of from "not bad" to "clean" (imho @cweiske s fair proposition is on the "not bad" range).
Obviously the problem is that many different use cases exist, and the cleanest implementation has probably been layed out by @mattab already. For my usecase I would chip in another "not bad" approach/workaround, that is keeping a hash of the entire log file in Log Analytics only. That would already solve my issue. Bonus points for storing the first and last request time with it, so that it could be made easy to DELETE all the visits of a particular log file (say, the log format changed/was extended and we want to add more information in hindsight). In every case, log analytics should implement a
Im not sure if this feature is implemented yet but im interested when would it be.
I didn't solve it - I stopped using matomo. I'm not at all interested in JS-based tools, I only want offline log analysis.
One really good tool that helps with log files is AWStats logresolvemerge.pl. This utility can take any number of log files (compressed or uncompressed) and merge them together, removing duplicates, sorting lines by timestamp and performing DNS lookups. It really "just works", and is also pretty fast. Matomo could do with a tool like that, so it could either be ported, or simply used as is, despite language mismatch.
I once wrote some utilities in PHP for processing logs, allowing split/merge/filtering by date, including using
Thanks @Synchro for the reference. However I don't see how this AWStat's perl script solve the issue for log entries that are already records in matomo's database.