New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log Analytics could detect log lines that were already imported and skip them automatically #144

Open
mattab opened this Issue Jul 12, 2016 · 12 comments

Comments

Projects
None yet
9 participants
@mattab
Copy link
Member

mattab commented Jul 12, 2016

Log Analytics is a powerful tool of the Piwik platform, and used by thousands of people in many interesting use cases. It is quite powerful and relatively easy to use, and offers has many options and features. We like to make our tools as easy as possible to use... this issue is about making Log Analytics easier to use and even more flexible for people.

Issue: the log data is not deduplicated

When you import logs in Piwik, Piwik will always import and track all the logs. When you import again the same log file in Piwik, it will be imported again by the Tracking API, and the data will end up being duplicated in the Piwik database.

Why this is not good enough

our users rightfully expect Piwik to be easy to use and do the right thing. Recently @Synchro reported this issue and did not expect Log Analytics to import the data again and again. See the description at: matomo-org/matomo#10248 (comment)

Over the years many users have reported experiencing this issue.

Existing workaround

So far most people manage to use Log Analytics despite this limitation. The common workaround is to create one log file per hour, or one log file per day, and import each log file only once. Commonly, people write a script which makes sure that each log file is imported only once. For example, the log files may be ingested into Piwik while/after they have been rotated.

Solution

Ideally, we do not want people to worry whether they have imported a given log file, or even whether a log file was partially imported before and is re-imported again. We want Piwik to automatically deduplicate the tracking API data.

so far I see two possible ways to fix this issue:

1. new Piwik Tracking API feature: request id deduplicator

The Tracking API could introduce a new feature, to let tracking api users specify a request ID for the given request. Piwik would store the request ID for each request and use this request ID as a unique key. If any tracking API request for a given date with a given request ID has already been tracked/imported in this date, then the request would be skipped. Each request id will be imported at most once for a given day.

The Log Analytics tool will then simply create, for each log file's line that is parsed, a request ID and pass it to the tracking api request to let the Tracking API deduplicate the requests. Log analytics could create this request id as a hash of the log line or so.

  • Pros: other Tracking API SDKs and clients will be able to use this feature to deduplicate the data.

2. Implement request ID deduplicator in Log Analytics only

Alternatively, we could implement this feature exclusively in the Log Analytics, and make this tool clever enough so that it will only send each Log Line's tracking data once to the Piwik Tracking API.

The Log Analytics Python app could for example keep track of the list of log files that were imported before, as well as a list of the request ID /hashes of all the log lines that were imported before, indexed by date or so. Maybe in SQLite database or so.

  • pros: maybe easier to implement.
  • cons: this will work only when people import their data on one server only (when several servers are using log analytics they would not share the "request id database" amongst them so may import the same data.

Summary

this feature would be awesome to have, and will make log analytics much more flexible and easier to use and setup.

What do you think?

@cweiske

This comment has been minimized.

Copy link
Contributor

cweiske commented Sep 16, 2016

Simply keeping track of the last imported log line of a file would suffice. Only records after this log line would need to be imported.

To solve the issue of log file rotation (last tracked line not contained in log file anymore, so nothing is imported at all), the first line of the log file would also need to be tracked. If the first line is not the same as the tracked first line, then the log file has been rotated and all of the file needs to be imported.
Alternatively it the creation date of the log file could be tracked. If that changes, it would indicate a rotation, too. Not sure if all log rotators create new files though.

Maybe look into how the since command works.

@mattab

This comment has been minimized.

Copy link
Member Author

mattab commented Sep 26, 2016

Simply keeping track of the last imported log line of a file would suffice. Only records after this log line would need to be imported.

Yes it would already be very useful to implement this simpler solution...

@glatzenarsch

This comment has been minimized.

Copy link

glatzenarsch commented Nov 3, 2016

is this sollution already implemented in 2.17 or this is just planned for future version that are you working on?
ty

@fwolfst

This comment has been minimized.

Copy link
Contributor

fwolfst commented May 8, 2017

@mattab Thanks for clear issue description.

I am all in for an improvement. I recall that in my tests not so long ago importing the same file twice did not result in doubling of view count, but I might be mistaken. Where there any news on this issue?

In my use case it is important to be able to import log files that were building "gaps", due to network or other issues. Say, I import weekly via cron and realize that one week in the last month was skipped. That is why @cweiske s approach would not help ME very much.

I think one can line up the use cases and solutions and sort them kind-of from "not bad" to "clean" (imho @cweiske s fair proposition is on the "not bad" range).

Obviously the problem is that many different use cases exist, and the cleanest implementation has probably been layed out by @mattab already. For my usecase I would chip in another "not bad" approach/workaround, that is keeping a hash of the entire log file in Log Analytics only. That would already solve my issue. Bonus points for storing the first and last request time with it, so that it could be made easy to DELETE all the visits of a particular log file (say, the log format changed/was extended and we want to add more information in hindsight). In every case, log analytics should implement a --force-XYZ (XYZ should be a specific name) flag to override any clever logic that suddenly is not as clever as needed.

@ilmtr

This comment has been minimized.

Copy link

ilmtr commented Jul 10, 2017

This feature is very much needed. In my setup (semi) realtime logs are added about every hour to a file. After a while this file is archived to a gzip file. This archiving is done unpredictably every 6 hours to 3 days.

@AlexeyKosov

This comment has been minimized.

Copy link

AlexeyKosov commented Jan 31, 2018

When can we expect this feature implemented?

@glatzenarsch

This comment has been minimized.

Copy link

glatzenarsch commented Mar 27, 2018

Im not sure if this feature is implemented yet but im interested when would it be.
I am runing Matomo 3.3 on cca 50 websites with crons going every hour importing same active apache access log and it would be nice to know that results are reliable not duplicated :)

thank you

@mhow2

This comment has been minimized.

Copy link

mhow2 commented Jul 20, 2018

Hi,
Let's speak in terms of workaround as I can feel this feature is not yet close to be fulfilled.
One did the bad move and imported the same log file twice : how do you manage to fix the mistake ?
Can you delete report for a given period of time ? is core:invalidate-report-data might be of any help ? Is anyone that have good experience of this could share and publish an entry in the FAQ ?

@Synchro

This comment has been minimized.

Copy link

Synchro commented Jul 20, 2018

I didn't solve it - I stopped using matomo. I'm not at all interested in JS-based tools, I only want offline log analysis.

One really good tool that helps with log files is AWStats logresolvemerge.pl. This utility can take any number of log files (compressed or uncompressed) and merge them together, removing duplicates, sorting lines by timestamp and performing DNS lookups. It really "just works", and is also pretty fast. Matomo could do with a tool like that, so it could either be ported, or simply used as is, despite language mismatch.

I once wrote some utilities in PHP for processing logs, allowing split/merge/filtering by date, including using strtotime, which allows you to do nice things like "find all entries from the last 2 days" while not having to worry too much about the timestamp format. I'll see if I can find them again.

@mhow2

This comment has been minimized.

Copy link

mhow2 commented Nov 12, 2018

Thanks @Synchro for the reference. However I don't see how this AWStat's perl script solve the issue for log entries that are already records in matomo's database.
@mattab do you have any hints/input about this ? Otherwise I'll be forced to track somewhere (outside matomo) which files have been successfully imported (if I can assert what is a "sucess").

@Synchro

This comment has been minimized.

Copy link

Synchro commented Nov 12, 2018

It doesn't solve it directly, but it can help avoid problems - for example you can make a real mess of your Matomo database if you import log files in the wrong order, but if you pass them through that first it avoids the problem.

@mackuba

This comment has been minimized.

Copy link

mackuba commented Nov 17, 2018

I've added a PR that adds a new --timestamp-file option that solves this problem by tracking the last processed timestamp in a selected file and then ignores logs up to that date: #232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment