New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to only process log entries that haven't been processed before #232

wants to merge 2 commits into
base: master


None yet
1 participant
Copy link

mackuba commented Nov 17, 2018

I want to use Matomo with log analytics only. My Nginx logs are rotated every week, but I want my reports to be updated much earlier, e.g. every hour. If I just feed the same log file with already reported visits to the importer, I will have duplicated entries, so I need to either rotate logs every hour (very inconvenient) or somehow prevent logs from being imported twice. Based on what I could find, there is currently no easy way to do this.

This pull request solves this by tracking the latest visit timestamp found in an imported log file and then saving it to a file specified in a --timestamp-file option. On the next run this timestamp is loaded at startup and all visits before or on this timestamp are ignored (like --exclude-older-than, but inclusive, since the log with equal timestamp was already parsed).

This kind of solves #144.

I've put initial_timestamp (loaded from the file at the beginning) in the config and latest_timestamp (updated after every log record) in the stats. This can be moved elsewhere if it's not the best place.

I've also added some lines to the summary to print the status of the timestamp-based filtering, and included the older/newer than filtering too since it's related:

Logs import summary

    85 requests imported successfully
    36 requests were downloads
    10627 requests ignored:
        73 HTTP errors
        2 HTTP redirects
        0 invalid log lines
        345 filtered log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        153 requests done by bots, search engines...
        10054 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

    Processed logs since: 2018-11-06 19:05:12 +0000
    Saved last timestamp: 2018-11-07 11:59:42 +0000

I also tweaked the printing there to remove extra empty lines (more than 2 newlines are compacted into 2) - this was already a problem before, as the space between the 2nd and 3rd section was bigger than between 1/2 and 3/4 because of %(sites_ignored)s, but was made more visible with the date filtering section added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment