New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matomo tracker select queries can get slow when most visits have same visitorId #16904
Comments
The where clause with one year back might not really help though because most visits might be within one year (seeing such a DB here). Maybe a config setting forcing look back to be only few days might be better but this would need a config. Ideally it was fixed automatically. |
A solution might be otherwise a custom index. |
Putting this into the milestone to see if we can somehow improve this or workaround it when a visitor has hundreds of thousands of visit to still have it fast. It may not be easily possible and that be fine. Ideally we also wouldn't really make it slower for the normal tracking request. |
This depends on whether you'd want the year period to work w/ cohorts. If not, then it's really a judgement call on the number of months. Though, I would expect the query to be fast if requests force new visits is always used? Since the last known visit would normally be recent? Unless it's common for visitors to come from years ago? Or does the ordering not get applied in the query? |
In our case all visits are recent like last 1 hour. So it's probably unrelated to how far back we look. |
Maybe it's the index being updated constantly? Just a guess. |
@diosmosis I'm not so much into details right now but it was basically fetching thousands of visits every time so I'm assuming the sorting might be the problem or so I'm not really sure. If you're keen to look into this I can let you know where this happens on slack potentially. |
Quick idea: Would a local (not synced) file based cache that just saved the idvisit => idvisitor mapping for 30mins be useful? If less than 30 mins, we use the cached value. If 30 mins passes we run the query again. If forcing a new visit, we unset the cache value if it's there. |
I'm not sure if it would help because
|
FYI one (reactive) workaround that might help is adding a third column to the index idsite idvisitor index like this (not fully tested):
In my case this improved the performance from |
We could also have a setting to block new visits for a specific visitor after for example say 500 visits. However, this might make it hard for people to troubleshoot this issue and sometimes you will want to track that many visits. It's not uncommon that if you have daily visitors over a 1-2 years to reach such a number. So it can't really be a solution. Maybe there could be some upper limit though like 10000 visits or so. |
@tsteur I wonder if we can do something like what's done w/ the Live plugin to avoid having to look over too many rows at the same time. For example, we look in the last 30 minutes for a idvisitor, then the last hour, then the last day, and so on until we find a previous visit. There would be more queries, but perhaps the queries would be faster? Just an idea. |
One of the problems that makes this query so hard to optimise is the requirement with the |
Was wondering the same but I don't think it would help too much maybe because we're using the idvisitor index here and it means it would still need to look at all the visits. It would depend if the sorting is the slow CPU consuming part, then it might help. Although often we then might need to select twice which may make it overall a bit slower again maybe. |
Is this something we can test or is high concurrency a contributing factor and it's only slow when we're actively tracking? |
fyi @diosmosis I've profiled the query and here's the output for a visitor with 20K visits. The big part is indeed creating the sort index. The where was I've then adjusted the where to only look at the last 2 weeks ( I then executed the first query again where I look at all visits. It took consistently I then ran the query without any time in the where and it didn't really make a difference (used Then used To have reliable numbers would need to execute all the queries many times and look at like 95th percentile or so. Having no order clause made the sorting part in the profile disappear (makes sense) but of course sending data took ages It should mean though maybe the less entries we select the faster the sorting should be. Where looking eg first only at least week is in the end faster or not really depends on the workload. If most visitors have say < 100 visits then it might make it slower. Also depends how much time visits usually have between visits etc. So overall it really almost depends on a per site basis and what kind of data we track. Probably changing it would make it worse for most users since there a visitor typically has only a few visits and often they don't visit for a week but then again after few weeks. |
Another option would be sorting in PHP as PHP can be scaled easily
again when there are thousands of visits the problem might be though that we'll be sending and transferring a lot of data. So far still can only think of adjusting the index as mentioned in #16904 (comment) but this of course affects all writes. |
fyi also tested below query thinking maybe it improves as there is less data to sort or so but it's not improving the performance
|
What did make the query fast is below query. But this will only work if the visits are actually in order and the highest idivist is the most recent visit
|
If I see this right: Basically, if we could cache whether they ever used a custom timestamp for a site during tracking or not then we could optimise the query big big times by sorting idvisit desc instead. As soon as they use a custom timestamp in a tracking request we would be no longer able to use it though and fall back to slow. Most users would maybe not send a custom timestamp. |
fyi tested adding an index |
We can't reproduce the performance improvement anymore when sorting by idvisit instead of visit_last_action_time so this doesn't seem to be a solution. For more information see #17649 (comment) |
Let's document this problem and the solution. Once done, let's close it as wontfix. |
@peterhashair I think you documented this one? Is it fully published? |
@tsteur just published it. This issue Should be good to go. |
@peterhashair can you please post a link to the new document and close this issue? |
To resolve this issue. Please go to FAQ HOW DO I IMPROVE THE LOAD TIME OF THE FIND VISITOR SQL QUERY WHEN VISITORS HAVE MANY VISITS? |
I'll reopen this issue for Matomo 5.0. In Matomo 5.0 we will change the existing index according to this FAQ: https://matomo.org/faq/troubleshooting/how-do-i-improve-the-load-time-of-the-find-visitor-sql-query-when-visitors-have-many-visits/ so the index looks like this:
We've been testing this index on the cloud and while it made writes slightly slower the reads are a lot faster. This query was basically top load contributor previously |
This was already solved in #18636 and the update script for Matomo 5 is already merged causing an index change is there already. We only need to close it as part of Matomo 5 release to make sure it appears in changelog. |
Say you are tracking apps and you send a userId with every tracking request. Then by default the userId will be used to generate the visitorId. It may also be a problem if for every action a new visit is forced.
Queries like this will become very slow:
Explain:

or

To workaround this performance issue set
[Tracker]enable_userid_overwrites_visitorid=0
in yourconfig/config.ini.php
if userId is used. If userId is not used, there is currently no workaround.Ideally we find a way to not needing this setting though. Maybe we could at least limit time range by adding to the where
visit_last_action_time > 1 year before upper limit
. It won't make it much better but at least it be slightly faster especially when the table has over the years collected many hundred thousands or millions of requests for the same visitor. In our use case we have for example for the same idVisitor 1.6M entries which is roughly 30% of all visits for the same visitor.We used to have this lower limit in the where clause but AFAIK we removed it for Cohorts. @diosmosis do you maybe remember? To make cohorts still useful, how far back would we need to look? 1 year? Or would it be fine to only look back 6 months? I suppose we'd need to look back kind of ideally at least 1 year for the reports to be fine when you look at months for cohorts.
Any other thoughts on how to make this use case faster be appreciated
UPDATE
See my comment further below. This issue is now about changing the existing index according to https://matomo.org/faq/troubleshooting/how-do-i-improve-the-load-time-of-the-find-visitor-sql-query-when-visitors-have-many-visits/
The text was updated successfully, but these errors were encountered: