Matomo tracker select queries can get slow when most visits have same visitorId #16904

tsteur · 2020-12-07T21:07:44Z

Say you are tracking apps and you send a userId with every tracking request. Then by default the userId will be used to generate the visitorId. It may also be a problem if for every action a new visit is forced.

Queries like this will become very slow:

explain extended SELECT visit_last_action_time, visit_first_action_time, idvisitor, idvisit, user_id, visit_exit_idaction_url, visit_exit_idaction_name, visitor_returning, visitor_seconds_since_first, visitor_seconds_since_order, visitor_count_visits, visit_goal_buyer, location_country, location_region, location_city, location_latitude, location_longitude, referer_name, referer_keyword, referer_type, idsite, profilable, visit_entry_idaction_url, visit_total_actions, visit_total_interactions, visit_total_searches, referer_url, config_browser_name, config_client_type, config_device_brand, config_device_model, config_device_type, visit_total_events, visit_total_time, location_ip, location_browser_lang, campaign_content, campaign_group, campaign_id, campaign_keyword, campaign_medium, campaign_name, campaign_placement, campaign_source, last_idlink_va, custom_dimension_1, custom_dimension_2, custom_dimension_3, custom_dimension_4, custom_dimension_5, custom_dimension_6, custom_var_k1, custom_var_v1, custom_var_k2, custom_var_v2, custom_var_k3, custom_var_v3, custom_var_k4, custom_var_v4, custom_var_k5, custom_var_v5  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '4' AND visit_last_action_time <= '2020-12-07 20:34:48' AND idvisitor = '...'
                ORDER BY visit_last_action_time DESC
                LIMIT 1

Explain:

or

To workaround this performance issue set [Tracker]enable_userid_overwrites_visitorid=0 in your config/config.ini.php if userId is used. If userId is not used, there is currently no workaround.

Ideally we find a way to not needing this setting though. Maybe we could at least limit time range by adding to the where visit_last_action_time > 1 year before upper limit . It won't make it much better but at least it be slightly faster especially when the table has over the years collected many hundred thousands or millions of requests for the same visitor. In our use case we have for example for the same idVisitor 1.6M entries which is roughly 30% of all visits for the same visitor.

We used to have this lower limit in the where clause but AFAIK we removed it for Cohorts. @diosmosis do you maybe remember? To make cohorts still useful, how far back would we need to look? 1 year? Or would it be fine to only look back 6 months? I suppose we'd need to look back kind of ideally at least 1 year for the reports to be fine when you look at months for cohorts.

Any other thoughts on how to make this use case faster be appreciated

UPDATE

See my comment further below. This issue is now about changing the existing index according to https://matomo.org/faq/troubleshooting/how-do-i-improve-the-load-time-of-the-find-visitor-sql-query-when-visitors-have-many-visits/

The text was updated successfully, but these errors were encountered:

tsteur · 2020-12-07T21:09:35Z

The where clause with one year back might not really help though because most visits might be within one year (seeing such a DB here). Maybe a config setting forcing look back to be only few days might be better but this would need a config. Ideally it was fixed automatically.

tsteur · 2020-12-07T23:17:10Z

A solution might be otherwise a custom index.

tsteur · 2021-03-01T21:12:28Z

Putting this into the milestone to see if we can somehow improve this or workaround it when a visitor has hundreds of thousands of visit to still have it fast. It may not be easily possible and that be fine. Ideally we also wouldn't really make it slower for the normal tracking request.

diosmosis · 2021-03-01T22:13:36Z

@tsteur

To make cohorts still useful, how far back would we need to look? 1 year?

This depends on whether you'd want the year period to work w/ cohorts. If not, then it's really a judgement call on the number of months.

Though, I would expect the query to be fast if requests force new visits is always used? Since the last known visit would normally be recent? Unless it's common for visitors to come from years ago? Or does the ordering not get applied in the query?

tsteur · 2021-03-01T22:16:55Z

In our case all visits are recent like last 1 hour. So it's probably unrelated to how far back we look.

diosmosis · 2021-03-01T22:17:58Z

Maybe it's the index being updated constantly? Just a guess.

tsteur · 2021-03-01T22:20:03Z

@diosmosis I'm not so much into details right now but it was basically fetching thousands of visits every time so I'm assuming the sorting might be the problem or so I'm not really sure. If you're keen to look into this I can let you know where this happens on slack potentially.

diosmosis · 2021-03-01T22:29:39Z

Quick idea: Would a local (not synced) file based cache that just saved the idvisit => idvisitor mapping for 30mins be useful? If less than 30 mins, we use the cached value. If 30 mins passes we run the query again. If forcing a new visit, we unset the cache value if it's there.

tsteur · 2021-03-01T22:36:54Z

I'm not sure if it would help because

you might have heaps of tracker instances and each instance would need its own cache plus they would need to have a way to invalidate each other's cache otherwise there might be outdated information
when tracking heaps of traffic then the caches could be quite big. This could be solved though by having a custom nested directory structure for this to make sure to have one cache file per eg configId/visitorId and to never have too many files in one directory (otherwise things get slow)

tsteur · 2021-03-02T22:36:10Z

FYI one (reactive) workaround that might help is adding a third column to the index idsite idvisitor index like this (not fully tested):

                drop index `index_idsite_idvisitor` on log_visit;
                CREATE INDEX `index_idsite_idvisitor` ON log_visit (idsite, idvisitor, visit_last_action_time DESC)

In my case this improved the performance from 0.13436800s to 0.00180650s

tsteur · 2021-03-02T22:39:13Z

We could also have a setting to block new visits for a specific visitor after for example say 500 visits. However, this might make it hard for people to troubleshoot this issue and sometimes you will want to track that many visits. It's not uncommon that if you have daily visitors over a 1-2 years to reach such a number. So it can't really be a solution. Maybe there could be some upper limit though like 10000 visits or so.

diosmosis · 2021-06-02T19:47:45Z

@tsteur I wonder if we can do something like what's done w/ the Live plugin to avoid having to look over too many rows at the same time. For example, we look in the last 30 minutes for a idvisitor, then the last hour, then the last day, and so on until we find a previous visit. There would be more queries, but perhaps the queries would be faster? Just an idea.

tsteur · 2021-06-02T19:48:07Z

One of the problems that makes this query so hard to optimise is the requirement with the visit_last_action_time because we need to find the correct last visit before that timestamp. Maybe, if the tracking request doesn't provide a custom timestamp we could think about optimizing this query. This would be the case for most of the requests. In that case we would be mostly looking for the most recent match. Possible though that this still might not be 100% possible because visits might be out of order or it might require an index update etc.

tsteur · 2021-06-02T19:49:56Z

@tsteur I wonder if we can do something like what's done w/ the Live plugin to avoid having to look over too many rows at the same time. For example, we look in the last 30 minutes for a idvisitor, then the last hour, then the last day, and so on until we find a previous visit. There would be more queries, but perhaps the queries would be faster? Just an idea.

Was wondering the same but I don't think it would help too much maybe because we're using the idvisitor index here and it means it would still need to look at all the visits. It would depend if the sorting is the slow CPU consuming part, then it might help. Although often we then might need to select twice which may make it overall a bit slower again maybe.

diosmosis · 2021-06-02T19:51:21Z

Is this something we can test or is high concurrency a contributing factor and it's only slow when we're actively tracking?

tsteur · 2021-06-02T20:16:42Z

fyi @diosmosis I've profiled the query and here's the output for a visitor with 20K visits. The big part is indeed creating the sort index. The where was WHERE idsite = '3' AND visit_last_action_time <= '2021-06-04 19:31:39'

I've then adjusted the where to only look at the last 2 weeks (WHERE idsite = '3' AND visit_last_action_time > '2021-05-20 19:31:39' AND visit_last_action_time <= '2021-06-04 19:31:39'). The sorting took longer for some reason. I executed it multiple times and it was always slower and took consistently 0.07*.

I then executed the first query again where I look at all visits. It took consistently 0.065 and was always slightly faster surprisingly. Would need to check what is all part of Creating sort index and whether it includes the where or something.

I then ran the query without any time in the where and it didn't really make a difference (used WHERE idsite = '3' AND idvisitor = @v1 ORDER BY visit_last_action_time DESC).

Then used ORDER BY idsite, visit_last_action_time DESC (as there is such an index) and it seemed slightly faster with 0.06 but not sure it's actually faster.

To have reliable numbers would need to execute all the queries many times and look at like 95th percentile or so.

Having no order clause made the sorting part in the profile disappear (makes sense) but of course sending data took ages

It should mean though maybe the less entries we select the faster the sorting should be.

Where looking eg first only at least week is in the end faster or not really depends on the workload. If most visitors have say < 100 visits then it might make it slower. Also depends how much time visits usually have between visits etc. So overall it really almost depends on a per site basis and what kind of data we track. Probably changing it would make it worse for most users since there a visitor typically has only a few visits and often they don't visit for a week but then again after few weeks.

tsteur · 2021-06-02T20:19:13Z

Another option would be sorting in PHP as PHP can be scaled easily

SELECT SQL_NO_CACHE visit_last_action_time, idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = ...

again when there are thousands of visits the problem might be though that we'll be sending and transferring a lot of data.

So far still can only think of adjusting the index as mentioned in #16904 (comment) but this of course affects all writes.

tsteur · 2021-06-02T20:21:15Z

fyi also tested below query thinking maybe it improves as there is less data to sort or so but it's not improving the performance


SELECT SQL_NO_CACHE idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = @v1  ORDER BY idsite, visit_last_action_time DESC
                LIMIT 1

tsteur · 2021-06-02T20:23:38Z

What did make the query fast is below query. But this will only work if the visits are actually in order and the highest idivist is the most recent visit

SELECT SQL_NO_CACHE idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = @v1  ORDER BY  idvisit DESC
                LIMIT 1

tsteur · 2021-06-02T20:33:00Z

If I see this right: Basically, if we could cache whether they ever used a custom timestamp for a site during tracking or not then we could optimise the query big big times by sorting idvisit desc instead. As soon as they use a custom timestamp in a tracking request we would be no longer able to use it though and fall back to slow. Most users would maybe not send a custom timestamp.

tsteur · 2021-06-04T02:52:54Z

fyi tested adding an index idsite, visit_last_action_time DESC but this seems to not have helped.

tsteur · 2021-06-23T22:53:31Z

We can't reproduce the performance improvement anymore when sorting by idvisit instead of visit_last_action_time so this doesn't seem to be a solution. For more information see #17649 (comment)

tsteur · 2021-07-22T22:55:22Z

Let's document this problem and the solution. Once done, let's close it as wontfix.

tsteur · 2021-10-07T20:26:58Z

@peterhashair I think you documented this one? Is it fully published?

peterhashair · 2021-10-07T22:07:17Z

@tsteur just published it. This issue Should be good to go.

justinvelluppillai · 2021-10-07T22:08:53Z

@peterhashair can you please post a link to the new document and close this issue?

peterhashair · 2021-10-07T22:13:38Z

To resolve this issue. Please go to FAQ HOW DO I IMPROVE THE LOAD TIME OF THE FIND VISITOR SQL QUERY WHEN VISITORS HAVE MANY VISITS?

tsteur · 2021-12-20T00:43:46Z

I'll reopen this issue for Matomo 5.0.

In Matomo 5.0 we will change the existing index according to this FAQ: https://matomo.org/faq/troubleshooting/how-do-i-improve-the-load-time-of-the-find-visitor-sql-query-when-visitors-have-many-visits/

so the index looks like this:

`index_idsite_idvisitor` ON matomo_log_visit (idsite, idvisitor, visit_last_action_time DESC)

We've been testing this index on the cloud and while it made writes slightly slower the reads are a lot faster. This query was basically top load contributor previously

tsteur · 2022-05-10T02:48:50Z

This was already solved in #18636 and the update script for Matomo 5 is already merged causing an index change is there already. We only need to close it as part of Matomo 5 release to make sure it appears in changelog.

tsteur added the c: Performance For when we could improve the performance / speed of Matomo. label Dec 7, 2020

tsteur added this to the 4.3.0 milestone Mar 1, 2021

tsteur added the Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. label Mar 1, 2021

tsteur mentioned this issue Mar 22, 2021

Huge query on Matomo hundreds of thousands of rows on repeat #17376

Closed

tsteur modified the milestones: 4.3.0, 4.4.0 Mar 28, 2021

diosmosis mentioned this issue Jun 5, 2021

if cdt is not used during tracking, sort by idvisit descending when looking for existing visits #17649

Closed

10 tasks

diosmosis mentioned this issue Jun 15, 2021

queuedtracking:process creates heavy load on SQL Database (after Matomo 4.x.x Update) matomo-org/plugin-QueuedTracking#151

Closed

tsteur added the c: Documentation For issues related to in-app product help messages, or to the Matomo knowledge base. label Jul 22, 2021

tsteur mentioned this issue Sep 21, 2021

Optimise ecommerce life time metric query #18036

Closed

4 tasks

peterhashair self-assigned this Oct 7, 2021

tsteur closed this as completed Oct 7, 2021

justinvelluppillai added the wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it. label Nov 29, 2021

tsteur modified the milestones: 4.6.0, 5.0.0 Dec 20, 2021

tsteur removed c: Documentation For issues related to in-app product help messages, or to the Matomo knowledge base. wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it. labels Dec 20, 2021

tsteur reopened this Dec 20, 2021

tsteur unassigned peterhashair Jan 10, 2022

mattab added the 5.0.0 label Jan 4, 2023

sgiehl closed this as completed Feb 6, 2023

L3on1d mentioned this issue Apr 27, 2023

Tons of FORCE INDEX (index_idsite_idvisitor) MySQL queries killing Matomo #20639

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matomo tracker select queries can get slow when most visits have same visitorId #16904

Matomo tracker select queries can get slow when most visits have same visitorId #16904

tsteur commented Dec 7, 2020 •

edited

tsteur commented Dec 7, 2020

tsteur commented Dec 7, 2020

tsteur commented Mar 1, 2021

diosmosis commented Mar 1, 2021

tsteur commented Mar 1, 2021

diosmosis commented Mar 1, 2021 •

edited

tsteur commented Mar 1, 2021

diosmosis commented Mar 1, 2021

tsteur commented Mar 1, 2021

tsteur commented Mar 2, 2021

tsteur commented Mar 2, 2021 •

edited

diosmosis commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

diosmosis commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 4, 2021

tsteur commented Jun 23, 2021

tsteur commented Jul 22, 2021

tsteur commented Oct 7, 2021

peterhashair commented Oct 7, 2021

justinvelluppillai commented Oct 7, 2021

peterhashair commented Oct 7, 2021

tsteur commented Dec 20, 2021 •

edited

tsteur commented May 10, 2022

Matomo tracker select queries can get slow when most visits have same visitorId #16904

Matomo tracker select queries can get slow when most visits have same visitorId #16904

Comments

tsteur commented Dec 7, 2020 • edited

tsteur commented Dec 7, 2020

tsteur commented Dec 7, 2020

tsteur commented Mar 1, 2021

diosmosis commented Mar 1, 2021

tsteur commented Mar 1, 2021

diosmosis commented Mar 1, 2021 • edited

tsteur commented Mar 1, 2021

diosmosis commented Mar 1, 2021

tsteur commented Mar 1, 2021

tsteur commented Mar 2, 2021

tsteur commented Mar 2, 2021 • edited

diosmosis commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

diosmosis commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 2, 2021

tsteur commented Jun 4, 2021

tsteur commented Jun 23, 2021

tsteur commented Jul 22, 2021

tsteur commented Oct 7, 2021

peterhashair commented Oct 7, 2021

justinvelluppillai commented Oct 7, 2021

peterhashair commented Oct 7, 2021

tsteur commented Dec 20, 2021 • edited

tsteur commented May 10, 2022

tsteur commented Dec 7, 2020 •

edited

diosmosis commented Mar 1, 2021 •

edited

tsteur commented Mar 2, 2021 •

edited

tsteur commented Dec 20, 2021 •

edited