New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve storage of URLs, normalization at DB Level #2976
Comments
s/remote/remove/ |
I think this is an acceptable edge case. From prior discussion, SEO would discourage the use of alias domains. |
We need to focus on performance, because this feature will be potentially very slow. We can't afford adding a new field to log_action to deal with this edge case. To maximise performance, solution 4) is the way to go. It is not a big drawback IMO because most websites do not use alias. vote for wontfix/worksforme/ acceptable edge case. |
Note: a relevant new FAQ to put up to help with this issue is to explain how to track the custom URL to be the canonical URL (if websites generate it, most CMS/apps do these days). See #2974 for code example |
See also #2805 (Purge logs from log_action) - could this be done if we happen to change URL storage algorithm? |
After discussing further with timo, we came to the conclusion that storing protocol + hostname for Page URLS (type==1) in log_action table is not currently useful very much.
Therefore, it makes sense to consider not storing the protocol+hostname in the future.
TODO: New FAQ to explain how to do alias domain URL segmentation.
The benefits of this change far outweight the downsides. The very slow upgrade will be a challenge, but a necessary one if we want to keep innovating :) |
After some discussion via email, here's an updated version of the ticket description. |
I think that's OK for V2 but we don't need the switch for V1, as this is an edge case and I think will not be useful for many people.
Because now the normalization is simple, maybe we can do something like:
|
Trac won't let me add files with 1K lines. So here's the patch in gist: https://gist.github.com/2157026 . It consists of two commits I have in a git branch. I hope you can manage to apply it to your SVN working copy.
Please review the patch carefully. |
Patch looks good, thanks. +1 for the tests! The only potential problem I can think of is the inserting of NULL values, which I remember has some issues with Mysqli. But, if that's the case, the beauty is that tests will fail when commmitted on jenkins, so we shall see. great stuff! |
(In [6790]) refs #2976 updates can be marked a major |
(In [6791]) refs #2976 fixing outdated doc of Piwik_Config |
(In [6792]) refs #2976 url normalization: store protocol and www in the url_prefix column of log_action. treat pages with different protocol or with/without www as the same action. includes a major db transformation and tests. |
It would be good if as many eyes as possible could review this update since it's quite critical. |
(In [6794]) refs #2976 svn properties |
Looks like I broke the build... /home/www/data/root/jenkins.private/jobs/Piwik/workspace/build/plugins/PrivacyManager/tests/PrivacyManager.test.php -> Test_Piwik_PrivacyManager -> test_purgeData_deleteReportsKeepRangeReports -> Unexpected exception of type [with message [SQLSTATE42000: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ' , What is this trying to tell me?? This particular test case doesn't complete on my machine at all. |
(In [6802]) refs #2976 fixed privacymanager tests |
(In [6808]) Adding quick tip about disabling maintenance and re-enable tracker refs #2976 |
Great code Timo! It will be useful to have this data cleaned up and simplified int he DB...
Not much to add, good stuff. Marking as fixed |
(In [6814]) No duplicate code + Testing for entryPageUrl/exitPageUrl Refs #2976 |
in [7017] Refs #2976 Removing comments from the SQL as, when we display SQL, it gets all inlined and #Xxxx comments gets inlined causing SQL to fail. Reported in forums eg. http://forum.piwik.org/read.php?2,93934 |
Problem
Pages might have multiple URLs, especially if a site has multiple domains.
The basic cases are
Current solution
domain1.com/
anddomain2.com/something/
.Why this is not enough
For upcoming features, we need data based on URLs. One URL might have multiple idactions.
Using
WHERE idaction IN ( ... )
works in some cases but not always. E.g. if you want the pages visitors viewed directly after a certain page you can useidaction_url_ref IN ( ... )
but you cannot recognize aliases in the result (column idaction) on the DB level.Proposed solution
The text was updated successfully, but these errors were encountered: