Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Improve storage of URLs, normalization at DB Level #2976

Closed
BeezyT opened this Issue · 21 comments

4 participants

@BeezyT
Collaborator

Problem

Pages might have multiple URLs, especially if a site has multiple domains.

The basic cases are

  1. Alias domains
  2. With and without www
  3. http and https

Current solution

  • For the Actions report, normalization is done implicitly in Piwik_Actions::getActionExplodedNames. The domain is removed, which takes care of (2) and (3). It also handles (1) correctly if the domains are real aliases, but it fails if the aliases are domain1.com/ and domain2.com/something/.
  • For the segement pageUrl, one can use =@ (i.e. contains) and omit domain and protocol.

Why this is not enough

For upcoming features, we need data based on URLs. One URL might have multiple idactions.

Using WHERE idaction IN ( ... ) works in some cases but not always. E.g. if you want the pages visitors viewed directly after a certain page you can use idaction_url_ref IN ( ... ) but you cannot recognize aliases in the result (column idaction) on the DB level.

Proposed solution

  1. Do some normalization before writing to log_action. We would not record protocol and www for page URLs ie. type == TYPE_ACTION_URL == 1.
  2. Add the information about protocol and www as a TINYINT to log_action. 1 = http://, 2 = http://www, 3 = https://, 4 = https://www. This information is added when new actions are inserted and never modified. It can be used to reconstruct the full URL. When actions should be tracked with the same URL but different TINYINT, use the existing idaction.
  3. Analysis can be done without worrying about problem (2) and (3).
  4. Problem (4) - alias domains - is acceptable. Maybe we can add a switch to the site configuration where users can configure Piwik to treat the domains as aliases. If set, the alias domains will be replaced with the main domain when tracking. By default, this option is off.
  5. We need to upgrade existing databases. Transform log_action, find duplicates, change foreign keys to duplicates in other tables, remove duplicates. Ideally, this is done without using stored procedures / functions and without data shipping to PHP.
@robocoder
Collaborator

s/remote/remove/

@robocoder
Collaborator

I think this is an acceptable edge case. From prior discussion, SEO would discourage the use of alias domains.

@mattab
Owner

We need to focus on performance, because this feature will be potentially very slow. We can't afford adding a new field to log_action to deal with this edge case. To maximise performance, solution 4) is the way to go. It is not a big drawback IMO because most websites do not use alias.

vote for wontfix/worksforme/ acceptable edge case.

@mattab
Owner

Note: a relevant new FAQ to put up to help with this issue is to explain how to track the custom URL to be the canonical URL (if websites generate it, most CMS/apps do these days). See #2974 for code example

@mattab
Owner

See also #2805 (Purge logs from log_action) - could this be done if we happen to change URL storage algorithm?

@mattab
Owner

After discussing further with timo, we came to the conclusion that storing protocol + hostname for Page URLS (type==1) in log_action table is not currently useful very much.

  • Storing hostname is only used when using the segments pageUrl, exitPageUrl, entryPageUrl.
  • However storing hostname for all rows where type==1 in log_action causes extra disk usage and prevents us from writing efficient and user friendly new features

Therefore, it makes sense to consider not storing the protocol+hostname in the future.
Proposed solution

  • Tracking: We would not record protocol+hostname for Page URLs ie. type== TYPE_ACTION_URL ==1
    • log_action rows for type TYPE_OUTLINK = 2; TYPE_DOWNLOAD = 3; TYPE_ACTION_NAME = 4; would be unchanged and still have full URL
  • Upgrade: The upgrade procedure would be quite slow. Timo is working on a test upgrade that we will try on the demo. Definitely high traffic piwik users will have to stop tracking and disable UI during the upgrade.
    • Post upgrade: the log_action table would be smaller (protocol+host stripped out)
    • Future developments will not have to worry about URL normalization. It will also improve user friendliness since URLs will be automatically aggregated.
  • Archiving: We would not put the hostname in the Actions Page URLs datatables.
  • API: The hostname would be prepended in the metadata 'url' column in the Actions.getPageUrls API call
    • This change to Archiving+API would result in smaller future Actions Datatable (not a significant improvement but always nice)

TODO: New FAQ to explain how to do alias domain URL segmentation.

  • Because some users might use pageUrl, exitPageUrl, entryPageUrl segmentation to test various aliases performance, we should document, before releasing this change, how to do Domain segmentation via Custom Vars.
    Typically, user could record the hostname in a custom variable of scope "page". Then it could segment the reports using
    &segment=customVariablePageName1==example.org;pageUrl=http://this-domain-is-not-used/page.html

    This will first normalize the pageUrl to "page.html" then select the idaction, then segment further only page views that had a custom variable "hostname" set to "example.org".

    This would achieve the current behavior of
    &segment=pageUrl=http://example.org/page.html
    .

  • document in the segmentation page the fact that segment pageUrl, entryPageUrl, exitPageUrl do not use the hostname.
    • This is the main downside of this change: that segments implementation is not really what their name mean, because only the path+querystring from the specified segment valueswill be used (rather than the whole URL).

The benefits of this change far outweight the downsides. The very slow upgrade will be a challenge, but a necessary one if we want to keep innovating :)

@BeezyT
Collaborator

After some discussion via email, here's an updated version of the ticket description.

@mattab
Owner

Maybe we can add a switch to the site configuration where users can configure Piwik to treat the domains as aliases. If set, the alias domains will be replaced with the main domain when tracking. By default, this option is off.

I think that's OK for V2 but we don't need the switch for V1, as this is an edge case and I think will not be useful for many people.

Update script
Hopefully you can manage to do the update without stored functions indeed, as I'm pretty sure many users will not be allowed to create the functions on shared hosts.

Because now the normalization is simple, maybe we can do something like:

select @prefix_type := (
case when LEFT(name, 10) = 'http://www' then 'http://www' 
when LEFT(name, 7) = 'http://' then 'http://' 
case when LEFT(name, 11) = 'https://www' then 'https://www'
when LEFT(name, 8) = 'https://' then 'https://' 
else '' 
end), 
name, 
IF(@prefix_type <> '', REPLACE(name,@prefix_type,''), name ) as nameafter
from piwik_log_action
where type=1
#AND idaction > 730641 
@BeezyT
Collaborator

Trac won't let me add files with 1K lines. So here's the patch in gist: https://gist.github.com/2157026 . It consists of two commits I have in a git branch. I hope you can manage to apply it to your SVN working copy.

  • All the integration tests pass and I added a new test suite for the normalization.
  • I called the update 1.7.3-b1, assuming that we release 1.7.2 before we do this. But the version can be changed when I commit to trunk.

Please review the patch carefully.

@mattab
Owner

Patch looks good, thanks. +1 for the tests!
I didn't apply it but I generally don't apply patches anyway, just read them.

The only potential problem I can think of is the inserting of NULL values, which I remember has some issues with Mysqli. But, if that's the case, the beauty is that tests will fail when commmitted on jenkins, so we shall see.

great stuff!

@BeezyT
Collaborator

(In [6790]) refs #2976 updates can be marked a major

@BeezyT
Collaborator

(In [6791]) refs #2976 fixing outdated doc of Piwik_Config

@BeezyT
Collaborator

(In [6792]) refs #2976 url normalization: store protocol and www in the url_prefix column of log_action. treat pages with different protocol or with/without www as the same action. includes a major db transformation and tests.

@BeezyT
Collaborator

It would be good if as many eyes as possible could review this update since it's quite critical.

@BeezyT
Collaborator

(In [6794]) refs #2976 svn properties

@BeezyT
Collaborator

Looks like I broke the build...

/home/www/data/root/jenkins.private/jobs/Piwik/workspace/build/plugins/PrivacyManager/tests/PrivacyManager.test.php -> Test_Piwik_PrivacyManager -> test_purgeData_deleteReportsKeepRangeReports -> Unexpected exception of type [with message [SQLSTATE42000: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ' ,
, , 1000)' at line 4

What is this trying to tell me??

This particular test case doesn't complete on my machine at all.

@sgiehl
Collaborator

(In [6802]) refs #2976 fixed privacymanager tests

@mattab
Owner

(In [6808]) Adding quick tip about disabling maintenance and re-enable tracker refs #2976

@mattab
Owner

Great code Timo!

It will be useful to have this data cleaned up and simplified int he DB...
This wasnt an easy upgrade script to write, you did really well

  • Tests look good, testing both the segmentation, and that the log_action is correctly built, and case sensitivity..

Not much to add, good stuff. Marking as fixed

@mattab
Owner

(In [6814]) No duplicate code + Testing for entryPageUrl/exitPageUrl Refs #2976

@mattab
Owner

in [7017] Refs #2976 Removing comments from the SQL as, when we display SQL, it gets all inlined and #Xxxx comments gets inlined causing SQL to fail. Reported in forums eg. http://forum.piwik.org/read.php?2,93934

@BeezyT BeezyT added this to the 1.12.x - Piwik 1.12.x milestone
@BeezyT BeezyT self-assigned this
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.