New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to define joins for log tables using `getWaysToJoinToOtherLogTables` #14062

Open
wants to merge 9 commits into
base: 3.x-dev
from

Conversation

Projects
None yet
2 participants
@sgiehl
Copy link
Member

sgiehl commented Jan 31, 2019

In some cases it might be useful to join tables to visitor data using other columns than idVisit and idAction.

There is already a method LogTable::getWaysToJoinToOtherLogTables, which makes it possible to define ways to join the table other than using idVisit and idAction.
But currently that method is only used in PrivacyManager in order to remove related data.

This change should also make it possible to use tables joined by getWaysToJoinToOtherLogTables as dimensions and in segmentation.

@sgiehl sgiehl force-pushed the improvejoins branch from 9b8f961 to fa74db6 Jan 31, 2019

@sgiehl sgiehl requested a review from tsteur Feb 3, 2019

@@ -53,6 +53,10 @@ private function addMissingTablesNeededForJoins()
if (!$logTable->getColumnToJoinOnIdVisit()) {
$tableNameToJoin = $logTable->getLinkTableToBeAbleToJoinOnVisit();
if (empty($tableNameToJoin) && $logTable->getWaysToJoinToOtherLogTables()) {

This comment has been minimized.

@tsteur

tsteur Feb 3, 2019

Member

not sure I understand what this does here? Wouldn't we want to add some tables here if we cannot join on idvisit but there is a matching way to join with another table?

This comment has been minimized.

@sgiehl

sgiehl Feb 4, 2019

Author Member

guess might make sense to add those tables if they are missing

This comment has been minimized.

@tsteur

tsteur Feb 5, 2019

Member

I'm not 100% sure about this... wouldn't we need to bubble up all ways to join other log tables to see if it can be joined with the original table eventually? Eg

custom_table.userid => refers to custom_table2.userid but this one has no idvisit ... then we shouldn't add those tables etc and not assume they can be joined when otherwise only log_link_visit_action is present. It may be edge case and should pretty much only happen though when not configuring it properly maybe or when creating log tables that don't make sense in terms of Matomo?

This comment has been minimized.

@tsteur

tsteur Feb 5, 2019

Member

or is it not too important because the query will then fail anyway?

This comment has been minimized.

@sgiehl

sgiehl Feb 5, 2019

Author Member

I guess the query would fail if something is configured badly. E.g. if a plugin defines dimensions and/or segments for a log table that is not joinable with log_visit or log_action. But I'll have a thought if there's a smart way to check if bubbling up ends in log_visit or log_action

This comment has been minimized.

@sgiehl

sgiehl Feb 6, 2019

Author Member

@tsteur I tried starting to implement a way to automatically add all tables up the hierarchy
It gets quite more complicated though. See
https://github.com/matomo-org/matomo/compare/improvejoins...improvejoins2?expand=1

Not sure if it's worth the effort. Otherwise it would only work if a log table defines another way to directly join into log_visit or log_action

This comment has been minimized.

@tsteur

tsteur Feb 6, 2019

Member

OK maybe not too important. I reckon could have been also useful for custom reports to know whether a table can be joined or not in order to eg know which dimensions can be combined etc. I suppose it might not be too important for now.

This comment has been minimized.

@sgiehl

sgiehl Feb 7, 2019

Author Member

So how to proceed here now? Adding the table or not?

This comment has been minimized.

@sgiehl

sgiehl Feb 8, 2019

Author Member

@tsteur Did some more testing on this. Seems its needed to add required tables up the hierarchy. Otherwise archiving for all reports including conversions fails, as it's not possible to join the custom log table with log_conversion when using a segment for the custom log table. Will merge the other branch here. Would be good if you could do another review then. Will also try to add some more specific tests

@tsteur

This comment has been minimized.

Copy link
Member

tsteur commented Feb 3, 2019

Those PRs are always tricky :) We would need to run all system tests of all premium features and for custom reports on top the integration tests to make sure everything will still work fine. (I suppose it will as they mainly define arrays in the archiving anyway so this logic should be ignored). Can you run them or should I?

@sgiehl

This comment has been minimized.

Copy link
Member Author

sgiehl commented Feb 4, 2019

@tsteur Updated the branch. Regarding running the tests. Would be nice if you could run them as well. Would need to set up a new vm to do that properly, as my current setup uses php 5.5. Some of the plugins have failing system tests due to sorting issues...

$otherJoins = $logTable->getWaysToJoinToOtherLogTables();
foreach ($otherJoins as $joinTable => $column) {
if($availableLogTable->getName() == $joinTable) {
$join = sprintf("%s.%s = %s.%s", $table, $column, $availableLogTable->getName(), $column);

This comment has been minimized.

@tsteur

tsteur Feb 5, 2019

Member

I'm not 100% sure it looks like this is also used in line 131:


                        $table['joinOn'] = $this->findJoinCriteriasForTables($logTable, $availableLogTables);

when dealing with arrays. but in the above code in addMissingTablesNeededForJoins() we don't add missing tables when it is an array? not sure if you know what I mean, or if it's an issue ... but may need to be looked at and probably needs a test. currently we're adding only needed tables when table is defined as a string log_xyz but not through an array array('table' => 'log_xyz').

@sgiehl sgiehl added this to the 3.9.0 milestone Feb 7, 2019

@sgiehl sgiehl force-pushed the improvejoins branch 6 times, most recently from 387e48a to 848b50d Feb 8, 2019

@sgiehl

This comment has been minimized.

Copy link
Member Author

sgiehl commented Feb 9, 2019

@tsteur I've now implemented the automatic join up the hierarchy if it's joinable with log_visit or log_action. To prove that works as expected I've added a new plugin ExampleLogTables which includes two custom log tables. One that joins to log_visit using the user_id column, and another log table that joins the other custom table using another custom column. Also there is a custom dimension/segment for both log tables.
The plugin includes various system tests to check that all APIs/archiving still work when using a segment of one of the custom log tables.

$otherTableJoins = $table->getWaysToJoinToOtherLogTables();
foreach ($otherTableJoins as $logTable => $column) {
$this->addMissingTablesForOtherTableJoin($logTable, $tableName);

This comment has been minimized.

@tsteur

tsteur Feb 11, 2019

Member

Feel free to ignore but in theory we would maybe only add the tables actually needed. Eg if there's log_form_foo and it can be joined with log_form_bar and log_form_baz, then we would maybe only add log_form_bar if this leads us to another log table that we can resolve? It's bit edge case and complicated to solve... the same way as figuring out if a table can be resolved or not... it's a rare feature and so far there's always only one dependency I think so it can be ignored probably until maybe needed or when there's a problem.

This comment has been minimized.

@sgiehl

sgiehl Feb 11, 2019

Author Member

So you mean if getWaysToJoinToOtherLogTables defines multiple ways to join to different tables?
I'm currently kind of expecting that there is only one possibility defined.

return;
}
$this->implicitTableDependencies[$table] = [$dependentTable];

This comment has been minimized.

@tsteur

tsteur Feb 11, 2019

Member

Not sure I understand, can't there be multiple dependencies?

This comment has been minimized.

@sgiehl

sgiehl Feb 11, 2019

Author Member

In theory there could be multiple, but actually I'm only adding those dependencies, that $table is in the list after the static defined ones. Otherwise it can happen that a custom table is sorted before e.g. log_visit, which breaks the query. (Removing this would also break the tests I've added)

sgiehl added some commits Jan 31, 2019

@sgiehl sgiehl force-pushed the improvejoins branch from 63af4f2 to c94396d Feb 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment