Fix repeated actions on repository feed #20738

Gusted · 2022-08-09T19:25:39Z

Before understanding why this patch is here. The actions table stores the actions, this includes commenting on a issue, merging a pull request and pushing a tag etc. However, due to historical (and likely performance) reason, each action(e.g. commenting on a issue) is duplicated for each user(such as the poster, watcher of that issue and repository etc.).
This means, if you only specify the repo_id you will end up with a lot of duplicated actions. We fix this by de-duplicating the actions by their created_unix, op_type and act_user_id. So that means if created_unix, op_type and act_user_id is the same it's likely the same action but meant for meant for different users. It's only possible to create a "collision" if a user is able to create a two or more actions under a second.

- Before understanding why this patch is here. The actions table stores the actions, this includes commenting on a issue, merging a pull request and pushing a tag etc. However, due to historical (and likely performance) reason, each action(e.g. commenting on a issue) is duplicated for each user(such as the poster, watcher of that issue and repository etc.). - This means, if you only specify the `repo_id` you will end up with a lot of duplicated actions. We fix this by de-duplicating the actions by their `created_unix`. While this isn't a perfect way of solving this problem, it will do the job for 99%. Only problems will arise for highly active repositories in which actions are being taken on the same second.

- Backport go-gitea#20738 - Before understanding why this patch is here. The actions table stores the actions, this includes commenting on a issue, merging a pull request and pushing a tag etc. However, due to historical (and likely performance) reason, each action(e.g. commenting on a issue) is duplicated for each user(such as the poster, watcher of that issue and repository etc.). - This means, if you only specify the `repo_id` you will end up with a lot of duplicated actions. We fix this by de-duplicating the actions by their `created_unix`. While this isn't a perfect way of solving this problem, it will do the job for 99%. Only problems will arise for highly active repositories in which actions are being taken on the same second.

silverwind · 2022-08-09T19:59:30Z

Unrelated, but we should probably migrate those unix timestamps in DB to millisecond precision at some point.

CirnoT · 2022-08-09T23:39:52Z

Won't this cause issues where branch is deleted when merging? In such cases, both actions are on same second

Gusted · 2022-08-09T23:41:25Z

Won't this cause issues where branch is deleted when merging? In such cases, both actions are on same second

That would indeed cause such issue, but there's currently no better way of handling these kind of edge-cases. The actions table is fundamentally flawed for this task.

CirnoT · 2022-08-09T23:45:12Z

Not exactly correct. If you want to avoid duplicated actions you should deduplicate them by content AND timestamp.

Also we definitely should do what @silverwind proposed, the sooner the better.

Gusted · 2022-08-09T23:48:17Z

them by content

It would only cover a handful of more edge-cases, not every action has content. Most are generated on the fly.

CirnoT · 2022-08-09T23:48:42Z

Some other cases that this PR causes issues with:

Merge (creates action merge and push on same second)
Automation bots (such as Renovate) will usually create branch and PR very quickly (on fast servers it ends up being same second and as such only first action [branch creation] would be shown [while we're more interested in PR creation])
Comment and close
Any issue/PR modification/creation action composed of multiple steps, such as (creation+label) or (creation+reviewer) or (creation+milestone) and so on)
Any composed push, such as pushing new commits to multiple branches or pushing new branch and new tag
Review composed of multiple delayed comments (Start review button delays creation of comment till review is submitted)

CirnoT · 2022-08-09T23:51:05Z

It would only cover a handful of more edge-cases, not every action has content. Most are generated on the fly.

By content I meant overall action entity sans for-whom (ie. all columns in type Action except for ID and UserID).

Gusted · 2022-08-09T23:58:14Z

By content I meant overall action entity sans for-whom (ie. all columns in type Action except for ID and UserID).

I don't think that's the correct way, I'm not even sure if that's efficient for a SQL server. That's more than a hack to fix this.

CirnoT · 2022-08-10T00:05:18Z

I don't think that's the correct way, I'm not even sure if that's efficient for a SQL server. That's more than a hack to fix this.

It is the only way to do it properly and not lose data. Usually you'd fetch it from SQL first and then filter it once data is in memory (we already fetch it in order to display it).

I don't think you can fix this by a simple hack, it's a bigger issue that requires migration of entire actions table to be more friendly for such use-cases.

We can't really call these "edge-cases". My daily activity consists 90% of actions that would be affected (see #20738 (comment), mostly review and merge) and considering core nature of actions affected (review, comment, push, merge) it is not something we can play around with freely; users are relying on seeing these actions on feeds.

wxiaoguang · 2022-08-10T00:39:29Z

Is it covered by tests? I guess the change will produce SQLs like SELECT a, b FROM t GROUP BY a.

IIRC, SELECT a, b FROM t GROUP BY a is not a standard SQL and some database(modes) report error for it, some databases(modes) return undefined results. The standard SQL is SELECT a, some_aggr_func(b) FROM t GROUP BY a.

Correct me if I am wrong or I misunderstood something.

Gusted · 2022-08-10T00:49:16Z

It is the only way to do it properly and not lose data.

Whatever we decide here, it won't lose data. We don't modify any items in the database we just restrict what we receive.

Usually you'd fetch it from SQL first and then filter it once data is in memory (we already fetch it in order to display it).

However that's not an option here as for larger repository we can end up fetching multiple times just to de-deplicate one action, the database should take care of this.

I don't think you can fix this by a simple hack, it's a bigger issue that requires migration of entire actions table to be more friendly for such use-cases.

Yeah and I'm currently not in to do that as just thinking about it gives me a headache, this will PR does solve the issue with repeated actions and we can deal with a proper long-term fix later on, which requires a complex migration that handles every edge case.

it is not something we can play around with freely; users are relying on seeing these actions on feeds.

I'm sorry but on which feed? We are currently just talking about the repository RSS that is borked(example), I don't think anybody is relying on this currently and if they do they also see "Oh this data is crap and repeated".

Is it covered by tests? I guess the change will produce SQLs like SELECT a, b FROM t GROUP BY a.

IIRC, SELECT a, b FROM t GROUP BY a is not a standard SQL and some database(modes) report error for it, some databases(modes) return undefined results. The standard SQL is SELECT a, some_aggr_func(b) FROM t GROUP BY a.

Correct me if I am wrong or I misunderstood something.

I have no idea, this is why we're using a ORM right? I see many other occurrences of GroupBy. And we only select action.* in the SQL query.

wxiaoguang · 2022-08-10T00:56:06Z

create table test (
  a int,
  b int
);
insert into test values (1, 2);
insert into test values (1, 3);

select * from test group by a;

db-fiddle: https://www.db-fiddle.com

MySQL

Query Error: Error: ER_WRONG_FIELD_WITH_GROUP: Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test.test.b' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

PostgreSQL

Query Error: error: column "test.b" must appear in the GROUP BY clause or be used in an aggregate function

Gusted · 2022-08-10T01:04:10Z

So what's the problem? Why does other SQL queries work that just uses GroupBy? And what's the proposed fix?

wxiaoguang · 2022-08-10T01:07:40Z

So what's the problem?

The problem is that the non-standard SQL will cause errors

Why does other SQL queries work that just uses GroupBy?

Maybe they were all written correctly? (I did a quick check, yes, all GroupBy I checked were written correctly before)

And what's the proposed fix?

#20738 (comment) , the standard SQL syntax is SELECT a, some_aggr_func(b) FROM t GROUP BY a, all other fields to be selected must be applied with an aggregate function.

Update: I am talking about the SQL only. I have no idea (and haven't thought) about how to improve the RSS feed at the moment. Maybe the solution is more complex than it looks like.

Gusted · 2022-08-10T01:23:29Z

Hmm, tests are passing and seems like this was already covered by test-cases. I now added a explicit test case for this bug, 11a82cb. Maybe XORM is doing some tricky things to make this work 🤷🏽.

lunny · 2022-08-10T01:50:24Z

For a big instance, created_unix will be collision frequently, I don't think this could be accepted.

Gusted · 2022-08-10T02:11:27Z

For a big instance, created_unix will be collision frequently, I don't think this could be accepted.

I've added op_type & act_user_id. If another person did the same or different action on that repository at the same time with someone else it will not be "de-duped". So unless someone is able to do a certain action within a second, collisions shouldn't occur.

wxiaoguang · 2022-08-10T02:29:38Z

Hmm, tests are passing and seems like this was already covered by test-cases. I now added a explicit test case for this bug, 11a82cb. Maybe XORM is doing some tricky things to make this work 🤷🏽.

Standard is standard .... I do not think it's right to do SELECT * FROM t GROUP BY a for all databases.

The unit tests only run with SQLite, SQLite can bear the non-standard SQLs (you can try it on https://sqlite.org/fiddle/).

I just tried to put the code in integration tests and started the CI, I would suppose the tests fail there.

See https://drone.gitea.io/go-gitea/gitea/59003/2/14 and https://drone.gitea.io/go-gitea/gitea/59003/3/8

[Discarded] A dummy test to run CI #20744

GetFeeds: Find: Error 1055: Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'testgitea.action.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

GetFeeds: Find: pq: column "action.id" must appear in the GROUP BY clause or be used in an aggregate function

Gusted · 2022-08-10T02:32:00Z

This is beyond my knowledge of SQL, feel free to push commits to make it the SQL standard.

wxiaoguang · 2022-08-10T02:34:31Z

This is beyond my knowledge of SQL, feel free to push commits to make it the SQL standard.

Sorry but I can not help to modify the SQL at the moment, it comes to the problem I mentioned in #20738 (comment)

I am talking about the SQL only. I have no idea (and haven't thought) about how to improve the RSS feed at the moment. Maybe the solution is more complex than it looks like.

I have no idea about how to help to improve the feed list at the moment. 😂

Gusted · 2022-08-10T03:27:17Z

Okay I think I might have something, what if we only request the action's id via the group_by query and then use a generic SELECT * FROM table WHERE id IN (query with group by) to get the other columns. db-fiddle seems to say that MySQL & PostgreSQL likes this.

Ex.

create table test (
  a int,
  b int,
  id int
);

insert into test values (1, 2, 1);
insert into test values (1, 3, 2);
insert into test values (2, 3, 3);
insert into test values (2, 7, 4);
insert into test values (1, 3, 5);
insert into test values (5, 3, 6);

select * from test where id in (select min(id) as id from test group by a);

wxiaoguang · 2022-08-10T03:32:47Z

Yup, that's a common trick to do the select+group. However, when you do group-by, you will have new problems with pagination (the SetSessionPagination, aka COUNT/LIMIT + GROUP BY) in code. So, the solution will be more complex than the problem it looks like.

Gusted · 2022-08-10T03:49:51Z

Hmm unless this is SQLite acting up again. I'm now at this query and it's producing the correct results:

 SELECT `action`.* FROM `action` INNER JOIN `repository` ON `repository`.id = `action`.repo_id WHERE repo_id=? AND is_deleted=? AND `action`.`id` IN (SELECT min(id) as id FROM action GROUP BY `action`.created_unix, `action`.op_type, `action`.act_user_id) ORDER BY `action`.`created_unix` DESC LIMIT 30 [22 false]

Which in theory should work for MySQL/PostgreSQL. And does produce the correct result on SQLite.

Gusted · 2022-08-12T17:46:25Z

#20744 Seems to suggest that my awful creation of SQL query still doesn't work? I don't see a lot of options anymore(that doesn't involve calling the database a thousands times in worst-case repositories), I'm also happy to just demolish the feature for now so we all can spend some time engineering a better solution to this problem as currently as-is the feature is just plain broken for repositories feeds.

CirnoT · 2022-08-12T18:00:59Z

(that doesn't involve calling the database a thousands times in worst-case repositories)

RSS feeds should not be so long, there should be a limit (say actions from past 6h).

Also, having every action separate in RSS feed is unhelpful, for stuff like replies to issues there should be a single entry per repository per issue per action (reply, review) that gets updated (RSS allows to update items so that readers show them as unread again). Ideally the title of item would contain something like "X new replies to issue Y", where X is number of unique replies since user last read (on site) that issue.

Basically, I do not think these feeds are useful to anyone, people would want to have feed of their notifications instead.

Gusted · 2022-08-12T18:12:11Z

RSS feeds should not be so long, there should be a limit (say actions from past 6h).

That's not the issue, even 1 action can be duplicated for hunderds of users. And we need to iterate through them if the database can't de-duplicate for us.

There's a hard limit of 30 actions to be shown in the RSS feed.

Also, having every action separate in RSS feed is unhelpful, for stuff like replies to issues there should be a single entry per repository per issue per action (reply, review) that gets updated (RSS allows to update items so that readers show them as unread again). Ideally the title of item would contain something like "X new replies to issue Y", where X is number of unique replies since user last read (on site) that issue.

Unfortunately that's not the case, current system is stupid so we got to deal with it.

Basically, I do not think these feeds are useful to anyone, people would want to have feed of their notifications instead.

If it wasn't useful it wouldn't have been implemented. See #569

CirnoT · 2022-08-12T22:09:53Z

If it wasn't useful it wouldn't have been implemented. See #569

I am certain that they are useful for someone, but that does not mean they are useful in general to larger audience. The desired use-case was most likely to use them instead of webhooks, as feeds are fetch wheras hooks are push. Maybe it was necessary for someone having air-gapped instance and instead of creating a solution themselves (ie. having separate service receiving webhooks on loopback and exposing them as fetch endpoint) they decided to pollute codebase with features that now will have to be maintained.

In the first place, if feature is in such broken state it should have never been approved and merged, it should've been on the original author to devise a working solution; as it stands now it just adds additional complexity and steals time from maintainers who are now left with a feature they'll have to fix and maintain in future.

Anyway, that is enough for my rant. Regarding PR itself - it does seem like the solution you came up with is no-go, both technically as well as behavior-wise. I have absolutely no idea how to handle this nicely without refactoring entire Action model.

Maybe consider dropping feed support for repos and orgs in general and instead add feed support for:

Each issue separately
Commits of specific repo
Releases of specific repo

No idea what to do with user feed though.

wxiaoguang · 2022-08-13T01:08:42Z

#20744 Seems to suggest that SQL query still doesn't work?

The 20744 only collected errors for SQL SELECT * FROM t GROUP BY a. The newly committed sub-query is valid SQL.

I have read the new change: SELECT * FROM t WHERE id IN (SELECT ... GROUP BY ...), there could still be a problem: the action table can be quite large, maybe millions rows for a busy instance, then the builder.Select("min(id) as id").From("action").GroupBy("`action`.created_unix, `action`.op_type, `action`.act_user_id")) is a full-table-scan for every time, it could be very slow.

Gusted · 2022-08-13T11:31:25Z

is a full-table-scan for every time, it could be very slow.

Confirmed that on larger repositories with quite a lot of actions this is indeed slow. I will close the PR for now and think about some kind of magical migration that will fix this.

Gusted · 2022-08-13T11:32:03Z

@CirnoT Please open a new issue for your concerns, this wasn't really the place to discuss such things.

Gusted added this to the 1.18.0 milestone Aug 9, 2022

Gusted requested a review from 6543 August 9, 2022 19:25

Gusted added type/bug backport/v1.17 labels Aug 9, 2022

Gusted mentioned this pull request Aug 9, 2022

Fix repeated actions on repository feed (#20738) #20739

Closed

Gusted changed the title ~~Fix duplicated actions on repository feed~~ Fix repeated actions on repository feed Aug 9, 2022

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Aug 9, 2022

Merge branch 'main' into fix-duplicated-actions-repo-feed

5479f7f

Merge branch 'main' into fix-duplicated-actions-repo-feed

1b46584

Add basic test-case

11a82cb

Use more advanced GroupBy + test case

f4b0d75

Merge branch 'main' into fix-duplicated-actions-repo-feed

f053a9c

Gusted requested review from lunny and wxiaoguang August 10, 2022 02:11

Use standardized SQL

4c90914

wxiaoguang mentioned this pull request Aug 10, 2022

[Discarded] A dummy test to run CI #20744

Closed

Merge branch 'main' into fix-duplicated-actions-repo-feed

2841845

Merge branch 'main' into fix-duplicated-actions-repo-feed

4194b8c

Gusted closed this Aug 13, 2022

Gusted deleted the fix-duplicated-actions-repo-feed branch August 13, 2022 11:32

zeripath removed type/bug lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. backport/v1.17 labels Aug 21, 2022

yan12125 mentioned this pull request Oct 2, 2022

RSS feeds for repositories owned by organizations have duplicate entries #20986

Closed

lunny removed this from the 1.18.0 milestone Oct 26, 2022

go-gitea locked and limited conversation to collaborators May 3, 2023

Uh oh!

Fix repeated actions on repository feed #20738

Fix repeated actions on repository feed #20738

Uh oh!

Conversation

Gusted commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silverwind commented Aug 9, 2022

Uh oh!

CirnoT commented Aug 9, 2022

Uh oh!

Gusted commented Aug 9, 2022

Uh oh!

CirnoT commented Aug 9, 2022

Uh oh!

Gusted commented Aug 9, 2022

Uh oh!

CirnoT commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CirnoT commented Aug 9, 2022

Uh oh!

Gusted commented Aug 9, 2022

Uh oh!

CirnoT commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wxiaoguang commented Aug 10, 2022

Uh oh!

Gusted commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wxiaoguang commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MySQL

PostgreSQL

Uh oh!

Gusted commented Aug 10, 2022

Uh oh!

wxiaoguang commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gusted commented Aug 10, 2022

Uh oh!

lunny commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gusted commented Aug 10, 2022

Uh oh!

wxiaoguang commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gusted commented Aug 10, 2022

Uh oh!

wxiaoguang commented Aug 10, 2022

Uh oh!

Gusted commented Aug 10, 2022

Uh oh!

wxiaoguang commented Aug 10, 2022

Uh oh!

Gusted commented Aug 10, 2022

Uh oh!

Gusted commented Aug 12, 2022

Uh oh!

CirnoT commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gusted commented Aug 12, 2022

Uh oh!

CirnoT commented Aug 12, 2022

Uh oh!

wxiaoguang commented Aug 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gusted commented Aug 13, 2022

Uh oh!

Gusted commented Aug 13, 2022

Uh oh!

Reviewers

Gusted commented Aug 9, 2022 •

edited

Loading

CirnoT commented Aug 9, 2022 •

edited

Loading

CirnoT commented Aug 10, 2022 •

edited

Loading

Gusted commented Aug 10, 2022 •

edited

Loading

wxiaoguang commented Aug 10, 2022 •

edited

Loading

wxiaoguang commented Aug 10, 2022 •

edited

Loading

lunny commented Aug 10, 2022 •

edited

Loading

wxiaoguang commented Aug 10, 2022 •

edited

Loading

CirnoT commented Aug 12, 2022 •

edited

Loading

wxiaoguang commented Aug 13, 2022 •

edited

Loading