Skip to content

Conversation

@Gusted
Copy link
Contributor

@Gusted Gusted commented Aug 9, 2022

  • Before understanding why this patch is here. The actions table stores the actions, this includes commenting on a issue, merging a pull request and pushing a tag etc. However, due to historical (and likely performance) reason, each action(e.g. commenting on a issue) is duplicated for each user(such as the poster, watcher of that issue and repository etc.).
  • This means, if you only specify the repo_id you will end up with a lot of duplicated actions. We fix this by de-duplicating the actions by their created_unix, op_type and act_user_id. So that means if created_unix, op_type and act_user_id is the same it's likely the same action but meant for meant for different users. It's only possible to create a "collision" if a user is able to create a two or more actions under a second.

- Before understanding why this patch is here. The actions table stores
the actions, this includes commenting on a issue, merging a pull
request and pushing a tag etc. However, due to historical (and likely
performance) reason, each action(e.g. commenting on a issue) is
duplicated for each user(such as the poster, watcher of that issue and
repository etc.).
- This means, if you only specify the `repo_id` you will
end up with a lot of duplicated actions. We fix this by de-duplicating
the actions by their `created_unix`. While this isn't a perfect way of
solving this problem, it will do the job for 99%. Only problems will
arise for highly active repositories in which actions are being taken on
the same second.
@Gusted Gusted added this to the 1.18.0 milestone Aug 9, 2022
@Gusted Gusted requested a review from 6543 August 9, 2022 19:25
Gusted pushed a commit to Gusted/gitea that referenced this pull request Aug 9, 2022
- Backport go-gitea#20738
  - Before understanding why this patch is here. The actions table stores the actions, this includes commenting on a issue, merging a pull request and pushing a tag etc. However, due to historical (and likely performance) reason, each action(e.g. commenting on a issue) is duplicated for each user(such as the poster, watcher of that issue and repository etc.).
  - This means, if you only specify the `repo_id` you will end up with a lot of duplicated actions. We fix this by de-duplicating the actions by their `created_unix`. While this isn't a perfect way of solving this problem, it will do the job for 99%. Only problems will arise for highly active repositories in which actions are being taken on the same second.
@Gusted Gusted changed the title Fix duplicated actions on repository feed Fix repeated actions on repository feed Aug 9, 2022
@silverwind
Copy link
Member

Unrelated, but we should probably migrate those unix timestamps in DB to millisecond precision at some point.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Aug 9, 2022
@CirnoT
Copy link
Contributor

CirnoT commented Aug 9, 2022

Won't this cause issues where branch is deleted when merging? In such cases, both actions are on same second

@Gusted
Copy link
Contributor Author

Gusted commented Aug 9, 2022

Won't this cause issues where branch is deleted when merging? In such cases, both actions are on same second

That would indeed cause such issue, but there's currently no better way of handling these kind of edge-cases. The actions table is fundamentally flawed for this task.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 9, 2022

Not exactly correct. If you want to avoid duplicated actions you should deduplicate them by content AND timestamp.

Also we definitely should do what @silverwind proposed, the sooner the better.

@Gusted
Copy link
Contributor Author

Gusted commented Aug 9, 2022

them by content

It would only cover a handful of more edge-cases, not every action has content. Most are generated on the fly.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 9, 2022

Some other cases that this PR causes issues with:

  • Merge (creates action merge and push on same second)
  • Automation bots (such as Renovate) will usually create branch and PR very quickly (on fast servers it ends up being same second and as such only first action [branch creation] would be shown [while we're more interested in PR creation])
  • Comment and close
  • Any issue/PR modification/creation action composed of multiple steps, such as (creation+label) or (creation+reviewer) or (creation+milestone) and so on)
  • Any composed push, such as pushing new commits to multiple branches or pushing new branch and new tag
  • Review composed of multiple delayed comments (Start review button delays creation of comment till review is submitted)

@CirnoT
Copy link
Contributor

CirnoT commented Aug 9, 2022

It would only cover a handful of more edge-cases, not every action has content. Most are generated on the fly.

By content I meant overall action entity sans for-whom (ie. all columns in type Action except for ID and UserID).

@Gusted
Copy link
Contributor Author

Gusted commented Aug 9, 2022

By content I meant overall action entity sans for-whom (ie. all columns in type Action except for ID and UserID).

I don't think that's the correct way, I'm not even sure if that's efficient for a SQL server. That's more than a hack to fix this.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 10, 2022

I don't think that's the correct way, I'm not even sure if that's efficient for a SQL server. That's more than a hack to fix this.

It is the only way to do it properly and not lose data. Usually you'd fetch it from SQL first and then filter it once data is in memory (we already fetch it in order to display it).

I don't think you can fix this by a simple hack, it's a bigger issue that requires migration of entire actions table to be more friendly for such use-cases.

We can't really call these "edge-cases". My daily activity consists 90% of actions that would be affected (see #20738 (comment), mostly review and merge) and considering core nature of actions affected (review, comment, push, merge) it is not something we can play around with freely; users are relying on seeing these actions on feeds.

@wxiaoguang
Copy link
Contributor

Is it covered by tests? I guess the change will produce SQLs like SELECT a, b FROM t GROUP BY a.

IIRC, SELECT a, b FROM t GROUP BY a is not a standard SQL and some database(modes) report error for it, some databases(modes) return undefined results. The standard SQL is SELECT a, some_aggr_func(b) FROM t GROUP BY a.

Correct me if I am wrong or I misunderstood something.

@Gusted
Copy link
Contributor Author

Gusted commented Aug 10, 2022

It is the only way to do it properly and not lose data.

Whatever we decide here, it won't lose data. We don't modify any items in the database we just restrict what we receive.

Usually you'd fetch it from SQL first and then filter it once data is in memory (we already fetch it in order to display it).

However that's not an option here as for larger repository we can end up fetching multiple times just to de-deplicate one action, the database should take care of this.

I don't think you can fix this by a simple hack, it's a bigger issue that requires migration of entire actions table to be more friendly for such use-cases.

Yeah and I'm currently not in to do that as just thinking about it gives me a headache, this will PR does solve the issue with repeated actions and we can deal with a proper long-term fix later on, which requires a complex migration that handles every edge case.

it is not something we can play around with freely; users are relying on seeing these actions on feeds.

I'm sorry but on which feed? We are currently just talking about the repository RSS that is borked(example), I don't think anybody is relying on this currently and if they do they also see "Oh this data is crap and repeated".

Is it covered by tests? I guess the change will produce SQLs like SELECT a, b FROM t GROUP BY a.

IIRC, SELECT a, b FROM t GROUP BY a is not a standard SQL and some database(modes) report error for it, some databases(modes) return undefined results. The standard SQL is SELECT a, some_aggr_func(b) FROM t GROUP BY a.

Correct me if I am wrong or I misunderstood something.

I have no idea, this is why we're using a ORM right? I see many other occurrences of GroupBy. And we only select action.* in the SQL query.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Aug 10, 2022

create table test (
  a int,
  b int
);
insert into test values (1, 2);
insert into test values (1, 3);

select * from test group by a;

db-fiddle: https://www.db-fiddle.com

MySQL

Query Error: Error: ER_WRONG_FIELD_WITH_GROUP: Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'test.test.b' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

image

PostgreSQL

Query Error: error: column "test.b" must appear in the GROUP BY clause or be used in an aggregate function

image

@Gusted
Copy link
Contributor Author

Gusted commented Aug 10, 2022

So what's the problem? Why does other SQL queries work that just uses GroupBy? And what's the proposed fix?

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Aug 10, 2022

So what's the problem?

The problem is that the non-standard SQL will cause errors

Why does other SQL queries work that just uses GroupBy?

Maybe they were all written correctly? (I did a quick check, yes, all GroupBy I checked were written correctly before)

And what's the proposed fix?

#20738 (comment) , the standard SQL syntax is SELECT a, some_aggr_func(b) FROM t GROUP BY a, all other fields to be selected must be applied with an aggregate function.


Update: I am talking about the SQL only. I have no idea (and haven't thought) about how to improve the RSS feed at the moment. Maybe the solution is more complex than it looks like.

@Gusted
Copy link
Contributor Author

Gusted commented Aug 10, 2022

Hmm, tests are passing and seems like this was already covered by test-cases. I now added a explicit test case for this bug, 11a82cb. Maybe XORM is doing some tricky things to make this work 🤷🏽.

@lunny
Copy link
Member

lunny commented Aug 10, 2022

For a big instance, created_unix will be collision frequently, I don't think this could be accepted.

@Gusted
Copy link
Contributor Author

Gusted commented Aug 10, 2022

For a big instance, created_unix will be collision frequently, I don't think this could be accepted.

I've added op_type & act_user_id. If another person did the same or different action on that repository at the same time with someone else it will not be "de-duped". So unless someone is able to do a certain action within a second, collisions shouldn't occur.

@Gusted Gusted requested review from lunny and wxiaoguang August 10, 2022 02:11
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Aug 10, 2022

Hmm, tests are passing and seems like this was already covered by test-cases. I now added a explicit test case for this bug, 11a82cb. Maybe XORM is doing some tricky things to make this work 🤷🏽.

Standard is standard .... I do not think it's right to do SELECT * FROM t GROUP BY a for all databases.

The unit tests only run with SQLite, SQLite can bear the non-standard SQLs (you can try it on https://sqlite.org/fiddle/).

I just tried to put the code in integration tests and started the CI, I would suppose the tests fail there.


See https://drone.gitea.io/go-gitea/gitea/59003/2/14 and https://drone.gitea.io/go-gitea/gitea/59003/3/8

GetFeeds: Find: Error 1055: Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'testgitea.action.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

GetFeeds: Find: pq: column "action.id" must appear in the GROUP BY clause or be used in an aggregate function

@Gusted
Copy link
Contributor Author

Gusted commented Aug 10, 2022

This is beyond my knowledge of SQL, feel free to push commits to make it the SQL standard.

@wxiaoguang
Copy link
Contributor

This is beyond my knowledge of SQL, feel free to push commits to make it the SQL standard.

Sorry but I can not help to modify the SQL at the moment, it comes to the problem I mentioned in #20738 (comment)

I am talking about the SQL only. I have no idea (and haven't thought) about how to improve the RSS feed at the moment. Maybe the solution is more complex than it looks like.

I have no idea about how to help to improve the feed list at the moment. 😂

@Gusted
Copy link
Contributor Author

Gusted commented Aug 10, 2022

Okay I think I might have something, what if we only request the action's id via the group_by query and then use a generic SELECT * FROM table WHERE id IN (query with group by) to get the other columns. db-fiddle seems to say that MySQL & PostgreSQL likes this.

Ex.

create table test (
  a int,
  b int,
  id int
);

insert into test values (1, 2, 1);
insert into test values (1, 3, 2);
insert into test values (2, 3, 3);
insert into test values (2, 7, 4);
insert into test values (1, 3, 5);
insert into test values (5, 3, 6);

select * from test where id in (select min(id) as id from test group by a);

@wxiaoguang
Copy link
Contributor

Yup, that's a common trick to do the select+group. However, when you do group-by, you will have new problems with pagination (the SetSessionPagination, aka COUNT/LIMIT + GROUP BY) in code. So, the solution will be more complex than the problem it looks like.

@Gusted
Copy link
Contributor Author

Gusted commented Aug 10, 2022

Hmm unless this is SQLite acting up again. I'm now at this query and it's producing the correct results:

 SELECT `action`.* FROM `action` INNER JOIN `repository` ON `repository`.id = `action`.repo_id WHERE repo_id=? AND is_deleted=? AND `action`.`id` IN (SELECT min(id) as id FROM action GROUP BY `action`.created_unix, `action`.op_type, `action`.act_user_id) ORDER BY `action`.`created_unix` DESC LIMIT 30 [22 false]

Which in theory should work for MySQL/PostgreSQL. And does produce the correct result on SQLite.

@Gusted
Copy link
Contributor Author

Gusted commented Aug 12, 2022

#20744 Seems to suggest that my awful creation of SQL query still doesn't work? I don't see a lot of options anymore(that doesn't involve calling the database a thousands times in worst-case repositories), I'm also happy to just demolish the feature for now so we all can spend some time engineering a better solution to this problem as currently as-is the feature is just plain broken for repositories feeds.

@CirnoT
Copy link
Contributor

CirnoT commented Aug 12, 2022

(that doesn't involve calling the database a thousands times in worst-case repositories)

RSS feeds should not be so long, there should be a limit (say actions from past 6h).

Also, having every action separate in RSS feed is unhelpful, for stuff like replies to issues there should be a single entry per repository per issue per action (reply, review) that gets updated (RSS allows to update items so that readers show them as unread again). Ideally the title of item would contain something like "X new replies to issue Y", where X is number of unique replies since user last read (on site) that issue.

Basically, I do not think these feeds are useful to anyone, people would want to have feed of their notifications instead.

@Gusted
Copy link
Contributor Author

Gusted commented Aug 12, 2022

RSS feeds should not be so long, there should be a limit (say actions from past 6h).

That's not the issue, even 1 action can be duplicated for hunderds of users. And we need to iterate through them if the database can't de-duplicate for us.

There's a hard limit of 30 actions to be shown in the RSS feed.

Also, having every action separate in RSS feed is unhelpful, for stuff like replies to issues there should be a single entry per repository per issue per action (reply, review) that gets updated (RSS allows to update items so that readers show them as unread again). Ideally the title of item would contain something like "X new replies to issue Y", where X is number of unique replies since user last read (on site) that issue.

Unfortunately that's not the case, current system is stupid so we got to deal with it.

Basically, I do not think these feeds are useful to anyone, people would want to have feed of their notifications instead.

If it wasn't useful it wouldn't have been implemented. See #569

@CirnoT
Copy link
Contributor

CirnoT commented Aug 12, 2022

If it wasn't useful it wouldn't have been implemented. See #569

I am certain that they are useful for someone, but that does not mean they are useful in general to larger audience. The desired use-case was most likely to use them instead of webhooks, as feeds are fetch wheras hooks are push. Maybe it was necessary for someone having air-gapped instance and instead of creating a solution themselves (ie. having separate service receiving webhooks on loopback and exposing them as fetch endpoint) they decided to pollute codebase with features that now will have to be maintained.

In the first place, if feature is in such broken state it should have never been approved and merged, it should've been on the original author to devise a working solution; as it stands now it just adds additional complexity and steals time from maintainers who are now left with a feature they'll have to fix and maintain in future.

Anyway, that is enough for my rant. Regarding PR itself - it does seem like the solution you came up with is no-go, both technically as well as behavior-wise. I have absolutely no idea how to handle this nicely without refactoring entire Action model.

Maybe consider dropping feed support for repos and orgs in general and instead add feed support for:

  • Each issue separately
  • Commits of specific repo
  • Releases of specific repo

No idea what to do with user feed though.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Aug 13, 2022

#20744 Seems to suggest that SQL query still doesn't work?

The 20744 only collected errors for SQL SELECT * FROM t GROUP BY a. The newly committed sub-query is valid SQL.

I have read the new change: SELECT * FROM t WHERE id IN (SELECT ... GROUP BY ...), there could still be a problem: the action table can be quite large, maybe millions rows for a busy instance, then the builder.Select("min(id) as id").From("action").GroupBy("`action`.created_unix, `action`.op_type, `action`.act_user_id")) is a full-table-scan for every time, it could be very slow.

@Gusted
Copy link
Contributor Author

Gusted commented Aug 13, 2022

is a full-table-scan for every time, it could be very slow.

Confirmed that on larger repositories with quite a lot of actions this is indeed slow. I will close the PR for now and think about some kind of magical migration that will fix this.

@Gusted Gusted closed this Aug 13, 2022
@Gusted
Copy link
Contributor Author

Gusted commented Aug 13, 2022

@CirnoT Please open a new issue for your concerns, this wasn't really the place to discuss such things.

@Gusted Gusted deleted the fix-duplicated-actions-repo-feed branch August 13, 2022 11:32
@zeripath zeripath removed type/bug lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. backport/v1.17 labels Aug 21, 2022
@lunny lunny removed this from the 1.18.0 milestone Oct 26, 2022
@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants