Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune hook_task Table #10741

Closed
2 of 7 tasks
jag3773 opened this issue Mar 16, 2020 · 19 comments
Closed
2 of 7 tasks

Prune hook_task Table #10741

jag3773 opened this issue Mar 16, 2020 · 19 comments
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented type/enhancement An improvement of existing functionality

Comments

@jag3773
Copy link

jag3773 commented Mar 16, 2020

  • Gitea version (or commit ref): NA
  • Git version: NA
  • Operating system: NA
  • Database (use [x]): NA
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

The hook_task table doesn't appear to be pruned at any point, so on an active site this table can grow to be very large and it makes loading the edit webhook page quite slow. I'm not sure how the action table relates but that may need to be pruned also?

Possible Solution

One solution I thought of is to allow administrators to set a max number of deliveries to retain per webhook.

Screenshots

MySQL [(none)]> SELECT 
    ->      table_schema as `Database`, 
    ->      table_name AS `Table`, 
    ->      round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB` 
    -> FROM information_schema.TABLES 
    -> ORDER BY (data_length + index_length) DESC;
+--------------------+------------------------------------------------------+------------+
| Database           | Table                                                | Size in MB |
+--------------------+------------------------------------------------------+------------+
| gogs               | hook_task                                            |    1723.52 |
| gogs               | action                                               |     908.06 |
@lafriks
Copy link
Member

lafriks commented Mar 16, 2020

Action table can't really be pruned as it contains valuable activity information, while I agree webhooks could be cleared up by some policy

@lunny lunny added the type/enhancement An improvement of existing functionality label Mar 17, 2020
@lunny
Copy link
Member

lunny commented Mar 17, 2020

At first, we can add button on repository webhook management UI to delete old webhooks.

@jag3773
Copy link
Author

jag3773 commented Mar 17, 2020

@lunny If I understand your suggestion, I don't think that solves the problem. That's only helpful if you have a handful of repositories, in which case you would never have this problem in the first place. Imagine having to click that button on 25,000 repositories!

The only realistic solution in that case is to have a global setting that the admin can control.

@lunny
Copy link
Member

lunny commented Mar 17, 2020

@jag3773 I also think there should be a button on admin panel but it's not conflicted with my idea. The button on admin panel will delete all the repositories in the gitea instance and for a public instance, we should let user chose whether to delete them.

@bhalbright
Copy link
Contributor

Just posting a note I am looking into this issue, thanks!

@stale
Copy link

stale bot commented Jun 13, 2020

This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.

@stale stale bot added the issue/stale label Jun 13, 2020
@lunny lunny added the issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented label Jun 13, 2020
@stale stale bot removed the issue/stale label Jun 13, 2020
@lunny
Copy link
Member

lunny commented Jun 13, 2020

A configuration item could be added to keep recent 1 year(or longer) webhooks. A background go routine could clean them every day.

@bhalbright
Copy link
Contributor

@lunny thanks, I had submitted a PR for an implementation based on deleting all but x webhooks (can be set per repo). Do you have any thoughts there...that was the implementation that worked best for our use case but I can understand if you'd rather have something a little different for general usage.

@lunny
Copy link
Member

lunny commented Jun 18, 2020

Then we could have two choices. One is to delete old finished records, another is to keep some recent unfinished records.

@jag3773
Copy link
Author

jag3773 commented Jun 18, 2020

Sounds like you are both saying similar things. My request is to limit the number of items kept in the history, whether it is date based or count based is not so important.

The advantage of "count based" is that every repo will retain a certain number of recent events. In a date based system, it's likely you would not see any history for a repo that hasn't been used recently.

@bhalbright
Copy link
Contributor

I guess @lunny you are saying we should give the user the option to either purge by "older than x days" or an option like we had suggested "delete all but most recent x entries"? Which should be the default option?

@lunny
Copy link
Member

lunny commented Jun 21, 2020

@bhalbright Yes. That's what I meant.

@jgkirschbaum
Copy link

jgkirschbaum commented Jul 16, 2020

Action table can't really be pruned as it contains valuable activity information, while I agree webhooks could be cleared up by some policy

@lunny @lafriks If I see this correctly, the action table could also be pruned. In my opinion the only function of the action table is to provide data for the dashboard. The timespan for the dashboard is currently 1 year, so data older than one year could be pruned. Our action table is about 1 GB in size after 1 year usage of gitea, which makes up 90% of the total database size.

So it would be perfect if someone could please implement the following features:

  • The timespan displayed of the dashboard should be configurable, so not always a full year has to be displayed (dashboard_display_period).
  • The timespan for historization of the action table should be configurable (action_historization_period (default is dashboard_display_period)).
  • Pruning of the action table should be implemented by a gitea cron task in dependence of the configured value for action_historization_period.

I'm sorry I can't support coding, but I don't speak go.

@lunny
Copy link
Member

lunny commented Jul 16, 2020

I think we could also have a button on admin panel to clean the two tables.

@jgkirschbaum
Copy link

jgkirschbaum commented Jul 16, 2020

Yes, would be a first step.

@zeripath
Copy link
Contributor

So partially this is a problem of database management. When you get into large enough systems those running Gitea are going to need to actually do some DB management themselves and not rely on the ORM creating a perfect DB.

For example here if you were using a postgres 10+ DB backend you could simply PARTITION the action table etc. Similarly for other DB systems.

Throwing away action data is a decision for server managers - I'm not sure that we at Gitea should be running anything that deletes data by default.

However, we can and should do a few things here.

  • I agree putting in some configurabilty about dashboard length/feed length is required.
  • We could provide a cron task that would delete old actions etc. BUT they absolutely cannot run by default

Another option is whether we can store these actions on disk as a sort of hybrid db - however, we are then getting in to the situation of effectively being a DBMS - it's the job of the DB to decide how to partition and manage big tables. I'm not certain how GH or Facebook manage their big tables - I know some people advocate throwing this sort of stuff into a non-relational/NoSQL db depending on the inherent structure within these - and given we mostly don't use the relational features of this data that could work.

One thing we should additionally look at is why this data is updated and not immutable - if it can be made immutable then the hybrid approach may make more sense.

@bhalbright
Copy link
Contributor

@zeripath regarding the cron to delete old actions being OFF by default, would you expect the same for a cron job to delete from hook_task? In the PR I had submitted it was turned on by default globally and then you could turn on/off by repo in the UI.

@jgkirschbaum
Copy link

jgkirschbaum commented Jul 29, 2020

@zeripath

For example here if you were using a postgres 10+ DB backend you could simply PARTITION the action table etc. Similarly for other DB systems.

You are right, that would be simple and effective, but I would not recommend this approach to Gitea admins, because then you have another DDL as the one delivered with Gitea.

However, we can and should do a few things here.

  • I agree putting in some configurabilty about dashboard length/feed length is required.
  • We could provide a cron task that would delete old actions etc. BUT they absolutely cannot run by default

That's great and I appreciate that.

Another option is whether we can store these actions on disk as a sort of hybrid db - however, we are then getting in to the situation of effectively being a DBMS - it's the job of the DB to decide how to partition and manage big tables. I'm not certain how GH or Facebook manage their big tables - I know some people advocate throwing this sort of stuff into a non-relational/NoSQL db depending on the inherent structure within these - and given we mostly don't use the relational features of this data that could work.

IMHO I would advice you not to use plain text files and create a hybrid db of your own. Plain text files are a mess and in a container environment they are a mess and pain. I think the preasure on this topic isn't so high, so making the things configurable paired with a few simple db jobs inside Gitea as you recommended would be sufficient.

One thing we should additionally look at is why this data is updated and not immutable - if it can be made immutable then the hybrid approach may make more sense.

Which data is updated? As far as I know, the data in both the action table and the hook_task table are immutable and are not updated. But as I mentioned before I would not go with an hybrid approach, this introduces unneeded complexity.

Thank you for your efforts.

@zeripath
Copy link
Contributor

@bhalbright in regards to the hook_task table - in some ways that could be argued as being just user visible logging. Actions though are the unique behaviours of the users - it's a bit more than logging.

6543 pushed a commit that referenced this issue Jan 26, 2021
Close **Prune hook_task Table (#10741)**

Added a cron job to delete webhook deliveries in the hook_task table. It can be turned on/off and the schedule controlled globally via app.ini. The data can be deleted by either the age of the delivery which is the default or by deleting the all but the most recent deliveries _per webhook_.

Note: I had previously submitted pr #11416  but I closed it when I realized that I had deleted per repository instead of per webhook. Also, I decided allowing the settings to be overridden via the ui was overkill. Also this version allows the deletion by age which is probably what most people would want.
@lunny lunny closed this as completed Apr 25, 2021
@go-gitea go-gitea locked and limited conversation to collaborators Jun 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented type/enhancement An improvement of existing functionality
Projects
None yet
Development

No branches or pull requests

6 participants