[Proposal][Discuss] Gitea Cluster #13791

lunny · 2020-12-02T16:36:17Z

How does a Gitea deployment scale? Gitea cluster should resolve part of it.

Currently when running several Gitea instances which shared database, git storage. There is still something needs to resolve.

Crons: Now every Gitea instance will run all the crons. It is duplicated work and will waste CPU and disk. The idea is the cron tasks should be spliced into all the Gitea instances.
Migrating: You cannot stop the running migrating task. Because you don't know which Gitea instance is running it.
Git Storage: A shared/copied git storage is required. If every Gitea instance will only store part of the repositories and when requests come, they will be routed to the right Gitea instance. Integrating Gitaly is a possible resolution.

comment by @wxiaoguang

If there is no session-stick or ip-stick, I guess it would trigger database deadlock more frequently due to transaction conflicts.
Actions Crons: And it conflicts with actions cron, there will be multiple tasks.
Packages: And it would trigger docker's duplicate insert bug again because there is only a workaround (in process mutex) at the moment.
UI notifications: I guess "event source" doesn't work with cluster either
Locks: Some packages depend on the ExclusivePool pool, which is also in-process now. Create a lock abstract layer and remove old one #22176

The text was updated successfully, but these errors were encountered:

6543 · 2020-12-02T16:47:00Z

for cron I propose: cron only create tasks, witch are represented in DB (like it's done with migration tasks)
for tasks: each instance should have a unique ID (GUID), if an instances fetch tasks from DB and alter there state by changing Status to running + adding GUID & PID into table
there must be some way gitea instances can speac to each other by GUID as Identifyer to:
- send cancle/pause/continue "signals"
propose a hardbeat to recover & cleanup tasks of crashed gitea instances:

type Hardbeat struct {
  guid        int64
  beat        int64
  recoverGUID int64 // empty if nothing crashed 
}


on process.Manager creation start hardbeatFunc()

func hardbeatFunc() {
  for {
    x.Where(guid=getGUID()).Update(&Hardbeat{beat=unixtime.Now()})
    
    for _, crash := x.Select(guids with timeout && recoverGUID == "") {
      x.Where(crash.guid).Update(recoverGUID = getGUID())
      // make sure no other instance has taken the recover step - and 
      if !x.Exist(guid=crash.guid, recoverGUID=getGUID()) {ret error}
      
      now reset all tasks with crash.guid

    }
    
    sleep 20sec

  }
}

modules/tasks task will need to be refactor to have an easy interface:
task.Signal(task.CANCEL, guid, pid) <- if guid is not of running instance, send it to the specific one ...
task.Run(t *task)
...

lafriks · 2020-12-02T17:38:30Z

Some kind of git storage layer would be needed imho (something like gitlab has)

6543 · 2020-12-02T17:48:01Z

I would fokus on tasks since git data via shared-storage work quite well at the moment

lunny · 2020-12-03T01:31:14Z

I would fokus on tasks since git data via shared-storage work quite well at the moment

It is but in fact it's expensive. So a distributed git data storage layer still be a necessary feature of Gitea in future.

Codeberg-org · 2020-12-03T08:36:03Z

I would fokus on tasks since git data via shared-storage work quite well at the moment

+1

Safe distributed/concurrent gitea is surely the highest priority from a user point of view, as off-the-shelf options for distributed SQL databases and distributed file systems are readily available.

6543 · 2021-02-03T18:42:40Z

Roadmap:

master elec
log & com for processManager com
tasks

master elec

done by DBMS: who get SQL select-update query in first

need hearbeat table in DB
GUID creation in process.GetManager()

~7msg types

CANCEL - cancle processes
ACK
LIST - get running processes
IMRUNNING
PING - are you alive
STATUS - get status of a specific process
CREATE - opt. create process on other instance

msg com

some sort of https://nats.io/, https://activemq.apache.org/cross-language-clients, ... over DB, Redis, ... ?

sidenotes

only master should start cron jobs
? trigger webhooks ?

gary-mazz · 2021-03-18T05:03:30Z

Interesting discussion. I think this started back in 2017 #2959.

There needs to be recognition of 2 cluster use cases: Load Balancing and High Availability (HA) with 2 types of location configurations: local and remote.

The more distant the cluster participant, data shifts from synchronous (near -real-time) to delayed; creating a spectrum of data synchronization quality levels from highly consistent to eventually consistent.

Technologies picked should be able to operate at a distance as well as on local prem without reconfiguration. Secure communications via tunneling and certificate based authentication between nodes should also be considered.

The "tricky part" is figuring out where to put the replication. Since gitea supports multiple databases, and each employs different and incompatible replication mechanisms, a formalized middle-ware layer is likely required to replicate data. The mid-layer replications also allows different db backend configuration (eg postgresql and Mysql) to provide transparent replication.

Replications will need some type lockout strategy for check-in/outs and zips operations during replication activity. The options are:

lockout client access until replication activity is completed
Lockout replication updates (cache operations) until client activity is complete.
Fail client operations if replication updates touches files in client operations
Delay/pause client operation until replication activities and status checks are complete. (for remote site failover and load balancing..)

With remote site load balancing, it is possible to have check-in collisions causing inconsistencies. The use cases that cause these conditions:

system clocks fall out of sync between servers (both local and remote locations)
remote site load balancing loses replication network connection(s) (two headed monster)
normal networking and load delays, cause race conditions between servers (occurs both local and remote configurations)

I hope this helps some of your design decisions

PS: don't forget config files change pushes..

lafriks · 2021-03-22T05:57:18Z

We should probably also need some kind of git repository access layer so that they could be distributed across cluster with local storage

imacks · 2021-06-03T04:45:08Z

Just want to contribute my own experience using Gitea for the last couple of years.

Our first attempt was to run dockerized Gitea in kube, with storage back end provided by NFS. We rely on kube healthcheck to restart an unresponsive Gitea instance, which can run on any tainted host managed by kube. This solves the reliability issue somewhat, though there will be a period of unavailability while the container restarts.

Our v2 setup swaps out NFS for ceph CSI in kube. R/W performance improves dramatically. We also use S3 compat layer in ceph to store LFS data.

My most pressing desire for v3 is HA. We can be less ambitious and work on single local cluster first. There can be a dedicated pod for running cron tasks, so Gitea can concentrate on doing git and webserver stuff. We can also use s3 for storage exclusively for its sync capabilities.

viceice · 2023-01-04T11:22:45Z

Just want to contribute my own experience using Gitea for the last couple of years.

Our first attempt was to run dockerized Gitea in kube, with storage back end provided by NFS. We rely on kube healthcheck to restart an unresponsive Gitea instance, which can run on any tainted host managed by kube. This solves the reliability issue somewhat, though there will be a period of unavailability while the container restarts.

Our v2 setup swaps out NFS for ceph CSI in kube. R/W performance improves dramatically. We also use S3 compat layer in ceph to store LFS data.

My most pressing desire for v3 is HA. We can be less ambitious and work on single local cluster first. There can be a dedicated pod for running cron tasks, so Gitea can concentrate on doing git and webserver stuff. We can also use s3 for storage exclusively for its sync capabilities.

Do you have some hint to move from nfs to ceph CSI? I like to test out the perf. I already use S3 (minio) for all other Gitea storage.

piamo · 2023-03-10T13:56:48Z

Just want to contribute my own experience using Gitea for the last couple of years.

Our first attempt was to run dockerized Gitea in kube, with storage back end provided by NFS. We rely on kube healthcheck to restart an unresponsive Gitea instance, which can run on any tainted host managed by kube. This solves the reliability issue somewhat, though there will be a period of unavailability while the container restarts.

Our v2 setup swaps out NFS for ceph CSI in kube. R/W performance improves dramatically. We also use S3 compat layer in ceph to store LFS data.

My most pressing desire for v3 is HA. We can be less ambitious and work on single local cluster first. There can be a dedicated pod for running cron tasks, so Gitea can concentrate on doing git and webserver stuff. We can also use s3 for storage exclusively for its sync capabilities.

Will there be concurrency problem when using Ceph CSI， since there is no file lock protection?

imacks · 2023-03-11T09:29:25Z

@piamo no. Only a single instance of gitea runs at any one time, so no locking is necessary. The appropriate ceph volume is auto mounted on whichever host the gitea container runs on. So yeah my setup is not HA, just resilient to host failure.

piamo · 2023-04-21T04:51:00Z

@piamo no. Only a single instance of gitea runs at any one time, so no locking is necessary. The appropriate ceph volume is auto mounted on whichever host the gitea container runs on. So yeah my setup is not HA, just resilient to host failure.

@imacks But if two or more concurrent requests try to change the same repo, lock is still necessary.

harryzcy · 2023-05-02T02:22:31Z

I think one immediate step for Gitea would be to enable limiting read-only operations and disable cron to somewhat achieve high availability. Many parts can already be deployed in a HA way:

database: depend on the replication of databases itself, e.g. when using Postgres replications
git storage: NFS or ceph or longhorn are possible solutions (but ReadWriteMany and ReadWriteOnce may have drastically different performance)
session: redis cluster already handles that

What we need right now is to allow for disabling cron jobs, then Gitea can be deployed in a cluster with ReadWriteMany storage for git objects. To support ReadWriteOnce storage, the files need to be replicated by Gitea instead of the storage provider. Then Gitea must have a read-only mode and those replicas need to pull changes from master instance. In this case, the read-only operations should be identified so that a load balancer can route traffic properly.

After we have done the above step, then we could try to find some leader election protocols so that a replica can be promoted to master if master is down. This would be the second step.

Only after we have done that, we can start to split cron jobs to multiple instances. I think this is more complicated than the first two steps above.

pat-s · 2023-05-02T20:08:30Z

Just FYI, we have an active WIP for a Gitea-HA setup in the helm-chart going on right now: https://gitea.com/gitea/helm-chart/pulls/437

It is based on Postgres-HA, a RWX file system and redis-cluster.
I think that using a RWX solves some part of the leader-election logic WRT to tasks and communication.

The only thing that is a true issue still are the duplicated cron executions. The biggest issue would be that both do the same thing at the exact same moment and crash therefore.
I haven't yet tested in in practice though.

Maybe implementing a random offset/sleep could help in the first place to at least ensure proper functionality? Even if all jobs would still be executed redundantly but it would at least allow us to make some initial progress.

lunny · 2023-05-03T12:27:32Z

There are still some locks in fact need to be refactored except cron, see #22176

wxiaoguang · 2023-05-15T06:12:44Z

If there is no session-stick or ip-stick, I guess it would trigger database deadlock more frequently due to transaction conflicts.
It conflicts with actions cron, there will be multiple duplicate tasks (related to the cron/task problem above)
it would trigger docker's duplicate insert bug again, because there is only a workaround (in process mutex) at the moment.
I guess "eventsource" doesn't work with cluster either
Some packages depend on ExclusivePool pool, which is also in-process now (mentioned above)

pat-s · 2023-05-15T21:02:40Z

Idk what the "docker's duplicate insert bug" is here and all the other points are also somewhat unclear in terms of severity. I think we need to check and find out in the end.

And to test all of them, we need a (functional) HA cluster first to test on.

I can provide a instance for testing if needed. Are you interested @wxiaoguang @lunny? I could also give you access to the k8s namespace so you can explore the pods yourself.

On the other hand I wonder if this could also be set up and tested using the project funds? A terraform setup which destroys everything again after testing is not a big deal. And the helm chart logic for a HA setup is ready.

lunny · 2023-05-16T01:43:05Z

I think most problems here are obvious from code level. Maybe we can find more when we start testing. LThank you for you idea about the testing infrastructure. When we need those, we can discuss them. But for now, there are so many problems, maybe we should begin from starting some discuss or sending some PRs.

wxiaoguang · 2023-05-16T02:09:48Z

Idk what the "docker's duplicate insert bug" is here and all the other points are also somewhat unclear in terms of severity. I think we need to check and find out in the end.

Context:

I can provide a instance for testing if needed. Are you interested? I could also give you access to the k8s namespace so you can explore the pods yourself.

I am interested, however, I have a quite long TODO list and many new PRs:

So I don't think I have the bandwidth at the moment.

prskr · 2023-11-09T21:47:11Z

I didn't check everything in the code so far but I think something like https://github.com/hibiken/asynq could help with the cron issues?

For the shared repo access I was actually wondering why not trying to abstract that e.g. with a S3 compatible storage and use something like redlock to synchronize the access repositories. I'd even assume concurrent read should be fine? It's only about consistence when writing to a repository (presumably)?

lunny added the type/proposal The new feature has not been accepted yet but needs to be discussed first. label Dec 2, 2020

6543 mentioned this issue Dec 2, 2020

Make Migrations Cancelable #12917

Closed

lunny closed this as completed Dec 3, 2020

lunny reopened this Dec 3, 2020

groundhog2k mentioned this issue May 10, 2021

Remove gitea replicaCount option, because Gitea is not officially scalable groundhog2k/helm-charts#429

Closed

6543 added the type/summary This issue aggregates a bunch of other issues label Jun 8, 2021

techknowlogick mentioned this issue Apr 11, 2022

High Availabilty or Disaster Recovery topology support would be nice #19371

Closed

6543 mentioned this issue May 5, 2022

replace sync module #19620

Open

techknowlogick mentioned this issue May 13, 2022

Running Gitea highly available #19700

Closed

lunny mentioned this issue Jun 4, 2023

[Proposal] Gitea repository storage abstract #25070

Open

2 tasks

zc-devs mentioned this issue Apr 13, 2024

Delete old pipeline logs after X days or Y new runs woodpecker-ci/woodpecker#1068

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal][Discuss] Gitea Cluster #13791

[Proposal][Discuss] Gitea Cluster #13791

lunny commented Dec 2, 2020 •

edited

6543 commented Dec 2, 2020 •

edited

lafriks commented Dec 2, 2020

6543 commented Dec 2, 2020

lunny commented Dec 3, 2020

Codeberg-org commented Dec 3, 2020 •

edited

6543 commented Feb 3, 2021 •

edited

gary-mazz commented Mar 18, 2021 •

edited

lafriks commented Mar 22, 2021

imacks commented Jun 3, 2021

viceice commented Jan 4, 2023 •

edited

piamo commented Mar 10, 2023

imacks commented Mar 11, 2023

piamo commented Apr 21, 2023 •

edited

harryzcy commented May 2, 2023 •

edited

pat-s commented May 2, 2023

lunny commented May 3, 2023 •

edited

wxiaoguang commented May 15, 2023

pat-s commented May 15, 2023

lunny commented May 16, 2023

wxiaoguang commented May 16, 2023 •

edited

prskr commented Nov 9, 2023

[Proposal][Discuss] Gitea Cluster #13791

[Proposal][Discuss] Gitea Cluster #13791

Comments

lunny commented Dec 2, 2020 • edited

6543 commented Dec 2, 2020 • edited

lafriks commented Dec 2, 2020

6543 commented Dec 2, 2020

lunny commented Dec 3, 2020

Codeberg-org commented Dec 3, 2020 • edited

6543 commented Feb 3, 2021 • edited

Roadmap:

master elec

~7msg types

msg com

sidenotes

gary-mazz commented Mar 18, 2021 • edited

lafriks commented Mar 22, 2021

imacks commented Jun 3, 2021

viceice commented Jan 4, 2023 • edited

piamo commented Mar 10, 2023

imacks commented Mar 11, 2023

piamo commented Apr 21, 2023 • edited

harryzcy commented May 2, 2023 • edited

pat-s commented May 2, 2023

lunny commented May 3, 2023 • edited

wxiaoguang commented May 15, 2023

pat-s commented May 15, 2023

lunny commented May 16, 2023

wxiaoguang commented May 16, 2023 • edited

prskr commented Nov 9, 2023

lunny commented Dec 2, 2020 •

edited

6543 commented Dec 2, 2020 •

edited

Codeberg-org commented Dec 3, 2020 •

edited

6543 commented Feb 3, 2021 •

edited

gary-mazz commented Mar 18, 2021 •

edited

viceice commented Jan 4, 2023 •

edited

piamo commented Apr 21, 2023 •

edited

harryzcy commented May 2, 2023 •

edited

lunny commented May 3, 2023 •

edited

wxiaoguang commented May 16, 2023 •

edited