Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple servers/mirrors #110

Closed
DavidJFelix opened this issue Feb 7, 2013 · 34 comments
Closed

Support for multiple servers/mirrors #110

DavidJFelix opened this issue Feb 7, 2013 · 34 comments

Comments

@DavidJFelix
Copy link

Possible enhancement Idea I had when rolling out a second seafile instance on a local machine (my current setup is on amazon AWS): I thought that a client being able to connect to multiple, mirrored servers might be beneficial for data durability and download speed/network utilization. It's really just a basic idea, I haven't had much time to think about how it might be implemented or any downsides, just thought I'd bring it up for discussion.

@freeplant
Copy link
Member

The libraries are easy to be mirrored, while other information stored in the database, like user info and permission info, is hard to.

For download speed/network utilization, we can mirror the file blocks, and client can download file blocks from multiple block servers. This feature is implemented but not tested yet.

@blablup
Copy link

blablup commented Feb 12, 2013

how can i test this feature? What will it do when only 1 server is reachable?

@freeplant
Copy link
Member

Not able to test yet, config options are not added.

Seafile breaks files into blocks. In future, Seafile server will breaks into one metadata server and multiple block servers. You can replicate blocks into multiple block servers. The data replication will be asynchronous. If one block server is unreachable, causing some blocks missing, the file syncing processes that requires the missing blocks will be blocked until the server becomes reachable.

In current version, the seafile server acts both as metadata server and block server. The syncing process is already breaks into two phases, i.e., metadata syncing and block syncing.

@DavidJFelix
Copy link
Author

I understand this is in the planning phase still, but is there a plan for metadata mirrors? Having additional block servers is fine, but redundancy of that level could be accomplished with a local disk mirrored to an offsite NAS. Multiple servers means a lower likelihood of outages. Relying on a single metadata server defeats this purpose as it's still a single point of failure.

@freeplant
Copy link
Member

Seafile metadata server stores its data in database. Itself is stateless. If you make a high-availability solution for the database, the problem is solved.

@DavidJFelix
Copy link
Author

So say I have a MySQL cluster and 2 Seafile servers. Is the DB the metadata server, or is it one of the Seafile servers? Would I have to designate one of the servers as a "metadata" server or does it do that automatically? When the metadata (if it's a Seafile server in the answer above) server goes down, do I need to detect that and tell the other server to stand in, or can the other server detect the event and handle it on its own? I'm just confused because above it felt like you were implying a master/slave style of configuration, so I'm just wondering if that was wrongly inferred on my part or if there is some mechanism for a slave/block server to stand in as a master/metadata server, which I don't understand.

@freeplant
Copy link
Member

There are actually two problems, high availability and scalability.

Suppose we store the metadata in MySQL cluster and the file blocks in AWS S3 like distributed storage.

To address high availability. The simplest solution is to run two Seafile servers with different IP addresses. The seafile client connects the server via DNS. When one server is down, you can switch the DNS to the second server.

To address scalability, we separate seafile servers into two groups: metadata servers and block servers. The client keeps a persistent connection with one metadata server and constantly checks whether something need to be synced. The metadata server checks the database when clients ask the question.

During syncing, the metadata server reads metadata from database and sends them to the client. Then it tells the client the addresses of block servers. The client then connects the block servers to get blocks. When all file blocks are downloaded, the client update the local files.

These are the planned architecture. We will implement a seafile cluster on AWS in next few months and solve the details.

@freeplant
Copy link
Member

Another harder problem is how to deploy seafile in different data-centers to improve availability and performance.
Say if your company have two offices in two different countries, but want to have one Seafile that servers the two places as well (Appears as one seafile to users).

The metadata servers could only be deployed in one data-center. The block servers can be deployed in different data-centers to speed up block transferring, but blocks can only be replicated asynchronously. This is what I mean by "The data replication will be asynchronous. If one block server is unreachable, causing some blocks missing, the file syncing processes that requires the missing blocks will be blocked until the server becomes reachable".

@DavidJFelix
Copy link
Author

Can two metadata servers operate simultaneously, or do you have to use DNS switching? Say I have 2 metadata servers, cloud1.example.com and cloud2.example.com. My DNS record for cloud.example.com redirects to the primary metadata server, cloud1. If this server fails, I'll update the DNS record so that it redirects to cloud2 - but while the update is propagating, users still can connect if their client knows to contact cloud2. Perhaps when users connect to the primary metadata server, it could push a configuration update, notifying them of other metadata servers that are active, the client can do load balancing and attempt to recover connections on its own.

@freeplant
Copy link
Member

Two metadata servers can operate simultaneously.

@DavidJFelix
Copy link
Author

Sweet. So is there anything that's missing or should I just close this?

Recap:

  • Database can be clustered for durability/availability
  • Metadata servers can be run on multiple machines for durability/availability
  • Block servers can be mirrored via LVM RAID 1/NAS for durability
  • Block servers can operate in a redundant fashion for availability (existing feature that needs a config setting)

anything I missed?

@blablup
Copy link

blablup commented Mar 5, 2013

I don't know how hard it would be to implement, but for the Issue with 2 data-centers wouldn't it be possible:

  • that the block-servers them self can asynchronous replicate the data to another block-server (It still will be asynchronous I know)
  • that you cold have different meta-data Servers with all needed databases replicated to the other data-center (I saw that you split things up into different databases, maybe replicate all databases except of the config one).
  • that each meta-data server had its own dns names settings.
    then all clients would can connect to there local meta-data and block-server only and download/upload files only there.
    Shared libraries syncing would maybe be stuck in one data-center, for the time the block-servers replicate. But that is (imho) not a Problem because the transfer would succeed when the data is replicated.

If someone needs ha above that, it could be added to on top of that in each data-center. Or if it is possible maybe even make it possible to have a failover meta-data/block-server. Then the client would only connect to the other data-center if the local instance is down.

That would also address a scenario where a company needs to deploy some data to a lot of locations. The data could be edited on from one employee and could be read in all branch offices locally.

Only some thought I just had (and some use cases I have in my head which could really use that, specially the first feature).

@freeplant freeplant reopened this Mar 6, 2013
@freeplant
Copy link
Member

I forgot to mention MooseFS. You can store file blocks in MooseFS. It is an easy to setup distributed file system, can provide a simple high availability solution.

Note, MooseFS is not good at storing lots of small files. So it can't provide a scalability solution for Seafile.

@DavidJFelix
Copy link
Author

I'm not familiar with mooseFS, would it have any benefits over configuring your drives as SAN, passing them to LVM and then using whatever FS you want on top of an LVM RAID1(or 1+0)?

@0xarve
Copy link

0xarve commented Mar 6, 2013

I don't think Seafile should handle content replication. There are various other good projects, such as Openstack Swift and similar that focus purely on this task. I see it as complicating the service if you were to add such functionality into the product directly. Rather, the focus should be on adding more storage backends (local, swift, S3 etc).

For metadata, look at adding support for a key/value (NoSQL) based database that scales easily, rather than MySQL or any other relational database engine. Perhaps Couchbase (http://www.couchbase.com).

@freeplant
Copy link
Member

With MooseFS, Users can build a distributed file system using 3-10 normal Linux machines for HA. I think SAN is a high-end solution. While MooseFS is a low-end solution.

@freeplant
Copy link
Member

@maxim We use MySQL because we need Transaction and high consistency. NoSQL is not good at the two. I used Cassandra before, it is only eventually consistency and no transaction.

@blablup
Copy link

blablup commented Mar 6, 2013

Is it possible that 2 Servers access the same data-pool (via replicated filesystem like mosesFS/GlusterFS/Ceph...) without a problem?
then my question in issue #113 (#issuecomment-13516103) should have been answered "yes", because if that is possible I can run another instance of seaf/ccnet on different ports with the same data-pool and a shared database.

@killing
Copy link
Member

killing commented Mar 6, 2013

I think Swift is the most promising open source storage solution for Seafile.
It's object storage, with the same interface as S3, and support HA and is scalable.
I believe RackSpace is using it for CloudFile. But I'm not sure how reliable is it for production use outside of RackSpace.

MooseFS has only one metadata server. There is a SPOF. But I think any NAS-like filesystem is good enough for in house deployment. For internet scale deployment, it's better to use S3 or Swift.

@killing
Copy link
Member

killing commented Mar 6, 2013

@blablup Yes, it's complete possible to use shared storage and db.

@chenull
Copy link

chenull commented Mar 17, 2013

beside swift, ceph has also distributed block storage (rados). it already has metadata server and monitoring server. is it feasible for seafile to use the library provided by ceph ? let the seafile tackle the higher level (web, user, revisions, etc) while ceph/librados providing block storage.

@killing
Copy link
Member

killing commented Mar 18, 2013

@chenull Yes, Seafile already supports rados as a block backend.

@bvleur
Copy link

bvleur commented Apr 16, 2013

freeplant commented:

These are the planned architecture. We will implement a seafile cluster on AWS in next few months and solve the details.

For people watching this issue: It can be implemented sucessfully (as seen on Seacloud ) but on the mailinglist I got the disappointing answer:

Blocks and objects are stored in S3. We currently don't plan to open source the backend for S3.

So we will have to duplicate the effort implement our own backends.

Is anyone else working on that or open to collaboration?

@jackloom
Copy link

jackloom commented May 2, 2013

Interested!

@Deradon
Copy link

Deradon commented Jan 19, 2014

Another harder problem is how to deploy seafile in different data-centers to improve availability and performance.
Say if your company have two offices in two different countries, but want to have one Seafile that servers the two places as well (Appears as one seafile to users).

Is this possible yet?

I'd like to setup seafile in two different LANs (for performance sync) which themselves should sync with a Seafile-Instance in the web. Not sure if this is possible at the moment.

@freeplant
Copy link
Member

Not possible yet.

@freeplant
Copy link
Member

It is possible to deploy Seafile in different data-centers now, by using swift storage backend and MariaDB cluster.

@thorgbarth
Copy link

Hello freeplant,

this is good news :-)

We consider implementing this for synchronizing three AFP (Mac OS X) file servers at different company sites in Germany, and for about 40 users that are frequently mobile with their MacBooks, but also frequently in one of the three offices. I like the ability of SeaFile to pause the sync of every checked out library for bandwidth management when being mobile... but Seafile misses the LAN sync option that can be found in DropBox, so a multi-site installation is needed to save bandwidth when 40 users are working with larger data sets where nearly a GB is changed each day during work hours, and a 10 mbit/s internet connection... :)

Is there a possibility to get paid support for such a project? Is this really a good solution for productive use? I am thinking of having a server instance on every Mac OS file server, and one in a datacenter in the web, which would be the "master".

Regards
Thorsten

@xchardon
Copy link

Also interested. Is there any documentation? And / or any possibility foir paid support?
Anything new about the multiple block servers thing? I mean, without using a distributed file system?

@fossxplorer
Copy link

Very interesting. Has anyone tested any of the mentioned/discussed solutions or running in production?
Thx.

@ftrojahn
Copy link

I'm interested in this, too.

As a workaround, atm, one can tweek the DNS entry of the seafile server locally.

Example: myserver.mydomain.com is the external address, where the seafile server is reachable online. In my local lan, I change the IP of myserver.mydomain.com to the local, private IP of the server, so synchronization bypasses the online route and uses local lan directly. Makes no sense if the seafile server is only online reachable - e.g. datacenter.

@geojanm
Copy link

geojanm commented Nov 9, 2015

@ftrojahn How did you manage to sync different servers?
My Infrastructure should contain one Server where all users are able to work on. But I need to completely duplicate this to another server with a very slow internet connection. But all Devices (only locally) connected to the permanent slow (or temporary not to internet) connected server should also be up to date.

@ftrojahn
Copy link

ftrojahn commented Nov 9, 2015

@geojanm no, my example needs only one server; if the client is local, it uses the ip of seafile server in local lan - if client is remote, it get's the online reachable ip e.g. of the external ip of a firewall which does portforwarding to the local server. This does not imply syncing between different offices, just speeds up local sync when the client is local and needs no reconfiguration when client is remote, since the dns name of the server is localy faked with different ip.

Shortly I've come across the possibility of glusterfs geo replication, especially georepsetup. Didn't use that myself, yet, and don't know if seafile may work on top of it, but perhaps you could give a combination of both a try.

@shoeper
Copy link
Collaborator

shoeper commented Nov 11, 2015

Same should work with Ceph as well, but I'm not sure how well it works with a slow link between the two locations (this is not related to Ceph but such a setup in general).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests