Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Sharding #1522

Closed
lvca opened this issue Jun 13, 2013 · 17 comments
Closed

DB Sharding #1522

lvca opened this issue Jun 13, 2013 · 17 comments
Assignees
Milestone

Comments

@lvca
Copy link
Member

lvca commented Jun 13, 2013

Even though auto-sharding is not available yet (will be in 2.0) by using a cross-database RID. So the classic RID is:

cluster-id:cluster-position, example #12:433

And the cross-db will be:

db:cluster-id:cluster-position, example #demo3:12:433

Now what to store as ? The db name? It could be a waste of space. It could be an id as 1 byte (0-127 databases or even 0-255 by playing with the unsigned bit) but where to map the database list?

@ghost ghost assigned lvca Jun 13, 2013
@kowalot
Copy link
Contributor

kowalot commented Jun 13, 2013

Do you consider #db as database on the same server or remote database as well?

@lvca
Copy link
Member Author

lvca commented Jun 13, 2013

Both: you provide an URL and OrientDB will make the rest.

@kowalot
Copy link
Contributor

kowalot commented Jun 13, 2013

Cool! Questions in bulk :)
1.Are you going to support cross database queries like gremlin/SELECT/TRAVERSE traversing partially through remote database records?
2.Will I able to provide db# during creations (my custom sharding key)?
3.Will I able to send TX_COMMIT with mixed #dbs objs to one database?

@lvca
Copy link
Member Author

lvca commented Jun 13, 2013

  1. yes, it will be transparent
  2. this is an open point. We could use the orientdb-server-config.xml
  3. No. This would mean implementing a 2 phase commit. This will be supported in the future but not right now

@kowalot
Copy link
Contributor

kowalot commented Jun 13, 2013

Great. Thanks for info. I look forward to test first workable commits.
Ad 2. Would be nice to have it for cases like CustomersA on ServerA or UsersID<1000000 on ServerA

@lvca
Copy link
Member Author

lvca commented Jun 13, 2013

Exactly. It will be a responsibility of the application to find the best way to cluster data.

@laa
Copy link
Member

laa commented Jun 13, 2013

Hi, I will add my five cents.
2 phase coomit is not durable and slow I think better to use Paxos for such kind of functionality http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/past/03F/notes/paxos-simple.pdf

@lvca
Copy link
Member Author

lvca commented Jul 8, 2013

Hey guys, I don't like this solution anymore, because obliges us to migrate (=update) all the RIDs when you move records between nodes. I prefer manual sharding support in current configuration, by specifying that some record clusters haven't to be replicated but rather it's the owner of the data.

@kowalot
Copy link
Contributor

kowalot commented Jul 9, 2013

Hi Luca, In what case do you expect moving a record between nodes? Rebalancing nodes in sharding?

@lvca
Copy link
Member Author

lvca commented Jul 9, 2013

Yes, rebalancing. This will be done manually through a new command but there can be pluggable strategies.

@kowalot
Copy link
Contributor

kowalot commented Jul 9, 2013

This is interesting view on sharding and repartitions in SQL Azure
I started to build my custom sharding based on that pattern, but I decided to wait when I noticed cross database feature on your roadmap. Maybe it will inspire you somehow. Cheers.

@lvca
Copy link
Member Author

lvca commented Jul 25, 2013

Postponed.

@lvca
Copy link
Member Author

lvca commented Aug 5, 2013

Renamed as "DB Sharding" has been planned for 1.6

@lvca
Copy link
Member Author

lvca commented Aug 10, 2013

For cross-links I've just created a new issue as federated databases #1597 planned for 2.0

@lvca
Copy link
Member Author

lvca commented Aug 10, 2013

This will allow to span database portions across multiple servers using the record cluster granularity. This works pretty well with OrientDB Object Orientation concepts. Example. This is the default:

Class Customer -> Cluster Customer

But if you assign:

Class Customer -> Clusters [ Customer_USA, Customer_Europe ]

OrientDB will use both clusters when you refer to the class "Customer". Well, with the new sharding feature if you assign Customer_USA to the server 1 and Customer_Europe to the server 2 you can work against the generic "Customer" class or single clusters.

You could also work only at class level by creating sub classes:

Customer <|--- Customer_USA, Customer_Europe

In this case OrientDB already creates 3 different clusters. So this query:

select from customer -> will retrieve both customers from both servers
select from customer_usa -> will retrieve only customer from usa server

So this isn't auto-sharding because you've to design how to split the database in efficient way but this is the best solution even when auto-sharding will be ready because it's very hard to create a generic algorithm that shard the database in all the different use cases.

Take a look at this tutorial https://github.com/orientechnologies/orientdb/wiki/Tutorial%3A-Clusters, but think that the cluster can reside only on certain servers.

@lvca
Copy link
Member Author

lvca commented Aug 14, 2013

Ok, probably we need to review the replication to address a simple configuration and again big power. I was thinking to change default cfg in this way:

{
    "synchronization" : true,
    "clusters" : {
        "internal" : { "synchronization" : false },
        "index" : { "synchronization" : false },
        "ODistributedConflict" : { "synchronization" : false },
        "*" : {
            "synchronization" : true,
            "master" : "$auto",
            "synchReplicas" : { "total" : 2, "nodes" : [] }
            "asynchReplicas" : { "total" : $all, "nodes" : [] },
        }
    }
}

synchReplicas.total means 2 nodes must have the database in synch. So the master will be chosen from the list in synchReplicas.nodes. This is the list of node names populated as soon as the server nodes connect. So After the first node "europe0" starts up the cfg will change in:

{
    "synchronization" : true,
    "clusters" : {
        "internal" : { "synchronization" : false },
        "index" : { "synchronization" : false },
        "ODistributedConflict" : { "synchronization" : false },
        "*" : {
            "synchronization" : true,
            "master" : "$auto",
            "synchReplicas" : { "total" : 2, "nodes" : ["europe1"] }
            "asynchReplicas" : { "total" : $all, "nodes" : [] },
        }
    }
}

When the 2nd "usa1" node joins:

{
    "synchronization" : true,
    "clusters" : {
        "internal" : { "synchronization" : false },
        "index" : { "synchronization" : false },
        "ODistributedConflict" : { "synchronization" : false },
        "*" : {
            "synchronization" : true,
            "master" : "$auto",
            "synchReplicas" : { "total" : 2, "nodes" : ["europe1", "usa1"] }
            "asynchReplicas" : { "total" : $all, "nodes" : [] },
        }
    }
}

This is because until all the synch nodes aren't filled the server assumes the right role. The 3rd server that joins, europe2, changes the cfg in:

{
    "synchronization" : true,
    "clusters" : {
        "internal" : { "synchronization" : false },
        "index" : { "synchronization" : false },
        "ODistributedConflict" : { "synchronization" : false },
        "*" : {
            "synchronization" : true,
            "master" : "$auto",
            "synchReplicas" : { "total" : 2, "nodes" : ["europe1", "usa1"] }
            "asynchReplicas" : { "total" : $all, "nodes" : ["europe2"] },
        }
    }
}

Because the synch node list is full and asynch is $all that means all the others.

@lvca
Copy link
Member Author

lvca commented Sep 20, 2013

This is the definitive format:

{
    "replication": true,
    "clusters": {
        "internal": {
            "replication": false
        },
        "index": {
            "replication": false
        },
        "ODistributedConflict": {
            "replication": false
        },
        "*": {
            "replication": true,
            "writeQuorum": 2,
            "partitioning": {
                "strategy": "round-robin",
                "default": 0,
                "partitions": [
                    [ "<NEW_NODE>" ]
                ]
            }
        }
    }
}

lvca added a commit that referenced this issue Sep 21, 2013
…eouts based on number of nodes and command types. working to issue #1522 about sharding. Fixed problem on triggers: now are executed on master and remote nodes depending by trigger
@lvca lvca closed this as completed Oct 14, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants