You can clone with
No one assigned
If you are using MySQL with application replication, you'll often have something like this in your my.cnf:
auto-increment-increment = 10
auto-increment-offset = 3 # this is different on each database
which means that you won't have an even distribution of users being enabled if you say id % 100 < percentage.
id % 100 < percentage
Yep, this is definitely not perfect. I've used this exact logic on apps with millions of users and though their ids were sequential, the active set of user ids was not and it worked fine. I tend to think of it not as exact, but more a general idea.
That said, I'm totally open to something that is doesn't impact performance and works better.
The problem is that if you have something like the example I have, you'll have 0 users active until you hit 30% and then you'll have all users active...
It would require having a configuration option for what your increment interval is, but I'm having trouble reasoning out the exact math at the moment.
What about using a hashing algorithm instead of modulus? I don't know how legitimate implementations do it, but shouldn't be to hard to do a hash on the id, and then see if that hash is in the first x percent.
Seems like the fix here is a hash function to provide repeatable, but "randomized" distribution before doing the modulus and comparison.
https://github.com/funny-falcon/murmurhash3-ruby could be a good fit.
That said, maybe this should be left up to the individual application because this I suspect this is a pretty uncommon case.
I could certainly make the algorithm pluggable. Allow people to change it and just default to the modulus. The one downside is right now I force id to be an integer. Using non-integers wouldn't work, but that can be changed as well.
I would guess it's more common than you think, and most people just don't realize they're running into it. Anyone doing MySQL replication is going to have this problem if they know it or not.
What about just using MD5 or similar hashing function? It would give a random distribution in every scenario.
I haven't done any research on how uniform the distribution of the last byte (for instance) of an md5 has is for these purposes, but it's likely it would be fine.
Good hashing functions tend to be uniform.
Here's the distribution of the last character of the MD5 hash of each string from "1" to "10000":
Seems good enough to me.
Does seem pretty uniform. I'm open to whatever. Someone want to put a pull together?
For what it is worth, I like this better. Instead of forcing integers, I can just to_s, then hash, etc.
This is fixed as of the pull request I merged above.