Question: How do percentages work with non-sequential IDs? #5

Closed
eric opened this Issue Aug 27, 2012 · 19 comments

Comments

Projects
None yet
6 participants
@eric
Contributor

eric commented Aug 27, 2012

If you are using MySQL with application replication, you'll often have something like this in your my.cnf:

auto-increment-increment = 10
auto-increment-offset    = 3 # this is different on each database

which means that you won't have an even distribution of users being enabled if you say id % 100 < percentage.

@ryanlecompte

This comment has been minimized.

Show comment Hide comment
@ryanlecompte

ryanlecompte Aug 27, 2012

Does the rollout gem have the same problem?

https://github.com/jamesgolick/rollout/blob/master/lib/rollout.rb#L97

On Mon, Aug 27, 2012 at 1:33 PM, Eric Lindvall notifications@github.comwrote:

If you are using MySQL with application replication, you'll often have
something like this in your my.cnf:

auto-increment-increment = 10
auto-increment-offset = 3 # this is different on each database

which means that you won't have an even distribution of users being
enabled if you say id % 100 < percentage.


Reply to this email directly or view it on GitHubhttps://github.com/jnunemaker/flipper/issues/5.

Does the rollout gem have the same problem?

https://github.com/jamesgolick/rollout/blob/master/lib/rollout.rb#L97

On Mon, Aug 27, 2012 at 1:33 PM, Eric Lindvall notifications@github.comwrote:

If you are using MySQL with application replication, you'll often have
something like this in your my.cnf:

auto-increment-increment = 10
auto-increment-offset = 3 # this is different on each database

which means that you won't have an even distribution of users being
enabled if you say id % 100 < percentage.


Reply to this email directly or view it on GitHubhttps://github.com/jnunemaker/flipper/issues/5.

@eric

This comment has been minimized.

Show comment Hide comment
@eric

eric Aug 27, 2012

Contributor

It does.

Contributor

eric commented Aug 27, 2012

It does.

@jnunemaker

This comment has been minimized.

Show comment Hide comment
@jnunemaker

jnunemaker Aug 27, 2012

Owner

Yep, this is definitely not perfect. I've used this exact logic on apps with millions of users and though their ids were sequential, the active set of user ids was not and it worked fine. I tend to think of it not as exact, but more a general idea.

That said, I'm totally open to something that is doesn't impact performance and works better.

Owner

jnunemaker commented Aug 27, 2012

Yep, this is definitely not perfect. I've used this exact logic on apps with millions of users and though their ids were sequential, the active set of user ids was not and it worked fine. I tend to think of it not as exact, but more a general idea.

That said, I'm totally open to something that is doesn't impact performance and works better.

@eric

This comment has been minimized.

Show comment Hide comment
@eric

eric Aug 27, 2012

Contributor

The problem is that if you have something like the example I have, you'll have 0 users active until you hit 30% and then you'll have all users active...

Contributor

eric commented Aug 27, 2012

The problem is that if you have something like the example I have, you'll have 0 users active until you hit 30% and then you'll have all users active...

@jnunemaker

This comment has been minimized.

Show comment Hide comment
@jnunemaker

jnunemaker Aug 27, 2012

Owner

That would suck. :)

Any ideas?

On Aug 27, 2012, at 4:50 PM, Eric Lindvall notifications@github.com wrote:

The problem is that if you have something like the example I have, you'll have 0 users active until you hit 30% and then you'll have all users active...


Reply to this email directly or view it on GitHub.

Owner

jnunemaker commented Aug 27, 2012

That would suck. :)

Any ideas?

On Aug 27, 2012, at 4:50 PM, Eric Lindvall notifications@github.com wrote:

The problem is that if you have something like the example I have, you'll have 0 users active until you hit 30% and then you'll have all users active...


Reply to this email directly or view it on GitHub.

@eric

This comment has been minimized.

Show comment Hide comment
@eric

eric Aug 27, 2012

Contributor

It would require having a configuration option for what your increment interval is, but I'm having trouble reasoning out the exact math at the moment.

Contributor

eric commented Aug 27, 2012

It would require having a configuration option for what your increment interval is, but I'm having trouble reasoning out the exact math at the moment.

@bkeepers

This comment has been minimized.

Show comment Hide comment
@bkeepers

bkeepers Aug 28, 2012

Collaborator

What about using a hashing algorithm instead of modulus? I don't know how legitimate implementations do it, but shouldn't be to hard to do a hash on the id, and then see if that hash is in the first x percent.

Collaborator

bkeepers commented Aug 28, 2012

What about using a hashing algorithm instead of modulus? I don't know how legitimate implementations do it, but shouldn't be to hard to do a hash on the id, and then see if that hash is in the first x percent.

@ryanlecompte

This comment has been minimized.

Show comment Hide comment
@ryanlecompte

ryanlecompte Aug 28, 2012

That's actually how I've done it in the past. I basically implemented
Java's String#hashCode implementation in Ruby and then mod based on that. I
don't hash on the database user id, but instead a different unique
identifier for that user.

module StringUtils
  def self.java_hash(txt)
    hash = txt.chars.inject(0) { |hash, c| ((hash << 5) - hash) + c.ord }
    [hash].pack('l').unpack('l').first
  end
end

Side note: If you ever implement Java's hash code implementation in
JavaScript, be careful! JavaScript does not handle mod'ing against negative
numbers well and needs a workaround.

On Mon, Aug 27, 2012 at 5:18 PM, Brandon Keepers
notifications@github.comwrote:

What about using a hashing algorithm instead of modulus? I don't know how
legitimate implementations do it, but shouldn't be to hard to do a hash on
the id, and then see if that hash is in the first x percent.


Reply to this email directly or view it on GitHubhttps://github.com/jnunemaker/flipper/issues/5#issuecomment-8075565.

That's actually how I've done it in the past. I basically implemented
Java's String#hashCode implementation in Ruby and then mod based on that. I
don't hash on the database user id, but instead a different unique
identifier for that user.

module StringUtils
  def self.java_hash(txt)
    hash = txt.chars.inject(0) { |hash, c| ((hash << 5) - hash) + c.ord }
    [hash].pack('l').unpack('l').first
  end
end

Side note: If you ever implement Java's hash code implementation in
JavaScript, be careful! JavaScript does not handle mod'ing against negative
numbers well and needs a workaround.

On Mon, Aug 27, 2012 at 5:18 PM, Brandon Keepers
notifications@github.comwrote:

What about using a hashing algorithm instead of modulus? I don't know how
legitimate implementations do it, but shouldn't be to hard to do a hash on
the id, and then see if that hash is in the first x percent.


Reply to this email directly or view it on GitHubhttps://github.com/jnunemaker/flipper/issues/5#issuecomment-8075565.

@eric

This comment has been minimized.

Show comment Hide comment
@eric

eric Aug 28, 2012

Contributor

I think the idea of using a hash of the value is a great idea.

Careful: using built-in hashcodes are not safe for distributed computing:

http://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

Make sure you use something that is.

Contributor

eric commented Aug 28, 2012

I think the idea of using a hash of the value is a great idea.

Careful: using built-in hashcodes are not safe for distributed computing:

http://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

Make sure you use something that is.

@ryanlecompte

This comment has been minimized.

Show comment Hide comment
@ryanlecompte

ryanlecompte Aug 28, 2012

Great point about hash codes!

On Mon, Aug 27, 2012 at 5:30 PM, Eric Lindvall notifications@github.comwrote:

I think the idea of using a hash of the value is a great idea.

Careful: using built-in hashcodes are not safe for distributed computing:

http://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

Make sure you use something that is.


Reply to this email directly or view it on GitHubhttps://github.com/jnunemaker/flipper/issues/5#issuecomment-8075750.

Great point about hash codes!

On Mon, Aug 27, 2012 at 5:30 PM, Eric Lindvall notifications@github.comwrote:

I think the idea of using a hash of the value is a great idea.

Careful: using built-in hashcodes are not safe for distributed computing:

http://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

Make sure you use something that is.


Reply to this email directly or view it on GitHubhttps://github.com/jnunemaker/flipper/issues/5#issuecomment-8075750.

@jqr

This comment has been minimized.

Show comment Hide comment
@jqr

jqr Aug 28, 2012

Collaborator

Seems like the fix here is a hash function to provide repeatable, but "randomized" distribution before doing the modulus and comparison.

https://github.com/funny-falcon/murmurhash3-ruby could be a good fit.

That said, maybe this should be left up to the individual application because this I suspect this is a pretty uncommon case.

Collaborator

jqr commented Aug 28, 2012

Seems like the fix here is a hash function to provide repeatable, but "randomized" distribution before doing the modulus and comparison.

https://github.com/funny-falcon/murmurhash3-ruby could be a good fit.

That said, maybe this should be left up to the individual application because this I suspect this is a pretty uncommon case.

@jnunemaker

This comment has been minimized.

Show comment Hide comment
@jnunemaker

jnunemaker Aug 28, 2012

Owner

I could certainly make the algorithm pluggable. Allow people to change it and just default to the modulus. The one downside is right now I force id to be an integer. Using non-integers wouldn't work, but that can be changed as well.

Owner

jnunemaker commented Aug 28, 2012

I could certainly make the algorithm pluggable. Allow people to change it and just default to the modulus. The one downside is right now I force id to be an integer. Using non-integers wouldn't work, but that can be changed as well.

@eric

This comment has been minimized.

Show comment Hide comment
@eric

eric Aug 28, 2012

Contributor

I would guess it's more common than you think, and most people just don't realize they're running into it. Anyone doing MySQL replication is going to have this problem if they know it or not.

Contributor

eric commented Aug 28, 2012

I would guess it's more common than you think, and most people just don't realize they're running into it. Anyone doing MySQL replication is going to have this problem if they know it or not.

@bkeepers

This comment has been minimized.

Show comment Hide comment
@bkeepers

bkeepers Aug 28, 2012

Collaborator

What about just using MD5 or similar hashing function? It would give a random distribution in every scenario.

Collaborator

bkeepers commented Aug 28, 2012

What about just using MD5 or similar hashing function? It would give a random distribution in every scenario.

@eric

This comment has been minimized.

Show comment Hide comment
@eric

eric Aug 28, 2012

Contributor

I haven't done any research on how uniform the distribution of the last byte (for instance) of an md5 has is for these purposes, but it's likely it would be fine.

Contributor

eric commented Aug 28, 2012

I haven't done any research on how uniform the distribution of the last byte (for instance) of an md5 has is for these purposes, but it's likely it would be fine.

@reinh

This comment has been minimized.

Show comment Hide comment
@reinh

reinh Sep 6, 2012

Good hashing functions tend to be uniform.

Here's the distribution of the last character of the MD5 hash of each string from "1" to "10000":

0: ===============================================================
1: ================================================================
2: ===========================================================
3: ================================================================
4: ==========================================================
5: ==============================================================
6: =========================================================
7: ================================================================
8: ================================================================
9: ===========================================================
a: ==============================================================
b: =============================================================
c: =================================================================
d: ===============================================================
e: =============================================================
f: ==================================================================

Seems good enough to me.

reinh commented Sep 6, 2012

Good hashing functions tend to be uniform.

Here's the distribution of the last character of the MD5 hash of each string from "1" to "10000":

0: ===============================================================
1: ================================================================
2: ===========================================================
3: ================================================================
4: ==========================================================
5: ==============================================================
6: =========================================================
7: ================================================================
8: ================================================================
9: ===========================================================
a: ==============================================================
b: =============================================================
c: =================================================================
d: ===============================================================
e: =============================================================
f: ==================================================================

Seems good enough to me.

@jnunemaker

This comment has been minimized.

Show comment Hide comment
@jnunemaker

jnunemaker Sep 6, 2012

Owner

Does seem pretty uniform. I'm open to whatever. Someone want to put a pull together?

Owner

jnunemaker commented Sep 6, 2012

Does seem pretty uniform. I'm open to whatever. Someone want to put a pull together?

@jnunemaker

This comment has been minimized.

Show comment Hide comment
@jnunemaker

jnunemaker Sep 6, 2012

Owner

For what it is worth, I like this better. Instead of forcing integers, I can just to_s, then hash, etc.

Owner

jnunemaker commented Sep 6, 2012

For what it is worth, I like this better. Instead of forcing integers, I can just to_s, then hash, etc.

@jnunemaker

This comment has been minimized.

Show comment Hide comment
@jnunemaker

jnunemaker Nov 14, 2012

Owner

This is fixed as of the pull request I merged above.

Owner

jnunemaker commented Nov 14, 2012

This is fixed as of the pull request I merged above.

@jnunemaker jnunemaker closed this Nov 14, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment