Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Regenerated

  • Loading branch information...
commit 930f38c0377e6d4c0dcc30cd8d3dd50323d71727 1 parent f66ea3c
Philip (flip) Kromer authored
View
2  2010/09/scalable_sampling/index.html
@@ -43,7 +43,7 @@
by: Infochimps Dev Team | posted: September 7th, 2010
</div>
- <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts_: If you use the job</em>id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in the rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;<hr />&#x000A;<h3 style="vertical-align:middle;">Uniform plus Edges (Global-feature preserving) Sample</h3>&#x000A;<p>This is better:</p>&#x000A;<ul>&#x000A; <li>Take all users whose ids hash correctly (n1)</li>&#x000A; <li>Do a join of the transactions with n1</li>&#x000A; <li>Do some joins to get relationships with a user from n1 on the left (and/or) right</li>&#x000A;</ul>&#x000A;<p>However, it&#8217;s computationally harder than doing straight samples of each. The consistent hash answers that problem: just use the same hash on the <strong>foreign key</strong> (in this case, the user_id):</p>&#x000A;<ul>&#x000A; <li>Take all users whose ids hash correctly</li>&#x000A; <li>Take all products whose seller_id hashes correctly</li>&#x000A; <li>Take all transactions whose buyer_id (and/or) seller_id hashes correctly</li>&#x000A;</ul>&#x000A;<p>This gives you a very efficient uniform sample. If 4% of your buyers are from Florida, about 4% of the sampled users should be too, and about 4% of the transactions will be from Floridians. (<a href="http://kottke.org/10/05/monday-puzzle-time">Don&#8217;t get careless,</a> though)</p>&#x000A;<p>Some caveats. You don&#8217;t have good control over the sample fraction: your transactions probably obey a long-tail distribution (a few users account for a disproportionate number of transactions), which introduces high variance for the quantity recovered.</p>&#x000A;<p>The sample is also sparse, which can make analysis hard in some contexts. If you sample 1% of buyers, a product with 100 purchases will in general retain 1 buyer. You can&#8217;t test an algortihm that looks for similar products, or measures reputation flow. The problem with joins</p>&#x000A;<h3>Subuniverse (Local-structure preserving) Sample</h3>&#x000A;<p>To do a &#8216;subuniverse&#8217; sample, find a handle for some connected neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.</p>&#x000A;<ul>&#x000A; <li>Take all users who match the handle (n0)</li>&#x000A; <li>Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.</li>&#x000A; <li>Prune n1_all: eliminate entities with very few or very weak ties to n0, and call this n1.</li>&#x000A; <li>Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)</li>&#x000A; <li>Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.</li>&#x000A;</ul>&#x000A;<p>You want highly similar features in n0, or n1 will get too large. &#8220;People from Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy football site.</p>&#x000A;<p>Here&#8217;s the same thing for our favorite network graph:</p>&#x000A;<ul>&#x000A; <li>Take all users who match the handle (n0)</li>&#x000A; <li>Broaden n0 along all relevant connections: for example atsign, follow, topic usage, etc &#8211; call this n1_all</li>&#x000A; <li>Prune n1_all: elminate users with very few or very weak ties to n0, and call this n1.</li>&#x000A; <li>Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)</li>&#x000A; <li>Do some joins of n1 to get relationships with a sample user on the left (and/or) right</li>&#x000A;</ul>&#x000A;<p>For example, the subuniverse we typically work with is &#8220;users who have mentioned @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there, it&#8217;s easy to inspect the results of an experiment against a familiar object (ourselves). It also gives very correlated edges: many such people also follow each other, use other similar terms, etc. Without this correlation, we&#8217;d span too much of the graph.</p>&#x000A;<p>Within the subuniverse, we can happily do joins, calculate trstrank, and examine local community structure.</p>&#x000A;<p>Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way: among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is unfortunately hard. More importantly, no matter what handle you use the subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:</p>&#x000A;<ul>&#x000A; <li>twitter users with millions of followers are going to land in almost any given subuniverse</li>&#x000A; <li>the trstrank of any given subuniverse is going to be vastly higher than the whole graph average</li>&#x000A; <li>Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.</li>&#x000A;</ul>&#x000A;<h3>Connectivity-preserving Sample</h3>&#x000A;<p>There&#8217;s one other type of sample you might like to do: one that preserves the global connectivity of edges.</p>&#x000A;<ul>&#x000A; <li>For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).</li>&#x000A; <li>Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.</li>&#x000A;</ul>&#x000A;<p>(The parameter f adjusts the fraction of edges sampled.) In this equation, a node with one inbound link has a high chance of survival. On average, each node will have f inbound and f outbound links survive.</p>&#x000A;<p>This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f down pretty far for a significant number of nodes to start failing the binomial trial at each end.</p>&#x000A;<p>To do this consistently, set g = 1/f and do</p>&#x000A;<pre><code> ( (MD5([node_a_id, node_b_id, 'a', salt].join(":")) % (deg_a * g) = 0) ||&#x000A; (MD5([node_a_id, node_b_id, 'b', salt].join(":")) % (deg_b * g) = 0) )</code></pre>&#x000A;<p>(since deg_a and deg_b may be correlated, we perturb it by adding &#8216;a&#8217; and &#8216;b&#8217; as extra salts)</p>&#x000A;<hr />&#x000A;<h4 style="vertical-align:middle;">Footnotes</h4>&#x000A;<p class="footnote" id="fn1"><a href="#fnr1"><sup>1</sup></a> We seem mostly rid of stupid and/or non-threadsafe RNGs. However, many <span class="caps">UUID</span> implementations (including Java&#8217;s, I think) require a global lock.</p>&#x000A;<p class="footnote" id="fn2"><a href="#fnr2"><sup>2</sup></a> By claiming an MD5 is good for anything I just made a cryptographer cry. So let me hurry to disclaim that you should really be using something-something-<span class="caps">HMAC</span>-whatever. That is &#8212; if you care that this mixing is cryptographically strong, go look up the one that is.</p>&#x000A;<p class="footnote" id="fn3"><a href="#fnr3"><sup>3</sup></a> Make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.</p>&#x000A;<p class="footnote" id="fn4"><a href="#fnr4"><sup>4</sup></a> These are available as environment variables if you&#8217;re streaming</p>&#x000A;<p class="footnote" id="fn5"><a href="#fnr5"><sup>5</sup></a> Note that you always need to use the <strong>least</strong> significant bytes, because of <a href="http://en.wikipedia.org/wiki/Benford%27s_law">Benford&#8217;s law</a></p>
+ <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts</em>: If you use the hadoop job id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in that rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;<hr />&#x000A;<h3 style="vertical-align:middle;">Uniform plus Edges (Global-feature preserving) Sample</h3>&#x000A;<p>This is better:</p>&#x000A;<ul>&#x000A; <li>Take all users whose ids hash correctly (n1)</li>&#x000A; <li>Do a join of the transactions with n1</li>&#x000A; <li>Do some joins to get relationships with a user from n1 on the left (and/or) right</li>&#x000A;</ul>&#x000A;<p>However, it&#8217;s computationally harder than doing straight samples of each. The consistent hash answers that problem: just use the same hash on the <strong>foreign key</strong> (in this case, the user_id):</p>&#x000A;<ul>&#x000A; <li>Take all users whose ids hash correctly</li>&#x000A; <li>Take all products whose seller_id hashes correctly</li>&#x000A; <li>Take all transactions whose buyer_id (and/or) seller_id hashes correctly</li>&#x000A;</ul>&#x000A;<p>This gives you a very efficient uniform sample. If 4% of your buyers are from Florida, about 4% of the sampled users should be too, and about 4% of the transactions will be from Floridians. (<a href="http://kottke.org/10/05/monday-puzzle-time">Don&#8217;t get careless,</a> though)</p>&#x000A;<p>Some caveats. You don&#8217;t have good control over the sample fraction: your transactions probably obey a long-tail distribution (a few users account for a disproportionate number of transactions), which introduces high variance for the quantity recovered.</p>&#x000A;<p>The sample is also sparse, which can make analysis hard in some contexts. If you sample 1% of buyers, a product with 100 purchases will in general retain 1 buyer. You can&#8217;t test an algortihm that looks for similar products, or measures reputation flow. The problem with joins</p>&#x000A;<h3>Subuniverse (Local-structure preserving) Sample</h3>&#x000A;<p>To do a &#8216;subuniverse&#8217; sample, find a handle for some connected neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.</p>&#x000A;<ul>&#x000A; <li>Take all users who match the handle (n0)</li>&#x000A; <li>Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.</li>&#x000A; <li>Prune n1_all: eliminate entities with very few or very weak ties to n0, and call this n1.</li>&#x000A; <li>Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)</li>&#x000A; <li>Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.</li>&#x000A;</ul>&#x000A;<p>You want highly similar features in n0, or n1 will get too large. &#8220;People from Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy football site.</p>&#x000A;<p>Here&#8217;s the same thing for our favorite network graph:</p>&#x000A;<ul>&#x000A; <li>Take all users who match the handle (n0)</li>&#x000A; <li>Broaden n0 along all relevant connections: for example atsign, follow, topic usage, etc &#8211; call this n1_all</li>&#x000A; <li>Prune n1_all: elminate users with very few or very weak ties to n0, and call this n1.</li>&#x000A; <li>Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)</li>&#x000A; <li>Do some joins of n1 to get relationships with a sample user on the left (and/or) right</li>&#x000A;</ul>&#x000A;<p>For example, the subuniverse we typically work with is &#8220;users who have mentioned @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there, it&#8217;s easy to inspect the results of an experiment against a familiar object (ourselves). It also gives very correlated edges: many such people also follow each other, use other similar terms, etc. Without this correlation, we&#8217;d span too much of the graph.</p>&#x000A;<p>Within the subuniverse, we can happily do joins, calculate trstrank, and examine local community structure.</p>&#x000A;<p>Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way: among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is unfortunately hard. More importantly, no matter what handle you use the subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:</p>&#x000A;<ul>&#x000A; <li>twitter users with millions of followers are going to land in almost any given subuniverse</li>&#x000A; <li>the trstrank of any given subuniverse is going to be vastly higher than the whole graph average</li>&#x000A; <li>Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.</li>&#x000A;</ul>&#x000A;<h3>Connectivity-preserving Sample</h3>&#x000A;<p>There&#8217;s one other type of sample you might like to do: one that preserves the global connectivity of edges.</p>&#x000A;<ul>&#x000A; <li>For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).</li>&#x000A; <li>Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.</li>&#x000A;</ul>&#x000A;<p>(The parameter f adjusts the fraction of edges sampled.) In this equation, a node with one inbound link has a high chance of survival. On average, each node will have f inbound and f outbound links survive.</p>&#x000A;<p>This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f down pretty far for a significant number of nodes to start failing the binomial trial at each end.</p>&#x000A;<p>To do this consistently, set g = 1/f and do</p>&#x000A;<pre><code> ( (MD5([node_a_id, node_b_id, 'a', salt].join(":")) % (deg_a * g) = 0) ||&#x000A; (MD5([node_a_id, node_b_id, 'b', salt].join(":")) % (deg_b * g) = 0) )</code></pre>&#x000A;<p>(since deg_a and deg_b may be correlated, we perturb it by adding &#8216;a&#8217; and &#8216;b&#8217; as extra salts)</p>&#x000A;<hr />&#x000A;<h4 style="vertical-align:middle;">Footnotes</h4>&#x000A;<p class="footnote" id="fn1"><a href="#fnr1"><sup>1</sup></a> We seem mostly rid of stupid and/or non-threadsafe RNGs. However, many <span class="caps">UUID</span> implementations (including Java&#8217;s, I think) require a global lock.</p>&#x000A;<p class="footnote" id="fn2"><a href="#fnr2"><sup>2</sup></a> By claiming an MD5 is good for anything I just made a cryptographer cry. So let me hurry to disclaim that you should really be using something-something-<span class="caps">HMAC</span>-whatever. That is &#8212; if you care that this mixing is cryptographically strong, go look up the one that is.</p>&#x000A;<p class="footnote" id="fn3"><a href="#fnr3"><sup>3</sup></a> Make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.</p>&#x000A;<p class="footnote" id="fn4"><a href="#fnr4"><sup>4</sup></a> These are available as environment variables if you&#8217;re streaming</p>&#x000A;<p class="footnote" id="fn5"><a href="#fnr5"><sup>5</sup></a> Note that you always need to use the <strong>least</strong> significant bytes, because of <a href="http://en.wikipedia.org/wiki/Benford%27s_law">Benford&#8217;s law</a></p>
<div id='disqus_thread'>
<script type='text/javascript'>
//<![CDATA[
View
2  about.html
@@ -36,7 +36,7 @@
<div id="page">
<div id="content">
<div id="main">
- <div class="content"><h2>About Us</h2>&#x000A;<p>This is the infochimps dev team blog.</p>&#x000A;<p>The last time someone bothered to update this page, we were:</p>&#x000A;<ul>&#x000A; <li>Flip Kromer</li>&#x000A; <li>Dhruv Bansal</li>&#x000A; <li>Jacob Perkins</li>&#x000A; <li>Jesse Crouch</li>&#x000A; <li>David Snyder</li>&#x000A; <li>Chris Howe</li>&#x000A;</ul>&#x000A;<hr />&#x000A;<h4 style="vertical-align:middle;">Find out more:</h4>&#x000A;<ul>&#x000A; <li><a href="http://infochimps.org/">Where we work</a></li>&#x000A; <li><a href="/colophon.html">Colophon</a> for the blog</li>&#x000A; <li>Here&#8217;s a <a href="http://www.youtube.com/watch?v=xp9Gm-aRe5A">Chimpanzee riding on a segway</a></li>&#x000A;</ul></div>
+ <div class="content"><h2>About Us</h2>&#x000A;<p>This is the infochimps dev team blog.</p>&#x000A;<p>We are:</p>&#x000A;<ul>&#x000A; <li>Flip Kromer</li>&#x000A; <li>Dhruv Bansal</li>&#x000A; <li>Jacob Perkins</li>&#x000A; <li>Jesse Crouch</li>&#x000A; <li>David Snyder</li>&#x000A; <li>Chris Howe</li>&#x000A;</ul>&#x000A;<hr />&#x000A;<h4 style="vertical-align:middle;">Find out more:</h4>&#x000A;<ul>&#x000A; <li><a href="http://infochimps.org/">Where we work</a></li>&#x000A; <li><a href="/colophon.html">Colophon</a> for the blog</li>&#x000A; <li>Here&#8217;s a <a href="http://www.youtube.com/watch?v=xp9Gm-aRe5A">Chimpanzee riding on a segway</a></li>&#x000A;</ul></div>
</div>
<div id="sidebar"></div>
</div>
View
2  atom.xml
@@ -39,7 +39,7 @@
&lt;h4&gt;Consistent Shuffling&lt;/h4&gt;
&lt;p&gt;For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to &lt;strong&gt;not&lt;/strong&gt; remain the same, use a salt&lt;sup class=&quot;footnote&quot; id=&quot;fnr3&quot;&gt;&lt;a href=&quot;#fn3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt; MD5( [key, salt].join(&quot;:&quot;) )&lt;/code&gt;&lt;/pre&gt;
- &lt;p&gt;Runs wich the same salt and data will receive an the same mixing. &lt;em&gt;Good salts_: If you use the job&lt;/em&gt;id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter&lt;sup class=&quot;footnote&quot; id=&quot;fnr4&quot;&gt;&lt;a href=&quot;#fn4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. &lt;em&gt;Bad Salts&lt;/em&gt;: random numbers, timestamps and the hostname + &lt;span class=&quot;caps&quot;&gt;PID&lt;/span&gt; are bad salts, for &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;the reasons given in the rapleaf post.&lt;/a&gt;&lt;/p&gt;
+ &lt;p&gt;Runs wich the same salt and data will receive an the same mixing. &lt;em&gt;Good salts&lt;/em&gt;: If you use the hadoop job id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter&lt;sup class=&quot;footnote&quot; id=&quot;fnr4&quot;&gt;&lt;a href=&quot;#fn4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. &lt;em&gt;Bad Salts&lt;/em&gt;: random numbers, timestamps and the hostname + &lt;span class=&quot;caps&quot;&gt;PID&lt;/span&gt; are bad salts, for &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;the reasons given in that rapleaf post.&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;Sampling&lt;/h4&gt;
&lt;p&gt;To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.&lt;/p&gt;
&lt;h3&gt;Uniform-All Sample&lt;/h3&gt;
View
2  index.html
@@ -36,7 +36,7 @@
<div id="page">
<div id="content">
<div id="main">
- <div class="content"><div class="blog">&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/scalable_sampling">Scalable Sampling</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 7th, 2010&#x000A; &#x000A; </div>&#x000A; <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts_: If you use the job</em>id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in the rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;(<a href="/2010/09/scalable_sampling" class="cont">continued&hellip;</a>)&#x000A; </div>&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/firsties">Firsties</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 6th, 2010&#x000A; &#x000A; </div>&#x000A; <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A;<p>Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our <a href="http://blog.infochimps.org">main blog</a> this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.</p>&#x000A;<h4>A word about this blog.</h4>&#x000A;<p>We&#8217;re using the <a href="http://github.com/imathis/octopress">Octopress framework</a> for <a href="http://github.com/mojombo/jekyll">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="http://github.com/infochimps/infochimps.github.com/tree/master/_plugins">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="http://jstorimer.com/2009/12/29/jekyll-on-heroku.html">Jekyll on Heroku</a> post. We added <a href="http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="http://github.com/infochimps/infochimps.github.com">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="http://redcloth.org/textile">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p>&#x000A; </div>&#x000A; <div class="footer">&#x000A; <a href="/archives.html" title="archives">&laquo; Blog Archives</a>&#x000A; </div>&#x000A;</div></div>
+ <div class="content"><div class="blog">&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/scalable_sampling">Scalable Sampling</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 7th, 2010&#x000A; &#x000A; </div>&#x000A; <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts</em>: If you use the hadoop job id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in that rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;(<a href="/2010/09/scalable_sampling" class="cont">continued&hellip;</a>)&#x000A; </div>&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/firsties">Firsties</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 6th, 2010&#x000A; &#x000A; </div>&#x000A; <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A;<p>Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our <a href="http://blog.infochimps.org">main blog</a> this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.</p>&#x000A;<h4>A word about this blog.</h4>&#x000A;<p>We&#8217;re using the <a href="http://github.com/imathis/octopress">Octopress framework</a> for <a href="http://github.com/mojombo/jekyll">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="http://github.com/infochimps/infochimps.github.com/tree/master/_plugins">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="http://jstorimer.com/2009/12/29/jekyll-on-heroku.html">Jekyll on Heroku</a> post. We added <a href="http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="http://github.com/infochimps/infochimps.github.com">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="http://redcloth.org/textile">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p>&#x000A; </div>&#x000A; <div class="footer">&#x000A; <a href="/archives.html" title="archives">&laquo; Blog Archives</a>&#x000A; </div>&#x000A;</div></div>
</div>
<div id="sidebar"></div>
</div>
View
6 stylesheets/screen.css
@@ -479,7 +479,7 @@ html body a:visited {
-ms-border-bottom-right-radius: 2px;
-khtml-border-bottom-right-radius: 2px;
border-bottom-right-radius: 2px;
- background: #aaaaaa url('/images/code_bg.png?1283853265') top repeat-x;
+ background: #aaaaaa url('/images/code_bg.png?1283854295') top repeat-x;
position: relative;
margin: 0.3em 0 1.3em;
padding: 0 3px 3px;
@@ -863,7 +863,7 @@ pre.console .stdin {
}
/* line 5, ../../stylesheets/partials/_search.sass */
#search form {
- background: url('/images/search_bg.png?1283853265') no-repeat;
+ background: url('/images/search_bg.png?1283854295') no-repeat;
padding: 0;
height: 28px;
width: 218px;
@@ -1059,7 +1059,7 @@ pre.console .stdin {
#nav ul li.subscribe a {
display: inline-block;
padding-left: 28px;
- background: url('/images/rss.png?1283853265') left top no-repeat;
+ background: url('/images/rss.png?1283854295') left top no-repeat;
}
/* line 32, ../../stylesheets/partials/_navigation.sass */
#nav ul li a {
Please sign in to comment.
Something went wrong with that request. Please try again.