Skip to content


Subversion checkout URL

You can clone with
Download ZIP
100644 71 lines (69 sloc) 7.637 kb
712b0a1 rendered site
Philip (flip) Kromer authored
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "">
2 <html xml:lang="en" xmlns="">
3 <head>
4 <title>Infochimps Dev Blog :: Blog</title>
5 <link href='/stylesheets/screen.css' media='screen, projection' rel='stylesheet' type='text/css' />
6 <script src='' type='text/javascript'></script>
7 <script src='/javascripts/mootools-' type='text/javascript'></script>
8 <link href='/atom.xml' rel='alternate' title='Infochimps Dev Blog' type='application/atom+xml' />
9 </head>
10 <body id="">
11 <div id="header">
12 <div class='content'>
13 <h1>
14 <a class='title' href='/'>Infochimps Dev Blog</a>
15 </h1>
16 </div>
17 </div>
18 <div id="nav">
19 <div class='content'>
20 <ul>
21 <li class='alpha'>
22 <a href='/'>Blog</a>
23 </li>
24 <li>
25 <a href='/archives.html'>Archives</a>
26 </li>
27 <li class='omega'>
28 <a href='/about.html'>About</a>
29 </li>
30 <li class='subscribe'>
31 <a href='/atom.xml'>Subscribe</a>
32 </li>
33 </ul>
34 </div>
35 </div>
36 <div id="page">
37 <div id="content">
38 <div id="main">
930f38c Regenerated
Philip (flip) Kromer authored
39 <div class="content"><div class="blog">&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/scalable_sampling">Scalable Sampling</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 7th, 2010&#x000A; &#x000A; </div>&#x000A; <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts</em>: If you use the hadoop job id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="">the reasons given in that rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;(<a href="/2010/09/scalable_sampling" class="cont">continued&hellip;</a>)&#x000A; </div>&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/firsties">Firsties</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 6th, 2010&#x000A; &#x000A; </div>&#x000A; <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A;<p>Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our <a href="">main blog</a> this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.</p>&#x000A;<h4>A word about this blog.</h4>&#x000A;<p>We&#8217;re using the <a href="">Octopress framework</a> for <a href="">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="">Jekyll on Heroku</a> post. We added <a href="">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p>&#x000A; </div>&#x000A; <div class="footer">&#x000A; <a href="/archives.html" title="archives">&laquo; Blog Archives</a>&#x000A; </div>&#x000A;</div></div>
712b0a1 rendered site
Philip (flip) Kromer authored
40 </div>
41 <div id="sidebar"></div>
42 </div>
43 </div>
44 <div id='footer'>
45 <div class='content'>
46 Copyright &copy; 2010 - Infochimps Dev Blog -
fd2a5ff regen'ed site
Philip (flip) Kromer authored
47 <span class='credit'><a href="/colophon.html">colophon</a></span>
f23c95b Updated files
Philip (flip) Kromer authored
48 &middot;
49 <span class='credit'>
50 <a href='' title='our thing'>Infochimps: Find Data</a>
51 </span>
712b0a1 rendered site
Philip (flip) Kromer authored
52 </div>
53 </div>
54 <script type='text/javascript'>
55 //<![CDATA[
56 (function() {
57 var links = document.getElementsByTagName('a');
58 var query = '?';
59 for(var i = 0; i < links.length; i++) {
60 if(links[i].href.indexOf('#disqus_thread') >= 0) {
61 query += 'url' + i + '=' + encodeURIComponent(links[i].href) + '&';
62 }
63 }
64 document.write('<script charset="utf-8" type="text/javascript" src="' + query + '"></' + 'script>');
65 })();
66 //]]>
67 </script>
68 </body>
69 </html>
ecb720d forcing regen
Philip (flip) Kromer authored
Something went wrong with that request. Please try again.