Skip to content
Browse files

regen'ed site

  • Loading branch information...
1 parent e700b36 commit fd2a5ff52c2b00b296aec62f9d5358c11215f8e1 Philip (flip) Kromer committed
View
4 _site/2010/09/firsties/index.html
@@ -43,7 +43,7 @@
by: Infochimps Dev Team | posted: September 6th, 2010
</div>
- <h3 style="color:red;">First Post wooooo!!!</h3>
+ <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A;<p>Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our <a href="http://blog.infochimps.org">main blog</a> this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.</p>&#x000A;<h4>A word about this blog.</h4>&#x000A;<p>We&#8217;re using the <a href="http://github.com/imathis/octopress">Octopress framework</a> for <a href="http://github.com/mojombo/jekyll">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="http://github.com/infochimps/infochimps.github.com/tree/master/_plugins">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="http://jstorimer.com/2009/12/29/jekyll-on-heroku.html">Jekyll on Heroku</a> post. We added <a href="http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="http://github.com/infochimps/infochimps.github.com">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="http://redcloth.org/textile">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p>
<div id='disqus_thread'>
<script type='text/javascript'>
//<![CDATA[
@@ -64,7 +64,7 @@ <h3 style="color:red;">First Post wooooo!!!</h3>
<div id='footer'>
<div class='content'>
Copyright &copy; 2010 - Infochimps Dev Blog -
- <span class='credit'>Powered by <a href="http://octopress.org">Octopress</a></span>
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
</div>
</div>
<script type='text/javascript'>
View
85 _site/2010/09/scalable_sampling/index.html
@@ -0,0 +1,85 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
+<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
+ <head>
+ <title>Infochimps Dev Blog :: Scalable Sampling</title>
+ <link href='/stylesheets/screen.css' media='screen, projection' rel='stylesheet' type='text/css' />
+ <script src='http://ajax.googleapis.com/ajax/libs/mootools/1.2.4/mootools-yui-compressed.js' type='text/javascript'></script>
+ <script src='/javascripts/mootools-1.2.4.2-more.js' type='text/javascript'></script>
+ <link href='/atom.xml' rel='alternate' title='Infochimps Dev Blog' type='application/atom+xml' />
+ </head>
+ <body id="">
+ <div id="header">
+ <div class='content'>
+ <h1>
+ <a class='title' href='/'>Infochimps Dev Blog</a>
+ </h1>
+ </div>
+ </div>
+ <div id="nav">
+ <div class='content'>
+ <ul>
+ <li class='alpha'>
+ <a href='/'>Blog</a>
+ </li>
+ <li>
+ <a href='/archives.html'>Archives</a>
+ </li>
+ <li class='omega'>
+ <a href='/about.html'>About</a>
+ </li>
+ <li class='subscribe'>
+ <a href='/atom.xml'>Subscribe</a>
+ </li>
+ </ul>
+ </div>
+ </div>
+ <div id="page">
+ <div id="content">
+ <div id="main">
+ <div class="blog content">
+ <div class='article'>
+ <h2>Scalable Sampling</h2>
+ <div class='meta'>
+ by: Infochimps Dev Team | posted: September 7th, 2010
+
+ </div>
+ <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts_: If you use the job</em>id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in the rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;<hr />&#x000A;<h3 style="vertical-align:middle;">Uniform plus Edges (Global-feature preserving) Sample</h3>&#x000A;<p>This is better:</p>&#x000A;<ul>&#x000A; <li>Take all users whose ids hash correctly (n1)</li>&#x000A; <li>Do a join of the transactions with n1</li>&#x000A; <li>Do some joins to get relationships with a user from n1 on the left (and/or) right</li>&#x000A;</ul>&#x000A;<p>However, it&#8217;s computationally harder than doing straight samples of each. The consistent hash answers that problem: just use the same hash on the <strong>foreign key</strong> (in this case, the user_id):</p>&#x000A;<ul>&#x000A; <li>Take all users whose ids hash correctly</li>&#x000A; <li>Take all products whose seller_id hashes correctly</li>&#x000A; <li>Take all transactions whose buyer_id (and/or) seller_id hashes correctly</li>&#x000A;</ul>&#x000A;<p>This gives you a very efficient uniform sample. If 4% of your buyers are from Florida, about 4% of the sampled users should be too, and about 4% of the transactions will be from Floridians. (<a href="http://kottke.org/10/05/monday-puzzle-time">Don&#8217;t get careless,</a> though)</p>&#x000A;<p>Some caveats. You don&#8217;t have good control over the sample fraction: your transactions probably obey a long-tail distribution (a few users account for a disproportionate number of transactions), which introduces high variance for the quantity recovered.</p>&#x000A;<p>The sample is also sparse, which can make analysis hard in some contexts. If you sample 1% of buyers, a product with 100 purchases will in general retain 1 buyer. You can&#8217;t test an algortihm that looks for similar products, or measures reputation flow. The problem with joins</p>&#x000A;<h3>Subuniverse (Local-structure preserving) Sample</h3>&#x000A;<p>To do a &#8216;subuniverse&#8217; sample, find a handle for some connected neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.</p>&#x000A;<ul>&#x000A; <li>Take all users who match the handle (n0)</li>&#x000A; <li>Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.</li>&#x000A; <li>Prune n1_all: eliminate entities with very few or very weak ties to n0, and call this n1.</li>&#x000A; <li>Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)</li>&#x000A; <li>Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.</li>&#x000A;</ul>&#x000A;<p>You want highly similar features in n0, or n1 will get too large. &#8220;People from Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy football site.</p>&#x000A;<p>Here&#8217;s the same thing for our favorite network graph:</p>&#x000A;<ul>&#x000A; <li>Take all users who match the handle (n0)</li>&#x000A; <li>Broaden n0 along all relevant connections: for example atsign, follow, topic usage, etc &#8211; call this n1_all</li>&#x000A; <li>Prune n1_all: elminate users with very few or very weak ties to n0, and call this n1.</li>&#x000A; <li>Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)</li>&#x000A; <li>Do some joins of n1 to get relationships with a sample user on the left (and/or) right</li>&#x000A;</ul>&#x000A;<p>For example, the subuniverse we typically work with is &#8220;users who have mentioned @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there, it&#8217;s easy to inspect the results of an experiment against a familiar object (ourselves). It also gives very correlated edges: many such people also follow each other, use other similar terms, etc. Without this correlation, we&#8217;d span too much of the graph.</p>&#x000A;<p>Within the subuniverse, we can happily do joins, calculate trstrank, and examine local community structure.</p>&#x000A;<p>Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way: among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is unfortunately hard. More importantly, no matter what handle you use the subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:</p>&#x000A;<ul>&#x000A; <li>twitter users with millions of followers are going to land in almost any given subuniverse</li>&#x000A; <li>the trstrank of any given subuniverse is going to be vastly higher than the whole graph average</li>&#x000A; <li>Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.</li>&#x000A;</ul>&#x000A;<h3>Connectivity-preserving Sample</h3>&#x000A;<p>There&#8217;s one other type of sample you might like to do: one that preserves the global connectivity of edges.</p>&#x000A;<ul>&#x000A; <li>For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).</li>&#x000A; <li>Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.</li>&#x000A;</ul>&#x000A;<p>(The parameter f adjusts the fraction of edges sampled.) In this equation, a node with one inbound link has a high chance of survival. On average, each node will have f inbound and f outbound links survive.</p>&#x000A;<p>This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f down pretty far for a significant number of nodes to start failing the binomial trial at each end.</p>&#x000A;<p>To do this consistently, set g = 1/f and do</p>&#x000A;<pre><code> ( (MD5([node_a_id, node_b_id, 'a', salt].join(":")) % (deg_a * g) = 0) ||&#x000A; (MD5([node_a_id, node_b_id, 'b', salt].join(":")) % (deg_b * g) = 0) )</code></pre>&#x000A;<p>(since deg_a and deg_b may be correlated, we perturb it by adding &#8216;a&#8217; and &#8216;b&#8217; as extra salts)</p>&#x000A;<hr />&#x000A;<h4 style="vertical-align:middle;">Footnotes</h4>&#x000A;<p class="footnote" id="fn1"><a href="#fnr1"><sup>1</sup></a> We seem mostly rid of stupid and/or non-threadsafe RNGs. However, many <span class="caps">UUID</span> implementations (including Java&#8217;s, I think) require a global lock.</p>&#x000A;<p class="footnote" id="fn2"><a href="#fnr2"><sup>2</sup></a> By claiming an MD5 is good for anything I just made a cryptographer cry. So let me hurry to disclaim that you should really be using something-something-<span class="caps">HMAC</span>-whatever. That is &#8212; if you care that this mixing is cryptographically strong, go look up the one that is.</p>&#x000A;<p class="footnote" id="fn3"><a href="#fnr3"><sup>3</sup></a> Make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.</p>&#x000A;<p class="footnote" id="fn4"><a href="#fnr4"><sup>4</sup></a> These are available as environment variables if you&#8217;re streaming</p>&#x000A;<p class="footnote" id="fn5"><a href="#fnr5"><sup>5</sup></a> Note that you always need to use the <strong>least</strong> significant bytes, because of <a href="http://en.wikipedia.org/wiki/Benford%27s_law">Benford&#8217;s law</a></p>
+ <div id='disqus_thread'>
+ <script type='text/javascript'>
+ //<![CDATA[
+ var disqus_url = "http://icsblog.heroku.com/2010/09/scalable_sampling";
+ //]]>
+ </script>
+ <noscript>
+ <a href='http://infochimps.disqus.com/?url=ref'>View the discussion thread</a>
+ </noscript>
+ <script src='http://disqus.com/forums/infochimps/embed.js' type='text/javascript'></script>
+ </div>
+ </div>
+ </div>
+ </div>
+ <div id="sidebar"></div>
+ </div>
+ </div>
+ <div id='footer'>
+ <div class='content'>
+ Copyright &copy; 2010 - Infochimps Dev Blog -
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
+ </div>
+ </div>
+ <script type='text/javascript'>
+ //<![CDATA[
+ (function() {
+ var links = document.getElementsByTagName('a');
+ var query = '?';
+ for(var i = 0; i < links.length; i++) {
+ if(links[i].href.indexOf('#disqus_thread') >= 0) {
+ query += 'url' + i + '=' + encodeURIComponent(links[i].href) + '&';
+ }
+ }
+ document.write('<script charset="utf-8" type="text/javascript" src="http://disqus.com/forums/infochimps/get_num_replies.js' + query + '"></' + 'script>');
+ })();
+ //]]>
+ </script>
+ </body>
+</html>
View
2 _site/404.html
@@ -44,7 +44,7 @@
<div id='footer'>
<div class='content'>
Copyright &copy; 2010 - Infochimps Dev Blog -
- <span class='credit'>Powered by <a href="http://octopress.org">Octopress</a></span>
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
</div>
</div>
<script type='text/javascript'>
View
2 _site/about.html
@@ -44,7 +44,7 @@
<div id='footer'>
<div class='content'>
Copyright &copy; 2010 - Infochimps Dev Blog -
- <span class='credit'>Powered by <a href="http://octopress.org">Octopress</a></span>
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
</div>
</div>
<script type='text/javascript'>
View
4 _site/archives.html
@@ -36,7 +36,7 @@
<div id="page">
<div id="content">
<div id="main">
- <div class="content"><h2>Blog Archives</h2>&#x000A;<h3>2010</h3>&#x000A;<ul>&#x000A; <li class="">&#x000A; <a href="/2010/09/firsties">Firsties</a>&#x000A; <span class="pubdate">06 Sep, 2010</span>&#x000A; </li>&#x000A;</ul></div>
+ <div class="content"><h2>Blog Archives</h2>&#x000A;<h3>2010</h3>&#x000A;<ul>&#x000A; <li class="">&#x000A; <a href="/2010/09/firsties">Firsties</a>&#x000A; <span class="pubdate">06 Sep, 2010</span>&#x000A; </li>&#x000A; <li class="">&#x000A; <a href="/2010/09/scalable_sampling">Scalable Sampling</a>&#x000A; <span class="pubdate">07 Sep, 2010</span>&#x000A; </li>&#x000A;</ul></div>
</div>
<div id="sidebar"></div>
</div>
@@ -44,7 +44,7 @@
<div id='footer'>
<div class='content'>
Copyright &copy; 2010 - Infochimps Dev Blog -
- <span class='credit'>Powered by <a href="http://octopress.org">Octopress</a></span>
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
</div>
</div>
<script type='text/javascript'>
View
98 _site/atom.html
@@ -3,7 +3,7 @@
<title>Infochimps Developers Blog: Big Data, Hadoop, Cassandra, Chef, Ruby, Rails and more.</title>
<link href="http://icsblog.heroku.com//atom.xml" rel="self" />
<link href="http://icsblog.heroku.com/" />
- <updated>2010-09-07T02:26:41-05:00</updated>
+ <updated>2010-09-07T04:16:05-05:00</updated>
<id>http://icsblog.heroku.com/</id>
<author>
<name>Infochimps Dev Team</name>
@@ -16,6 +16,102 @@
<id>http://icsblog.heroku.com//2010/09/firsties</id>
<content type="html">
&lt;h3 style=&quot;color:red;&quot;&gt;First Post wooooo!!!&lt;/h3&gt;
+ &lt;p&gt;Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our &lt;a href=&quot;http://blog.infochimps.org&quot;&gt;main blog&lt;/a&gt; this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.&lt;/p&gt;
+ &lt;h4&gt;A word about this blog.&lt;/h4&gt;
+ &lt;p&gt;We&#8217;re using the &lt;a href=&quot;http://github.com/imathis/octopress&quot;&gt;Octopress framework&lt;/a&gt; for &lt;a href=&quot;http://github.com/mojombo/jekyll&quot;&gt;Jekyll.&lt;/a&gt; Since octopress required some extinct fork of jekyll to render &lt;span class=&quot;caps&quot;&gt;HAML&lt;/span&gt;, we did &lt;a href=&quot;http://github.com/infochimps/infochimps.github.com/tree/master/_plugins&quot;&gt;horrible, horrible monkey things&lt;/a&gt; to make it render &lt;span class=&quot;caps&quot;&gt;HAML&lt;/span&gt; text and layouts, but not require a special fork of Jekyll.&lt;/p&gt;
+ &lt;p&gt;We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s &lt;a href=&quot;http://jstorimer.com/2009/12/29/jekyll-on-heroku.html&quot;&gt;Jekyll on Heroku&lt;/a&gt; post. We added &lt;a href=&quot;http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb&quot;&gt;two tweaks:&lt;/a&gt; one is to allow no-extension permalinks (redirects &lt;code&gt;/2010/09/foo&lt;/code&gt; to &lt;code&gt;/2010/09/foo/index.html&lt;/code&gt;), the other is to render the custom &lt;a href=&quot;http://icsblog.heroku.com//404.html&quot;&gt;/404.html&lt;/a&gt; page.&lt;/p&gt;
+ &lt;p&gt;Get your own copy here:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;&lt;a href=&quot;http://github.com/infochimps/infochimps.github.com&quot;&gt;Infochimps Blog Source Code&lt;/a&gt;&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;Posts are composed in &lt;a href=&quot;http://redcloth.org/textile&quot;&gt;Textile&lt;/a&gt; using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).&lt;/p&gt;
+ </content>
+ </entry>
+ <entry>
+ <title>Scalable Sampling</title>
+ <link href="http://icsblog.heroku.com//2010/09/scalable_sampling" />
+ <updated>2010-09-07T00:00:00-05:00</updated>
+ <id>http://icsblog.heroku.com//2010/09/scalable_sampling</id>
+ <content type="html">
+ &lt;h3&gt;Sampling and Random Numbers&lt;/h3&gt;
+ &lt;p&gt;Found a really good caveat about using random numbers in a distributed system at the &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;rapleaf blog.&lt;/a&gt; It&#8217;s subtle, so I&#8217;ll let you go read it there.&lt;/p&gt;
+ &lt;p&gt;Before you even get to such advanced mis-uses of random numbers&lt;sup class=&quot;footnote&quot; id=&quot;fnr1&quot;&gt;&lt;a href=&quot;#fn1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, be sure you should be using them in the first place. People often reach for a &lt;strong&gt;random&lt;/strong&gt; mapping what they really want is a &lt;strong&gt;well-mixed&lt;/strong&gt; mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.&lt;sup class=&quot;footnote&quot; id=&quot;fnr2&quot;&gt;&lt;a href=&quot;#fn2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
+ &lt;h4&gt;Consistent Shuffling&lt;/h4&gt;
+ &lt;p&gt;For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to &lt;strong&gt;not&lt;/strong&gt; remain the same, use a salt&lt;sup class=&quot;footnote&quot; id=&quot;fnr3&quot;&gt;&lt;a href=&quot;#fn3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;
+ &lt;pre&gt;&lt;code&gt; MD5( [key, salt].join(&quot;:&quot;) )&lt;/code&gt;&lt;/pre&gt;
+ &lt;p&gt;Runs wich the same salt and data will receive an the same mixing. &lt;em&gt;Good salts_: If you use the job&lt;/em&gt;id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter&lt;sup class=&quot;footnote&quot; id=&quot;fnr4&quot;&gt;&lt;a href=&quot;#fn4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. &lt;em&gt;Bad Salts&lt;/em&gt;: random numbers, timestamps and the hostname + &lt;span class=&quot;caps&quot;&gt;PID&lt;/span&gt; are bad salts, for &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;the reasons given in the rapleaf post.&lt;/a&gt;&lt;/p&gt;
+ &lt;h4&gt;Sampling&lt;/h4&gt;
+ &lt;p&gt;To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.&lt;/p&gt;
+ &lt;h3&gt;Uniform-All Sample&lt;/h3&gt;
+ &lt;p&gt;Here&#8217;s the wrong way to sample three related tables:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Sample 1/100 users&lt;/li&gt;
+ &lt;li&gt;Sample 1/100 products&lt;/li&gt;
+ &lt;li&gt;Sample 1/100 transactions&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.&lt;sup class=&quot;footnote&quot; id=&quot;fnr5&quot;&gt;&lt;a href=&quot;#fn5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
+ &lt;hr /&gt;
+ &lt;h3 style=&quot;vertical-align:middle;&quot;&gt;Uniform plus Edges (Global-feature preserving) Sample&lt;/h3&gt;
+ &lt;p&gt;This is better:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Take all users whose ids hash correctly (n1)&lt;/li&gt;
+ &lt;li&gt;Do a join of the transactions with n1&lt;/li&gt;
+ &lt;li&gt;Do some joins to get relationships with a user from n1 on the left (and/or) right&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;However, it&#8217;s computationally harder than doing straight samples of each. The consistent hash answers that problem: just use the same hash on the &lt;strong&gt;foreign key&lt;/strong&gt; (in this case, the user_id):&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Take all users whose ids hash correctly&lt;/li&gt;
+ &lt;li&gt;Take all products whose seller_id hashes correctly&lt;/li&gt;
+ &lt;li&gt;Take all transactions whose buyer_id (and/or) seller_id hashes correctly&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;This gives you a very efficient uniform sample. If 4% of your buyers are from Florida, about 4% of the sampled users should be too, and about 4% of the transactions will be from Floridians. (&lt;a href=&quot;http://kottke.org/10/05/monday-puzzle-time&quot;&gt;Don&#8217;t get careless,&lt;/a&gt; though)&lt;/p&gt;
+ &lt;p&gt;Some caveats. You don&#8217;t have good control over the sample fraction: your transactions probably obey a long-tail distribution (a few users account for a disproportionate number of transactions), which introduces high variance for the quantity recovered.&lt;/p&gt;
+ &lt;p&gt;The sample is also sparse, which can make analysis hard in some contexts. If you sample 1% of buyers, a product with 100 purchases will in general retain 1 buyer. You can&#8217;t test an algortihm that looks for similar products, or measures reputation flow. The problem with joins&lt;/p&gt;
+ &lt;h3&gt;Subuniverse (Local-structure preserving) Sample&lt;/h3&gt;
+ &lt;p&gt;To do a &#8216;subuniverse&#8217; sample, find a handle for some connected neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
+ &lt;li&gt;Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.&lt;/li&gt;
+ &lt;li&gt;Prune n1_all: eliminate entities with very few or very weak ties to n0, and call this n1.&lt;/li&gt;
+ &lt;li&gt;Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)&lt;/li&gt;
+ &lt;li&gt;Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;You want highly similar features in n0, or n1 will get too large. &#8220;People from Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy football site.&lt;/p&gt;
+ &lt;p&gt;Here&#8217;s the same thing for our favorite network graph:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
+ &lt;li&gt;Broaden n0 along all relevant connections: for example atsign, follow, topic usage, etc &#8211; call this n1_all&lt;/li&gt;
+ &lt;li&gt;Prune n1_all: elminate users with very few or very weak ties to n0, and call this n1.&lt;/li&gt;
+ &lt;li&gt;Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)&lt;/li&gt;
+ &lt;li&gt;Do some joins of n1 to get relationships with a sample user on the left (and/or) right&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;For example, the subuniverse we typically work with is &#8220;users who have mentioned @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there, it&#8217;s easy to inspect the results of an experiment against a familiar object (ourselves). It also gives very correlated edges: many such people also follow each other, use other similar terms, etc. Without this correlation, we&#8217;d span too much of the graph.&lt;/p&gt;
+ &lt;p&gt;Within the subuniverse, we can happily do joins, calculate trstrank, and examine local community structure.&lt;/p&gt;
+ &lt;p&gt;Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way: among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is unfortunately hard. More importantly, no matter what handle you use the subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;twitter users with millions of followers are going to land in almost any given subuniverse&lt;/li&gt;
+ &lt;li&gt;the trstrank of any given subuniverse is going to be vastly higher than the whole graph average&lt;/li&gt;
+ &lt;li&gt;Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;h3&gt;Connectivity-preserving Sample&lt;/h3&gt;
+ &lt;p&gt;There&#8217;s one other type of sample you might like to do: one that preserves the global connectivity of edges.&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).&lt;/li&gt;
+ &lt;li&gt;Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;(The parameter f adjusts the fraction of edges sampled.) In this equation, a node with one inbound link has a high chance of survival. On average, each node will have f inbound and f outbound links survive.&lt;/p&gt;
+ &lt;p&gt;This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f down pretty far for a significant number of nodes to start failing the binomial trial at each end.&lt;/p&gt;
+ &lt;p&gt;To do this consistently, set g = 1/f and do&lt;/p&gt;
+ &lt;pre&gt;&lt;code&gt; ( (MD5([node_a_id, node_b_id, 'a', salt].join(&quot;:&quot;)) % (deg_a * g) = 0) ||
+ (MD5([node_a_id, node_b_id, 'b', salt].join(&quot;:&quot;)) % (deg_b * g) = 0) )&lt;/code&gt;&lt;/pre&gt;
+ &lt;p&gt;(since deg_a and deg_b may be correlated, we perturb it by adding &#8216;a&#8217; and &#8216;b&#8217; as extra salts)&lt;/p&gt;
+ &lt;hr /&gt;
+ &lt;h4 style=&quot;vertical-align:middle;&quot;&gt;Footnotes&lt;/h4&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn1&quot;&gt;&lt;a href=&quot;#fnr1&quot;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; We seem mostly rid of stupid and/or non-threadsafe RNGs. However, many &lt;span class=&quot;caps&quot;&gt;UUID&lt;/span&gt; implementations (including Java&#8217;s, I think) require a global lock.&lt;/p&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn2&quot;&gt;&lt;a href=&quot;#fnr2&quot;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; By claiming an MD5 is good for anything I just made a cryptographer cry. So let me hurry to disclaim that you should really be using something-something-&lt;span class=&quot;caps&quot;&gt;HMAC&lt;/span&gt;-whatever. That is &#8212; if you care that this mixing is cryptographically strong, go look up the one that is.&lt;/p&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn3&quot;&gt;&lt;a href=&quot;#fnr3&quot;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; Make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.&lt;/p&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn4&quot;&gt;&lt;a href=&quot;#fnr4&quot;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; These are available as environment variables if you&#8217;re streaming&lt;/p&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn5&quot;&gt;&lt;a href=&quot;#fnr5&quot;&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt; Note that you always need to use the &lt;strong&gt;least&lt;/strong&gt; significant bytes, because of &lt;a href=&quot;http://en.wikipedia.org/wiki/Benford%27s_law&quot;&gt;Benford&#8217;s law&lt;/a&gt;&lt;/p&gt;
</content>
</entry>
</feed>
View
150 _site/atom.xml
@@ -3,7 +3,7 @@
<title>Infochimps Developers Blog: Big Data, Hadoop, Cassandra, Chef, Ruby, Rails and more.</title>
<link href="http://icsblog.heroku.com//atom.xml" rel="self" />
<link href="http://icsblog.heroku.com/" />
- <updated>2010-09-07T02:21:29-05:00</updated>
+ <updated>2010-09-07T02:59:02-05:00</updated>
<id>http://icsblog.heroku.com/</id>
<author>
<name>Infochimps Dev Team</name>
@@ -16,6 +16,154 @@
<id>http://icsblog.heroku.com//2010/09/firsties</id>
<content type="html">
&lt;h3 style=&quot;color:red;&quot;&gt;First Post wooooo!!!&lt;/h3&gt;
+ &lt;p&gt;Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our &lt;a href=&quot;http://blog.infochimps.org&quot;&gt;main blog&lt;/a&gt; this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.&lt;/p&gt;
+ &lt;h4&gt;A word about this blog.&lt;/h4&gt;
+ &lt;p&gt;We&#8217;re using the &lt;a href=&quot;http://github.com/imathis/octopress&quot;&gt;Octopress framework&lt;/a&gt; for &lt;a href=&quot;http://github.com/mojombo/jekyll&quot;&gt;Jekyll.&lt;/a&gt; Since octopress required some extinct fork of jekyll to render &lt;span class=&quot;caps&quot;&gt;HAML&lt;/span&gt;, we did &lt;a href=&quot;http://github.com/infochimps/infochimps.github.com/tree/master/_plugins&quot;&gt;horrible, horrible monkey things&lt;/a&gt; to make it render &lt;span class=&quot;caps&quot;&gt;HAML&lt;/span&gt; text and layouts, but not require a special fork of Jekyll.&lt;/p&gt;
+ &lt;p&gt;We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s &lt;a href=&quot;http://jstorimer.com/2009/12/29/jekyll-on-heroku.html&quot;&gt;Jekyll on Heroku&lt;/a&gt; post. We added &lt;a href=&quot;http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb&quot;&gt;two tweaks:&lt;/a&gt; one is to allow no-extension permalinks (redirects &lt;code&gt;/2010/09/foo&lt;/code&gt; to &lt;code&gt;/2010/09/foo/index.html&lt;/code&gt;), the other is to render the custom &lt;a href=&quot;http://icsblog.heroku.com//404.html&quot;&gt;/404.html&lt;/a&gt; page.&lt;/p&gt;
+ &lt;p&gt;Get your own copy here:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;&lt;a href=&quot;http://github.com/infochimps/infochimps.github.com&quot;&gt;Infochimps Blog Source Code&lt;/a&gt;&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;Posts are composed in &lt;a href=&quot;http://redcloth.org/textile&quot;&gt;Textile&lt;/a&gt; using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).&lt;/p&gt;
+ </content>
+ </entry>
+ <entry>
+ <title>Sample Simply</title>
+ <link href="http://icsblog.heroku.com//2010/09/sampling_aint_easy" />
+ <updated>2010-09-06T00:00:00-05:00</updated>
+ <id>http://icsblog.heroku.com//2010/09/sampling_aint_easy</id>
+ <content type="html">
+ &lt;h2&gt;Sampling and Random Numbers&lt;/h2&gt;
+ &lt;p&gt;Found a really good caveat about using random numbers in a distributed system at the &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;rapleaf blog.&lt;/a&gt; It&#8217;s subtle, so I&#8217;ll let you go read it there.(*)&lt;/p&gt;
+ &lt;p&gt;Besides lot of times that people reach for &#8216;random&#8217; what they really want is &#8216;mixed arbitrarily&#8217;.&lt;/p&gt;
+ &lt;p&gt;A lot of times that people reach for &#8216;random&#8217; what they really want is &#8216;mixed&lt;br /&gt;
+ arbitrarily&#8217;: that is, a function such that similar objects will receive&lt;br /&gt;
+ arbitrarily different outcomes. The MD5 hash is an easy way to do this.**&lt;/p&gt;
+ &lt;p&gt;To shuffle a set of records, take the MD5 hash of its primary key. The mixing is&lt;br /&gt;
+ &#8220;consistent&#8221;: every time you run this you&#8217;ll get the same mixing. If you&#8217;d like&lt;br /&gt;
+ it to &lt;strong&gt;not&lt;/strong&gt; be consistent, use a salt:&lt;/p&gt;
+ &lt;acronym title=&quot; [key, salt].join(&quot;:&quot;&quot;&gt;&lt;span class=&quot;caps&quot;&gt;MD5&lt;/span&gt;&lt;/acronym&gt; ) (**)
+ &lt;p&gt;Now every run using the same salt will receive an identical mixing that is still&lt;br /&gt;
+ arbitrary within the run. To vary by the job, the task, the partition, or the&lt;br /&gt;
+ row, salt using the job_id, task_id, source filename + split boundary(&lt;strong&gt;*&lt;/strong&gt;), or&lt;br /&gt;
+ source filename + split boundary + running counter.&lt;/p&gt;
+ &lt;p&gt;A random number, a timestamp, or the hostname + &lt;span class=&quot;caps&quot;&gt;PID&lt;/span&gt; are bad salts, for the reasons given http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&lt;/p&gt;
+ &lt;p&gt;To take a 1/n sample from a set of records, take the MD5 hash and emit only&lt;br /&gt;
+ those records which are zero modulo n. If you have arbitrarily-assigned numeric&lt;br /&gt;
+ primary keys you can just modulo n them directly, as long as n is large. In both&lt;br /&gt;
+ cases note that you can&#8217;t subsample using this trick.&lt;/p&gt;
+ &lt;h2&gt;Uniform-All Sample&lt;/h2&gt;
+ &lt;p&gt;Say you&#8217;re a site where users sell products to each other. For development, you&lt;br /&gt;
+ want a 1% sample to test on. Here&#8217;s the wrong thing to do:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Sample 1/100 users&lt;/li&gt;
+ &lt;li&gt;Sample 1/100 products&lt;/li&gt;
+ &lt;li&gt;Sample 1/100 transactions&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;The good thing is that each given product, transaction or user has the same&lt;br /&gt;
+ uniform chance of being included.&lt;/p&gt;
+ &lt;p&gt;The problem is that none of them will join: for most transactions, you won&#8217;t be&lt;br /&gt;
+ able to look up the buyers, sellers or products. **&lt;/p&gt;
+
+ &lt;ul&gt;
+ &lt;li&gt;If you&#8217;re developing an exploratory data analysis tool for big data please support at least the Subuniverse and&lt;/li&gt;
+ &lt;/ul&gt;&lt;h2&gt;Uniform plus Edges (Global-feature preserving) Sample&lt;/h2&gt;
+ &lt;p&gt;This is better:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Take all users whose ids hash correctly (n1)&lt;/li&gt;
+ &lt;li&gt;Do a join of the transactions with n1&lt;/li&gt;
+ &lt;li&gt;Do some joins to get relationships with a user from n1 on the left (and/or) right&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;However, it&#8217;s computationally harder than doing straight samples of each. The&lt;br /&gt;
+ consistent hash answers that problem: just use the same hash on the &lt;strong&gt;foreign&lt;br /&gt;
+ key&lt;/strong&gt; (in this case, the user_id):&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Take all users whose ids hash correctly&lt;/li&gt;
+ &lt;li&gt;Take all products whose seller_id hashes correctly&lt;/li&gt;
+ &lt;li&gt;Take all transactions whose buyer_id (and/or) seller_id hashes correctly&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;This gives you a very efficient uniform sample. If 4% of your buyers are from&lt;br /&gt;
+ Florida, about 4% of the sampled users should be too, and about 4% of the&lt;br /&gt;
+ transactions will be from Floridians. (&lt;a href=&quot;http://kottke.org/10/05/monday-puzzle-time&quot;&gt;Don&#8217;t get careless,&lt;/a&gt; though)&lt;/p&gt;
+ &lt;p&gt;Some caveats. You don&#8217;t have good control over the sample fraction: your&lt;br /&gt;
+ transactions probably obey a long-tail distribution (a few users account for a&lt;br /&gt;
+ disproportionate number of transactions), which introduces high variance for the&lt;br /&gt;
+ quantity recovered.&lt;/p&gt;
+ &lt;p&gt;The sample is also sparse, which can make analysis hard in some contexts. If you&lt;br /&gt;
+ sample 1% of buyers, a product with 100 purchases will in general retain 1&lt;br /&gt;
+ buyer. You can&#8217;t test an algortihm that looks for similar products, or measures&lt;br /&gt;
+ reputation flow. The problem with joins&lt;/p&gt;
+ &lt;h2&gt;Subuniverse (Local-structure preserving) Sample&lt;/h2&gt;
+ &lt;p&gt;To do a &#8216;subuniverse&#8217; sample, find some handle that lets you pick up a connected&lt;br /&gt;
+ neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
+ &lt;li&gt;Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.&lt;/li&gt;
+ &lt;li&gt;Prune n1_all: eliminate entities with very few or very weak ties to n0, and call this n1.&lt;/li&gt;
+ &lt;li&gt;Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)&lt;/li&gt;
+ &lt;li&gt;Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;You want highly similar features in n0, or n1 will get too large. &#8220;People from&lt;br /&gt;
+ Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy&lt;br /&gt;
+ football site.&lt;/p&gt;
+ &lt;p&gt;Here&#8217;s the same thing for our favorite network graph:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
+ &lt;li&gt;Broaden n0 along all relevant connections: for example atsign, follow, topic usage, etc &#8211; call this n1_all&lt;/li&gt;
+ &lt;li&gt;Prune n1_all: elminate users with very few or very weak ties to n0, and call this n1.&lt;/li&gt;
+ &lt;li&gt;Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)&lt;/li&gt;
+ &lt;li&gt;Do some joins of n1 to get relationships with a sample user on the left (and/or) right&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;For example, the subuniverse we typically work with is &#8220;users who have mentioned&lt;br /&gt;
+ @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few&lt;br /&gt;
+ reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there,&lt;br /&gt;
+ it&#8217;s easy to inspect the results of an experiment against a familiar object&lt;br /&gt;
+ (ourselves). It also gives very correlated edges: many such people also follow&lt;br /&gt;
+ each other, use other similar terms, etc. Without this correlation, we&#8217;d span&lt;br /&gt;
+ too much of the graph.&lt;/p&gt;
+ &lt;p&gt;Within the subuniverse, we can happily do joins, calculate trstrank, and examine&lt;br /&gt;
+ local community structure.&lt;/p&gt;
+ &lt;p&gt;Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way:&lt;br /&gt;
+ among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is&lt;br /&gt;
+ unfortunately hard. More importantly, no matter what handle you use the&lt;br /&gt;
+ subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;twitter users with millions of followers are going to land in almost any given subuniverse&lt;/li&gt;
+ &lt;li&gt;the trstrank of any given subuniverse is going to be vastly higher than the whole graph average&lt;/li&gt;
+ &lt;li&gt;Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;h2&gt;Connectivity-preserving Sample&lt;/h2&gt;
+ &lt;p&gt;There&#8217;s one other type of sample you might like to do: one that preserves the&lt;br /&gt;
+ global connectivity of edges.&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).&lt;/li&gt;
+ &lt;li&gt;Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;p&gt;(The parameter f adjusts the fraction of edges sampled.) In this equation, a&lt;br /&gt;
+ node with one inbound link has a high chance of survival. On average, each node&lt;br /&gt;
+ will have f inbound and f outbound links survive.&lt;/p&gt;
+ &lt;p&gt;This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m&lt;br /&gt;
+ nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f&lt;br /&gt;
+ down pretty far for a significant number of nodes to start failing the binomial&lt;br /&gt;
+ trial at each end.&lt;/p&gt;
+ &lt;p&gt;To do this consistently, set g = 1/f and do&lt;/p&gt;
+ ( (&lt;acronym title=&quot;[node_a_id, node_b_id, &#39;a&#39;, salt].join(&quot;:&quot;&quot;&gt;&lt;span class=&quot;caps&quot;&gt;MD5&lt;/span&gt;&lt;/acronym&gt;) % (deg_a * g) = 0) ||
+ (&lt;acronym title=&quot;[node_a_id, node_b_id, &#39;b&#39;, salt].join(&quot;:&quot;&quot;&gt;&lt;span class=&quot;caps&quot;&gt;MD5&lt;/span&gt;&lt;/acronym&gt;) % (deg_b * g) = 0) )
+ &lt;p&gt;(I introduced &#8216;a&#8217; and &#8216;b&#8217; as extra salts: deg_a and deg_b may be correlated, but&lt;br /&gt;
+ we need the two trials to be independent)&lt;/p&gt;
+ &lt;ul&gt;
+ &lt;li&gt;I think we&#8217;ve rid the earth of non-threadsafe RNGs, but many &lt;span class=&quot;caps&quot;&gt;UUID&lt;/span&gt; implementations (incl. Java&#8217;s, I think) require a global lock.&lt;/li&gt;
+ &lt;li&gt;(I just made a cryptographer cry, so let me disclaim that you should really be using something-something-&lt;span class=&quot;caps&quot;&gt;HMAC&lt;/span&gt;-whatever i.e. if you care that this mixing is cryptographically strong take some time and look it up.)
+ &lt;ul&gt;
+ &lt;li&gt;make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.
+ &lt;ul&gt;
+ &lt;li&gt;These are available as environment variables if you&#8217;re streaming
+ &lt;ul&gt;
+ &lt;li&gt;Note that you always need to use the &lt;strong&gt;least&lt;/strong&gt; significant bytes because of Benford&#8217;s law&lt;/li&gt;
+ &lt;/ul&gt;&lt;/li&gt;
+ &lt;/ul&gt;&lt;/li&gt;
+ &lt;/ul&gt;&lt;/li&gt;
+ &lt;/ul&gt;
</content>
</entry>
</feed>
View
4 _site/colophon.html
@@ -36,7 +36,7 @@
<div id="page">
<div id="content">
<div id="main">
- <div class="content"><h2>Colophon</h2>&#x000A;<p>Uses <a href="http://wiki.github.com/imathis/octopress/">Octopress, a blogging framework designed for<br />&#x000A;hackers</a>, based on<br />&#x000A;<a href="http://github.com/mojombo/jekyll">Jekyll,</a> the blog aware static site generator.</p>&#x000A;<p>Composed in emacs, except for Jesse who isn&#8217;t enlightened that way yet.</p></div>
+ <div class="content"><h2>Colophon</h2>&#x000A;<p>We&#8217;re using the <a href="http://github.com/imathis/octopress">Octopress framework</a> for <a href="http://github.com/mojombo/jekyll">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="http://github.com/infochimps/infochimps.github.com/tree/master/_plugins">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="http://jstorimer.com/2009/12/29/jekyll-on-heroku.html">Jekyll on Heroku</a> post. We added <a href="http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="http://github.com/infochimps/infochimps.github.com">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="http://redcloth.org/textile">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p></div>
</div>
<div id="sidebar"></div>
</div>
@@ -44,7 +44,7 @@
<div id='footer'>
<div class='content'>
Copyright &copy; 2010 - Infochimps Dev Blog -
- <span class='credit'>Powered by <a href="http://octopress.org">Octopress</a></span>
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
</div>
</div>
<script type='text/javascript'>
View
4 _site/index.html
@@ -36,7 +36,7 @@
<div id="page">
<div id="content">
<div id="main">
- <div class="content"><div class="blog">&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/firsties">Firsties</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 6th, 2010&#x000A; &#x000A; </div>&#x000A; <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A; </div>&#x000A; <div class="footer">&#x000A; <a href="/archives.html" title="archives">&laquo; Blog Archives</a>&#x000A; </div>&#x000A;</div></div>
+ <div class="content"><div class="blog">&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/scalable_sampling">Scalable Sampling</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 7th, 2010&#x000A; &#x000A; </div>&#x000A; <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts_: If you use the job</em>id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in the rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;(<a href="/2010/09/scalable_sampling">continued&hellip;</a>)&#x000A; </div>&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/firsties">Firsties</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 6th, 2010&#x000A; &#x000A; </div>&#x000A; <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A;<p>Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our <a href="http://blog.infochimps.org">main blog</a> this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.</p>&#x000A;<h4>A word about this blog.</h4>&#x000A;<p>We&#8217;re using the <a href="http://github.com/imathis/octopress">Octopress framework</a> for <a href="http://github.com/mojombo/jekyll">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="http://github.com/infochimps/infochimps.github.com/tree/master/_plugins">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="http://jstorimer.com/2009/12/29/jekyll-on-heroku.html">Jekyll on Heroku</a> post. We added <a href="http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="http://github.com/infochimps/infochimps.github.com">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="http://redcloth.org/textile">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p>&#x000A; </div>&#x000A; <div class="footer">&#x000A; <a href="/archives.html" title="archives">&laquo; Blog Archives</a>&#x000A; </div>&#x000A;</div></div>
</div>
<div id="sidebar"></div>
</div>
@@ -44,7 +44,7 @@
<div id='footer'>
<div class='content'>
Copyright &copy; 2010 - Infochimps Dev Blog -
- <span class='credit'>Powered by <a href="http://octopress.org">Octopress</a></span>
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
</div>
</div>
<script type='text/javascript'>

0 comments on commit fd2a5ff

Please sign in to comment.
Something went wrong with that request. Please try again.