Permalink
Browse files

re-adding _site

  • Loading branch information...
1 parent 2d3478f commit 5297f61637c69b256392be6e96ff483308f4e74d Philip (flip) Kromer committed Sep 7, 2010
@@ -0,0 +1,85 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
+<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
+ <head>
+ <title>Infochimps Dev Blog :: Firsties</title>
+ <link href='/stylesheets/screen.css' media='screen, projection' rel='stylesheet' type='text/css' />
+ <script src='http://ajax.googleapis.com/ajax/libs/mootools/1.2.4/mootools-yui-compressed.js' type='text/javascript'></script>
+ <script src='/javascripts/mootools-1.2.4.2-more.js' type='text/javascript'></script>
+ <link href='/atom.xml' rel='alternate' title='Infochimps Dev Blog' type='application/atom+xml' />
+ </head>
+ <body id="">
+ <div id="header">
+ <div class='content'>
+ <h1>
+ <a class='title' href='/'>Infochimps Dev Blog</a>
+ </h1>
+ </div>
+ </div>
+ <div id="nav">
+ <div class='content'>
+ <ul>
+ <li class='alpha'>
+ <a href='/'>Blog</a>
+ </li>
+ <li>
+ <a href='/archives.html'>Archives</a>
+ </li>
+ <li class='omega'>
+ <a href='/about.html'>About</a>
+ </li>
+ <li class='subscribe'>
+ <a href='/atom.xml'>Subscribe</a>
+ </li>
+ </ul>
+ </div>
+ </div>
+ <div id="page">
+ <div id="content">
+ <div id="main">
+ <div class="blog content">
+ <div class='article'>
+ <h2>Firsties</h2>
+ <div class='meta'>
+ by: Infochimps Dev Team | posted: September 6th, 2010
+
+ </div>
+ <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A;<p>Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our <a href="http://blog.infochimps.org">main blog</a> this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.</p>&#x000A;<h4>A word about this blog.</h4>&#x000A;<p>We&#8217;re using the <a href="http://github.com/imathis/octopress">Octopress framework</a> for <a href="http://github.com/mojombo/jekyll">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="http://github.com/infochimps/infochimps.github.com/tree/master/_plugins">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="http://jstorimer.com/2009/12/29/jekyll-on-heroku.html">Jekyll on Heroku</a> post. We added <a href="http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="http://github.com/infochimps/infochimps.github.com">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="http://redcloth.org/textile">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p>
+ <div id='disqus_thread'>
+ <script type='text/javascript'>
+ //<![CDATA[
+ var disqus_url = "http://icsblog.heroku.com/2010/09/firsties";
+ //]]>
+ </script>
+ <noscript>
+ <a href='http://infochimps.disqus.com/?url=ref'>View the discussion thread</a>
+ </noscript>
+ <script src='http://disqus.com/forums/infochimps/embed.js' type='text/javascript'></script>
+ </div>
+ </div>
+ </div>
+ </div>
+ <div id="sidebar"></div>
+ </div>
+ </div>
+ <div id='footer'>
+ <div class='content'>
+ Copyright &copy; 2010 - Infochimps Dev Blog -
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
+ </div>
+ </div>
+ <script type='text/javascript'>
+ //<![CDATA[
+ (function() {
+ var links = document.getElementsByTagName('a');
+ var query = '?';
+ for(var i = 0; i < links.length; i++) {
+ if(links[i].href.indexOf('#disqus_thread') >= 0) {
+ query += 'url' + i + '=' + encodeURIComponent(links[i].href) + '&';
+ }
+ }
+ document.write('<script charset="utf-8" type="text/javascript" src="http://disqus.com/forums/infochimps/get_num_replies.js' + query + '"></' + 'script>');
+ })();
+ //]]>
+ </script>
+ </body>
+</html>
@@ -0,0 +1,85 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
+<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
+ <head>
+ <title>Infochimps Dev Blog :: Scalable Sampling</title>
+ <link href='/stylesheets/screen.css' media='screen, projection' rel='stylesheet' type='text/css' />
+ <script src='http://ajax.googleapis.com/ajax/libs/mootools/1.2.4/mootools-yui-compressed.js' type='text/javascript'></script>
+ <script src='/javascripts/mootools-1.2.4.2-more.js' type='text/javascript'></script>
+ <link href='/atom.xml' rel='alternate' title='Infochimps Dev Blog' type='application/atom+xml' />
+ </head>
+ <body id="">
+ <div id="header">
+ <div class='content'>
+ <h1>
+ <a class='title' href='/'>Infochimps Dev Blog</a>
+ </h1>
+ </div>
+ </div>
+ <div id="nav">
+ <div class='content'>
+ <ul>
+ <li class='alpha'>
+ <a href='/'>Blog</a>
+ </li>
+ <li>
+ <a href='/archives.html'>Archives</a>
+ </li>
+ <li class='omega'>
+ <a href='/about.html'>About</a>
+ </li>
+ <li class='subscribe'>
+ <a href='/atom.xml'>Subscribe</a>
+ </li>
+ </ul>
+ </div>
+ </div>
+ <div id="page">
+ <div id="content">
+ <div id="main">
+ <div class="blog content">
+ <div class='article'>
+ <h2>Scalable Sampling</h2>
+ <div class='meta'>
+ by: Infochimps Dev Team | posted: September 7th, 2010
+
+ </div>
+ <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts_: If you use the job</em>id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in the rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;<hr />&#x000A;<h3 style="vertical-align:middle;">Uniform plus Edges (Global-feature preserving) Sample</h3>&#x000A;<p>This is better:</p>&#x000A;<ul>&#x000A; <li>Take all users whose ids hash correctly (n1)</li>&#x000A; <li>Do a join of the transactions with n1</li>&#x000A; <li>Do some joins to get relationships with a user from n1 on the left (and/or) right</li>&#x000A;</ul>&#x000A;<p>However, it&#8217;s computationally harder than doing straight samples of each. The consistent hash answers that problem: just use the same hash on the <strong>foreign key</strong> (in this case, the user_id):</p>&#x000A;<ul>&#x000A; <li>Take all users whose ids hash correctly</li>&#x000A; <li>Take all products whose seller_id hashes correctly</li>&#x000A; <li>Take all transactions whose buyer_id (and/or) seller_id hashes correctly</li>&#x000A;</ul>&#x000A;<p>This gives you a very efficient uniform sample. If 4% of your buyers are from Florida, about 4% of the sampled users should be too, and about 4% of the transactions will be from Floridians. (<a href="http://kottke.org/10/05/monday-puzzle-time">Don&#8217;t get careless,</a> though)</p>&#x000A;<p>Some caveats. You don&#8217;t have good control over the sample fraction: your transactions probably obey a long-tail distribution (a few users account for a disproportionate number of transactions), which introduces high variance for the quantity recovered.</p>&#x000A;<p>The sample is also sparse, which can make analysis hard in some contexts. If you sample 1% of buyers, a product with 100 purchases will in general retain 1 buyer. You can&#8217;t test an algortihm that looks for similar products, or measures reputation flow. The problem with joins</p>&#x000A;<h3>Subuniverse (Local-structure preserving) Sample</h3>&#x000A;<p>To do a &#8216;subuniverse&#8217; sample, find a handle for some connected neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.</p>&#x000A;<ul>&#x000A; <li>Take all users who match the handle (n0)</li>&#x000A; <li>Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.</li>&#x000A; <li>Prune n1_all: eliminate entities with very few or very weak ties to n0, and call this n1.</li>&#x000A; <li>Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)</li>&#x000A; <li>Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.</li>&#x000A;</ul>&#x000A;<p>You want highly similar features in n0, or n1 will get too large. &#8220;People from Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy football site.</p>&#x000A;<p>Here&#8217;s the same thing for our favorite network graph:</p>&#x000A;<ul>&#x000A; <li>Take all users who match the handle (n0)</li>&#x000A; <li>Broaden n0 along all relevant connections: for example atsign, follow, topic usage, etc &#8211; call this n1_all</li>&#x000A; <li>Prune n1_all: elminate users with very few or very weak ties to n0, and call this n1.</li>&#x000A; <li>Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)</li>&#x000A; <li>Do some joins of n1 to get relationships with a sample user on the left (and/or) right</li>&#x000A;</ul>&#x000A;<p>For example, the subuniverse we typically work with is &#8220;users who have mentioned @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there, it&#8217;s easy to inspect the results of an experiment against a familiar object (ourselves). It also gives very correlated edges: many such people also follow each other, use other similar terms, etc. Without this correlation, we&#8217;d span too much of the graph.</p>&#x000A;<p>Within the subuniverse, we can happily do joins, calculate trstrank, and examine local community structure.</p>&#x000A;<p>Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way: among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is unfortunately hard. More importantly, no matter what handle you use the subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:</p>&#x000A;<ul>&#x000A; <li>twitter users with millions of followers are going to land in almost any given subuniverse</li>&#x000A; <li>the trstrank of any given subuniverse is going to be vastly higher than the whole graph average</li>&#x000A; <li>Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.</li>&#x000A;</ul>&#x000A;<h3>Connectivity-preserving Sample</h3>&#x000A;<p>There&#8217;s one other type of sample you might like to do: one that preserves the global connectivity of edges.</p>&#x000A;<ul>&#x000A; <li>For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).</li>&#x000A; <li>Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.</li>&#x000A;</ul>&#x000A;<p>(The parameter f adjusts the fraction of edges sampled.) In this equation, a node with one inbound link has a high chance of survival. On average, each node will have f inbound and f outbound links survive.</p>&#x000A;<p>This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f down pretty far for a significant number of nodes to start failing the binomial trial at each end.</p>&#x000A;<p>To do this consistently, set g = 1/f and do</p>&#x000A;<pre><code> ( (MD5([node_a_id, node_b_id, 'a', salt].join(":")) % (deg_a * g) = 0) ||&#x000A; (MD5([node_a_id, node_b_id, 'b', salt].join(":")) % (deg_b * g) = 0) )</code></pre>&#x000A;<p>(since deg_a and deg_b may be correlated, we perturb it by adding &#8216;a&#8217; and &#8216;b&#8217; as extra salts)</p>&#x000A;<hr />&#x000A;<h4 style="vertical-align:middle;">Footnotes</h4>&#x000A;<p class="footnote" id="fn1"><a href="#fnr1"><sup>1</sup></a> We seem mostly rid of stupid and/or non-threadsafe RNGs. However, many <span class="caps">UUID</span> implementations (including Java&#8217;s, I think) require a global lock.</p>&#x000A;<p class="footnote" id="fn2"><a href="#fnr2"><sup>2</sup></a> By claiming an MD5 is good for anything I just made a cryptographer cry. So let me hurry to disclaim that you should really be using something-something-<span class="caps">HMAC</span>-whatever. That is &#8212; if you care that this mixing is cryptographically strong, go look up the one that is.</p>&#x000A;<p class="footnote" id="fn3"><a href="#fnr3"><sup>3</sup></a> Make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.</p>&#x000A;<p class="footnote" id="fn4"><a href="#fnr4"><sup>4</sup></a> These are available as environment variables if you&#8217;re streaming</p>&#x000A;<p class="footnote" id="fn5"><a href="#fnr5"><sup>5</sup></a> Note that you always need to use the <strong>least</strong> significant bytes, because of <a href="http://en.wikipedia.org/wiki/Benford%27s_law">Benford&#8217;s law</a></p>
+ <div id='disqus_thread'>
+ <script type='text/javascript'>
+ //<![CDATA[
+ var disqus_url = "http://icsblog.heroku.com/2010/09/scalable_sampling";
+ //]]>
+ </script>
+ <noscript>
+ <a href='http://infochimps.disqus.com/?url=ref'>View the discussion thread</a>
+ </noscript>
+ <script src='http://disqus.com/forums/infochimps/embed.js' type='text/javascript'></script>
+ </div>
+ </div>
+ </div>
+ </div>
+ <div id="sidebar"></div>
+ </div>
+ </div>
+ <div id='footer'>
+ <div class='content'>
+ Copyright &copy; 2010 - Infochimps Dev Blog -
+ <span class='credit'><a href="/colophon.html">colophon</a></span>
+ </div>
+ </div>
+ <script type='text/javascript'>
+ //<![CDATA[
+ (function() {
+ var links = document.getElementsByTagName('a');
+ var query = '?';
+ for(var i = 0; i < links.length; i++) {
+ if(links[i].href.indexOf('#disqus_thread') >= 0) {
+ query += 'url' + i + '=' + encodeURIComponent(links[i].href) + '&';
+ }
+ }
+ document.write('<script charset="utf-8" type="text/javascript" src="http://disqus.com/forums/infochimps/get_num_replies.js' + query + '"></' + 'script>');
+ })();
+ //]]>
+ </script>
+ </body>
+</html>
Oops, something went wrong.

0 comments on commit 5297f61

Please sign in to comment.