Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

testing deploy

  • Loading branch information...
commit 44acdac43d14e173c502a4866fa1e1b3c36e09a6 1 parent 9d0a8fe
@mrflip mrflip authored
Showing with 3 additions and 3 deletions.
  1. +1 −1  Rakefile
  2. +1 −1  _site/atom.html
  3. +1 −1  _site/index.html
View
2  Rakefile
@@ -21,7 +21,7 @@ document_root = "~/document_root/" # for rsync deployment
# Read http://pages.github.com for guidance
# If you're not using this, you can remove it
source_branch = "working" # this compiles to your deploy branch
-deploy_branch = "master" # For user pages, use "master" for project pages use "gh-pages"
+deploy_branch = "rendered" # For user pages, use "master" for project pages use "gh-pages"
## ---- ##
def ok_failed(condition)
View
2  _site/atom.html
@@ -3,7 +3,7 @@
<title>Infochimps Developers Blog: Big Data, Hadoop, Cassandra, Chef, Ruby, Rails and more.</title>
<link href="http://icsblog.heroku.com//atom.xml" rel="self" />
<link href="http://icsblog.heroku.com/" />
- <updated>2010-09-07T04:16:05-05:00</updated>
+ <updated>2010-09-07T04:23:24-05:00</updated>
<id>http://icsblog.heroku.com/</id>
<author>
<name>Infochimps Dev Team</name>
View
2  _site/index.html
@@ -36,7 +36,7 @@
<div id="page">
<div id="content">
<div id="main">
- <div class="content"><div class="blog">&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/scalable_sampling">Scalable Sampling</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 7th, 2010&#x000A; &#x000A; </div>&#x000A; <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts_: If you use the job</em>id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in the rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;(<a href="/2010/09/scalable_sampling">continued&hellip;</a>)&#x000A; </div>&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/firsties">Firsties</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 6th, 2010&#x000A; &#x000A; </div>&#x000A; <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A;<p>Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our <a href="http://blog.infochimps.org">main blog</a> this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.</p>&#x000A;<h4>A word about this blog.</h4>&#x000A;<p>We&#8217;re using the <a href="http://github.com/imathis/octopress">Octopress framework</a> for <a href="http://github.com/mojombo/jekyll">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="http://github.com/infochimps/infochimps.github.com/tree/master/_plugins">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="http://jstorimer.com/2009/12/29/jekyll-on-heroku.html">Jekyll on Heroku</a> post. We added <a href="http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="http://github.com/infochimps/infochimps.github.com">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="http://redcloth.org/textile">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p>&#x000A; </div>&#x000A; <div class="footer">&#x000A; <a href="/archives.html" title="archives">&laquo; Blog Archives</a>&#x000A; </div>&#x000A;</div></div>
+ <div class="content"><div class="blog">&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/scalable_sampling">Scalable Sampling</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 7th, 2010&#x000A; &#x000A; </div>&#x000A; <h3>Sampling and Random Numbers</h3>&#x000A;<p>Found a really good caveat about using random numbers in a distributed system at the <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">rapleaf blog.</a> It&#8217;s subtle, so I&#8217;ll let you go read it there.</p>&#x000A;<p>Before you even get to such advanced mis-uses of random numbers<sup class="footnote" id="fnr1"><a href="#fn1">1</a></sup>, be sure you should be using them in the first place. People often reach for a <strong>random</strong> mapping what they really want is a <strong>well-mixed</strong> mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.<sup class="footnote" id="fnr2"><a href="#fn2">2</a></sup></p>&#x000A;<h4>Consistent Shuffling</h4>&#x000A;<p>For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to <strong>not</strong> remain the same, use a salt<sup class="footnote" id="fnr3"><a href="#fn3">3</a></sup>:</p>&#x000A;<pre><code> MD5( [key, salt].join(":") )</code></pre>&#x000A;<p>Runs wich the same salt and data will receive an the same mixing. <em>Good salts_: If you use the job</em>id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter<sup class="footnote" id="fnr4"><a href="#fn4">4</a></sup>, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. <em>Bad Salts</em>: random numbers, timestamps and the hostname + <span class="caps">PID</span> are bad salts, for <a href="http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/">the reasons given in the rapleaf post.</a></p>&#x000A;<h4>Sampling</h4>&#x000A;<p>To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.</p>&#x000A;<h3>Uniform-All Sample</h3>&#x000A;<p>Here&#8217;s the wrong way to sample three related tables:</p>&#x000A;<ul>&#x000A; <li>Sample 1/100 users</li>&#x000A; <li>Sample 1/100 products</li>&#x000A; <li>Sample 1/100 transactions</li>&#x000A;</ul>&#x000A;<p>The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.<sup class="footnote" id="fnr5"><a href="#fn5">5</a></sup></p>&#x000A;(<a href="/2010/09/scalable_sampling" class="cont">continued&hellip;</a>)&#x000A; </div>&#x000A; <div class="article">&#x000A; <h2><a class="title" href="/2010/09/firsties">Firsties</a></h2>&#x000A; <div class="meta">&#x000A; posted: September 6th, 2010&#x000A; &#x000A; </div>&#x000A; <h3 style="color:red;">First Post wooooo!!!</h3>&#x000A;<p>Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our <a href="http://blog.infochimps.org">main blog</a> this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.</p>&#x000A;<h4>A word about this blog.</h4>&#x000A;<p>We&#8217;re using the <a href="http://github.com/imathis/octopress">Octopress framework</a> for <a href="http://github.com/mojombo/jekyll">Jekyll.</a> Since octopress required some extinct fork of jekyll to render <span class="caps">HAML</span>, we did <a href="http://github.com/infochimps/infochimps.github.com/tree/master/_plugins">horrible, horrible monkey things</a> to make it render <span class="caps">HAML</span> text and layouts, but not require a special fork of Jekyll.</p>&#x000A;<p>We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s <a href="http://jstorimer.com/2009/12/29/jekyll-on-heroku.html">Jekyll on Heroku</a> post. We added <a href="http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb">two tweaks:</a> one is to allow no-extension permalinks (redirects <code>/2010/09/foo</code> to <code>/2010/09/foo/index.html</code>), the other is to render the custom <a href="/404.html">/404.html</a> page.</p>&#x000A;<p>Get your own copy here:</p>&#x000A;<ul>&#x000A; <li><a href="http://github.com/infochimps/infochimps.github.com">Infochimps Blog Source Code</a></li>&#x000A;</ul>&#x000A;<p>Posts are composed in <a href="http://redcloth.org/textile">Textile</a> using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).</p>&#x000A; </div>&#x000A; <div class="footer">&#x000A; <a href="/archives.html" title="archives">&laquo; Blog Archives</a>&#x000A; </div>&#x000A;</div></div>
</div>
<div id="sidebar"></div>
</div>
Please sign in to comment.
Something went wrong with that request. Please try again.