Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Atom feed adopts timestamp of last post

  • Loading branch information...
commit cf7707fb0dde454385a250675dcf14b2070e5c50 1 parent 44acdac
@mrflip mrflip authored
View
117 _site/atom.html
@@ -1,117 +0,0 @@
-<?xml version="1.0" encoding="utf-8" ?>
-<feed xmlns="http://www.w3.org/2005/Atom">
- <title>Infochimps Developers Blog: Big Data, Hadoop, Cassandra, Chef, Ruby, Rails and more.</title>
- <link href="http://icsblog.heroku.com//atom.xml" rel="self" />
- <link href="http://icsblog.heroku.com/" />
- <updated>2010-09-07T04:23:24-05:00</updated>
- <id>http://icsblog.heroku.com/</id>
- <author>
- <name>Infochimps Dev Team</name>
- <email>coders@infochimps.org</email>
- </author>
- <entry>
- <title>Firsties</title>
- <link href="http://icsblog.heroku.com//2010/09/firsties" />
- <updated>2010-09-06T00:00:00-05:00</updated>
- <id>http://icsblog.heroku.com//2010/09/firsties</id>
- <content type="html">
- &lt;h3 style=&quot;color:red;&quot;&gt;First Post wooooo!!!&lt;/h3&gt;
- &lt;p&gt;Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our &lt;a href=&quot;http://blog.infochimps.org&quot;&gt;main blog&lt;/a&gt; this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.&lt;/p&gt;
- &lt;h4&gt;A word about this blog.&lt;/h4&gt;
- &lt;p&gt;We&#8217;re using the &lt;a href=&quot;http://github.com/imathis/octopress&quot;&gt;Octopress framework&lt;/a&gt; for &lt;a href=&quot;http://github.com/mojombo/jekyll&quot;&gt;Jekyll.&lt;/a&gt; Since octopress required some extinct fork of jekyll to render &lt;span class=&quot;caps&quot;&gt;HAML&lt;/span&gt;, we did &lt;a href=&quot;http://github.com/infochimps/infochimps.github.com/tree/master/_plugins&quot;&gt;horrible, horrible monkey things&lt;/a&gt; to make it render &lt;span class=&quot;caps&quot;&gt;HAML&lt;/span&gt; text and layouts, but not require a special fork of Jekyll.&lt;/p&gt;
- &lt;p&gt;We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s &lt;a href=&quot;http://jstorimer.com/2009/12/29/jekyll-on-heroku.html&quot;&gt;Jekyll on Heroku&lt;/a&gt; post. We added &lt;a href=&quot;http://github.com/infochimps/infochimps.github.com/blob/master/devblog.rb&quot;&gt;two tweaks:&lt;/a&gt; one is to allow no-extension permalinks (redirects &lt;code&gt;/2010/09/foo&lt;/code&gt; to &lt;code&gt;/2010/09/foo/index.html&lt;/code&gt;), the other is to render the custom &lt;a href=&quot;http://icsblog.heroku.com//404.html&quot;&gt;/404.html&lt;/a&gt; page.&lt;/p&gt;
- &lt;p&gt;Get your own copy here:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;&lt;a href=&quot;http://github.com/infochimps/infochimps.github.com&quot;&gt;Infochimps Blog Source Code&lt;/a&gt;&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;Posts are composed in &lt;a href=&quot;http://redcloth.org/textile&quot;&gt;Textile&lt;/a&gt; using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).&lt;/p&gt;
- </content>
- </entry>
- <entry>
- <title>Scalable Sampling</title>
- <link href="http://icsblog.heroku.com//2010/09/scalable_sampling" />
- <updated>2010-09-07T00:00:00-05:00</updated>
- <id>http://icsblog.heroku.com//2010/09/scalable_sampling</id>
- <content type="html">
- &lt;h3&gt;Sampling and Random Numbers&lt;/h3&gt;
- &lt;p&gt;Found a really good caveat about using random numbers in a distributed system at the &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;rapleaf blog.&lt;/a&gt; It&#8217;s subtle, so I&#8217;ll let you go read it there.&lt;/p&gt;
- &lt;p&gt;Before you even get to such advanced mis-uses of random numbers&lt;sup class=&quot;footnote&quot; id=&quot;fnr1&quot;&gt;&lt;a href=&quot;#fn1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, be sure you should be using them in the first place. People often reach for a &lt;strong&gt;random&lt;/strong&gt; mapping what they really want is a &lt;strong&gt;well-mixed&lt;/strong&gt; mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.&lt;sup class=&quot;footnote&quot; id=&quot;fnr2&quot;&gt;&lt;a href=&quot;#fn2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
- &lt;h4&gt;Consistent Shuffling&lt;/h4&gt;
- &lt;p&gt;For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to &lt;strong&gt;not&lt;/strong&gt; remain the same, use a salt&lt;sup class=&quot;footnote&quot; id=&quot;fnr3&quot;&gt;&lt;a href=&quot;#fn3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;
- &lt;pre&gt;&lt;code&gt; MD5( [key, salt].join(&quot;:&quot;) )&lt;/code&gt;&lt;/pre&gt;
- &lt;p&gt;Runs wich the same salt and data will receive an the same mixing. &lt;em&gt;Good salts_: If you use the job&lt;/em&gt;id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter&lt;sup class=&quot;footnote&quot; id=&quot;fnr4&quot;&gt;&lt;a href=&quot;#fn4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. &lt;em&gt;Bad Salts&lt;/em&gt;: random numbers, timestamps and the hostname + &lt;span class=&quot;caps&quot;&gt;PID&lt;/span&gt; are bad salts, for &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;the reasons given in the rapleaf post.&lt;/a&gt;&lt;/p&gt;
- &lt;h4&gt;Sampling&lt;/h4&gt;
- &lt;p&gt;To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.&lt;/p&gt;
- &lt;h3&gt;Uniform-All Sample&lt;/h3&gt;
- &lt;p&gt;Here&#8217;s the wrong way to sample three related tables:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Sample 1/100 users&lt;/li&gt;
- &lt;li&gt;Sample 1/100 products&lt;/li&gt;
- &lt;li&gt;Sample 1/100 transactions&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.&lt;sup class=&quot;footnote&quot; id=&quot;fnr5&quot;&gt;&lt;a href=&quot;#fn5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
- &lt;hr /&gt;
- &lt;h3 style=&quot;vertical-align:middle;&quot;&gt;Uniform plus Edges (Global-feature preserving) Sample&lt;/h3&gt;
- &lt;p&gt;This is better:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Take all users whose ids hash correctly (n1)&lt;/li&gt;
- &lt;li&gt;Do a join of the transactions with n1&lt;/li&gt;
- &lt;li&gt;Do some joins to get relationships with a user from n1 on the left (and/or) right&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;However, it&#8217;s computationally harder than doing straight samples of each. The consistent hash answers that problem: just use the same hash on the &lt;strong&gt;foreign key&lt;/strong&gt; (in this case, the user_id):&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Take all users whose ids hash correctly&lt;/li&gt;
- &lt;li&gt;Take all products whose seller_id hashes correctly&lt;/li&gt;
- &lt;li&gt;Take all transactions whose buyer_id (and/or) seller_id hashes correctly&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;This gives you a very efficient uniform sample. If 4% of your buyers are from Florida, about 4% of the sampled users should be too, and about 4% of the transactions will be from Floridians. (&lt;a href=&quot;http://kottke.org/10/05/monday-puzzle-time&quot;&gt;Don&#8217;t get careless,&lt;/a&gt; though)&lt;/p&gt;
- &lt;p&gt;Some caveats. You don&#8217;t have good control over the sample fraction: your transactions probably obey a long-tail distribution (a few users account for a disproportionate number of transactions), which introduces high variance for the quantity recovered.&lt;/p&gt;
- &lt;p&gt;The sample is also sparse, which can make analysis hard in some contexts. If you sample 1% of buyers, a product with 100 purchases will in general retain 1 buyer. You can&#8217;t test an algortihm that looks for similar products, or measures reputation flow. The problem with joins&lt;/p&gt;
- &lt;h3&gt;Subuniverse (Local-structure preserving) Sample&lt;/h3&gt;
- &lt;p&gt;To do a &#8216;subuniverse&#8217; sample, find a handle for some connected neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
- &lt;li&gt;Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.&lt;/li&gt;
- &lt;li&gt;Prune n1_all: eliminate entities with very few or very weak ties to n0, and call this n1.&lt;/li&gt;
- &lt;li&gt;Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)&lt;/li&gt;
- &lt;li&gt;Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;You want highly similar features in n0, or n1 will get too large. &#8220;People from Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy football site.&lt;/p&gt;
- &lt;p&gt;Here&#8217;s the same thing for our favorite network graph:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
- &lt;li&gt;Broaden n0 along all relevant connections: for example atsign, follow, topic usage, etc &#8211; call this n1_all&lt;/li&gt;
- &lt;li&gt;Prune n1_all: elminate users with very few or very weak ties to n0, and call this n1.&lt;/li&gt;
- &lt;li&gt;Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)&lt;/li&gt;
- &lt;li&gt;Do some joins of n1 to get relationships with a sample user on the left (and/or) right&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;For example, the subuniverse we typically work with is &#8220;users who have mentioned @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there, it&#8217;s easy to inspect the results of an experiment against a familiar object (ourselves). It also gives very correlated edges: many such people also follow each other, use other similar terms, etc. Without this correlation, we&#8217;d span too much of the graph.&lt;/p&gt;
- &lt;p&gt;Within the subuniverse, we can happily do joins, calculate trstrank, and examine local community structure.&lt;/p&gt;
- &lt;p&gt;Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way: among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is unfortunately hard. More importantly, no matter what handle you use the subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;twitter users with millions of followers are going to land in almost any given subuniverse&lt;/li&gt;
- &lt;li&gt;the trstrank of any given subuniverse is going to be vastly higher than the whole graph average&lt;/li&gt;
- &lt;li&gt;Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.&lt;/li&gt;
- &lt;/ul&gt;
- &lt;h3&gt;Connectivity-preserving Sample&lt;/h3&gt;
- &lt;p&gt;There&#8217;s one other type of sample you might like to do: one that preserves the global connectivity of edges.&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).&lt;/li&gt;
- &lt;li&gt;Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;(The parameter f adjusts the fraction of edges sampled.) In this equation, a node with one inbound link has a high chance of survival. On average, each node will have f inbound and f outbound links survive.&lt;/p&gt;
- &lt;p&gt;This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f down pretty far for a significant number of nodes to start failing the binomial trial at each end.&lt;/p&gt;
- &lt;p&gt;To do this consistently, set g = 1/f and do&lt;/p&gt;
- &lt;pre&gt;&lt;code&gt; ( (MD5([node_a_id, node_b_id, 'a', salt].join(&quot;:&quot;)) % (deg_a * g) = 0) ||
- (MD5([node_a_id, node_b_id, 'b', salt].join(&quot;:&quot;)) % (deg_b * g) = 0) )&lt;/code&gt;&lt;/pre&gt;
- &lt;p&gt;(since deg_a and deg_b may be correlated, we perturb it by adding &#8216;a&#8217; and &#8216;b&#8217; as extra salts)&lt;/p&gt;
- &lt;hr /&gt;
- &lt;h4 style=&quot;vertical-align:middle;&quot;&gt;Footnotes&lt;/h4&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn1&quot;&gt;&lt;a href=&quot;#fnr1&quot;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; We seem mostly rid of stupid and/or non-threadsafe RNGs. However, many &lt;span class=&quot;caps&quot;&gt;UUID&lt;/span&gt; implementations (including Java&#8217;s, I think) require a global lock.&lt;/p&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn2&quot;&gt;&lt;a href=&quot;#fnr2&quot;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; By claiming an MD5 is good for anything I just made a cryptographer cry. So let me hurry to disclaim that you should really be using something-something-&lt;span class=&quot;caps&quot;&gt;HMAC&lt;/span&gt;-whatever. That is &#8212; if you care that this mixing is cryptographically strong, go look up the one that is.&lt;/p&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn3&quot;&gt;&lt;a href=&quot;#fnr3&quot;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; Make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.&lt;/p&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn4&quot;&gt;&lt;a href=&quot;#fnr4&quot;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; These are available as environment variables if you&#8217;re streaming&lt;/p&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn5&quot;&gt;&lt;a href=&quot;#fnr5&quot;&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt; Note that you always need to use the &lt;strong&gt;least&lt;/strong&gt; significant bytes, because of &lt;a href=&quot;http://en.wikipedia.org/wiki/Benford%27s_law&quot;&gt;Benford&#8217;s law&lt;/a&gt;&lt;/p&gt;
- </content>
- </entry>
-</feed>
View
138 _site/atom.xml
@@ -3,7 +3,7 @@
<title>Infochimps Developers Blog: Big Data, Hadoop, Cassandra, Chef, Ruby, Rails and more.</title>
<link href="http://icsblog.heroku.com//atom.xml" rel="self" />
<link href="http://icsblog.heroku.com/" />
- <updated>2010-09-07T02:59:02-05:00</updated>
+ <updated>2010-09-07T00:00:00-05:00</updated>
<id>http://icsblog.heroku.com/</id>
<author>
<name>Infochimps Dev Team</name>
@@ -28,74 +28,47 @@
</content>
</entry>
<entry>
- <title>Sample Simply</title>
- <link href="http://icsblog.heroku.com//2010/09/sampling_aint_easy" />
- <updated>2010-09-06T00:00:00-05:00</updated>
- <id>http://icsblog.heroku.com//2010/09/sampling_aint_easy</id>
+ <title>Scalable Sampling</title>
+ <link href="http://icsblog.heroku.com//2010/09/scalable_sampling" />
+ <updated>2010-09-07T00:00:00-05:00</updated>
+ <id>http://icsblog.heroku.com//2010/09/scalable_sampling</id>
<content type="html">
- &lt;h2&gt;Sampling and Random Numbers&lt;/h2&gt;
- &lt;p&gt;Found a really good caveat about using random numbers in a distributed system at the &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;rapleaf blog.&lt;/a&gt; It&#8217;s subtle, so I&#8217;ll let you go read it there.(*)&lt;/p&gt;
- &lt;p&gt;Besides lot of times that people reach for &#8216;random&#8217; what they really want is &#8216;mixed arbitrarily&#8217;.&lt;/p&gt;
- &lt;p&gt;A lot of times that people reach for &#8216;random&#8217; what they really want is &#8216;mixed&lt;br /&gt;
- arbitrarily&#8217;: that is, a function such that similar objects will receive&lt;br /&gt;
- arbitrarily different outcomes. The MD5 hash is an easy way to do this.**&lt;/p&gt;
- &lt;p&gt;To shuffle a set of records, take the MD5 hash of its primary key. The mixing is&lt;br /&gt;
- &#8220;consistent&#8221;: every time you run this you&#8217;ll get the same mixing. If you&#8217;d like&lt;br /&gt;
- it to &lt;strong&gt;not&lt;/strong&gt; be consistent, use a salt:&lt;/p&gt;
- &lt;acronym title=&quot; [key, salt].join(&quot;:&quot;&quot;&gt;&lt;span class=&quot;caps&quot;&gt;MD5&lt;/span&gt;&lt;/acronym&gt; ) (**)
- &lt;p&gt;Now every run using the same salt will receive an identical mixing that is still&lt;br /&gt;
- arbitrary within the run. To vary by the job, the task, the partition, or the&lt;br /&gt;
- row, salt using the job_id, task_id, source filename + split boundary(&lt;strong&gt;*&lt;/strong&gt;), or&lt;br /&gt;
- source filename + split boundary + running counter.&lt;/p&gt;
- &lt;p&gt;A random number, a timestamp, or the hostname + &lt;span class=&quot;caps&quot;&gt;PID&lt;/span&gt; are bad salts, for the reasons given http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&lt;/p&gt;
- &lt;p&gt;To take a 1/n sample from a set of records, take the MD5 hash and emit only&lt;br /&gt;
- those records which are zero modulo n. If you have arbitrarily-assigned numeric&lt;br /&gt;
- primary keys you can just modulo n them directly, as long as n is large. In both&lt;br /&gt;
- cases note that you can&#8217;t subsample using this trick.&lt;/p&gt;
- &lt;h2&gt;Uniform-All Sample&lt;/h2&gt;
- &lt;p&gt;Say you&#8217;re a site where users sell products to each other. For development, you&lt;br /&gt;
- want a 1% sample to test on. Here&#8217;s the wrong thing to do:&lt;/p&gt;
+ &lt;h3&gt;Sampling and Random Numbers&lt;/h3&gt;
+ &lt;p&gt;Found a really good caveat about using random numbers in a distributed system at the &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;rapleaf blog.&lt;/a&gt; It&#8217;s subtle, so I&#8217;ll let you go read it there.&lt;/p&gt;
+ &lt;p&gt;Before you even get to such advanced mis-uses of random numbers&lt;sup class=&quot;footnote&quot; id=&quot;fnr1&quot;&gt;&lt;a href=&quot;#fn1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, be sure you should be using them in the first place. People often reach for a &lt;strong&gt;random&lt;/strong&gt; mapping what they really want is a &lt;strong&gt;well-mixed&lt;/strong&gt; mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.&lt;sup class=&quot;footnote&quot; id=&quot;fnr2&quot;&gt;&lt;a href=&quot;#fn2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
+ &lt;h4&gt;Consistent Shuffling&lt;/h4&gt;
+ &lt;p&gt;For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to &lt;strong&gt;not&lt;/strong&gt; remain the same, use a salt&lt;sup class=&quot;footnote&quot; id=&quot;fnr3&quot;&gt;&lt;a href=&quot;#fn3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;
+ &lt;pre&gt;&lt;code&gt; MD5( [key, salt].join(&quot;:&quot;) )&lt;/code&gt;&lt;/pre&gt;
+ &lt;p&gt;Runs wich the same salt and data will receive an the same mixing. &lt;em&gt;Good salts_: If you use the job&lt;/em&gt;id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter&lt;sup class=&quot;footnote&quot; id=&quot;fnr4&quot;&gt;&lt;a href=&quot;#fn4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. &lt;em&gt;Bad Salts&lt;/em&gt;: random numbers, timestamps and the hostname + &lt;span class=&quot;caps&quot;&gt;PID&lt;/span&gt; are bad salts, for &lt;a href=&quot;http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/&quot;&gt;the reasons given in the rapleaf post.&lt;/a&gt;&lt;/p&gt;
+ &lt;h4&gt;Sampling&lt;/h4&gt;
+ &lt;p&gt;To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.&lt;/p&gt;
+ &lt;h3&gt;Uniform-All Sample&lt;/h3&gt;
+ &lt;p&gt;Here&#8217;s the wrong way to sample three related tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sample 1/100 users&lt;/li&gt;
&lt;li&gt;Sample 1/100 products&lt;/li&gt;
&lt;li&gt;Sample 1/100 transactions&lt;/li&gt;
&lt;/ul&gt;
- &lt;p&gt;The good thing is that each given product, transaction or user has the same&lt;br /&gt;
- uniform chance of being included.&lt;/p&gt;
- &lt;p&gt;The problem is that none of them will join: for most transactions, you won&#8217;t be&lt;br /&gt;
- able to look up the buyers, sellers or products. **&lt;/p&gt;
-
- &lt;ul&gt;
- &lt;li&gt;If you&#8217;re developing an exploratory data analysis tool for big data please support at least the Subuniverse and&lt;/li&gt;
- &lt;/ul&gt;&lt;h2&gt;Uniform plus Edges (Global-feature preserving) Sample&lt;/h2&gt;
+ &lt;p&gt;The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.&lt;sup class=&quot;footnote&quot; id=&quot;fnr5&quot;&gt;&lt;a href=&quot;#fn5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
+ &lt;hr /&gt;
+ &lt;h3 style=&quot;vertical-align:middle;&quot;&gt;Uniform plus Edges (Global-feature preserving) Sample&lt;/h3&gt;
&lt;p&gt;This is better:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Take all users whose ids hash correctly (n1)&lt;/li&gt;
&lt;li&gt;Do a join of the transactions with n1&lt;/li&gt;
&lt;li&gt;Do some joins to get relationships with a user from n1 on the left (and/or) right&lt;/li&gt;
&lt;/ul&gt;
- &lt;p&gt;However, it&#8217;s computationally harder than doing straight samples of each. The&lt;br /&gt;
- consistent hash answers that problem: just use the same hash on the &lt;strong&gt;foreign&lt;br /&gt;
- key&lt;/strong&gt; (in this case, the user_id):&lt;/p&gt;
+ &lt;p&gt;However, it&#8217;s computationally harder than doing straight samples of each. The consistent hash answers that problem: just use the same hash on the &lt;strong&gt;foreign key&lt;/strong&gt; (in this case, the user_id):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Take all users whose ids hash correctly&lt;/li&gt;
&lt;li&gt;Take all products whose seller_id hashes correctly&lt;/li&gt;
&lt;li&gt;Take all transactions whose buyer_id (and/or) seller_id hashes correctly&lt;/li&gt;
&lt;/ul&gt;
- &lt;p&gt;This gives you a very efficient uniform sample. If 4% of your buyers are from&lt;br /&gt;
- Florida, about 4% of the sampled users should be too, and about 4% of the&lt;br /&gt;
- transactions will be from Floridians. (&lt;a href=&quot;http://kottke.org/10/05/monday-puzzle-time&quot;&gt;Don&#8217;t get careless,&lt;/a&gt; though)&lt;/p&gt;
- &lt;p&gt;Some caveats. You don&#8217;t have good control over the sample fraction: your&lt;br /&gt;
- transactions probably obey a long-tail distribution (a few users account for a&lt;br /&gt;
- disproportionate number of transactions), which introduces high variance for the&lt;br /&gt;
- quantity recovered.&lt;/p&gt;
- &lt;p&gt;The sample is also sparse, which can make analysis hard in some contexts. If you&lt;br /&gt;
- sample 1% of buyers, a product with 100 purchases will in general retain 1&lt;br /&gt;
- buyer. You can&#8217;t test an algortihm that looks for similar products, or measures&lt;br /&gt;
- reputation flow. The problem with joins&lt;/p&gt;
- &lt;h2&gt;Subuniverse (Local-structure preserving) Sample&lt;/h2&gt;
- &lt;p&gt;To do a &#8216;subuniverse&#8217; sample, find some handle that lets you pick up a connected&lt;br /&gt;
- neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.&lt;/p&gt;
+ &lt;p&gt;This gives you a very efficient uniform sample. If 4% of your buyers are from Florida, about 4% of the sampled users should be too, and about 4% of the transactions will be from Floridians. (&lt;a href=&quot;http://kottke.org/10/05/monday-puzzle-time&quot;&gt;Don&#8217;t get careless,&lt;/a&gt; though)&lt;/p&gt;
+ &lt;p&gt;Some caveats. You don&#8217;t have good control over the sample fraction: your transactions probably obey a long-tail distribution (a few users account for a disproportionate number of transactions), which introduces high variance for the quantity recovered.&lt;/p&gt;
+ &lt;p&gt;The sample is also sparse, which can make analysis hard in some contexts. If you sample 1% of buyers, a product with 100 purchases will in general retain 1 buyer. You can&#8217;t test an algortihm that looks for similar products, or measures reputation flow. The problem with joins&lt;/p&gt;
+ &lt;h3&gt;Subuniverse (Local-structure preserving) Sample&lt;/h3&gt;
+ &lt;p&gt;To do a &#8216;subuniverse&#8217; sample, find a handle for some connected neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
&lt;li&gt;Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.&lt;/li&gt;
@@ -103,9 +76,7 @@
&lt;li&gt;Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)&lt;/li&gt;
&lt;li&gt;Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.&lt;/li&gt;
&lt;/ul&gt;
- &lt;p&gt;You want highly similar features in n0, or n1 will get too large. &#8220;People from&lt;br /&gt;
- Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy&lt;br /&gt;
- football site.&lt;/p&gt;
+ &lt;p&gt;You want highly similar features in n0, or n1 will get too large. &#8220;People from Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy football site.&lt;/p&gt;
&lt;p&gt;Here&#8217;s the same thing for our favorite network graph:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
@@ -114,56 +85,33 @@
&lt;li&gt;Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)&lt;/li&gt;
&lt;li&gt;Do some joins of n1 to get relationships with a sample user on the left (and/or) right&lt;/li&gt;
&lt;/ul&gt;
- &lt;p&gt;For example, the subuniverse we typically work with is &#8220;users who have mentioned&lt;br /&gt;
- @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few&lt;br /&gt;
- reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there,&lt;br /&gt;
- it&#8217;s easy to inspect the results of an experiment against a familiar object&lt;br /&gt;
- (ourselves). It also gives very correlated edges: many such people also follow&lt;br /&gt;
- each other, use other similar terms, etc. Without this correlation, we&#8217;d span&lt;br /&gt;
- too much of the graph.&lt;/p&gt;
- &lt;p&gt;Within the subuniverse, we can happily do joins, calculate trstrank, and examine&lt;br /&gt;
- local community structure.&lt;/p&gt;
- &lt;p&gt;Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way:&lt;br /&gt;
- among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is&lt;br /&gt;
- unfortunately hard. More importantly, no matter what handle you use the&lt;br /&gt;
- subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:&lt;/p&gt;
+ &lt;p&gt;For example, the subuniverse we typically work with is &#8220;users who have mentioned @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there, it&#8217;s easy to inspect the results of an experiment against a familiar object (ourselves). It also gives very correlated edges: many such people also follow each other, use other similar terms, etc. Without this correlation, we&#8217;d span too much of the graph.&lt;/p&gt;
+ &lt;p&gt;Within the subuniverse, we can happily do joins, calculate trstrank, and examine local community structure.&lt;/p&gt;
+ &lt;p&gt;Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way: among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is unfortunately hard. More importantly, no matter what handle you use the subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;twitter users with millions of followers are going to land in almost any given subuniverse&lt;/li&gt;
&lt;li&gt;the trstrank of any given subuniverse is going to be vastly higher than the whole graph average&lt;/li&gt;
&lt;li&gt;Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.&lt;/li&gt;
&lt;/ul&gt;
- &lt;h2&gt;Connectivity-preserving Sample&lt;/h2&gt;
- &lt;p&gt;There&#8217;s one other type of sample you might like to do: one that preserves the&lt;br /&gt;
- global connectivity of edges.&lt;/p&gt;
+ &lt;h3&gt;Connectivity-preserving Sample&lt;/h3&gt;
+ &lt;p&gt;There&#8217;s one other type of sample you might like to do: one that preserves the global connectivity of edges.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).&lt;/li&gt;
&lt;li&gt;Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.&lt;/li&gt;
&lt;/ul&gt;
- &lt;p&gt;(The parameter f adjusts the fraction of edges sampled.) In this equation, a&lt;br /&gt;
- node with one inbound link has a high chance of survival. On average, each node&lt;br /&gt;
- will have f inbound and f outbound links survive.&lt;/p&gt;
- &lt;p&gt;This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m&lt;br /&gt;
- nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f&lt;br /&gt;
- down pretty far for a significant number of nodes to start failing the binomial&lt;br /&gt;
- trial at each end.&lt;/p&gt;
+ &lt;p&gt;(The parameter f adjusts the fraction of edges sampled.) In this equation, a node with one inbound link has a high chance of survival. On average, each node will have f inbound and f outbound links survive.&lt;/p&gt;
+ &lt;p&gt;This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f down pretty far for a significant number of nodes to start failing the binomial trial at each end.&lt;/p&gt;
&lt;p&gt;To do this consistently, set g = 1/f and do&lt;/p&gt;
- ( (&lt;acronym title=&quot;[node_a_id, node_b_id, &#39;a&#39;, salt].join(&quot;:&quot;&quot;&gt;&lt;span class=&quot;caps&quot;&gt;MD5&lt;/span&gt;&lt;/acronym&gt;) % (deg_a * g) = 0) ||
- (&lt;acronym title=&quot;[node_a_id, node_b_id, &#39;b&#39;, salt].join(&quot;:&quot;&quot;&gt;&lt;span class=&quot;caps&quot;&gt;MD5&lt;/span&gt;&lt;/acronym&gt;) % (deg_b * g) = 0) )
- &lt;p&gt;(I introduced &#8216;a&#8217; and &#8216;b&#8217; as extra salts: deg_a and deg_b may be correlated, but&lt;br /&gt;
- we need the two trials to be independent)&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;I think we&#8217;ve rid the earth of non-threadsafe RNGs, but many &lt;span class=&quot;caps&quot;&gt;UUID&lt;/span&gt; implementations (incl. Java&#8217;s, I think) require a global lock.&lt;/li&gt;
- &lt;li&gt;(I just made a cryptographer cry, so let me disclaim that you should really be using something-something-&lt;span class=&quot;caps&quot;&gt;HMAC&lt;/span&gt;-whatever i.e. if you care that this mixing is cryptographically strong take some time and look it up.)
- &lt;ul&gt;
- &lt;li&gt;make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.
- &lt;ul&gt;
- &lt;li&gt;These are available as environment variables if you&#8217;re streaming
- &lt;ul&gt;
- &lt;li&gt;Note that you always need to use the &lt;strong&gt;least&lt;/strong&gt; significant bytes because of Benford&#8217;s law&lt;/li&gt;
- &lt;/ul&gt;&lt;/li&gt;
- &lt;/ul&gt;&lt;/li&gt;
- &lt;/ul&gt;&lt;/li&gt;
- &lt;/ul&gt;
+ &lt;pre&gt;&lt;code&gt; ( (MD5([node_a_id, node_b_id, 'a', salt].join(&quot;:&quot;)) % (deg_a * g) = 0) ||
+ (MD5([node_a_id, node_b_id, 'b', salt].join(&quot;:&quot;)) % (deg_b * g) = 0) )&lt;/code&gt;&lt;/pre&gt;
+ &lt;p&gt;(since deg_a and deg_b may be correlated, we perturb it by adding &#8216;a&#8217; and &#8216;b&#8217; as extra salts)&lt;/p&gt;
+ &lt;hr /&gt;
+ &lt;h4 style=&quot;vertical-align:middle;&quot;&gt;Footnotes&lt;/h4&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn1&quot;&gt;&lt;a href=&quot;#fnr1&quot;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; We seem mostly rid of stupid and/or non-threadsafe RNGs. However, many &lt;span class=&quot;caps&quot;&gt;UUID&lt;/span&gt; implementations (including Java&#8217;s, I think) require a global lock.&lt;/p&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn2&quot;&gt;&lt;a href=&quot;#fnr2&quot;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; By claiming an MD5 is good for anything I just made a cryptographer cry. So let me hurry to disclaim that you should really be using something-something-&lt;span class=&quot;caps&quot;&gt;HMAC&lt;/span&gt;-whatever. That is &#8212; if you care that this mixing is cryptographically strong, go look up the one that is.&lt;/p&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn3&quot;&gt;&lt;a href=&quot;#fnr3&quot;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; Make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.&lt;/p&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn4&quot;&gt;&lt;a href=&quot;#fnr4&quot;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; These are available as environment variables if you&#8217;re streaming&lt;/p&gt;
+ &lt;p class=&quot;footnote&quot; id=&quot;fn5&quot;&gt;&lt;a href=&quot;#fnr5&quot;&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt; Note that you always need to use the &lt;strong&gt;least&lt;/strong&gt; significant bytes, because of &lt;a href=&quot;http://en.wikipedia.org/wiki/Benford%27s_law&quot;&gt;Benford&#8217;s law&lt;/a&gt;&lt;/p&gt;
</content>
</entry>
</feed>
View
6 _site/stylesheets/screen.css
@@ -479,7 +479,7 @@ html body a:visited {
-ms-border-bottom-right-radius: 2px;
-khtml-border-bottom-right-radius: 2px;
border-bottom-right-radius: 2px;
- background: #aaaaaa url('/images/code_bg.png?1283827278') top repeat-x;
+ background: #aaaaaa url('/images/code_bg.png?1283852883') top repeat-x;
position: relative;
margin: 0.3em 0 1.3em;
padding: 0 3px 3px;
@@ -863,7 +863,7 @@ pre.console .stdin {
}
/* line 5, ../../stylesheets/partials/_search.sass */
#search form {
- background: url('/images/search_bg.png?1283827278') no-repeat;
+ background: url('/images/search_bg.png?1283852883') no-repeat;
padding: 0;
height: 28px;
width: 218px;
@@ -1059,7 +1059,7 @@ pre.console .stdin {
#nav ul li.subscribe a {
display: inline-block;
padding-left: 28px;
- background: url('/images/rss.png?1283827278') left top no-repeat;
+ background: url('/images/rss.png?1283852883') left top no-repeat;
}
/* line 32, ../../stylesheets/partials/_navigation.sass */
#nav ul li a {
View
2  source/atom.haml
@@ -10,7 +10,7 @@ full_url: http://icsblog.heroku.com/
%title= page.blog_title
%link(href="#{page.full_url}/atom.xml" rel="self")
%link(href="#{page.full_url}")
- %updated= Time.now.xmlschema
+ %updated= site.posts.first.date.xmlschema
%id=page.full_url
%author
%name= page.author
Please sign in to comment.
Something went wrong with that request. Please try again.