Browse files

Fixed site feed, re-rendered

  • Loading branch information...
1 parent 9d0a8fe commit ab39464038554aa44e5d1420d933269a0d2f158d Philip (flip) Kromer committed Sep 7, 2010
Showing with 47 additions and 216 deletions.
  1. +0 −117 _site/atom.html
  2. +43 −95 _site/atom.xml
  3. +1 −1 _site/index.html
  4. +3 −3 _site/stylesheets/screen.css
@@ -1,117 +0,0 @@
-<?xml version="1.0" encoding="utf-8" ?>
-<feed xmlns="">
- <title>Infochimps Developers Blog: Big Data, Hadoop, Cassandra, Chef, Ruby, Rails and more.</title>
- <link href="" rel="self" />
- <link href="" />
- <updated>2010-09-07T04:16:05-05:00</updated>
- <id></id>
- <author>
- <name>Infochimps Dev Team</name>
- <email></email>
- </author>
- <entry>
- <title>Firsties</title>
- <link href="" />
- <updated>2010-09-06T00:00:00-05:00</updated>
- <id></id>
- <content type="html">
- &lt;h3 style=&quot;color:red;&quot;&gt;First Post wooooo!!!&lt;/h3&gt;
- &lt;p&gt;Welcome to the new Infochimps dev blog. As opposed to the long-form thoughtful stuff you&#8217;ll find over at our &lt;a href=&quot;;&gt;main blog&lt;/a&gt; this will be pure geek thoughtstream. Posts might be one line or hundreds, and might contain only formulas or only code.&lt;/p&gt;
- &lt;h4&gt;A word about this blog.&lt;/h4&gt;
- &lt;p&gt;We&#8217;re using the &lt;a href=&quot;;&gt;Octopress framework&lt;/a&gt; for &lt;a href=&quot;;&gt;Jekyll.&lt;/a&gt; Since octopress required some extinct fork of jekyll to render &lt;span class=&quot;caps&quot;&gt;HAML&lt;/span&gt;, we did &lt;a href=&quot;;&gt;horrible, horrible monkey things&lt;/a&gt; to make it render &lt;span class=&quot;caps&quot;&gt;HAML&lt;/span&gt; text and layouts, but not require a special fork of Jekyll.&lt;/p&gt;
- &lt;p&gt;We also use a tiny little sinatra shim inspired by Jesse Storimer&#8217;s &lt;a href=&quot;;&gt;Jekyll on Heroku&lt;/a&gt; post. We added &lt;a href=&quot;;&gt;two tweaks:&lt;/a&gt; one is to allow no-extension permalinks (redirects &lt;code&gt;/2010/09/foo&lt;/code&gt; to &lt;code&gt;/2010/09/foo/index.html&lt;/code&gt;), the other is to render the custom &lt;a href=&quot;;&gt;/404.html&lt;/a&gt; page.&lt;/p&gt;
- &lt;p&gt;Get your own copy here:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;&lt;a href=&quot;;&gt;Infochimps Blog Source Code&lt;/a&gt;&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;Posts are composed in &lt;a href=&quot;;&gt;Textile&lt;/a&gt; using Emacs (except for Jesse, who has some insane Dvorak-inverted notation textmate retroclone thing going).&lt;/p&gt;
- </content>
- </entry>
- <entry>
- <title>Scalable Sampling</title>
- <link href="" />
- <updated>2010-09-07T00:00:00-05:00</updated>
- <id></id>
- <content type="html">
- &lt;h3&gt;Sampling and Random Numbers&lt;/h3&gt;
- &lt;p&gt;Found a really good caveat about using random numbers in a distributed system at the &lt;a href=&quot;;&gt;rapleaf blog.&lt;/a&gt; It&#8217;s subtle, so I&#8217;ll let you go read it there.&lt;/p&gt;
- &lt;p&gt;Before you even get to such advanced mis-uses of random numbers&lt;sup class=&quot;footnote&quot; id=&quot;fnr1&quot;&gt;&lt;a href=&quot;#fn1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, be sure you should be using them in the first place. People often reach for a &lt;strong&gt;random&lt;/strong&gt; mapping what they really want is a &lt;strong&gt;well-mixed&lt;/strong&gt; mapping: a function such that similar but distinguishable objects will receive arbitrarily different outcomes. The MD5 hash is an easy way to do this.&lt;sup class=&quot;footnote&quot; id=&quot;fnr2&quot;&gt;&lt;a href=&quot;#fn2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
- &lt;h4&gt;Consistent Shuffling&lt;/h4&gt;
- &lt;p&gt;For example, you can shuffle a set of records by taking the MD5 hash of its primary key. The mixing is &#8220;consistent&#8221;: every run yields the same mixing. If you&#8217;d like it to &lt;strong&gt;not&lt;/strong&gt; remain the same, use a salt&lt;sup class=&quot;footnote&quot; id=&quot;fnr3&quot;&gt;&lt;a href=&quot;#fn3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;
- &lt;pre&gt;&lt;code&gt; MD5( [key, salt].join(&quot;:&quot;) )&lt;/code&gt;&lt;/pre&gt;
- &lt;p&gt;Runs wich the same salt and data will receive an the same mixing. &lt;em&gt;Good salts_: If you use the job&lt;/em&gt;id as salt, different runs will give different shuffles but within each such run, identical input keys are shuffled identically. If you use the source filename + split boundary + a running counter&lt;sup class=&quot;footnote&quot; id=&quot;fnr4&quot;&gt;&lt;a href=&quot;#fn4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, each record will be mixed arbitrarily, but in a way that&#8217;s predicatable acreoss runs. &lt;em&gt;Bad Salts&lt;/em&gt;: random numbers, timestamps and the hostname + &lt;span class=&quot;caps&quot;&gt;PID&lt;/span&gt; are bad salts, for &lt;a href=&quot;;&gt;the reasons given in the rapleaf post.&lt;/a&gt;&lt;/p&gt;
- &lt;h4&gt;Sampling&lt;/h4&gt;
- &lt;p&gt;To take a 1/n sample from a set of records, take the MD5 hash and emit only records which are zero modulo n. If you have arbitrarily-assigned numeric primary keys you can just modulo n them directly, as long as n is large. In both cases note that you can&#8217;t subsample using this trick.&lt;/p&gt;
- &lt;h3&gt;Uniform-All Sample&lt;/h3&gt;
- &lt;p&gt;Here&#8217;s the wrong way to sample three related tables:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Sample 1/100 users&lt;/li&gt;
- &lt;li&gt;Sample 1/100 products&lt;/li&gt;
- &lt;li&gt;Sample 1/100 transactions&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;The problem is that none of them will join: for most transactions, you won&#8217;t be able to look up the buyers, sellers or products.&lt;sup class=&quot;footnote&quot; id=&quot;fnr5&quot;&gt;&lt;a href=&quot;#fn5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
- &lt;hr /&gt;
- &lt;h3 style=&quot;vertical-align:middle;&quot;&gt;Uniform plus Edges (Global-feature preserving) Sample&lt;/h3&gt;
- &lt;p&gt;This is better:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Take all users whose ids hash correctly (n1)&lt;/li&gt;
- &lt;li&gt;Do a join of the transactions with n1&lt;/li&gt;
- &lt;li&gt;Do some joins to get relationships with a user from n1 on the left (and/or) right&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;However, it&#8217;s computationally harder than doing straight samples of each. The consistent hash answers that problem: just use the same hash on the &lt;strong&gt;foreign key&lt;/strong&gt; (in this case, the user_id):&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Take all users whose ids hash correctly&lt;/li&gt;
- &lt;li&gt;Take all products whose seller_id hashes correctly&lt;/li&gt;
- &lt;li&gt;Take all transactions whose buyer_id (and/or) seller_id hashes correctly&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;This gives you a very efficient uniform sample. If 4% of your buyers are from Florida, about 4% of the sampled users should be too, and about 4% of the transactions will be from Floridians. (&lt;a href=&quot;;&gt;Don&#8217;t get careless,&lt;/a&gt; though)&lt;/p&gt;
- &lt;p&gt;Some caveats. You don&#8217;t have good control over the sample fraction: your transactions probably obey a long-tail distribution (a few users account for a disproportionate number of transactions), which introduces high variance for the quantity recovered.&lt;/p&gt;
- &lt;p&gt;The sample is also sparse, which can make analysis hard in some contexts. If you sample 1% of buyers, a product with 100 purchases will in general retain 1 buyer. You can&#8217;t test an algortihm that looks for similar products, or measures reputation flow. The problem with joins&lt;/p&gt;
- &lt;h3&gt;Subuniverse (Local-structure preserving) Sample&lt;/h3&gt;
- &lt;p&gt;To do a &#8216;subuniverse&#8217; sample, find a handle for some connected neighborhood of the graph &#8212; say, &#8220;sellers of quilts&#8221;.&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
- &lt;li&gt;Broaden n0 along all relevant connections: buy- or sell-transactions, sellers of products sold by people in n0, etc. Call this n1_all.&lt;/li&gt;
- &lt;li&gt;Prune n1_all: eliminate entities with very few or very weak ties to n0, and call this n1.&lt;/li&gt;
- &lt;li&gt;Do a join of n1 on the products&#8217;s seller_id. (This requires a join, but since n1 is &#8216;only&#8217; a few million rows, you can do a fairly efficient map-side (aka fragment-replicate) join)&lt;/li&gt;
- &lt;li&gt;Do some joins of n1 on the transactions, keeping those with a member of n1 on the left (and/or) right.&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;You want highly similar features in n0, or n1 will get too large. &#8220;People from Denver&#8221; would be a bad handle for a shopping site, a decent handle for a fantasy football site.&lt;/p&gt;
- &lt;p&gt;Here&#8217;s the same thing for our favorite network graph:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;Take all users who match the handle (n0)&lt;/li&gt;
- &lt;li&gt;Broaden n0 along all relevant connections: for example atsign, follow, topic usage, etc &#8211; call this n1_all&lt;/li&gt;
- &lt;li&gt;Prune n1_all: elminate users with very few or very weak ties to n0, and call this n1.&lt;/li&gt;
- &lt;li&gt;Do a join of n1 on the tweets. (Note that, since n1 is &#8216;only&#8217; a few million rows, you can do a map-side aka fragment-replicate join, which is actually quite efficient)&lt;/li&gt;
- &lt;li&gt;Do some joins of n1 to get relationships with a sample user on the left (and/or) right&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;For example, the subuniverse we typically work with is &#8220;users who have mentioned @infochimps, hadoop, opendata, or bigdata&#8221;. We chose this handle for a few reasons (besides the &#8220;we are big dorks&#8221;). Since we infochimps land in there, it&#8217;s easy to inspect the results of an experiment against a familiar object (ourselves). It also gives very correlated edges: many such people also follow each other, use other similar terms, etc. Without this correlation, we&#8217;d span too much of the graph.&lt;/p&gt;
- &lt;p&gt;Within the subuniverse, we can happily do joins, calculate trstrank, and examine local community structure.&lt;/p&gt;
- &lt;p&gt;Of course, the sample is heavily skewed by its handle. There&#8217;s the obvious way: among people who mention &#8216;hadoop&#8217;, conference planning is easy, dating is unfortunately hard. More importantly, no matter what handle you use the subuniverse will be heavily biased towards the &#8216;core&#8217; of the graph:&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;twitter users with millions of followers are going to land in almost any given subuniverse&lt;/li&gt;
- &lt;li&gt;the trstrank of any given subuniverse is going to be vastly higher than the whole graph average&lt;/li&gt;
- &lt;li&gt;Since real-world dynamic graphs typically densify over time (more roads are built, you follow more people on twitter), a subuniverse sample will have disproportionately few recent nodes.&lt;/li&gt;
- &lt;/ul&gt;
- &lt;h3&gt;Connectivity-preserving Sample&lt;/h3&gt;
- &lt;p&gt;There&#8217;s one other type of sample you might like to do: one that preserves the global connectivity of edges.&lt;/p&gt;
- &lt;ul&gt;
- &lt;li&gt;For each edge, record the total degree of the node at it&#8217;s ends (deg_a, deg_b).&lt;/li&gt;
- &lt;li&gt;Stream through all the edges and with a probability of ( f * ( 1/deg_a + 1/deg_b )), keep the edge.&lt;/li&gt;
- &lt;/ul&gt;
- &lt;p&gt;(The parameter f adjusts the fraction of edges sampled.) In this equation, a node with one inbound link has a high chance of survival. On average, each node will have f inbound and f outbound links survive.&lt;/p&gt;
- &lt;p&gt;This also in general retains all nodes: with f ~ 0.5, a 1B-edge graph on 100m nodes will come out with about 100m edges and 100m nodes. You&#8217;ll have to turn f down pretty far for a significant number of nodes to start failing the binomial trial at each end.&lt;/p&gt;
- &lt;p&gt;To do this consistently, set g = 1/f and do&lt;/p&gt;
- &lt;pre&gt;&lt;code&gt; ( (MD5([node_a_id, node_b_id, 'a', salt].join(&quot;:&quot;)) % (deg_a * g) = 0) ||
- (MD5([node_a_id, node_b_id, 'b', salt].join(&quot;:&quot;)) % (deg_b * g) = 0) )&lt;/code&gt;&lt;/pre&gt;
- &lt;p&gt;(since deg_a and deg_b may be correlated, we perturb it by adding &#8216;a&#8217; and &#8216;b&#8217; as extra salts)&lt;/p&gt;
- &lt;hr /&gt;
- &lt;h4 style=&quot;vertical-align:middle;&quot;&gt;Footnotes&lt;/h4&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn1&quot;&gt;&lt;a href=&quot;#fnr1&quot;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; We seem mostly rid of stupid and/or non-threadsafe RNGs. However, many &lt;span class=&quot;caps&quot;&gt;UUID&lt;/span&gt; implementations (including Java&#8217;s, I think) require a global lock.&lt;/p&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn2&quot;&gt;&lt;a href=&quot;#fnr2&quot;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; By claiming an MD5 is good for anything I just made a cryptographer cry. So let me hurry to disclaim that you should really be using something-something-&lt;span class=&quot;caps&quot;&gt;HMAC&lt;/span&gt;-whatever. That is &#8212; if you care that this mixing is cryptographically strong, go look up the one that is.&lt;/p&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn3&quot;&gt;&lt;a href=&quot;#fnr3&quot;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; Make sure to join with a character that can&#8217;t appear in the key (here, &#8216;:&#8217;). Without the separator, key 12 in job 34 and key 123 in job 4 would hash identically.&lt;/p&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn4&quot;&gt;&lt;a href=&quot;#fnr4&quot;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; These are available as environment variables if you&#8217;re streaming&lt;/p&gt;
- &lt;p class=&quot;footnote&quot; id=&quot;fn5&quot;&gt;&lt;a href=&quot;#fnr5&quot;&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt; Note that you always need to use the &lt;strong&gt;least&lt;/strong&gt; significant bytes, because of &lt;a href=&quot;;&gt;Benford&#8217;s law&lt;/a&gt;&lt;/p&gt;
- </content>
- </entry>
Oops, something went wrong.

0 comments on commit ab39464

Please sign in to comment.