Permalink
Browse files

Created gh-pages branch via GitHub

  • Loading branch information...
0 parents commit 95a52584bd03af6a1a8b93b95a269b512df4c441 @matthayes matthayes committed Apr 9, 2012
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@@ -0,0 +1,246 @@
+<!DOCTYPE html>
+<html>
+
+ <head>
+ <meta charset='utf-8' />
+ <meta http-equiv="X-UA-Compatible" content="chrome=1" />
+ <meta name="description" content="DataFu : Hadoop library for large-scale data processing" />
+
+ <link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">
+
+ <title>DataFu</title>
+ </head>
+
+ <body>
+
+ <!-- HEADER -->
+ <div id="header_wrap" class="outer">
+ <header class="inner">
+ <a id="forkme_banner" href="https://github.com/linkedin/datafu">Fork Me on GitHub</a>
+
+ <h1 id="project_title">DataFu</h1>
+ <h2 id="project_tagline">Hadoop library for large-scale data processing</h2>
+
+ <section id="downloads">
+ <a class="zip_download_link" href="https://github.com/linkedin/datafu/zipball/master">Download this project as a .zip file</a>
+ <a class="tar_download_link" href="https://github.com/linkedin/datafu/tarball/master">Download this project as a tar.gz file</a>
+ </section>
+ </header>
+ </div>
+
+ <!-- MAIN CONTENT -->
+ <div id="main_content_wrap" class="outer">
+ <section id="main_content" class="inner">
+ <h1>DataFu</h1>
+
+<p>DataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. It is used at LinkedIn in many of our off-line workflows for data derived products like "People You May Know" and "Skills". It contains functions for:</p>
+
+<ul>
+<li>PageRank</li>
+<li>Quantiles (median), variance, etc.</li>
+<li>Sessionization</li>
+<li>Convenience bag functions (e.g., set operations, enumerating bags, etc)</li>
+<li>Convenience utility functions (e.g., assertions, easier writing of
+EvalFuncs)</li>
+<li>and <a href="http://sna-projects.com/datafu/javadoc/0.0.4/">more</a>...</li>
+</ul><p>Each function is unit tested and code coverage is being tracked for the entire library. It has been tested against pig 0.9.</p>
+
+<p><a href="http://sna-projects.com/datafu/">http://sna-projects.com/datafu/</a></p>
+
+<h2>What can you do with it?</h2>
+
+<p>Here's a taste of what you can do in Pig.</p>
+
+<h3>Statistics</h3>
+
+<p>Compute the <a href="http://en.wikipedia.org/wiki/Median">median</a> of sequence of sorted bags:</p>
+
+<pre><code>define Median datafu.pig.stats.Median();
+
+-- input: 3,5,4,1,2
+input = LOAD 'input' AS (val:int);
+
+grouped = GROUP input ALL;
+
+-- produces median of 3
+medians = FOREACH grouped {
+ sorted = ORDER input BY val;
+ GENERATE Median(sorted.val);
+}
+</code></pre>
+
+<p>Similarly, compute any arbitrary <a href="http://en.wikipedia.org/wiki/Quantile">quantiles</a>:</p>
+
+<pre><code>define Quantile datafu.pig.stats.Quantile('0.0','0.5','1.0');
+
+-- input: 9,10,2,3,5,8,1,4,6,7
+input = LOAD 'input' AS (val:int);
+
+grouped = GROUP input ALL;
+
+-- produces: (1,5.5,10)
+quantiles = FOREACH grouped {
+ sorted = ORDER input BY val;
+ GENERATE Quantile(sorted.val);
+}
+</code></pre>
+
+<h3>Set Operations</h3>
+
+<p>Treat sorted bags as sets and compute their intersection:</p>
+
+<pre><code>define SetIntersect datafu.pig.bags.sets.SetIntersect();
+
+-- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)})
+input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
+
+-- ({(1),(4),(5)})
+intersected = FOREACH input {
+ sorted_b1 = ORDER B1 by val;
+ sorted_b2 = ORDER B2 by val;
+ GENERATE SetIntersect(sorted_b1,sorted_b2);
+}
+</code></pre>
+
+<p>Compute the set union:</p>
+
+<pre><code>define SetUnion datafu.pig.bags.sets.SetUnion();
+
+-- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)})
+input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
+
+-- ({(3),(4),(1),(2),(7),(5),(6),(0),(10)})
+unioned = FOREACH input GENERATE SetUnion(B1,B2);
+</code></pre>
+
+<p>Operate on several bags even:</p>
+
+<pre><code>intersected = FOREACH input GENERATE SetUnion(B1,B2,B3);
+</code></pre>
+
+<h3>Bag operations</h3>
+
+<p>Concatenate two or more bags:</p>
+
+<pre><code>define BagConcat datafu.pig.bags.BagConcat();
+
+-- ({(1),(2),(3)},{(4),(5)},{(6),(7)})
+input = LOAD 'input' AS (B1: bag{T: tuple(v:INT)}, B2: bag{T: tuple(v:INT)}, B3: bag{T: tuple(v:INT)});
+
+-- ({(1),(2),(3),(4),(5),(6),(7)})
+output = FOREACH input GENERATE BagConcat(B1,B2,B3);
+</code></pre>
+
+<p>Append a tuple to a bag:</p>
+
+<pre><code>define AppendToBag datafu.pig.bags.AppendToBag();
+
+-- ({(1),(2),(3)},(4))
+input = LOAD 'input' AS (B: bag{T: tuple(v:INT)}, T: tuple(v:INT));
+
+-- ({(1),(2),(3),(4)})
+output = FOREACH input GENERATE AppendToBag(B,T);
+</code></pre>
+
+<h3>PageRank</h3>
+
+<p>Run PageRank on a large number of independent graphs:</p>
+
+<pre><code>define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true');
+
+topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE);
+
+topic_edges_grouped = GROUP topic_edges by (topic, source) ;
+topic_edges_grouped = FOREACH topic_edges_grouped GENERATE
+ group.topic as topic,
+ group.source as source,
+ topic_edges.(dest,weight) as edges;
+
+topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic;
+
+topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE
+ group as topic,
+ FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,rank);
+
+skill_ranks = FOREACH skill_ranks GENERATE
+ topic, source, rank;
+</code></pre>
+
+<p>This implementation stores the nodes and edges (mostly) in memory. It is therefore best suited when one needs to compute PageRank on many reasonably sized graphs in parallel.</p>
+
+<h2>Start Using It</h2>
+
+<p>The JAR can be found <a href="http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.linkedin.datafu%22">here</a> in the Maven central repository. The GroupId and ArtifactId are <code>com.linkedin.datafu</code> and <code>datafu</code>, respectively.</p>
+
+<p>If you are using Ivy:</p>
+
+<pre><code>&lt;dependency org="com.linkedin.datafu" name="datafu" rev="0.0.4"/&gt;
+</code></pre>
+
+<p>If you are using Maven:</p>
+
+<pre><code>&lt;dependency&gt;
+ &lt;groupId&gt;com.linkedin.datafu&lt;/groupId&gt;
+ &lt;artifactId&gt;datafu&lt;/artifactId&gt;
+ &lt;version&gt;0.0.4&lt;/version&gt;
+&lt;/dependency&gt;
+</code></pre>
+
+<p>Or you can download one of the packages from the <a href="https://github.com/linkedin/datafu/downloads">downloads</a> section. </p>
+
+<h2>Working with the source code</h2>
+
+<p>Here are some common tasks when working with the source code.</p>
+
+<h3>Build the JAR</h3>
+
+<pre><code>ant jar
+</code></pre>
+
+<h3>Run all tests</h3>
+
+<pre><code>ant test
+</code></pre>
+
+<h3>Run specific tests</h3>
+
+<p>Override <code>testclasses.pattern</code>, which defaults to <code>**/*.class</code>. For example, to run all tests defined in <code>QuantileTests</code>:</p>
+
+<pre><code>ant test -Dtestclasses.pattern=**/QuantileTests.class
+</code></pre>
+
+<h3>Compute code coverage</h3>
+
+<pre><code>ant coverage
+</code></pre>
+
+<h2>Contribute</h2>
+
+<p>The source code is available under the Apache 2.0 license. </p>
+
+<p>For help please see the <a href="http://groups.google.com/group/datafu">discussion group</a>. Bugs and feature requests can be filed <a href="http://linkedin.jira.com/browse/DATAFU">here</a>.</p>
+ </section>
+ </div>
+
+ <!-- FOOTER -->
+ <div id="footer_wrap" class="outer">
+ <footer class="inner">
+ <p class="copyright">DataFu maintained by <a href="https://github.com/linkedin">linkedin</a></p>
+ <p>Published with <a href="http://pages.github.com">GitHub Pages</a></p>
+ </footer>
+ </div>
+
+ <script type="text/javascript">
+ var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
+ document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
+ </script>
+ <script type="text/javascript">
+ try {
+ var pageTracker = _gat._getTracker("UA-30533336-1");
+ pageTracker._trackPageview();
+ } catch(err) {}
+ </script>
+
+
+ </body>
+</html>
@@ -0,0 +1 @@
+console.log('This would be the main JS file.');
@@ -0,0 +1 @@
+{"name":"DataFu","body":"# DataFu\r\n\r\nDataFu is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. It is used at LinkedIn in many of our off-line workflows for data derived products like \"People You May Know\" and \"Skills\". It contains functions for:\r\n\r\n* PageRank\r\n* Quantiles (median), variance, etc.\r\n* Sessionization\r\n* Convenience bag functions (e.g., set operations, enumerating bags, etc)\r\n* Convenience utility functions (e.g., assertions, easier writing of\r\nEvalFuncs)\r\n* and [more](http://sna-projects.com/datafu/javadoc/0.0.4/)...\r\n\r\nEach function is unit tested and code coverage is being tracked for the entire library. It has been tested against pig 0.9.\r\n\r\n[http://sna-projects.com/datafu/](http://sna-projects.com/datafu/)\r\n\r\n## What can you do with it?\r\n\r\nHere's a taste of what you can do in Pig.\r\n\r\n### Statistics\r\n \r\nCompute the [median](http://en.wikipedia.org/wiki/Median) of sequence of sorted bags:\r\n\r\n define Median datafu.pig.stats.Median();\r\n\r\n -- input: 3,5,4,1,2\r\n input = LOAD 'input' AS (val:int);\r\n\r\n grouped = GROUP input ALL;\r\n\r\n -- produces median of 3\r\n medians = FOREACH grouped {\r\n sorted = ORDER input BY val;\r\n GENERATE Median(sorted.val);\r\n }\r\n \r\nSimilarly, compute any arbitrary [quantiles](http://en.wikipedia.org/wiki/Quantile):\r\n\r\n define Quantile datafu.pig.stats.Quantile('0.0','0.5','1.0');\r\n\r\n -- input: 9,10,2,3,5,8,1,4,6,7\r\n input = LOAD 'input' AS (val:int);\r\n\r\n grouped = GROUP input ALL;\r\n\r\n -- produces: (1,5.5,10)\r\n quantiles = FOREACH grouped {\r\n sorted = ORDER input BY val;\r\n GENERATE Quantile(sorted.val);\r\n }\r\n\r\n### Set Operations\r\n\r\nTreat sorted bags as sets and compute their intersection:\r\n\r\n define SetIntersect datafu.pig.bags.sets.SetIntersect();\r\n \r\n -- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)})\r\n input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});\r\n\r\n -- ({(1),(4),(5)})\r\n intersected = FOREACH input {\r\n sorted_b1 = ORDER B1 by val;\r\n sorted_b2 = ORDER B2 by val;\r\n GENERATE SetIntersect(sorted_b1,sorted_b2);\r\n }\r\n \r\nCompute the set union:\r\n\r\n define SetUnion datafu.pig.bags.sets.SetUnion();\r\n\r\n -- ({(3),(4),(1),(2),(7),(5),(6)},{(0),(5),(10),(1),(4)})\r\n input = LOAD 'input' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});\r\n\r\n -- ({(3),(4),(1),(2),(7),(5),(6),(0),(10)})\r\n unioned = FOREACH input GENERATE SetUnion(B1,B2);\r\n \r\nOperate on several bags even:\r\n\r\n intersected = FOREACH input GENERATE SetUnion(B1,B2,B3);\r\n\r\n### Bag operations\r\n\r\nConcatenate two or more bags:\r\n\r\n define BagConcat datafu.pig.bags.BagConcat();\r\n\r\n -- ({(1),(2),(3)},{(4),(5)},{(6),(7)})\r\n input = LOAD 'input' AS (B1: bag{T: tuple(v:INT)}, B2: bag{T: tuple(v:INT)}, B3: bag{T: tuple(v:INT)});\r\n\r\n -- ({(1),(2),(3),(4),(5),(6),(7)})\r\n output = FOREACH input GENERATE BagConcat(B1,B2,B3);\r\n\r\nAppend a tuple to a bag:\r\n\r\n define AppendToBag datafu.pig.bags.AppendToBag();\r\n\r\n -- ({(1),(2),(3)},(4))\r\n input = LOAD 'input' AS (B: bag{T: tuple(v:INT)}, T: tuple(v:INT));\r\n\r\n -- ({(1),(2),(3),(4)})\r\n output = FOREACH input GENERATE AppendToBag(B,T);\r\n\r\n### PageRank\r\n\r\nRun PageRank on a large number of independent graphs:\r\n\r\n define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true');\r\n\r\n topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE);\r\n\r\n topic_edges_grouped = GROUP topic_edges by (topic, source) ;\r\n topic_edges_grouped = FOREACH topic_edges_grouped GENERATE\r\n group.topic as topic,\r\n group.source as source,\r\n topic_edges.(dest,weight) as edges;\r\n\r\n topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic; \r\n\r\n topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE\r\n group as topic,\r\n FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,rank);\r\n\r\n skill_ranks = FOREACH skill_ranks GENERATE\r\n topic, source, rank;\r\n \r\nThis implementation stores the nodes and edges (mostly) in memory. It is therefore best suited when one needs to compute PageRank on many reasonably sized graphs in parallel.\r\n \r\n## Start Using It\r\n\r\nThe JAR can be found [here](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.linkedin.datafu%22) in the Maven central repository. The GroupId and ArtifactId are `com.linkedin.datafu` and `datafu`, respectively.\r\n\r\nIf you are using Ivy:\r\n\r\n <dependency org=\"com.linkedin.datafu\" name=\"datafu\" rev=\"0.0.4\"/>\r\n \r\nIf you are using Maven:\r\n\r\n <dependency>\r\n <groupId>com.linkedin.datafu</groupId>\r\n <artifactId>datafu</artifactId>\r\n <version>0.0.4</version>\r\n </dependency>\r\n \r\nOr you can download one of the packages from the [downloads](https://github.com/linkedin/datafu/downloads) section. \r\n\r\n## Working with the source code\r\n\r\nHere are some common tasks when working with the source code.\r\n\r\n### Build the JAR\r\n\r\n ant jar\r\n \r\n### Run all tests\r\n\r\n ant test\r\n\r\n### Run specific tests\r\n\r\nOverride `testclasses.pattern`, which defaults to `**/*.class`. For example, to run all tests defined in `QuantileTests`:\r\n\r\n ant test -Dtestclasses.pattern=**/QuantileTests.class\r\n\r\n### Compute code coverage\r\n\r\n ant coverage\r\n\r\n## Contribute\r\n\r\nThe source code is available under the Apache 2.0 license. \r\n\r\nFor help please see the [discussion group](http://groups.google.com/group/datafu). Bugs and feature requests can be filed [here](http://linkedin.jira.com/browse/DATAFU).","tagline":"Hadoop library for large-scale data processing","google":"UA-30533336-1","note":"Don't delete this file! It's used internally to help with page regeneration."}
@@ -0,0 +1,70 @@
+.highlight .hll { background-color: #ffffcc }
+.highlight { background: #f0f3f3; }
+.highlight .c { color: #0099FF; font-style: italic } /* Comment */
+.highlight .err { color: #AA0000; background-color: #FFAAAA } /* Error */
+.highlight .k { color: #006699; font-weight: bold } /* Keyword */
+.highlight .o { color: #555555 } /* Operator */
+.highlight .cm { color: #0099FF; font-style: italic } /* Comment.Multiline */
+.highlight .cp { color: #009999 } /* Comment.Preproc */
+.highlight .c1 { color: #0099FF; font-style: italic } /* Comment.Single */
+.highlight .cs { color: #0099FF; font-weight: bold; font-style: italic } /* Comment.Special */
+.highlight .gd { background-color: #FFCCCC; border: 1px solid #CC0000 } /* Generic.Deleted */
+.highlight .ge { font-style: italic } /* Generic.Emph */
+.highlight .gr { color: #FF0000 } /* Generic.Error */
+.highlight .gh { color: #003300; font-weight: bold } /* Generic.Heading */
+.highlight .gi { background-color: #CCFFCC; border: 1px solid #00CC00 } /* Generic.Inserted */
+.highlight .go { color: #AAAAAA } /* Generic.Output */
+.highlight .gp { color: #000099; font-weight: bold } /* Generic.Prompt */
+.highlight .gs { font-weight: bold } /* Generic.Strong */
+.highlight .gu { color: #003300; font-weight: bold } /* Generic.Subheading */
+.highlight .gt { color: #99CC66 } /* Generic.Traceback */
+.highlight .kc { color: #006699; font-weight: bold } /* Keyword.Constant */
+.highlight .kd { color: #006699; font-weight: bold } /* Keyword.Declaration */
+.highlight .kn { color: #006699; font-weight: bold } /* Keyword.Namespace */
+.highlight .kp { color: #006699 } /* Keyword.Pseudo */
+.highlight .kr { color: #006699; font-weight: bold } /* Keyword.Reserved */
+.highlight .kt { color: #007788; font-weight: bold } /* Keyword.Type */
+.highlight .m { color: #FF6600 } /* Literal.Number */
+.highlight .s { color: #CC3300 } /* Literal.String */
+.highlight .na { color: #330099 } /* Name.Attribute */
+.highlight .nb { color: #336666 } /* Name.Builtin */
+.highlight .nc { color: #00AA88; font-weight: bold } /* Name.Class */
+.highlight .no { color: #336600 } /* Name.Constant */
+.highlight .nd { color: #9999FF } /* Name.Decorator */
+.highlight .ni { color: #999999; font-weight: bold } /* Name.Entity */
+.highlight .ne { color: #CC0000; font-weight: bold } /* Name.Exception */
+.highlight .nf { color: #CC00FF } /* Name.Function */
+.highlight .nl { color: #9999FF } /* Name.Label */
+.highlight .nn { color: #00CCFF; font-weight: bold } /* Name.Namespace */
+.highlight .nt { color: #330099; font-weight: bold } /* Name.Tag */
+.highlight .nv { color: #003333 } /* Name.Variable */
+.highlight .ow { color: #000000; font-weight: bold } /* Operator.Word */
+.highlight .w { color: #bbbbbb } /* Text.Whitespace */
+.highlight .mf { color: #FF6600 } /* Literal.Number.Float */
+.highlight .mh { color: #FF6600 } /* Literal.Number.Hex */
+.highlight .mi { color: #FF6600 } /* Literal.Number.Integer */
+.highlight .mo { color: #FF6600 } /* Literal.Number.Oct */
+.highlight .sb { color: #CC3300 } /* Literal.String.Backtick */
+.highlight .sc { color: #CC3300 } /* Literal.String.Char */
+.highlight .sd { color: #CC3300; font-style: italic } /* Literal.String.Doc */
+.highlight .s2 { color: #CC3300 } /* Literal.String.Double */
+.highlight .se { color: #CC3300; font-weight: bold } /* Literal.String.Escape */
+.highlight .sh { color: #CC3300 } /* Literal.String.Heredoc */
+.highlight .si { color: #AA0000 } /* Literal.String.Interpol */
+.highlight .sx { color: #CC3300 } /* Literal.String.Other */
+.highlight .sr { color: #33AAAA } /* Literal.String.Regex */
+.highlight .s1 { color: #CC3300 } /* Literal.String.Single */
+.highlight .ss { color: #FFCC33 } /* Literal.String.Symbol */
+.highlight .bp { color: #336666 } /* Name.Builtin.Pseudo */
+.highlight .vc { color: #003333 } /* Name.Variable.Class */
+.highlight .vg { color: #003333 } /* Name.Variable.Global */
+.highlight .vi { color: #003333 } /* Name.Variable.Instance */
+.highlight .il { color: #FF6600 } /* Literal.Number.Integer.Long */
+
+.type-csharp .highlight .k { color: #0000FF }
+.type-csharp .highlight .kt { color: #0000FF }
+.type-csharp .highlight .nf { color: #000000; font-weight: normal }
+.type-csharp .highlight .nc { color: #2B91AF }
+.type-csharp .highlight .nn { color: #000000 }
+.type-csharp .highlight .s { color: #A31515 }
+.type-csharp .highlight .sc { color: #A31515 }
Oops, something went wrong.

0 comments on commit 95a5258

Please sign in to comment.