[Streaming] Examples using Twitter's Algebird library #480

MLnick · 2013-02-19T15:59:56Z

This PR adds two examples for streaming that use monoids from Twitter's Algebird library:

HyperLogLog for approximate distinct object counting with low memory overhead
CountMinSketch for approximating object frequency in a stream as well as TopK or "heavy hitters" estimation

See https://groups.google.com/forum/?fromgroups=#!topic/spark-users/4ht9ndVaZQY

…lgebird Conflicts: project/SparkBuild.scala

…text.twitterStream method

MLnick · 2013-02-19T16:03:48Z

examples/src/main/scala/spark/streaming/examples/TwitterAlgebirdCMS.scala

+    val stream = ssc.twitterStream(username, password, filters,
+      StorageLevel.MEMORY_ONLY_SER)
+
+    val users = stream.map(status => status.getUser.getId)


A note about this: currently Algebird CMS only supports Long inputs. Since it uses hashing under the hood it should be possible to have any hashable input as with HyperLogLog, but not currently.

So for now this example works on user ids, so running it over relatively small durations will not result in very heavily-skewed data (which is where the sketch will be most useful). If we could take String inputs then it would be more interesting as we could do TopK on hashtags (for example) which is likely to be a lot more skewed.

This maybe an important point that may confused people. Can you added a line to the comment at the top?

johnynek · 2013-02-20T19:17:29Z

Glad to see this pull req. Hope this helps CMS and HLL make more impacts.

I agree that the CMS interface is suboptimal now. We are going to update it to support the same approach as HLL (probably in algebird 0.2.0). Let us know if there are any algorithms to add. I'd love to collaborate and share this code in Algebird (which we extracted from scalding).

sritchie · 2013-02-20T21:14:58Z

examples/pom.xml

-      <version>3.0.3</version>
+      <groupId>com.twitter</groupId>
+      <artifactId>algebird-core_2.9.2</artifactId>
+      <version>0.1.8</version>


0.1.9 is out!

…lgebird

…ar use of the monoids.

MLnick · 2013-02-21T11:55:20Z

@johnynek thanks for the comments! Look forward to 0.2.0 in that case since CMS with any hashable inputs will be neat. Also if I find some time I'd be happy to try a scalding version of the example.

[Streaming] Examples using Twitter's Algebird library

tdas · 2013-02-22T20:33:47Z

Thank you very much. This is a great addition.

@mridulm

Handful of 0.9 fixes This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate. @mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in mesos#477 and follow up discussion with him.

1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x 2, Fix SPARK-1491: maven hadoop-provided profile fails to build 3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency 4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces) Author: witgo <witgo@qq.com> Closes mesos#480 from witgo/format_pom and squashes the following commits: 03f652f [witgo] review commit b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence 7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence 0da4bc3 [witgo] merge master d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom e345919 [witgo] add avro dependency to yarn-alpha 77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency 1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 934f24d [witgo] review commit cf46edc [witgo] exclude jruby 06e7328 [witgo] Merge branch 'SparkBuild' into format_pom 99464d2 [witgo] fix maven hadoop-provided profile fails to build 0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x 6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml

MLnick added 4 commits February 19, 2013 13:21

Adding streaming HyperLogLog example using Algebird

015893f

Merge remote-tracking branch 'upstream/streaming' into streaming-eg-a…

315ea06

…lgebird Conflicts: project/SparkBuild.scala

Dependencies and refactoring for streaming HLL example, and using con…

d8ee184

…text.twitterStream method

Streaming example using Twitter Algebird's Count Min Sketch monoid

8a28139

MLnick reviewed Feb 19, 2013
View reviewed changes

sritchie reviewed Feb 20, 2013
View reviewed changes

MLnick added 3 commits February 21, 2013 09:33

Merge remote-tracking branch 'upstream/streaming' into streaming-eg-a…

16d4567

…lgebird

Bumping Algebird to 0.1.9

718474b

Adding documentation for HLL and CMS examples. More efficient and cle…

d9bdae8

…ar use of the monoids.

tdas added a commit that referenced this pull request Feb 22, 2013

Merge pull request #480 from MLnick/streaming-eg-algebird

cfa65eb

[Streaming] Examples using Twitter's Algebird library

tdas merged commit cfa65eb into mesos:streaming Feb 22, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Streaming] Examples using Twitter's Algebird library #480

[Streaming] Examples using Twitter's Algebird library #480

MLnick commented Feb 19, 2013

MLnick Feb 19, 2013

tdas Feb 19, 2013

johnynek commented Feb 20, 2013

sritchie Feb 20, 2013

MLnick commented Feb 21, 2013

tdas commented Feb 22, 2013

[Streaming] Examples using Twitter's Algebird library #480

[Streaming] Examples using Twitter's Algebird library #480

Conversation

MLnick commented Feb 19, 2013

MLnick Feb 19, 2013

Choose a reason for hiding this comment

tdas Feb 19, 2013

Choose a reason for hiding this comment

johnynek commented Feb 20, 2013

sritchie Feb 20, 2013

Choose a reason for hiding this comment

MLnick commented Feb 21, 2013

tdas commented Feb 22, 2013