Perl and Big Data

Martin Holste (This is the guy who was at Madison.PM talking about Sphinx)

Signs it is getting big

  • Distributed insert/select too slow for work load
  • too many record to iteratoe over
  • You can't control how much data you're getting
    • 52 billion rows

ELSA goals

  • Centralized logging for unlimited data
  • Ad-hoc Google-fast searches
  • Splunk was too expensive

Challenge: Input

  • Couldn't even receive syslog traditionally
  • Regex too slow, does not scale

Solution: PatternDB

  • Syslog-NG with PatternDB
  • Uses pattern matching state engine
  • Parses 100k+/sec into fields/value
  • Robust test framework

Challenge: Insert into DB

  • Basic INSERT is too slow
  • Traditional DBCS tops out at around 5k/sec
  • All DBMSes fairly equal (MS, Oracle, Pg, MySQL)

Solution: Batch load

  • LOAD DATA on any DBMS 100k+/sec

Challenge: Indexing

  • Even simple integer indexing was too slow
  • Way too slow for text columns
  • Tried special strage engines (TokuDB)
  • Tried NoSQL: Cassandra, Mongo, Couch, etc
  • All way too slow (less than 5k/sec)

Solution: Sphinx

  • Indexes 100k rows/sec
  • Slurps rows in batches
  • Provides MySQL-compatible DB handle
  • Multi-threaded searches take advantage of cores

Syslog-NG -> MySql -> Sphinx -> Web

  • Perl glues these together
  • 1.0 used Syslog->NG -> MySQL -> Sphinx -> POE -> m iddlware -> CGI::Application
  • 2.0 used Syslog->NG -> MySQL -> Sphinx -> Plack/AnyEvent

Some basic modules

  • Search::QueryParser (Google-style)
  • Log::Log4perl
  • CHI (for all caching)
  • Config::JSON
  • Not using DBIx::Class (weird schema) or Sphinx::Search (need async)

Showed big graphs of 1400ms queries of billions of records.


  • Chain searches together with results from one as params for next

    sqj_injection groupby:srcip | subsearch(status_code:500)


  • Map = query on index
  • Reduce = run function (plugin) or report on field
  • Recurse as necessary

Our stats

  • Billinos of fulpl-text indexed logs per node
  • 1 billion logs per day across nodes
  • Dozens of customers querying
  • Most queries finish in under 100ms