Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
81 lines (61 sloc) 2.08 KB

Perl and Big Data

Martin Holste (This is the guy who was at Madison.PM talking about Sphinx)

Signs it is getting big

  • Distributed insert/select too slow for work load
  • too many record to iteratoe over
  • You can't control how much data you're getting
    • 52 billion rows

ELSA goals

  • Centralized logging for unlimited data
  • Ad-hoc Google-fast searches
  • Splunk was too expensive

Challenge: Input

  • Couldn't even receive syslog traditionally
  • Regex too slow, does not scale

Solution: PatternDB

  • Syslog-NG with PatternDB
  • Uses pattern matching state engine
  • Parses 100k+/sec into fields/value
  • Robust test framework

Challenge: Insert into DB

  • Basic INSERT is too slow
  • Traditional DBCS tops out at around 5k/sec
  • All DBMSes fairly equal (MS, Oracle, Pg, MySQL)

Solution: Batch load

  • LOAD DATA on any DBMS 100k+/sec

Challenge: Indexing

  • Even simple integer indexing was too slow
  • Way too slow for text columns
  • Tried special strage engines (TokuDB)
  • Tried NoSQL: Cassandra, Mongo, Couch, etc
  • All way too slow (less than 5k/sec)

Solution: Sphinx

  • Indexes 100k rows/sec
  • Slurps rows in batches
  • Provides MySQL-compatible DB handle
  • Multi-threaded searches take advantage of cores

Syslog-NG -> MySql -> Sphinx -> Web

  • Perl glues these together
  • 1.0 used Syslog->NG -> MySQL -> Sphinx -> POE -> m iddlware -> CGI::Application
  • 2.0 used Syslog->NG -> MySQL -> Sphinx -> Plack/AnyEvent

Some basic modules

  • Search::QueryParser (Google-style)
  • Log::Log4perl
  • CHI (for all caching)
  • Config::JSON
  • Not using DBIx::Class (weird schema) or Sphinx::Search (need async)

Showed big graphs of 1400ms queries of billions of records.


  • Chain searches together with results from one as params for next

    sqj_injection groupby:srcip | subsearch(status_code:500)


  • Map = query on index
  • Reduce = run function (plugin) or report on field
  • Recurse as necessary

Our stats

  • Billinos of fulpl-text indexed logs per node
  • 1 billion logs per day across nodes
  • Dozens of customers querying
  • Most queries finish in under 100ms