Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Large Data and Clojure: the middle ground between RAM and EC2

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 images
Octocat-spinner-32 sample
Octocat-spinner-32 .gitignore
Octocat-spinner-32 README.textile
Octocat-spinner-32 large-data.key
Octocat-spinner-32 large-data.pdf
README.textile

Beat the DB: Recipes for Large data Analysis with Clojure

Learn techniques for working with large data sets: those which are too large to
fit in RAM, but not so large they require distributed computing. Take random
samples and find highly duplicated values – without a database import.

I’ll walk you through efficient recipes for: taking random samples in a single
pass, finding highly duplicated values in a single pass using a bloom filter,
and processing large data files in parallel – without preprocessing. These
techniques let you avoid having to import the data sets into your database,
and efficiently perform operations that would incur large memory or IO overhead
in a database.

Though not specific to Clojure, many of the techniques lead to elegant, easy to
understand implementations in Clojure by leveraging it’s sequence abstraction,
immutability and parallelization features.

Come hear about some new gadgets to add to your data munging utility belt!

I recently learned some new tools and techniques for working with large data
sets – those which are too large to fit in RAM, but not so large you need
distributed computing to work with them. I’ll discuss things like: taking
random samples; finding duplicated values as well as other types of basic
analysis.’

How to Win the Random Sample Turkey Shoot

You have been asked to take a random sample of a few hundred million records in
order to estimate some metrics of the data set as a whole.

[ dice and a turkey ]

I Know: the database!

SELECT * FROM some_table ORDER BY RANDOM() LIMIT 1000 [ HUGE red X ]

BZZT! You just wasted time fixing the other guy’s data export errors, then you
wasted time importing the data into your database only to find that the above
query is estimated to take somewhere between two days and the heat death of the
universe to execute on your workstation.

[ a clock face and a DB cylinder ]

I’ll use the ‘GNU Signal!’

[ Bat-signal-like spotlight shining a huge GNU onto clouds? ]

Turns out you can, sort supports a ‘random’ flag:

sort -R

This is faster than the database…

uses up disk space on the order of your original data set [wasteful] lots of IO spent computing the output [ unhappy anthropomorphic database ]

“You Shall Not Pass!” (multiple times)

“Selection without replacement”
Stream the data:
odds are N/M of picking an element
N is your desired sample set size
M is the size of the population you haven’t yet considered

Decrement M after considering each member of the population

Decrement N after you’ve chosen an element.

Sit back and bask in the Win.

[ relaxed person? the Reddit meme paint drawing / person? ]

One Machine, One Pass over the file, output on the order of your sample set
size. The very definition of W.I.N. (Why did I Not think of that before)

That Was My Hook…

I was advised to tell you something valuable in the first few minutes of the
talk by a good friend of mine: Jonny Tran. I hope that counts…

What do I mean by ‘Large Data’?

When your data set, or the information you need to track to analyze it, fits
into RAM, you can effectively ignore most of these techniques. Finding
duplicated values in an array that already fits in memory is simple: you
iterate through it keeping a table of the times you’ve seen each element. At
worst (no duplicates), you’ll use memory on the order of the size of the data
set, plus the storage space it takes to store the count (an integer).

Counting and Analysis

Just Keep a HashMap/Table

[ HUGE red X ]

BZZT!

Remember: too big for RAM, possibly way to big.

Next Utility Belt Gadget: Bloom Filters

What’s a Bloom Filter

Find likely Dupes in 1 pass

Emit anything that’s a HIT with the Bloom Filter. You’ll get Some false positives. This is good enough for many of my use cases.

Find Actual Dupes / Counts in 1 Pass

Use a HashMap to count the elements that are a hit. Dump anything that has a count > 1 when you’re done.

Bonus: use proxy keys, not just natural keys

You can compute a value on the data before testing it vs the bloom filter.

Records and Identity

Identity within the file. Many of the dumps I receive have no persistent identifiers. Many have no relative identifiers! Give each record an ID. [ seq util that adds a counter to each record ] You can just count up from 1, or if you want to get fancy, use an LFSR.

What’s an LFSR?

Linear Feedback Shift Register. Class of PRNG. Interesting subset: m-sequences (have maximal period) Binary Numbering System – but non-sequential (remember: a prng) Fast, easy to serialize, not a hash. Keeps adjacent records from being close based on ID. This is not an issue for software / machines: but may help humans looking at the data from assuming that IDs which are close have some meaning. [ seq util that adds an lfsr id to each record ]

Cheap Indexing Outside the Database / Grouping Records

Random Access

Really means ability to seek within the file, not really random. Not a Streaming / Sequencing interface, boo With a few Clojure functions over the indexes we can re-make an ordered sequence of records!

Creating Indexes

Natual Keys are easy: just extract and return

indexer.clj supports > 1 key for each record

Proxy keys are easy, just compute the proxy keys for the row and return them
(eg: soundex)

Using Indexes

Interactive Lookups (yes, like a sql select)

Intersecting multiple indexes

Getting Record Groupings

The sequence merge function.

Fuzzy Duplicate Detection

Given a record grouping (based on proxy key), can we estimate similarity?

Concurrency: Keeping your cores busy

Processing a File: lines in parallel

(pmap some-fn (ds/read-lines some-file))

For heavyweight stuff this works but may not. Also, based on pmap’s
implementation (see: http://data-sorcery.org/2010/10/23/clojureconj/), you may
not get the parallelization you anticipate if you don’t keep the threadpool hot
/ busy. Please see David Edgar Liebke’s slide deck for more details on this,
but I’ve seen it happen.

Creating simple work queues:

Break a large file down with GNU split split l 10000 some-large-file.tab working-directory/inp Simple to create a work queue: (pmap some-expensive-operation (map str (filter #(.isFile %) (.listFiles (java.io.File. “working-directory/”))))) Recombine results: for f in working-directory/outp-*; do cat $f >> all-results.tab done

Do this w/o Leaving Clojure : Processing a file: By Chunks

(byte-partitions-at-line-boundaries “some-huge-file.tab” (* 5 1024 1024)) => ([start1 end1] [start2 end2] [start3 end3] …)

These will be ‘about’ every 5Mb through the file. Then, turn each of those
into a seq of lines:

(pmap (fn [[start end]] (doseq [line (read-lines-from-file-segment “some-huge-file.tab” start end)] (some-expensive-thing line) (byte-partitions-at-line-boundaries “some-huge-file.tab” (* 5 1024 1024)))

You can also do this pretty easily by line count as well, by using the Clojure
built-in function: partition and @pmap@’ing over the results.

Aside: pmap and When it’s worth it.

Concurrency has some overhead.

For cheap operations, this it outweighs the benefits.

Special Thanks

Rob DiMarco for helping with the title and abstract for this talk.

Paul Santa Clara for being part of the learning process.

Something went wrong with that request. Please try again.