Learn techniques for working with large data sets: those which are too large to
fit in RAM, but not so large they require distributed computing. Take random
samples and find highly duplicated values – without a database import.
I’ll walk you through efficient recipes for: taking random samples in a single
pass, finding highly duplicated values in a single pass using a bloom filter,
and processing large data files in parallel – without preprocessing. These
techniques let you avoid having to import the data sets into your database,
and efficiently perform operations that would incur large memory or IO overhead
in a database.
Though not specific to Clojure, many of the techniques lead to elegant, easy to
understand implementations in Clojure by leveraging it’s sequence abstraction,
immutability and parallelization features.
Come hear about some new gadgets to add to your data munging utility belt!
I recently learned some new tools and techniques for working with large data
sets – those which are too large to fit in RAM, but not so large you need
distributed computing to work with them. I’ll discuss things like: taking
random samples; finding duplicated values as well as other types of basic
You have been asked to take a random sample of a few hundred million records in
order to estimate some metrics of the data set as a whole.
BZZT! You just wasted time fixing the other guy’s data export errors, then you
wasted time importing the data into your database only to find that the above
query is estimated to take somewhere between two days and the heat death of the
universe to execute on your workstation.
Turns out you can, sort supports a ‘random’ flag:sort -R
This is faster than the database…uses up disk space on the order of your original data set [wasteful] lots of IO spent computing the output [ unhappy anthropomorphic database ]
“Selection without replacement”
Stream the data:
odds are N/M of picking an element
N is your desired sample set size
M is the size of the population you haven’t yet considered
Decrement M after considering each member of the population
Decrement N after you’ve chosen an element.
Sit back and bask in the Win.[ relaxed person? the Reddit meme paint drawing / person? ]
One Machine, One Pass over the file, output on the order of your sample set
size. The very definition of W.I.N. (Why did I Not think of that before)
I was advised to tell you something valuable in the first few minutes of the
talk by a good friend of mine: Jonny Tran. I hope that counts…
When your data set, or the information you need to track to analyze it, fits
into RAM, you can effectively ignore most of these techniques. Finding
duplicated values in an array that already fits in memory is simple: you
iterate through it keeping a table of the times you’ve seen each element. At
worst (no duplicates), you’ll use memory on the order of the size of the data
set, plus the storage space it takes to store the count (an integer).
Remember: too big for RAM, possibly way to big.
Random AccessReally means ability to seek within the file, not really random. Not a Streaming / Sequencing interface, boo With a few Clojure functions over the indexes we can re-make an ordered sequence of records!
Natual Keys are easy: just extract and return
indexer.clj supports > 1 key for each record
Proxy keys are easy, just compute the proxy keys for the row and return them
Interactive Lookups (yes, like a sql select)
Intersecting multiple indexes
The sequence merge function.
Given a record grouping (based on proxy key), can we estimate similarity?
For heavyweight stuff this works but may not. Also, based on pmap’s
implementation (see: http://data-sorcery.org/2010/10/23/clojureconj/), you may
not get the parallelization you anticipate if you don’t keep the threadpool hot
/ busy. Please see David Edgar Liebke’s slide deck for more details on this,
but I’ve seen it happen.
These will be ‘about’ every 5Mb through the file. Then, turn each of those
into a seq of lines:
You can also do this pretty easily by line count as well, by using the Clojure
built-in function: partition and @pmap@’ing over the results.
Concurrency has some overhead.
For cheap operations, this it outweighs the benefits.
Rob DiMarco for helping with the title and abstract for this talk.
Paul Santa Clara for being part of the learning process.