Bigg(ish) Data: Working the Middle ground between RAM and a Cluster
We all want fast answers, even as our data sets grow in size. Simple
techniques stop working when our data no longer fits in a single process. In
this talk I’d like to share a few of the techniques I’ve learned that stretch
what you can do on a single machine. I’ll cover taking random samples from
large data sets with stream sampling and identifying (probable) duplicates when
the set is too large to fit in memory.
NB: I’m the process of updating this talk.
- data processing techniques for single processes
- stream sampling
- bloom filters
- clojure and distributed data processing
- getting a REPL on your Storm topology
- ? connecting a REPL to Spark
…tell a story…
2. When RAM is not enoughThere are a couple of algorithms that I’ve learned that help stretch what can be done in a single process. They help you use a fixed amount of memory to achieve the same outcomes without pulling your entire data set into memory. These started out as interesting algorithms when I read about them, I didn’t know where or when I was going to use them. Knowing about them let me start to see where they could be applied. That’s one of the things I love about learning new techniques. Two of the techniques I’m going to cover in this talk are: taking random samples and detecting duplicates in large data sets. These techniques let you work with data that’s too large to fit in memory in a single process, while still staying in a single process.
3. Mission: ProbableA few years ago, my company purchased a large data set and wanted to get a sense for how good the data was. We needed to pull a random sample of the data and verify it. The verification was too manual and expensive to do for the all of the data. The first thing that comes to mind is to iterate over the data and emit a record with a 20k out of 400M probability. This is pretty quick, but it doesn’t guarantee that we’ll get the exact sample size we’re looking for. Lets see, what else comes to mind?
7. Flip a Coin?
We could iterate through the data set, generate a random number for each element and if it’s less than 20k out of 400M, select that element. This will let us do one pass over the data, though we will be over or under our sample size by some small amount. If we have random access to the elements, we could generate random indexes into the 400M and select elements that way. We’d have to be a little careful not to select the same record more than once.
Can we refine the ‘flip a coin’ approach? If you flip a coin 1/M for each
element there is a chance you won’t hit your sample size. Should we choosing a
random index until we hit your sample size? That’s sampling with replacement,
which may result in duplicates. Was that our last hope? No, there is another.
Running the example code
test -d $HOME/bin || mkdir $HOME/bin test -x $HOME/bin/bake || curl https://raw.githubusercontent.com/kyleburton/bake/master/bake > $HOME/bin/bake chmod 755 $HOME/bin/bake test -x $HOME/bin/lein || curl https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein > $HOME/bin/lein chmod 755 $HOME/bin/lein export PATH=$HOME/bin:$PATH bake emacs-install-cider bake emacs-install-ac-cider bake get-sample-data
h1. Beat the DB: Recipes for Large data Analysis with Clojure
Learn techniques for working with large data sets: those which are too large to
fit in RAM, but not so large they require distributed computing. Take random
samples and find highly duplicated values – without a database import.
I’ll walk you through efficient recipes for: taking random samples in a single
pass, finding highly duplicated values in a single pass using a bloom filter,
and processing large data files in parallel – without preprocessing. These
techniques let you avoid having to import the data sets into your database,
and efficiently perform operations that would incur large memory or IO overhead
in a database.
Though not specific to Clojure, many of the techniques lead to elegant, easy to
understand implementations in Clojure by leveraging it’s sequence abstraction,
immutability and parallelization features.
Come hear about some new gadgets to add to your data munging utility belt!
I recently learned some new tools and techniques for working with large data
sets – those which are too large to fit in RAM, but not so large you need
distributed computing to work with them. I’ll discuss things like: taking
random samples; finding duplicated values as well as other types of basic
How to Win the Random Sample Turkey Shoot
You have been asked to take a random sample of a few hundred million records in
order to estimate some metrics of the data set as a whole.
SELECT * FROM some_table ORDER BY RANDOM() LIMIT 1000 [ HUGE red X ] I Know: the database!
BZZT! You just wasted time fixing the other guy’s data export errors, then you
wasted time importing the data into your database only to find that the above
query is estimated to take somewhere between two days and the heat death of the
universe to execute on your workstation.
[ Bat-signal-like spotlight shining a huge GNU onto clouds? ] I’ll use the ‘GNU Signal!’
Turns out you can, sort supports a ‘random’ flag:sort -R
This is faster than the database…uses up disk space on the order of your original data set [wasteful] lots of IO spent computing the output [ unhappy anthropomorphic database ]
“You Shall Not Pass!” (multiple times)
“Selection without replacement”
Stream the data:
odds are N/M of picking an element
N is your desired sample set size
M is the size of the population you haven’t yet considered
Decrement M after considering each member of the population
Decrement N after you’ve chosen an element.
Sit back and bask in the Win.[ relaxed person? the Reddit meme paint drawing / person? ]
One Machine, One Pass over the file, output on the order of your sample set
size. The very definition of W.I.N. (Why did I Not think of that before)
That Was My Hook…
I was advised to tell you something valuable in the first few minutes of the
talk by a good friend of mine: Jonny Tran. I hope that counts…
What do I mean by ‘Large Data’?
When your data set, or the information you need to track to analyze it, fits
into RAM, you can effectively ignore most of these techniques. Finding
duplicated values in an array that already fits in memory is simple: you
iterate through it keeping a table of the times you’ve seen each element. At
worst (no duplicates), you’ll use memory on the order of the size of the data
set, plus the storage space it takes to store the count (an integer).
Counting and Analysis
[ HUGE red X ] Just Keep a HashMap/Table
Remember: too big for RAM, possibly way to big.
Next Utility Belt Gadget: Bloom Filters
What’s a Bloom Filter
Emit anything that’s a HIT with the Bloom Filter. You’ll get Some false positives. This is good enough for many of my use cases. Find likely Dupes in 1 pass
Use a HashMap to count the elements that are a hit. Dump anything that has a count > 1 when you’re done. Find Actual Dupes / Counts in 1 Pass
You can compute a value on the data before testing it vs the bloom filter. Bonus: use proxy keys, not just natural keys
Identity within the file. Many of the dumps I receive have no persistent identifiers. Many have no relative identifiers! Give each record an ID. [ seq util that adds a counter to each record ] You can just count up from 1, or if you want to get fancy, use an LFSR. Records and Identity
Linear Feedback Shift Register. Class of PRNG. Interesting subset: m-sequences (have maximal period) Binary Numbering System – but non-sequential (remember: a prng) Fast, easy to serialize, not a hash. Keeps adjacent records from being close based on ID. This is not an issue for software / machines: but may help humans looking at the data from assuming that IDs which are close have some meaning. [ seq util that adds an lfsr id to each record ] What’s an LFSR?
Cheap Indexing Outside the Database / Grouping Records
Random AccessReally means ability to seek within the file, not really random. Not a Streaming / Sequencing interface, boo With a few Clojure functions over the indexes we can re-make an ordered sequence of records!
Natual Keys are easy: just extract and return
indexer.clj supports > 1 key for each record
Proxy keys are easy, just compute the proxy keys for the row and return them
Interactive Lookups (yes, like a sql select)
Intersecting multiple indexes
Getting Record Groupings
The sequence merge function.
Fuzzy Duplicate Detection
Given a record grouping (based on proxy key), can we estimate similarity?
Concurrency: Keeping your cores busy
(pmap some-fn (ds/read-lines some-file)) Processing a File: lines in parallel
For heavyweight stuff this works but may not. Also, based on pmap’s
implementation (see: http://data-sorcery.org/2010/10/23/clojureconj/), you may
not get the parallelization you anticipate if you don’t keep the threadpool hot
/ busy. Please see David Edgar Liebke’s slide deck for more details on this,
but I’ve seen it happen.
Break a large file down with GNU split split Creating simple work queues:
(byte-partitions-at-line-boundaries “some-huge-file.tab” (* 5 1024 1024)) => ([start1 end1] [start2 end2] [start3 end3] …) Do this w/o Leaving Clojure : Processing a file: By Chunks
These will be ‘about’ every 5Mb through the file. Then, turn each of those
into a seq of lines:
You can also do this pretty easily by line count as well, by using the Clojure
built-in function: partition and @pmap@’ing over the results.
pmap and When it’s worth it.
Concurrency has some overhead.
For cheap operations, this it outweighs the benefits.
Rob DiMarco for helping with the title and abstract for this talk.
Paul Santa Clara for being part of the learning process.