---
Probabilistic Data Structures
===

<img src="images/die.png" style="width: 400px;"/>

      
> Get your data structures correct first, and the rest of the program will write itself.  
> \- David Jones

---
Student Discussion
---

<details><summary>
If your data is too to big query, what do you do?
</summary>
- Sample (i.e., use less of it)  
- Get more resources (rent a bigger computer or a cluster)
</details>

---
Why probabilistic data structure?
---

__Too much data (often too fast)__

<img src="http://screenmediadaily.com/wp-content/uploads/2015/01/IAB-Complexity.jpg" style="width: 400px;"/>

Get the approximate/estimates answer. This is opposite of tradiational data structures that provide precise answers. As a data scientist, you can manage that ambiguity.

> You can the approximate answer shortly or ...   
> the precise answer NEVER!

Reduce the computational complexity and storage requirements (RAM, CPU, etc.)  
(Usually) fixed size memory and storage requirements

Need to accept a certain error rate. 

Tradeoff between __size__ and __error__. 

---
When to use them?
---

- Designed to answer queries only about specific properties 
    (e.g., set cardinality, set membership, etc.)
- Support only specific operations
    (e.g., adding a set member, set unions but not removing or intersections)
- It is okay to have possible false positives/overestimations

### Student Activity

List types of nonprobablistic data structures

---
Probabilistic data structures covered
---

1. Bloom filter
2. Count–min sketch
2. Locality-sensitive hashing (LSH)
4. HyperLogLog

---
Hash functions ftw
---

<img src="images/hash.png" style="width: 400px;"/>

- Hash functions maps an arbitrary input to a single numeric output
- Hash functions are deterministic
- Right now, hash functions are a "solved" problem - there is perfect hashing and with a mechanims to handle collisions
- Names of common hash functions:
    - hashCode()
    - CRC32
    - MD5
    - SHA
    - MurmurHash

---
Review Question
---

<details><summary>
What is the difference between Hash Set and Hash Map?
</summary>
Set is unordered collection of unique objects. <br>
Map is unordered collection of unique keys (with associated values).
</details>

---
Hash Set
---

- Cardinality and membership is O(1)
- Insert and delete O(1)
- Fast and universal
- However they can take a lot resources
    - If you want to store UUIDs in a HashSet: 
        - UUID - 36 bytes
        - ~100m UUIDs * 36B ~= 3.35GB

---
Bloom Filter
---

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Bloom_filter.svg/2000px-Bloom_filter.svg.png" style="width: 400px;"/>

- What does it do? Probablistic membership queries
- Several hash functions mapping a single input to mupltiple bits in a bit mask
- An element is a member if all hash values map to a 1

---
Add
---

<img src="images/add.jpg" style="width: 400px;"/>

---
Membership Check
---

<img src="images/check.png" style="width: 400px;"/>

---
Bloom Filters's scale well
----

Remember our UUID problem...

In a Bloom Filter, 100m UUIDs ~= 79.85MB (with 4% error rate)

---
Bloom Filters for DBs
---

<img src="http://crzyjcky.com/wp-content/uploads/2013/01/bloom-filter-database.png" style="width: 400px;"/>

True -> maybe contains   
False -> definitely __does not__ contain

---
Let's build a Boom Filter
---

__What You'll Need__:
1. A bit array (as the name suggests it's just an array of bits)
2. A quick non-cryptographic hash function (e.g., murmurhash3 or cityhash)

In [25]:
! pip install bitarray



In [26]:
! pip install mmh3



In [27]:
from bitarray import bitarray
import mmh3
 
# Create our empty bit array
bit_array = bitarray(10)
bit_array.setall(0)

In [28]:
bit_array

bitarray('0000000000')

In [29]:
# Let's hash something
mmh3.hash("foo", seed=42)

-1322301282

---
Challenge question
---

<details><summary>
How do we put that hash into our bit array?
</summary>
index = mmh3.hash("foo", 42) % len(bit_array)
<br>
`bit_array[index] = 1` <br>
</details>

<br>
<br>
<br>
<br>

In [30]:
index = mmh3.hash("foo", seed=42) % len(bit_array)
bit_array[index] = 1

In [31]:
bit_array

bitarray('0000000010')

Bloom filters work by hashing an object several times using either multiple hash functions or the same hash function with a different seed. This insures that when we hash an object we're unlikely to get the same result.

In [32]:
bit_array[mmh3.hash("foo", seed=999) % len(bit_array)] = 1

In [33]:
bit_array

bitarray('0000000110')

In [35]:
if (bit_array[mmh3.hash("foo", seed=42) % len(bit_array)] == 1) and \
    (bit_array[mmh3.hash("foo", seed=999) % len(bit_array)] == 1):
    print("Probably in set")
else:
    print("Definitely not in set")

Probably in set


Let's make a class'

In [11]:
from bitarray import bitarray
import mmh3
 
class BloomFilter(object):
    "Simple implementation of a Bloom Filter"
    
    def __init__(self, size, hash_count):
        self.size = size
        self.hash_count = hash_count # Number of hashes for seeds
        self.bit_array = bitarray(size) 
        self.bit_array.setall(0)
        
    def add(self, string):
        """Add element to current Bloom Filter.
        TODO: Add awesome code
        """
        
    def lookup(self, string):
        """Check if element is in current Bloom Filter.
        
        >>> bf = BloomFilter(size=10, hash_count=3)
        >>> bf.add('pudding')
        >>> bf.lookup('pudding')
        'Probably'
        >>> bf.lookup('salad')
        'Nope'

        TODO: Add awesome code
        """

---
Solutions
---

<details><summary>
Click here for `add` code.
</summary>
```
def add(self, string):
    """Add element to current Bloom Filter.
    """
    for seed in xrange(self.hash_count):
        result = mmh3.hash(string, seed) % self.size
        self.bit_array[result] = 1
```
</details>

<details><summary>
Click here for `lookup` code.
</summary>
```
def lookup(self, string):
    """Check if element is in current Bloom Filter.

    >>> bf = BloomFilter(size=10, hash_count=3)
    >>> bf.add('pudding')
    >>> bf.lookup('pudding')
    'Probably'
    >>> bf.lookup('salad')
    'Nope'
    """
    for seed in xrange(self.hash_count):
        result = mmh3.hash(string, seed) % self.size
        if self.bit_array[result] == 0:
            return "Nope"
    return "Probably"
```
</details>

[Source](http://www.maxburstein.com/blog/creating-a-simple-bloom-filter/)

---
Let's use a Bloom Filter package
---

<img src="http://i.imgur.com/xTiuVsi.jpg" style="width: 400px;"/>

In [37]:
! pip install pybloom



In [13]:
from pybloom import BloomFilter

In [38]:
# Create a Bloom filter
bf = BloomFilter(capacity=1000, 
                error_rate=0.001)

In [39]:
# Let's put in some numbers
[bf.add(_) for _ in range(5)]

[False, False, False, False, False]

In [40]:
# Are the numbers there?
0 in bf

True

In [42]:
# Are they all there?
all((_ in bf) for _ in range(5))

True

In [43]:
# How about other numbers?
5 in bf

False

---
Fun Facts about Bloom Filters
---

-  Databases such as Cassandra and HBase use bloom filters to see if it should do a large query or not.

- Elements only added, never removed

- A fixed size Bloom filter can represent a set with an arbitrarily large number of elements. Adding an element never fails due to the data structure "filling up." However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point __all queries yield a positive result__.

---
What is a use case for Bloom Filters?
---

A browser could keep a local copy of all the malicious URLs that it could query to warn users (eliminates the network requests).

But what about the wrong answers? 

A Bloom filter of malicious URLs will never report a malicious URL as “safe”, it might only report a “safe” URL as “malicious”. 

---
Additional Resources
---

- [Cuckoo Filters](http://mybiasedcoin.blogspot.com/2014/10/cuckoo-filters.html)
- [Cuckoo Filters: Better than Bloom](https://www.cs.cmu.edu/~binfan/papers/login_cuckoofilter.pdf)
- [Cuckoo Filter: Simplification and Analysis](http://arxiv.org/abs/1604.06067)

<br>
<br> 
<br>

----
What is Count-Min Sketch?
---

- Analogous to a Bloom filter that stores Approximate frequency counts
- #1 use case: top unique entries for key (i.e., “Heavy Hitters” on Twitter)

----
How does count-min sketch work?
----

- Intialize a 2-dimensional array of counters
- Number of hash functions * (fixed) number of counters
- Insert by incrementing the counters on hashed indexes

![](images/count_min.png)

---
Add
---

Given a stream of occurrences

<img src="images/add_count.png" style="width: 400px;"/>

---
Count
---

<img src="images/count_count.png" style="width: 400px;"/>

----
What kind of errors do Count Min Sketch have
----

> Can overestimate, but never underestimate

---
Check for understanding
---

<details><summary>
If I wanted to estimate frequency, which data structure would I use?
</summary>
Count-Min Sketch<br>
</details>

<br>
<br> 
<br>

----
Summary
----

- Trade bounded errors for greatly reduced resource costs
- The algorithms are straight forward to implement
- Could possible be the "silver bullet" for your scaling problem

### Summary Table

| Problem | Solution  |  
|:-------:|:------:|
| Set Membership | Bloom Filter  |
| Frequency summaries | Count-Min Sketch |

<br>
<br> 
<br>

----
Bonus Materials
----

---
Distributed Streaming Quantiles (DSQ)
---

- Sketch-based streaming quantile computation
- The domain of possible values is split into dyadic intervals (powers of two) up to a specified number of levels. (This also means the domain must be specified in advance.) 
- Each level has its own count-min sketch. 
- When computing a quantile, we estimate the number of observations to the left of the specified quantile (along with the total count of observations). 
- The estimate is built by taking the largest possible intervals from the highest levels of the tree; higher levels are more accurate.

---
Additional Resources
----

- [DSQ in PySpark](https://github.com/laserson/dsq)
- [Orginal Paper](http://www.sciencedirect.com/science/article/pii/S0196677403001913)