In [1]:
from demo_utils import *

# Begin demo

## Count-Min Sketch estimates across distributions 
The plots below shows a scatterplot of errors with data points (left), the distribution of data (center), and the distribution of errors (right). Again, we can see that the Count-Min Sketch gives us decent estimates; none of the errors exceed the threshold, which is excellent! But interestingly, by converting to a normal distribution, our average error almost halved, from 20.154 to 11.4482, and our max error shot up by 12.

In [2]:
opts1 = {
    'title': "Scatterplot of Errors",
    'ylabel': "Error",
    'xlabel': "Data Point",
    'ylim': (0, 500)}
nbi.scatter(generate_hist_sample, get_y_errors, n=(1,2000), eps=(0.01,1, 0.01), delta=(0.01, 1, 0.01), distribution={'normal': "normal", 'zipf': 'zipf', 'uniform': 'uniform', 'exp':'exp', 'geometric':'geometric', 'lognorm':'lognorm'}, options=opts1)


VBox(children=(interactive(children=(IntSlider(value=1000, description='n', max=2000, min=1), FloatSlider(valu…

In [3]:
opts2 = {
    'title': "Distribution of Data",
    'ylabel': "Count",
    'xlabel': "Data Point",}
nbi.hist(generate_hist_sample, n=(1,10000), distribution={'normal': "normal", 'zipf': 'zipf', 'uniform': 'uniform', 'exp':'exp', 'geometric':'geometric', 'lognorm':'lognorm'}, options=opts2)


VBox(children=(interactive(children=(IntSlider(value=5000, description='n', max=10000, min=1), Dropdown(descri…

In [4]:
opts3 = {
    'title': "Distribution of errors",
    'ylabel': "Count",
    'xlabel': "Error Magnitude",}
nbi.hist(get_data_for_hist_errors, n=(1,2000), eps=(0.01,1, 0.01), delta=(0.01, 1, 0.01), distribution={'normal': "normal", 'zipf': 'zipf', 'uniform': 'uniform', 'exp':'exp', 'geometric':'geometric', 'lognorm':'lognorm'}, options=opts3)


VBox(children=(interactive(children=(IntSlider(value=1000, description='n', max=2000, min=1), FloatSlider(valu…

# How does data spread affect our errors?
Let’s take this a step further and scope out the exact effect on standard deviation (a mathematical proxy for the “spread” of the data) on the errors. Let’s investigate what happens when we sample n = 1000 integers from a normal distribution with a mean of μ = 0. We vary the standard deviation between 1 and 100 to understand how spreading the distribution affects errors.

In [5]:
opts4 = {
    'title': "Scatterplot of Errors",
    'ylabel': "Error",
    'xlabel': "Data Point",
    'ylim': (0, 500),}
print("Vary the standard distribution to see how the errors change!")

nbi.scatter(get_sample_sd, get_y_errors_sd, n=(1,2000), eps=(0.01,1, 0.01), delta=(0.01, 1, 0.01), sd=(0.01, 1000, 10), options=opts4)


Vary the standard distribution to see how the errors change!


VBox(children=(interactive(children=(IntSlider(value=1000, description='n', max=2000, min=1), FloatSlider(valu…

In [6]:
opts5 = {
    'title': "Distribution of errors",
    'ylabel': "Count",
    'xlabel': "Error Magnitude",}
print("Vary the standard distribution to see how error disribution change!")
nbi.hist(get_data_for_hist_errors_sd, n=(1,2000), eps=(0.01,1, 0.01), delta=(0.01, 1, 0.01),sd=(0.01, 1000, 10),  options=opts5)


Vary the standard distribution to see how error disribution change!


VBox(children=(interactive(children=(IntSlider(value=1000, description='n', max=2000, min=1), FloatSlider(valu…

# Optimization #1: The Learned Count-Min Sketch

One approach is simply to treat the heavy hitters and non-heavy-hitters separately. This is where we can motivate our data structure design with two ideas from the original Learned Index Structures paper –– recursive models and auxiliary structures.


In [7]:
opts = {
    'title': "Count-Min Sketch",
    'ylabel': "Frequency",
    'xlabel': "Error",}
print("Vary the standard distribution to see how error disribution change!")
nbi.hist(opt_1_normal, sd=(0, 1000, 10),  options=opts)


Vary the standard distribution to see how error disribution change!


VBox(children=(interactive(children=(IntSlider(value=500, description='sd', max=1000, step=10), Output()), _do…

In [10]:
opts = {
    'title': "Learned Count-Min Sketch",
    'ylabel': "Frequency",
    'xlabel': "Error",}
print("Vary the standard distribution to see how error disribution change!")
nbi.hist(opt_1_learned, sd=(0, 1000, 10),  options=opts)


Vary the standard distribution to see how error disribution change!


VBox(children=(interactive(children=(IntSlider(value=500, description='sd', max=1000, step=10), Output()), _do…

In [11]:
opts = {
    'title': "Rule Based Count-Min Sketch",
    'ylabel': "Frequency",
    'xlabel': "Error",}
print("Vary the standard distribution to see how error disribution change!")
nbi.hist(opt_2, sd=(0, 1000, 10),  options=opts)


Vary the standard distribution to see how error disribution change!


VBox(children=(interactive(children=(IntSlider(value=500, description='sd', max=1000, step=10), Output()), _do…