Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
142 lines (87 sloc) 3.73 KB

Lab 5 response

Yahoo and non-enron accounts are where the interesting content is

Many of the things are spam

Approximate Query Processing

Goal: low latency queries on lang edatasets

Use summary or subset of data to answer some queries over original data

Tricks

  • Histograms
  • Samples
  • Sketches

History lesson: DBs used to run query plans on sample of the data for cost estimation. Later, trick surfaced as approximate query processing to handle large datasets.

Histograms

Frequency count of values:

Salary    Count
0-10        5
10-20      10
20-30       5
30-40      12
>40        15

Answer

select count(*) where salary between [25 - 35]
Estimate as 1/2 * 5 + 1/2 * 12

Nice for these aggregation queries.

Not so good for

  • Outliers: Catch all bins (>40) throws away outlier information. Could use exception lists, etc
  • Non-numeric values (e.g., LIKE)
  • May fix minimum granularity (usually use equiheight histogram rather than equiwidth)
  • Guarantees?
  • Joins
  • Using multiple histograms (if dimensions are correlated)
  • Multi-dimensional histograms are expensive

Sampling

  • Compute sample, use to answer query

given:

f: function
S: sample
g: answer
g': estimated answer

if Q(s) = g, g' = g/f
Can figure this out for some Qs (mean, count, sum, etc)

Pros:

  • Sample contains non-numeric fields
  • Correlations may be preserved

Cons:

  • Computing extrema is hard e.g., max, min
  • Sample size increases ^2 wrt desired confidence
  • Rare items not in sample.

How to actually use sampling?

At query time, read data in single pass and use Bernoulli sampling. Doesn't guarantee sample size!

Reservoir sampling: math trick to guarantee a sample size. To convince yourself, consider how it works with sample size 1

Probably don't want to compute sample at query time (will need to read all the data, which we are trying to avoid!) Instead, we can:

  • precompute and store samples, then choose sample to use at query time
  • (somehow) randomly pick blocks to read at query time
    • Problem: we want to avoid random IO!
    • trick is played when blocks are distributed
    • randomized indices

Theoretical guarantees break down if you keep using single sample if unlucky

  • Could use escape hatch and run over all the data
  • Could store multiple samples at multiple sizes

How to handle rare-subgroups?

  • Stratified sampling

      sample from group if size > k
      entire group if size < k
    
  • Pick attributes to stratify over using past workload queries on

  • note: using past workloads is a classic database trick

BlinkDB@Berkeley/MIT uses stratified sampling

select avg(sal) from T 
where dept = 'eng' within X seconds

or 

select avg(sal) from T
where dept = 'eng' w/ conf 95%

Works as follows

workload -> pick lots of stratified samples 
query -> pick samples to use -> run query on sample on hive --> error estimation --> result

Sketches/Streaming algorithms

Very compact, and highly optimized for a single task (bloomfilter, mincount, etc)

Min Count

Keep k hash functions and k hash tables

                 buckets
                0  1 …  m
hashtable 1     0  0 .. 0
…
hashtable k     0  0 .. 0
  • For value v, hash into each hashtable and increment the bucket.
  • At query time, min(v) = min(hashtable[hash(v)] for the k hash functions)
  • Code of count-min is posted online. Super simple algorithm.

Really good for heavy hitters (really popular values). If everything is roughly the same, then hard to distinguish.