# SDSC3001 - Assignment 3


## Question 1

In [None]:
import random


def reservoir_sampling(k, stream):
    reservoir = []
    for i, item in enumerate(stream):
        if i < k:
            # Fill reservoir until we have k items
            reservoir.append(item)
        else:
            # Randomly decide whether to replace an item
            j = random.randint(0, i)  # Probability of keeping new item: k/(i+1)
            if j < k:
                reservoir[j] = item

    return reservoir

In [None]:
stream = range(1000)  # Simulate a data stream
sample_size = 5
result = reservoir_sampling(sample_size, stream)
print(f"Random sample of {sample_size} items:", result)

Random sample of 5 items: [425, 33, 491, 102, 978]


### Proof of correctness

Maintaining $k$ uniform samples from a streaming set guarantees at any time point $t \ge k$, the probability of any element already possessed from the sampling set is $\frac{k}{t}$, which can be proven inductively.

When $t = k$, the reservoir is filled with the first $k$ elements and each of these $k$ elements in the reservoir with probability 1.

Assume that after filling first $k$ element and processing $t - 1$ elements, each element $x_i$ is in the reservoir with the probability of $\frac{k}{t-1}$. Then, considering the $t$-th element $x_t$, if the probability of $x_t$ being in the reservoir (not replaced by $x_t$, in other words) is $1-\frac{1}{t}$.

Therefore, the probability that $x_i$ is kept as a sample is the product of these two probability. $\frac{k}{t-1} \cdot (1-\frac{1}{t}) = \frac{k}{t}$.

## Question 2

### Part A

When an itemset $I$ has a size of $m$, there are $2^m -1$ possible subsets. When mining top-$k$ most frequent patterns, 

$$
2^m - 1 \leq k \\
2^m \leq k + 1 \\
m \leq \log_2(k + 1) \\
\therefore m = \lceil \log_2(k + 1) \rceil
$$

### Part B

#### b.1

#### b.2

#### b.3

#### b.4
Question B4:

Set $k=500$. Run your Misra–Gries Algorithm on the "trans.txt" dataset and report the values of $L$ and $minSup(A)$ when setting $C=500000, 750000, 1000000$. To compute $minSub(A)$, you can refer to the file "patterns_Apriori.txt" containing all the frequent patterns of support at least $21$. Each line of "patterns_Apriori.txt" is in the form $id_1,id_2,...,id_l:sup$, where $id_1,id_2,...,id_l$ denotes a pattern $\{id_1,id_2,...,id_l\}$ and $sup$ is the support of this pattern. (Hint: the file "patterns_Apriori.txt" contains enough information. If your algorithm returns some pattern that is not in the "patterns_Apriori.txt" file, probably your algorithm is not implemented correctly.)

In [1]:
import numpy as np
import polars as pl
import math
from collections import Counter
from itertools import combinations

In [2]:
class FrequenctPatterns:
    def __init__(self):
        self.transactions = []
        self.patterns = {}

    def load_data(self): 
        with open("trans.txt") as f:
            for line in f:
                transaction = list(map(int, line.split()))
                self.transactions.append(transaction)

        with open("patterns_Apriori.txt") as f:
            for line in f:
                key, value = line.strip().split(":")
                key = tuple(map(int, key.split(",")))
                self.patterns[key] = int(value)

    def Misra_Gries(self, C, k=500): ...

In [3]:
frequent_patterns = FrequenctPatterns()
frequent_patterns.load_data()

In [4]:
frequent_patterns.patterns

{(0,): 70783,
 (10,): 40917,
 (42,): 40106,
 (24,): 38108,
 (30,): 33995,
 (2,): 32890,
 (4,): 32399,
 (7,): 27725,
 (23,): 22716,
 (22,): 22520,
 (55,): 22254,
 (1,): 20080,
 (43,): 17745,
 (151,): 17585,
 (36,): 17317,
 (81,): 16679,
 (40,): 16182,
 (26,): 15703,
 (54,): 15429,
 (0, 30): 14544,
 (5,): 14325,
 (45,): 13704,
 (19,): 13168,
 (3,): 12655,
 (18,): 12433,
 (11,): 11730,
 (66,): 11540,
 (89,): 10577,
 (72,): 10090,
 (144,): 10015,
 (53,): 9998,
 (125,): 9681,
 (47,): 9646,
 (32,): 9644,
 (21,): 9488,
 (20,): 9481,
 (0, 10): 9223,
 (6,): 9121,
 (63,): 8671,
 (8,): 8547,
 (37,): 8467,
 (67,): 7744,
 (87,): 7264,
 (109,): 7150,
 (34,): 6982,
 (108,): 6948,
 (104,): 6896,
 (123,): 6891,
 (0, 5): 6864,
 (162,): 6823,
 (129,): 6809,
 (136,): 6751,
 (0, 2): 6750,
 (24, 151): 6741,
 (27,): 6703,
 (0, 4): 6665,
 (22, 24): 6491,
 (767,): 6458,
 (2, 10): 6241,
 (48,): 6027,
 (59,): 6023,
 (85,): 5914,
 (77,): 5912,
 (2, 26): 5887,
 (33,): 5795,
 (24, 45): 5777,
 (19, 24): 5737,
 (17,)

In [None]:
frequent_patterns.Misra_Gries(500_000)

In [None]:
for count in [500_000, 750_000, 1_000_000]:
    frequent_patterns.Misra_Gries(count)