# CLAM Clustering: New Algorithm and Recurrence Relations

Given:

- $f$: a distance function
- $C$: a Cluster to partition
- $criteria$: user-defined continuation criteria

The algorithm partitions $C$ into child clusters $L$ and $R$ as follows:

1. $S \leftarrow$ a random sample of $\Big\lceil \sqrt{|C|} \Big\rceil$ points from $C$
2. $c \leftarrow$ the geometric median of $S$
3. Remove $c$ from $C$ and assign it as the center of $C$.
4. $l \leftarrow$ the point in $C$ farthest from $c$
5. $r \leftarrow$ the point in $C$ farthest from $l$
6. $L \leftarrow$ the points in $C$ closer to $l$ than to $r$
7. $R \leftarrow$ the remaining points in $C$
8. If $|L| > 2$ and $criteria(L)$ is true, recursively partition $L$
9. If $|R| > 2$ and $criteria(R)$ is true, recursively partition $R$

The key difference is that we do not pass the center $c$ down to the child clusters. This also changes the definition of a $leaf$ cluster: a leaf cluster either contains 1 point (which is its own center) or contains 2 points (in which case one is the center and the other, being a singular point, cannot be used to make further children).

## Recurrence Relations

Let $T(n)$ be the number of clusters in the tree produced for a dataset containing $n$ points. The recurrence relation for $T(n)$ is given by:

1. Base Case (the leaf clusters): $T(1) = 1$ and $T(2) = 1$.
2. Recursive Case (the parent clusters with $n > 2$):
    - $T(1 + 2n) = 1 + 2T(n)$ for odd $n$
    - $T(2 + 2n) = 1 + T(n + 1) + T(n)$ for even $n$

Clearly $T(n) \leq n$ by the pigeonhole principle. The two are equal when every leaf cluster contains exactly one point (its own center) and $n = 2^h - 1$ where $h$ is the height of the tree. However, when every leaf cluster contains exactly two points, $T(n)$ approaches a lower bound of $\frac{2n}{3}$ for large $n$ (when $n = 2^{h-1} + 2^h - 1$, because there are $2^{h-1}$ leaf clusters).

In [None]:
import pandas
import plotly.colors as pc
import plotly.graph_objects as go
from plotly.subplots import make_subplots


In [None]:
# pyright: reportUnknownMemberType=false

def compute_memo(min_n: int, max_n: int, b: int = 2) -> pandas.DataFrame:
    """Compute the memoization table for our recurrence relation.

    For a dataset of size n and a branching factor of b, the number of clusters in the tree T(n)
    is given by the following recurrence relations:
      - T(1) = 1 and T(2) = 1, the leaf clusters
      - T(n) = n - 1 for 3 <= n <= b + 1, parent cluster whose children are all leaves
      - T(1 + a + b * n) = 1 + a * T(n + 1) + (b - a) * T(n) for n >= b + 2 and 0 <= a < b

    Args:
        min_n: The minimum value of n to compute. This is to reduce noise in the output.
        max_n: The maximum value of n to compute.
        b: The branching factor.

    Returns:
        A pandas DataFrame with columns "n", "T(n)", and "T(n)/n".
    """
    memo = [0] * (max_n + 1)
    memo[0] = 1
    memo[1] = 1
    memo[2] = 1

    for n in range(3, b + 2):
        memo[n] = n - 1

    for n in range(b + 2, max_n + 1):
        q = (n - 1) // b
        a = (n - 1) % b
        memo[n] = 1 + a * memo[q + 1] + (b - a) * memo[q]

    memo = memo[1 + min_n:]
    ratios = [(n, t, t / n) for n, t in enumerate(memo, start=min_n)]
    (n, t, r) = tuple(zip(*ratios))

    return pandas.DataFrame({"n": n, "T(n)": t, "T(n)/n": r})


In [None]:
# pyright: reportUnknownMemberType=false

type Data = list[tuple[int, pandas.DataFrame]] | pandas.DataFrame


def make_plot(min_n: int, max_n: int, data: Data) -> go.Figure:
    """Create the plots for the recurrence relations, using the same color for each branching factor and combining the legends."""
    if isinstance(data, pandas.DataFrame):
        data = [(2, data)]
        return _make_plot(min_n, max_n, data, False)
    return _make_plot(min_n, max_n, data, True)


def _make_plot(min_n: int, max_n: int, data: list[tuple[int, pandas.DataFrame]], reveal: bool) -> go.Figure:
    """Create the plots for the recurrence relations, using the same color for each branching factor and combining the legends."""
    # Assign a color for each branching factor
    bs = [b for b, _ in data]
    palette = pc.qualitative.Set1 if len(bs) <= len(pc.qualitative.Set1) else pc.qualitative.Plotly
    color_map = {b: palette[i % len(palette)] for i, b in enumerate(bs)}

    if reveal:
        titles = ["T(n, b) vs n for various branching factors b", "Ratio of clusters to data points: T(n, b)/n vs n for various branching factors b"]
    else:
        titles = ["T(n) vs n", "Ratio of clusters to data points: T(n)/n vs n"]

    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=titles,
        column_widths=[0.33, 0.67],
        horizontal_spacing=0.05
    )
    fig.add_shape(
        type="line",
        x0=min_n,
        y0=min_n,
        x1=max_n,
        y1=max_n,
        line=dict(color="Black", dash="dash"),
        row=1, col=1
    )
    if reveal:
        fig.add_trace(
            go.Scatter(
            x=[min_n, max_n],
            y=[1/2, 1/2],
            mode="lines",
            line=dict(color="Black", dash="dash"),
            name="Lower Bound",
            legendgroup="Lower Bound"
            ),
            row=1,
            col=2
        )
    else:
        fig.add_trace(
            go.Scatter(
            x=[min_n, max_n],
            y=[2/3, 2/3],
            mode="lines",
            line=dict(color="Black", dash="dash"),
            name="Lower Bound",
            legendgroup="Lower Bound"
            ),
            row=1,
            col=2
        )

    for b, tree_size_df in data:
        color = color_map[b]
        name = f"b={b}" if reveal else "T(n)"
        # First plot: T(n) vs n
        fig.add_trace(
            go.Scatter(x=tree_size_df["n"], y=tree_size_df["T(n)"], mode="lines", name=name, legendgroup=name, line=dict(color=color)),
            row=1, col=1
        )
        # Second plot: T(n)/n vs n
        fig.add_trace(
            go.Scatter(x=tree_size_df["n"], y=tree_size_df["T(n)/n"], mode="lines", name=name, legendgroup=name, line=dict(color=color), showlegend=False),
            row=1, col=2
        )

    fig.update_xaxes(type="log", title_text="n (log scale)", row=1, col=1)
    fig.update_yaxes(type="log", title_text="T(n) (log scale)", row=1, col=1)
    fig.update_xaxes(type="log", title_text="n (log scale)", row=1, col=2)
    fig.update_yaxes(title_text="T(n)/n", row=1, col=2)

    # Layout adjustments
    fig.update_layout(width=1600, height=600, showlegend=True)
    return fig


In [None]:
min_n = 10
max_n = 100_000

tree_size_df = compute_memo(min_n, max_n)


In [None]:
fig = make_plot(min_n, max_n, tree_size_df)
fig.show()


## Generalization to Arbitrary Branching Factor `b`

For a dataset of size `n` and a branching factor of `b`, the number of clusters in the tree `T(n)` is given by the following recurrence relations:

1. Base Case (the leaf clusters): $T(1) = 1$ and $T(2) = 1$.
2. Just before the base case (the parents of leaf clusters): $T(n) = n - 1$ for $3 \leq n \leq b + 1$.
3. Recursive Case (the parent clusters with $n > b + 1$):
   - $T(1+a+bn) = 1 + aT(n+1) + (b-a)T(n)$ for $n > b+1$ and $0 \leq a < b$.

In [None]:
bs = list(range(2, 11))
data = [(b, compute_memo(min_n, max_n, b)) for b in bs]


In [None]:
fig = make_plot(min_n, max_n, data)
fig.show()


In [None]:
min_n = 10
max_n = 100_000
bs = list(range(2, 65))
data = [(b, compute_memo(min_n, max_n, b)) for b in bs]


In [None]:
fig = make_plot(min_n, max_n, data)
fig.show()
