You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Single-processor C++ implementation of an efficient algorithm for computing
$M(n)$, the number of distinct entries in the $n \times n$ multiplication table.
The Problem
$$M(n) = |{ij : 1 \le i, j \le n}|$$
Despite its elementary definition, $M(n)$ grows slower than $n^2$.
Erdős (1955) showed $M(n) = o(n^2)$, and Ford (2008) established
Equivalently, most integers less than $n^2$ are not products of two integers both at most $n$.
Algorithm
The implementation is based on Algorithm 4 of:
R. Brent, C. Pomerance, D. Purdum, J. Webster,
Algorithms for the Multiplication Table Problem,
arXiv:1908.04251 (2021).
This implementation goes beyond the above and incorporates the ability to
correct shapes from unit shifts of Algorithm 4.
The key identity is
$$M(n) = \sum_{k=1}^{n} (k - \delta(k)),$$
where $\delta(k)$ counts multiples of $k$ already in the $(k-1) \times (k-1)$ table.
Unit-shift strategy. For a fixed multiplier$m$, the divisor shape for
$mk$ changes predictably as $k$ increases by 1. When $k$ is prime the shape
shifts exactly (no correction needed); when $k$ is composite a small correction
accounts for extra divisors of $mk$ not in $M \cup kM$.
Variable-smoothness multipliers. A multiplier $m$ is included when
$m \cdot P^+(m) \le n$, where $P^+(m)$ is the largest prime factor of $m$.
Multipliers are processed in decreasing order of $\tau(m) = |{d : d \mid m}|$
so that high-divisor-count multipliers claim the most composite integers early,
minimising redundant work.
Complexity. The runtime is conjectured to be $O(n^{5/3} \log n)$, which
is superlinear — scaling from $n = 10^6$ to $n = 10^7$ increases the running
time by a factor of roughly $10^{5/3} \approx 46$, not 10.
Build
Requires a C++17 compiler. No external dependencies.
make
To enable SIMD acceleration on your platform, edit CXXFLAGS in the Makefile:
The value for $2^{30}-1$ is from Brent et al. (2021); the value for $2^{32}-1$ is new.
Memory
The bit-vector of $n$ bits is read and written with random-access patterns
throughout the entire multiplier sweep; performance degrades once it exceeds
the L3 cache. The uint32_t array of $k - \delta(k)$ values is written
with strided access (multiples of each multiplier $m$) during the sweep and
read sequentially once at the end to form the sum $M(n)$.
$n$
bit-vector
uint32_t array
$10^6$
125 KB
4 MB
$10^8$
12.5 MB
400 MB
$10^9$
125 MB
4 GB
For larger $n$
mtable performs all steps in a single pass on a single processor, which
becomes impractical beyond roughly $n \sim 10^8$. Scaling to $n = 2^{32}$
and beyond requires several architectural changes:
Decouple the pipeline steps. The single-pass structure of mtable
couples multiplier generation, the unit-shift sweep, corrections, and
accumulation. At large scale these become separate programs communicating
through files, allowing each stage to be tuned, restarted, or distributed
independently.
Segment the bit-vector. Due to its random-access pattern, the bit-vector
must fit in L3 cache to avoid severe performance degradation from cache misses.
For $n = 2^{32}$ the full bit-vector is 512 MB — far exceeding a typical L3
cache. Segmenting the sweep so that each pass works on a contiguous chunk
that fits in cache significantly improves throughput.
Distribute multipliers across processors. Each multiplier is largely
independent: it owns a disjoint set of composite integers and sweeps its own
portion of the bit-vector. Multipliers can therefore be partitioned across
nodes in a cluster, each node writing its share of the $k - \delta(k)$ values
to a memory-mapped file. An ownership bit-vector prevents two nodes from
counting the same composite twice.
Replace the val array with a memory-mapped file. When multipliers are
distributed across nodes, each node must write its $k - \delta(k)$ values
into a shared result array. Memory-mapping a single file on shared storage
allows all nodes to write to disjoint regions concurrently without explicit
coordination, and the completed file serves directly as input to the final
summation step.