{{ message }}
/ primesieve Public

# Segmented sieve of Eratosthenes

##### Clone this wiki locally

This page contains a step by step explanation of a simple but fast C++ implementation of the segmented sieve of Eratosthenes that generates the primes below n using operations and space. It runs significantly faster than a traditional sieve of Eratosthenes implementation due to its more efficient CPU cache usage i.e. it uses the CPU's L1 data cache size as its sieve array size. This ensures that less than 1 percent of the memory accesses will be cache misses. This implementation generates the primes below 10^9 in just 0.8 seconds (single-threaded) on an Intel Core i7-6700 3.4 GHz CPU.

# Initialization

We set the size of the sieve array named `segment_size` to be the same size as the CPU's L1 data cache size per core e.g. 32768 bytes. The segment size must not be smaller than the square root of the sieving limit otherwise the run-time complexity of the algorithm deteriorates.

```/// Set your CPU's L1 data cache size (in bytes) here
const int64_t L1D_CACHE_SIZE = 32768;

/// Generate primes using the segmented sieve of Eratosthenes.
/// This algorithm uses O(n log log n) operations and O(sqrt(n)) space.
/// @param limit  Sieve primes <= limit.
///
void segmented_sieve(int64_t limit)
{
int64_t sqrt = (int64_t) std::sqrt(limit);
int64_t segment_size = std::max(sqrt, L1D_CACHE_SIZE);```

# Outer sieving loop

We sieve one segment in each iteration of the outer sieving loop. At the beginning of each iteration we reset the sieve array and set the lower and upper bounds of the current segment. Hence the current segment corresponds to the interval [low, high].

```for (int64_t low = 0; low <= limit; low += segment_size)
{
std::fill(sieve.begin(), sieve.end(), true);

// current segment = [low, high]
int64_t high = low + segment_size - 1;
high = std::min(high, limit);```

# Generate the sieving primes

Before we can start sieving the current segment [low, high] we need to generate the sieving primes ≤ sqrt(high). We do this using a simple sieve of Eratosthenes implementation. When we find a new sieving prime we store it in the primes array and we also store the index of the first multiple that needs to be crossed off (in the segmented sieve) in the multiples array.

```// generate sieving primes using simple sieve of Eratosthenes
for (; i * i <= high; i += 2)
if (is_prime[i])
for (int64_t j = i * i; j <= sqrt; j += i)
is_prime[j] = false;

// initialize sieving primes for segmented sieve
for (; s * s <= high; s += 2)
{
if (is_prime[s])
{
primes.push_back(s);
multiples.push_back(s * s - low);
}
}```

# Sieve a segment

After having generated the sieving primes we can proceed with the actual prime sieving. We start crossing off the multiples of the first sieving prime 3 until we reach a multiple of 3 ≥ `segment_size`, when this happens we calculate the index of that multiple in the next segment using (multiple - segment_size) and store it in the multiples array. We then cross off the multiples of the other sieving primes using the same procedure.

```// sieve the current segment
for (std::size_t i = 0; i < primes.size(); i++)
{
int64_t j = multiples[i];
for (int64_t k = primes[i] * 2; j < segment_size; j += k)
sieve[j] = false;
multiples[i] = j - segment_size;
}```

# Identify the primes

Once we have crossed off the multiples of all sieving primes in the current segment we iterate over the sieve array and count (or print) the primes.

```for (; n <= high; n += 2)
if (sieve[n - low]) // n is a prime
count++;```

# Putting it all together

```/// @file     segmented_sieve.cpp
/// @author   Kim Walisch, <kim.walisch@gmail.com>
/// @brief    This is a simple implementation of the segmented sieve of
///           Eratosthenes with a few optimizations. It generates the
///           primes below 10^9 in 0.8 seconds (single-threaded) on an
///           Intel Core i7-6700 3.4 GHz CPU.

#include <iostream>
#include <algorithm>
#include <cmath>
#include <vector>
#include <cstdlib>
#include <stdint.h>

/// Set your CPU's L1 data cache size (in bytes) here
const int64_t L1D_CACHE_SIZE = 32768;

/// Generate primes using the segmented sieve of Eratosthenes.
/// This algorithm uses O(n log log n) operations and O(sqrt(n)) space.
/// @param limit  Sieve primes <= limit.
///
void segmented_sieve(int64_t limit)
{
int64_t sqrt = (int64_t) std::sqrt(limit);
int64_t segment_size = std::max(sqrt, L1D_CACHE_SIZE);
int64_t count = (limit < 2) ? 0 : 1;

// we sieve primes >= 3
int64_t i = 3;
int64_t n = 3;
int64_t s = 3;

std::vector<char> sieve(segment_size);
std::vector<char> is_prime(sqrt + 1, true);
std::vector<int64_t> primes;
std::vector<int64_t> multiples;

for (int64_t low = 0; low <= limit; low += segment_size)
{
std::fill(sieve.begin(), sieve.end(), true);

// current segment = [low, high]
int64_t high = low + segment_size - 1;
high = std::min(high, limit);

// generate sieving primes using simple sieve of Eratosthenes
for (; i * i <= high; i += 2)
if (is_prime[i])
for (int64_t j = i * i; j <= sqrt; j += i)
is_prime[j] = false;

// initialize sieving primes for segmented sieve
for (; s * s <= high; s += 2)
{
if (is_prime[s])
{
primes.push_back(s);
multiples.push_back(s * s - low);
}
}

// sieve the current segment
for (std::size_t i = 0; i < primes.size(); i++)
{
int64_t j = multiples[i];
for (int64_t k = primes[i] * 2; j < segment_size; j += k)
sieve[j] = false;
multiples[i] = j - segment_size;
}

for (; n <= high; n += 2)
if (sieve[n - low]) // n is a prime
count++;
}

std::cout << count << " primes found." << std::endl;
}

/// Usage: ./segmented_sieve n
/// @param n  Sieve the primes up to n.
///
int main(int argc, char** argv)
{
if (argc >= 2)
segmented_sieve(std::atoll(argv));
else
segmented_sieve(1000000000);

return 0;
}```

# How to run it

```# Compile using default C++ compiler
c++ -O3 segmented_sieve.cpp -o segmented_sieve

# Count the primes below 10^9
time ./segmented_sieve 1000000000```

# Using a bit array

The segmented sieve of Eratosthenes implementation presented before is both simple and fast. But what if we wanted to further optimize it so that it runs even faster? Well the next step is to change our implementation so that it uses a bit array instead of a byte array for sieving. The idea is to map numbers to bits in the sieve array e.g. we define that each bit in the sieve array corresponds to an odd number. Hence bit0 = 1, bit1 = 3, bit2 = 5, bit3 = 7 and so forth. This also means that each byte of the sieve array corresponds to a interval, as each byte has 8 bits byte0 = [0, 15], byte1 = [16, 31], byte2 = [32, 47] and so forth.

# Segmented bit sieving

This loop is very similar to the segmented byte sieving loop. But here when we cross off a multiple we unset the corresponding bit instead of setting the corresponding byte to false. We divide the multiple by 16 (right shift by 4) to find the corresponding byte and we use the `unset_bit` lookup table to mask off the bit corresponding to the multiple in the sieve array.

```// sieve segment using bit array
for (std::size_t i = 0; i < primes.size(); i++)
{
int64_t j = multiples[i];
for (int64_t k = primes[i] * 2; j < segment_size; j += k)
sieve[j >> 4] &= unset_bit[j & 15];
multiples[i] = j - segment_size;
}```

# Counting bits

Each 1 bit that remains in our sieve array after the sieving step corresponds to a prime. Instead of iterating over each bit and checking whether it is set or not we will use a precomputed popcnt lookup table that tells us how many bits are set in each byte, this is much faster. We also need to unset the bits in the sieve array that correspond to numbers > sieving limit. These are the most significant bits in the last byte of the sieve array.

```// unset bits > limit
if (high == limit)
{
int64_t bits = 0xff << (limit % 16 + 1) / 2;
sieve[sieve_size - 1] &= ~bits;
}

// count primes
for (int64_t n = 0; n < sieve_size; n++)
count += popcnt[sieve[n]];```

# Run the segmented bit sieve

```# Compile using default C++ compiler
c++ -O3 segmented_bit_sieve.cpp -o segmented_bit_sieve

# Count the primes below 10^9
time ./segmented_bit_sieve 1000000000```

# Why is bit sieving faster?

The performance of the segmented sieve of Eratosthenes mainly depends on CPU caching. If the sieve array fits into the CPU's L1 or L2 cache sieving will be extremely fast, if however the sieve array is larger than the CPU's cache size the sieving performance will deteriorate by up to 2 orders of magnitude. As the size of the sieve array is set to the square root of the sieving limit the sieve array will not fit into the CPU's cache if the sieving limit is sufficiently large e.g. 10^10. By using a bit array for sieving the sieve array uses less memory and hence it will fit into the CPU's caches even for larger sieving limits. Actually the bit sieving array will fit into the CPU's L1 cache up to ~ 10^11.4 and into the CPU's L2 cache up to ~ 10^13.8.

# Benchmarks

 x Prime Count Segmented byte sieve Intel Core i7-6700, 32 KiB L1 cache Segmented bit sieve Intel Core i7-6700, 32 KiB L1 cache 107 664,579 0.03s 0.03s 108 5,761,455 0.11s 0.11s 109 50,847,534 0.80s 0.65s 1010 455,052,511 11.10s 7.25s 1011 4,118,054,813 141.88s 88.47s 1012 37,607,912,018 1742.95s 1104.00s