# Pair counting for Tutorial: Aperiodic Data and Jackknives

This computes the pair counts and correlation functions for [the main tutorial notebook](tutorial.ipynb).
If you already have the pair counting results saved, you do not need to run this notebook again unless you change the settings affecting the correlation function and/or pair counts.

This part should be separate from the main notebook/script for two reasons:

- If `pycorr` pair counting is called from Python with multi-threading (OpenMP) enabled, it may disable multi-threading (OpenMP) in all subsequent external computations, including `RascalC`, harming their performance.
- Pair counts (usually) do not need to be recomputed before each `RascalC` run.

We will have to repeat the data preparation steps from [the main tutorial notebook](tutorial.ipynb):

## Preparing the catalogs

First of all, we will take precautions against repeated warnings disappearing.

In [None]:
import warnings
warnings.filterwarnings("always")

Set the filenames

In [1]:
galaxies_filename = "mock_galaxy_DR12_CMASS_N_QPM_0001.rdzw"
randoms_filename = "mock_random_DR12_CMASS_N_50x1.rdzw"

Read the original files: text with RA, DEC, Z (redshift) and weight columns.
This, and many of the next cells, may take about 10 seconds because of the large number of randoms.

In [2]:
import numpy as np
from astropy.table import Table
galaxies = Table(np.loadtxt(galaxies_filename, usecols = range(4)), names = ["RA", "DEC", "Z", "WEIGHT"]) # ignore the last column, not sure what it is
randoms = Table(np.loadtxt(randoms_filename, usecols = range(4)), names = ["RA", "DEC", "Z", "WEIGHT"])

Compute the comoving distance within the fiducial (grid) cosmology. Here we use a utility function from the RascalC library to do this.

In [3]:
from RascalC.pre_process import comoving_distance_Mpch
Omega_m = 0.29; Omega_k = 0; w_DE = -1 # density parameters of matter and curvature, and the equation-of-state parameter of dark energy
galaxies["comov_dist"] = comoving_distance_Mpch(galaxies["Z"], Omega_m, Omega_k, w_DE)
randoms["comov_dist"] = comoving_distance_Mpch(randoms["Z"], Omega_m, Omega_k, w_DE)

Let us define a utility function for position formatting that will be useful on several more occasions.

In [4]:
def get_rdd_positions(catalog: Table) -> tuple[np.ndarray[float]]: # utility function to format positions from a catalog
    return (catalog["RA"], catalog["DEC"], catalog["comov_dist"])

Assign jackknife regions to both galaxies and randoms.

The `get_subsampler_xirunpc` function generates a $K$-means subsampler from `sklearn` under the rug through a `pycorr` interface, ensuring that it is run single-threaded.
If you call it in your own way, beware that it might run multi-threaded if `OMP_` environment variables are set, and that is known to impede the performance of pair counting or the main `RascalC` covariance computation later (due to OpenMP limitations).

$K$-means is nice in that it can generate a fixed number of regions of similar size with realistic survey geometry.
(However, it is not without issues, e.g. it does not guarantee similar completeness patterns that may affect the shot-noise rescaling.)
Previously, jackknife region numbers were assigned as `healpix` pixels (with number controlled by `NSIDE` variable), which is simpler, but less balanced and flexible.
In cubic/rectangular boxes, box subsamplers from `pycorr` may be useful.
If you have better ideas, feel free to use them and perhaps share with the code authors if they work out well.

In [5]:
from RascalC.pre_process import get_subsampler_xirunpc
n_jack = 60 # number of regions
subsampler = get_subsampler_xirunpc(get_rdd_positions(galaxies), n_jack, position_type = "rdd") # "rdd" means RA, DEC in degrees and then distance (corresponding to pycorr)
galaxies["JACK"] = subsampler.label(get_rdd_positions(galaxies), position_type = "rdd")
randoms["JACK"] = subsampler.label(get_rdd_positions(randoms), position_type = "rdd")

Select a smaller subset of randoms to make pair counting and `RascalC` importance sampling more feasible

In [6]:
x_randoms = 10 # how many times the number of galaxies should the number of randoms be; the total number of randoms is ≈50x the number of galaxies
np.random.seed(42) # for reproducibility
randoms_subset = randoms[np.random.choice(len(randoms), x_randoms * len(galaxies), replace = False, p = randoms["WEIGHT"] / np.sum(randoms["WEIGHT"]))]

## Pair counts and correlation functions with [`pycorr`](https://github.com/cosmodesi/pycorr)

In addition to the galaxy and randoms catalogs, `RascalC` requires the random counts and correlation functions.
We use [`pycorr`](https://github.com/cosmodesi/pycorr) — a wrapper over the fast pair-counting engine `Corrfunc` that saves all the counts and allows to re-bin.
So it can nicely do in one go what used to require several scripts in the legacy version of this tutorial.

We remind you to make sure your current Python environment has [the custom version of `Corrfunc`](https://github.com/cosmodesi/Corrfunc) as suggested for [`pycorr`](https://github.com/cosmodesi/pycorr) (see also [the Installing dependencies section](#Installing-dependencies) in the beginning of this notebook).

Pair counting requires the same catalogs as the main `RascalC` covariance computation.
To minimize the repetitions and/or avoid splitting in this tutorial, we decided to incorporate the `pycorr` run here.
This required some extra tricks to avoid the OpenMP (multi-threading) interference, even more so to make this run safely in a Jupyter notebook (and not just in a script).

In a "production" run, you might want to set up the `pycorr` pair counting separately.
We typically do so, noting that it can be performed on GPU while the main covariance computation with `RascalC` proper requires CPU only.
But it is **very important** to reproduce the pre-processing (including jackknife assignment) for `RascalC` inputs in the same way it was done for pair counting.

First, we choose whether to use full randoms or a smaller subset for pair counting.
The latter is faster but less precise (and may cause convergence problems later).

In [7]:
randoms_for_counts = randoms
# randoms_for_counts = randoms_subset

We continue by splitting the randoms into parts of roughly the same size as data.
This gives high precision at fixed computing cost [(Keihänen et al 2019)](https://arxiv.org/abs/1905.01133).

In [8]:
n_splits = int(np.rint(len(randoms_for_counts) / len(galaxies))) # the number of parts to split the randoms to
print(f"Splitting randoms into {n_splits} parts")

# split randoms into the desired number of parts randomly
random_indices = np.arange(len(randoms_for_counts))
np.random.seed(42) # for reproducibility
np.random.shuffle(random_indices) # random shuffle in place
random_parts = [randoms_for_counts[random_indices[i_random::n_splits]] for i_random in range(n_splits)] # quick way to produce parts of almost the same size

# normalize the weights in each part — fluctuations in their sums may be a bit of a problem
for i_random in range(n_splits): random_parts[i_random]["WEIGHT"] /= np.sum(random_parts[i_random]["WEIGHT"])

Splitting randoms into 10 parts


Settings for pair counts:

In [9]:
n_threads = 10 # number of threads for pycorr computation, feel free to adjust according to your CPU

split_above = 20 # Mpc/h. Below this, will use concatenated randoms. Above, will use split.
s_max = 200 # maximal separation in Mpc/h
n_mu = 200 # number of angular (µ) bins
counts_filename = f"allcounts_mock_galaxy_DR12_CMASS_N_QPM_0001_lin_njack{n_jack}_nran{n_splits}_split{split_above}.npy" # filename to save counts

Define the function to compute the counts, using split randoms at larger separations.

In [None]:
from pycorr import TwoPointCorrelationFunction, setup_logging
from tqdm import trange # nice progress bar

mu_edges = np.linspace(-1, 1, n_mu + 1) # make uniform µ bins between -1 and 1, or twice less bins between 0 and 1 after wrapping (will be done within RascalC wrapper)
s_edges_all = (np.arange(split_above + 1), np.arange(split_above, s_max + 1)) # 1 Mpc/h wide separation bins from 0 to s_max Mpc/h, separated to concatenated/split random regions. Can be rebinned to any bin width that divides split_above and s_max

def run_pair_counts():
    setup_logging()
    results = []
    # compute
    for i_split_randoms, s_edges in enumerate(s_edges_all):
        result = 0
        D1D2 = None # to compute the data-data counts on the first go but not recompute then
        for i_random in trange(n_splits if i_split_randoms else 1, desc="Computing counts with random part"):
            these_randoms = random_parts[i_random] if i_split_randoms else randoms_for_counts
            tmp = TwoPointCorrelationFunction(mode = 'smu', edges = (s_edges, mu_edges),
                                            data_positions1 = get_rdd_positions(galaxies), data_weights1 = galaxies["WEIGHT"], data_samples1 = galaxies["JACK"],
                                            randoms_positions1 = get_rdd_positions(these_randoms), randoms_weights1 = these_randoms["WEIGHT"], randoms_samples1 = these_randoms["JACK"],
                                            position_type = "rdd", engine = "corrfunc", D1D2 = D1D2, gpu = False, nthreads = n_threads)
            # "rdd" means RA, DEC in degrees and then distance
            D1D2 = tmp.D1D2 # once computed, becomes not None and will not be recomputed
            result += tmp
        results.append(result)
    corr = results[0].concatenate_x(*results) # join the unsplit and split parts
    corr.D1D2.attrs['nsplits'] = n_splits

    corr.save(counts_filename)

Run the pair counts.
The computation does take a while <!-- (about 10 minutes for me at NERSC login node; expect to see 1/10 progress in a few minutes) --> even with multi-threading (which I haven't managed to make work with `Corrfunc` in macOS yet).

In [11]:
run_pair_counts()

Computing xi with random part: 100%|██████████| 10/10 [09:07<00:00, 54.73s/it]


If the previous cell concluded successfully, you should be good to go back to [the main tutorial notebook](tutorial.ipynb) and load the saved pair counts.