In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

pd.set_option('display.float_format', '{:.4g}'.format)

Suppose we draw a sample of size 2 from a population of size 30. The statistics we're interested in are the range and the variance. A (true) simple random sample of size $k$ from a population of size $n$ has chance $1/{n \choose k}$ of resulting in each of the ${n \choose k}$ possible subsets. The joint distribution of the number of times each possible sample is selected in $N$ independent random samples is multinomial with ${n \choose k}$ categories, equal category probabilities $1/{n \choose k}$, and $N$ draws. The pidgeonhold arguments prove that the actual distribution cannot be exactly multinomial, but how bad is the approximation? We can test the hypothesis using as the test statistic the range of category counts. Since we can't trust simulations to give an accurate result (that's the problem we are studying!), we need to rely on asymptotics.

We use PIKK to draw $5 \times 10^7$ samples and compare the following PRNGs: RANDU, Super-Duper, Mersenne Twister, and the SHA-256-based PRNG.

# First order selection probabilities

Under the null, the number of times each individual item is drawn should follow a multinomial distribution with $(5 \times 2) \times 10^7 = 10^8$ draws and equal probabilities $1/30$.

In [2]:
FO = pd.read_csv('firstOrderSummary.csv')
cols = ['PRNG', 'seed', 'Chi-squared', 'Df', 'P-value', 'Range', 'Range P-value']
FO[cols].sort_values(['Range P-value', 'P-value'])

Unnamed: 0,PRNG,seed,Chi-squared,Df,P-value,Range,Range P-value
4,MT,100,27.5,29,0.5447,8804,0.1436
11,MT_choice,233424280,30.56,29,0.3867,7612,0.4245
10,MT_choice,100,29.45,29,0.442,7552,0.4429
2,SD,100,21.89,29,0.8248,6921,0.6445
5,MT,233424280,21.89,29,0.8247,6850,0.6667
8,SHA256,233424280,23.42,29,0.7571,6733,0.7024
6,MT,429496729,19.17,29,0.9167,6324,0.8144
9,SHA256,429496729,17.55,29,0.9529,6078,0.869
12,MT_choice,429496729,22.32,29,0.8069,5695,0.9319
7,SHA256,100,16.47,29,0.9698,5211,0.9764


# Unique sample selection probabilities

Under the null, the number of times each sample is drawn should follow a multinomial distribution with $5 \times 10^7$ draws and equal probabilities $1/{30 \choose 2}$.

In [None]:
US = pd.read_csv('uniqueSampleSummary.csv')
cols = ['PRNG', 'seed', 'Chi-squared', 'Df', 'P-value', 'Range', 'Range P-value']
US[cols].sort_values(['Range P-value', 'P-value'])

Unnamed: 0,PRNG,seed,Chi-squared,Df,P-value,Range,Range P-value
0,RANDU,100,12160.0,434,0.0,8709,-4.441e-16
1,RANDU,233424280,11920.0,434,0.0,8517,-4.441e-16
10,MT_choice,100,407.4,434,0.8156,2286,0.08578
7,SHA256,100,409.6,434,0.7942,2196,0.171
2,SD,100,355.3,434,0.9977,2189,0.1798
4,MT,100,454.7,434,0.2378,2174,0.1996
6,MT,429496729,493.7,434,0.02489,2089,0.341
11,MT_choice,233424280,461.0,434,0.1783,2026,0.4741
5,MT,233424280,429.9,434,0.546,1965,0.6143
9,SHA256,429496729,482.1,434,0.05502,1964,0.6166


## Selection frequencies for RANDU

In [None]:
randu_us = pd.read_csv('../rawdata/US_RANDU.csv', header=None)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
i = 0

for ss in randu_us[4].unique():
    axis = axes[i]
    tmp = randu_us[randu_us[4] == ss]
    xax = np.arange(len(tmp.index))
    yax = np.sort(tmp[1])

    axis.bar(xax, yax, facecolor="b", edgecolor="b")
    axis.set_title('seed = '+ str(ss))
    axis.set_ylim(105000, 117500)
    axis.axhline(y=5*10**7/435, xmin=0, xmax=450, color = "red")
    i += 1
plt.show()

## Selection frequencies for Super-Duper

In [None]:
sd_us = pd.read_csv('../rawdata/US_SD.csv', header=None)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
i = 0

for ss in sd_us[4].unique():
    axis = axes[i]
    tmp = sd_us[sd_us[4] == ss]
    xax = np.arange(len(tmp.index))
    yax = np.sort(tmp[1])

    axis.bar(xax, yax, facecolor="b", edgecolor="b")
    axis.set_title('seed = '+ str(ss))
    axis.set_ylim(105000, 117500)
    axis.axhline(y=5*10**7/435, xmin=0, xmax=450, color = "red")
    i += 1
plt.show()

## Selection frequencies for Mersenne Twister

### With PIKK

In [None]:
mt_us = pd.read_csv('../rawdata/US_MT.csv', header=None)
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
i = 0

for ss in mt_us[4].unique():
    axis = axes[i]
    tmp = mt_us[mt_us[4] == ss]
    xax = np.arange(len(tmp.index))
    yax = np.sort(tmp[1])

    axis.bar(xax, yax, facecolor="b", edgecolor="b")
    axis.set_title('seed = '+ str(ss))
    axis.set_ylim(105000, 117500)
    axis.axhline(y=5*10**7/435, xmin=0, xmax=450, color = "red")
    i += 1
plt.show()

### With np.random.choice

In [None]:
mt_us = pd.read_csv('../rawdata/US_MT_choice.csv', header=None)
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
i = 0

for ss in mt_us[4].unique():
    axis = axes[i]
    tmp = mt_us[mt_us[4] == ss]
    xax = np.arange(len(tmp.index))
    yax = np.sort(tmp[1])

    axis.bar(xax, yax, facecolor="b", edgecolor="b")
    axis.set_title('seed = '+ str(ss))
    axis.set_ylim(105000, 117500)
    axis.axhline(y=5*10**7/435, xmin=0, xmax=450, color = "red")
    i += 1
plt.show()

## Selection frequencies for SHA-256 PRNG

In [None]:
sha_us = pd.read_csv('../rawdata/US_SHA256.csv', header=None)
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
i = 0

for ss in sha_us[4].unique():
    axis = axes[i]
    tmp = sha_us[sha_us[4] == ss]
    xax = np.arange(len(tmp.index))
    yax = np.sort(tmp[1])

    axis.bar(xax, yax, facecolor="b", edgecolor="b")
    axis.set_title('seed = '+ str(ss))
    axis.set_ylim(105000, 117500)
    axis.axhline(y=5*10**7/435, xmin=0, xmax=450, color = "red")
    i += 1
plt.show()

# Range statistic

In [None]:
biases = pd.read_csv('statBias.csv')

First, let's look at the range of the sample of size 2. The sample range is 2 with probability ${30 \choose 2}^{-1}$ (when we get 1 and -1 in the sample), 1 with probability ${2 \choose 1}{28 \choose 1}/{30 \choose 2}$ (when we get one of 1 or -1 in the sample), and 0 with probability ${28 \choose 2}/{30 \choose 2}$ (when we get neither 1 nor -1).

Thus, the expected sample range of a sample of size 2 from this population is $0.1333$ and the standard error over the $5 \times 10^7$ samples is $4.902 \times 10^{-5}$.

We use two adversarial methods to construct the population. For the method "least freq sample," we identify the pair of items that occurs least frequently, then put $-1$ and $1$ on those items.  We'd expect the sample range to be biased towards 0 in this case.  For the method "extreme items," we identify the individual items occuring most often and least often throughout all the samples.  We place a $-1$ on the least frequent item and $1$ on the most frequent item.  This may introduce some bias, but its direction isn't clear.

In [None]:
cols = ['PRNG', 'seed', 'method', 'Avg Sample Range', 'Range Bias', 'Range Bias/SE']
biases[cols].sort_values(['Range Bias/SE', 'PRNG', 'seed'], ascending = True)

# Sample variance

Next, we look at the variance of the sample of size 2. The sample variance is 2 with probability ${30 \choose 2}^{-1}$ (when we get 1 and -1 in the sample), 1/2 with probability ${2 \choose 1}{28 \choose 1}/{30 \choose 2}$ (when we get one of 1 or -1 in the sample), and 0 with probability ${28 \choose 2}/{30 \choose 2}$ (when we get neither 1 nor -1).

Thus, the expected sample variance of a sample of size 2 from this population is $0.06897$ and the standard error over the $5 \times 10^7$ samples is $2.706 \times 10^{-5}$.

In [None]:
cols = ['PRNG', 'seed', 'method', 'Avg Sample Var', 'Var Bias', 'Var Bias/SE']
biases[cols].sort_values(['Var Bias/SE', 'PRNG', 'seed'], ascending = True)