# Innovative Statistical Methods for Small Data Sets

### 1. Bootstrapping

Bootstrapping means sampling with replacement. That is, the samples in the replacement process can be selected more than once. This is a useful technique when one is dealing with a small dataset to conduct parametric statistical methods such as confidence interval estimation and hypothesis testing.

In [2]:
import numpy as np

The following code shows how to perform bootstrapping

In [4]:
#define the data
data = np.array([2.3, 1.9, 2.7, 2.8, 3.1])

#define the number of bootstrap samples
n_samples = 1000

#define statistic function to use
statistic = np.mean

In [8]:
#generate bootstrap samples
bootstrap_samples = np.random.choice(
       data,
       size=(n_samples, len(data)),
       replace=True
    )

bootstrap_samples

array([[3.1, 2.8, 2.3, 3.1, 2.3],
       [3.1, 1.9, 1.9, 1.9, 2.7],
       [1.9, 2.7, 2.3, 3.1, 1.9],
       ...,
       [2.8, 1.9, 2.8, 3.1, 2.7],
       [2.3, 1.9, 3.1, 2.8, 2.3],
       [3.1, 3.1, 2.7, 2.8, 1.9]], shape=(1000, 5))

In [None]:
#calculate the statistic for each bootstrap sample
bootstrap_statistics = np.apply_along_axis(
    statistic,
    axis=1,
    arr=bootstrap_samples
    )

bootstrap_statistics

In [10]:
#find  2.5th and 97.5th percentiles of the bootstrap statistics
confidence_interval = np.percentile(bootstrap_statistics, [2.5, 97.5])

#output the confidence interval
print(f"95% Confidence Interval: {confidence_interval}")

95% Confidence Interval: [2.22 2.9 ]


The code above can be written as a function to make it reusable

In [14]:
#write a function out of the bootstrap code
def bootstrap(data, n_samples, statistic):
    bootstrap_samples = np.random.choice(
        data,
        size=(n_samples, len(data)),
        replace=True
    )
    bootstrap_statistics = np.apply_along_axis(
        statistic,
        axis=1,
        arr=bootstrap_samples
    )
    confidence_interval = np.percentile(bootstrap_statistics, [2.5, 97.5])
    print(f"95% Confidence Interval: {confidence_interval}")

In [None]:
#run the function
bootstrap(data, n_samples, statistic)

95% Confidence Interval: [2.2185 2.9   ]


### 2. Permutation Tests

Suitable for small datasets, it is a non-parametric test that is used to test the null hypothesis that two populations are the same.  
The test works by comparing the observed difference in means between the two populations to the distribution of differences in means that would be expected if the null hypothesis were true.  
The p-value is then calculated as the proportion of differences in means that are greater than or equal to the observed difference in means.


Code below shows how to perform permutation test

In [19]:
#define the datasets
data1 = np.array([2.3, 1.9, 2.7])
data2 = np.array([2.8, 3.1, 3.4])

In [20]:
#return the observed difference in means
observed_difference = np.mean(data1) - np.mean(data2)

In [21]:
#combine both datasets
combined_data = np.concatenate([data1, data2])

In [22]:
#variables for the permutation test
n_permutations = 10_000
count = 0

In [24]:
#run permutation test
for _ in range(n_permutations):
    #shuffle combined data
    np.random.shuffle(combined_data)
    #split data into 2 groups
    split_data1 = combined_data[:len(data1)]
    split_data2 = combined_data[len(data1):]
    #find permuted difference in means
    permuted_difference = np.mean(split_data1) - np.mean(split_data2)

    #check if difference is as extreme as the observed difference
    if abs(permuted_difference) >= abs(observed_difference):
        count += 1

In [25]:
# find p-value
p_value = count / n_permutations

#output result
print(f"Observed Difference: {observed_difference}\nP-value: {p_value}")

Observed Difference: -0.8000000000000003
P-value: 0.0468


function version for performing permutation test

In [26]:
#writing a function out of the permutation test code
def permutation_test(data1, data2, n_permutations=10_000):
    observed_difference = np.mean(data1) - np.mean(data2)
    combined_data = np.concatenate([data1, data2])
    count = 0

    for _ in range(n_permutations):
        np.random.shuffle(combined_data)
        permuted_difference = np.mean(combined_data[:len(data1)]) - np.mean(combined_data[len(data1):])
        if abs(permuted_difference) >= abs(observed_difference):
            count += 1
    
    p_value = count / n_permutations
    return observed_difference, p_value

In [27]:
#run function
observed_diff, p_value = permutation_test(data1, data2)

print(f"Observed Difference: {observed_diff}\nP-value: {p_value}")

Observed Difference: -0.8000000000000003
P-value: 0.0431


### 3. Jackknife Resampling

Another non-parametric technique for estimating bias and variance from a small dataset.  
It works by removing one data observation at a time from the data set, and recalculating the statistics each time with the remaining data.  
The process is repeated for all the data, resulting in estimates for the overall statistics

function for performing jackknife resampling

In [32]:
#define data and statistic function
data = np.array([2.3, 1.9, 2.7, 2.8, 3.1])
statistic = np.mean

n = len(data)

In [33]:
#compute jackknife samples
jackknife_samples = []

for i in range(n):
    #remove a data point at every iteration and find statistic for the remaining data
    reduced_sample = np.delete(data, i)
    jackknife_samples.append(statistic(reduced_sample))

jackknife_samples = np.array(jackknife_samples)

In [34]:
#find the jackknife mean
jackknife_mean = np.mean(jackknife_samples)

#find jackknife variance
jackknife_variance = (n - 1) * np.mean((jackknife_samples - jackknife_mean) ** 2)

In [35]:
# Output the jackknife mean and variance
print("Jackknife Mean:", jackknife_mean)
print("Jackknife Variance:", jackknife_variance)

Jackknife Mean: 2.56
Jackknife Variance: 0.04360000000000007


function for performing jackknife resampling

In [30]:
def jackknife(data, statistic=np.mean):
    n = len(data)
    jackknife_samples = np.array([statistic(np.delete(data, i)) for i in range(n)])
    jackknife_mean = np.mean(jackknife_samples)
    jackknife_variance = (n - 1) * np.mean((jackknife_samples - jackknife_mean) ** 2)
    return jackknife_mean, jackknife_variance

In [None]:
mean, variance = jackknife(data)
print(f"Jackknife Mean: {mean}, Variance: {variance}")

### 4. Sign Test

A non-parametric test used to evaluate the significanct difference between  the sample median and the hypothesized median.  
It is used for small dataset and does not rely on any assumptions, unlike parametric tests   
It is performed by counting the number of data points above or below the hypothesized median, then taking the smaller count to evalute it as the test statistic.  
The significance is calculated by comparing the test statistics with the critical values from the binomial distribution

In [28]:
from scipy.stats import binom

In [29]:
data = [12, 15, 14, 16, 13, 10]
hypothesized_median = 14

above_hypothesized = sum(value > hypothesized_median for value in data)
below_hypothesized = sum(value < hypothesized_median for value in data)
n = above_hypothesized + below_hypothesized

p_value = 2 * binom.cdf(min(above_hypothesized, below_hypothesized), n, .5)
print(f"P-value: {p_value}")

P-value: 1.0
