
# Building empirical interval estimates and computing intervals with the "Bootstrap" 

We demonstrate a nonparametric bootstrap to compare intervals over a sample of the embedding vectors in the "small" dataset.

As a statistic, we look at the pairwise difference of random vectors in three subsets - all positive, all negative and positive -- negative pairs.  For each of these subsets we take a sample of N (200), and show histograms of the differences. From these we can create interval estimates of the values. 

class 9
JMA March 2024

In [36]:
import pandas as pd
import numpy as np
import os
# For vector norms, and determinants. 
import numpy.linalg as la

### Plotting
import seaborn as sns
from bokeh.plotting import figure, show
from bokeh.models import Band, BoxAnnotation, ColumnDataSource, VBar, Span, Legend
from bokeh.io import output_notebook
from bokeh.palettes import Category10
output_notebook()

### use to create the embeddings
# See  https://www.sbert.net/
from sentence_transformers import SentenceTransformer, util # use sbert's cosine similarity measure
EMBEDDING_DIMENSIONS = 384     # This model creates vectors of this length

### Stats
# use statsmodels to compute regression standard errors
# since scikit learn doesn't compute them.
import statsmodels.api as sm
# Principal component analysis
from sklearn.decomposition import PCA, TruncatedSVD
# Distributions
from scipy.stats import norm
# The random number generator can be used to select a random set of columns for visualization
from numpy.random import Generator, PCG64
rng = Generator(PCG64())

In [111]:
# A plot that shows the difference between the normal approximation interval and an equivalent coverage empirical interval, 

def expand_range(low, high, by=2.0):
    mid = (low + high)/2
    half_width = by * abs(high - low)/2
    return (mid - half_width, mid + half_width)

def plot_intervals(the_sample, bin_ct = 50, subtitle = ''): 
    normal_approx = interval_estimate(the_sample)
    lower_limit = min(normal_approx['normal_interval'][0], normal_approx['quantile_interval'][0])
    upper_limit = max(normal_approx['normal_interval'][1], normal_approx['quantile_interval'][1])
    lower_plot_limit, upper_plot_limit = expand_range(lower_limit, upper_limit) 
    print('Limits ', lower_plot_limit, upper_plot_limit)
    p = figure(width = 800, height = 300, x_range = (lower_plot_limit,  upper_plot_limit ), title='Normal versus Empirical Interval Estimates.\n' + subtitle)

    hist, bin_edges = np.histogram(the_sample, bins=bin_ct, density=True)
    hist_df = pd.DataFrame(dict(density=hist, rv= bin_edges[:-1]), columns=['rv', 'density'])
    # Shade the background corresponding to the normal interval
    box = BoxAnnotation(left=normal_approx['normal_interval'][0], right=normal_approx['normal_interval'][1], bottom=0, top=max(1, max(hist)), fill_alpha=0.2, fill_color='#D55E00')
    p.add_layout(box)
    bar_width = (bin_edges[-1] - bin_edges[0])/bin_ct
    hist_src = ColumnDataSource(hist_df)
    
    glyph = VBar(x='rv', top='density', bottom=0, width=bar_width, fill_color='limegreen')
    p.add_glyph(hist_src, glyph)
    # Overlay a selection of the histogram within the estimate interval
    interval_src = ColumnDataSource(hist_df[(hist_df.rv > normal_approx['quantile_interval'][0]) & (hist_df.rv < normal_approx['quantile_interval'][1])])
    glyph = VBar(x='rv', top='density', bottom=0, width=bar_width, fill_color='darkgreen')
    p.add_glyph(interval_src, glyph)
    # Overlay the normal distribution within the interval. 
    x = np.linspace(normal_approx['normal_interval'][0], normal_approx['normal_interval'][1])
    p.line(x=x, y=norm.pdf(x, loc= normal_approx['mean'], scale = normal_approx['sd']))
    return p

# Examples using positive and negative subsets of the embedding vectors

Lets see if distances between embedding vectors within the positive and negative classes are significantly different than between classes. We compute random pair-wise distances within and between classes for comparison. 

## First load the embedding vectors and separate them into subsets

In [3]:
# Load the embedding vectors dataset 
if os.environ['HOME'] == '/Users/jma':
    featurized_parquet_file = '/Users/jma/Library/CloudStorage/OneDrive-Personal/' +\
    'teaching/sjsu/ISE201/ISE 201 - Math Foundations for Decision and Data Science/embeddings/' +\
    'sentence_embeddings.parquet'
else:
    featurized_parquet_file = 'data/sentence_embeddings.parquet'

text_df = pd.read_parquet(featurized_parquet_file)

In [4]:
# The review column contains varying-length text that was converted into the vector embeddings
# Sentiment is a 0, 1 label that classifies the text as either positive or negative
# The embedding vectors are stuffed into a the last column
text_df

Unnamed: 0,review,sentiment,vector
0,So there is no way for me to plug it in here i...,0,"[0.08027008920907974, -0.04396028444170952, -0..."
1,"Good case, Excellent value.",1,"[-0.009648566134274006, 0.10622689127922058, 0..."
2,Great for the jawbone.,1,"[-0.07081733644008636, 0.07361650466918945, 0...."
3,Tied to charger for conversations lasting more...,0,"[-0.0739610344171524, 0.06734045594930649, 0.0..."
4,The mic is great.,1,"[-0.09819574654102325, 0.010798277333378792, 0..."
...,...,...,...
743,I just got bored watching Jessice Lange take h...,0,"[-0.02032049186527729, -0.07333985716104507, 0..."
744,"Unfortunately, any virtue in this film's produ...",0,"[-0.025788182392716408, 0.007497682701796293, ..."
745,"In a word, it is embarrassing.",0,"[0.026193976402282715, 0.022175997495651245, 0..."
746,Exceptionally bad!,0,"[-0.027648691087961197, -0.004298456944525242,..."


In [5]:
# Create the positive and negative subsets
how_many_positive = text_df[text_df['sentiment'] == 1]['vector']
how_many_negative = text_df[text_df['sentiment'] == 0]['vector']

Xpositives = np.array([v for v in how_many_positive])
Xnegatives = np.array([v for v in how_many_negative])

Xpositives.shape, Xnegatives.shape, 

((1386, 384), (1362, 384))

In [7]:
# Create pairwise cosine distance samples - positive, negative and differences
sample_count = 2000

# Select pairwise differences 
def pair_wise_distance(X):
    pos_pair = np.random.choice(np.arange(len(X)), 2, replace=False)
    return util.cos_sim(X[pos_pair[0],:], X[pos_pair[1],:]).tolist()[0][0]

def contrast_pair_distance(X,Y):
    pos = np.random.choice(np.arange(len(X)), 1, replace=False)
    neg = np.random.choice(np.arange(len(Y)), 1, replace=False)
    return util.cos_sim(X[pos,:], X[neg,:]).tolist()[0][0]

pos_distances = [pair_wise_distance(Xpositives) for k in range(sample_count)]
neg_distances = [pair_wise_distance(Xnegatives) for k in range(sample_count)]
contrast_distances = [contrast_pair_distance(Xpositives, Xnegatives) for k in range(sample_count)]

# Check the mean cosine distances of each sample 
# np.apply_along_axis(la.norm, 1, positive_samples), np.apply_along_axis(la.norm, 1, negative_samples)
np.array(pos_distances).mean(), np.array(neg_distances).mean(), np.array(contrast_distances).mean()

(0.13920222591322096, 0.12436748612423594, 0.14137404903581174)

## Two kinds of intervals - "normal distribution"  and "empirical distribution"

To express the "standard error" around the mean of a distribution these are two comparable methods.  

1) Use the mean and standard deviation to create an interval with 95% of the probability mass.  

2) Compute percentiles of the empirical (histogram) and find the boundaries of the lower and upper 2.5% probability tails

Note how these appear when applied to the embedding distance distributions.  There is clearly no clean separation between positive and negative vectors! 

In [120]:
# Compute two different interval estimates
# The first assuming a normal distribution, using standard deviation
# The second computing the equivalent quantiles 0.025 and 0.975, from the empirical data. Note: For extreme distributions, such as those on a closed or half interval, this may not be the best approximation
def interval_estimate(data_list):
    data_ar = np.array(data_list)
    mn = data_ar.mean()
    sd = data_ar.std()
    return {'mean': mn, 'sd': sd,
        'normal_interval': (mn-1.96*sd, mn+1.96*sd), 
            'quantile_interval': (np.percentile(data_ar, 2.5), np.percentile(data_ar, 97.5))}

interval_estimate(pos_distances)

{'mean': 0.13920222591322096,
 'sd': 0.12135931687992416,
 'normal_interval': (-0.09866203517143038, 0.37706648699787226),
 'quantile_interval': (-0.05143368842438651, 0.42687950723485385)}

In [113]:
show(plot_intervals(pos_distances, subtitle='pair-wise positive distances'))

Limits  -0.36143280637457254 0.689650278437996


In [114]:
show(plot_intervals(neg_distances, subtitle='pair-wise negative distances'))

Limits  -0.31623396855466857 0.6098488752564705


In [112]:
show(plot_intervals(contrast_distances, subtitle='Distribution of differences'))

Limits  -0.3689014302474778 0.7060777896467401


### For comparison, here's what it looks like on a normally distributed sample

In [116]:
# Test this on a normal sample
normal_density = norm.rvs(loc=0, scale=1, size=1000)
p = plot_intervals(normal_density, subtitle='normal distribution')
show(p)

Limits  -3.964520783258499 4.037253746017703


# More sampling statistics

__An exercise is working with non-normal distributions__

Most cases there isn't an analytic solution for the standard error (SE) (the standard deviation of a statistic)  In finding an SE, a particular egregious case is the quotient of two random variables.  Assuming the above-derived positive and negative differences samples are independent, we create a new rv of their quotient. 

In [121]:
quotient_rv_sample = pd.DataFrame(np.array(pos_distances) / np.array(neg_distances))
quotient_rv_sample.describe()

Unnamed: 0,0
count,2000.0
mean,7.58381
std,242.684015
min,-764.039003
25%,0.211558
50%,0.834779
75%,1.974618
max,10749.860138


In [115]:
# Intervals for the quotient of two random variables. 
p = plot_intervals(quotient_rv_sample, bin_ct=300, subtitle='quotient distribution')
show(p)

Limits  -943.4996666219088 958.6672875759598


In [117]:
inverse_rv = np.divide(np.ones(np.array(pos_distances).shape), np.array(pos_distances))
p = plot_intervals(inverse_rv, bin_ct= 240, subtitle='inverse random variable distribution')
show(p)

Limits  -2426.7831245765447 2405.618174562098


In [118]:
squared_rv = np.multiply(np.array(pos_distances), np.array(pos_distances))
p = plot_intervals(squared_rv, bin_ct=180, subtitle='squared random variable distribution')
show(p)

Limits  -0.19106181221744112 0.30665547835379464


## An application of the Bootstrap. 

The "bootstrap" uses resampling given only a single sample, to find the standard error of statistics that summarize the sample. 

The _non-parametric_ bootstrap gives a general purpose alternative to using normal distribution approximations of mean and standard deviation to computing interval estimates.  

### Estimating the error interval for a statistic

In this example we consider the mean of the squared value of our existing sample.  The "underlying" distribution of this transformed random variable -- perhaps could be determined by theory; instead we estimate it by re-sampling the sample of squared values. 
By squaring the random variable it is limited to positive values, and the normal approximation fails. 
Here we show that by bootstrapping an interval for the mean of the squared value of a random variable we can still determine an error estimate fo the mean statistic. 


In [122]:
# Create a large number of bootstrap replications 

def bootstrap_replicate(the_sample, B=1000):
    replicates = np.zeros((len(the_sample), B))
    for k in range(B):
        replicates[:,k] = np.random.choice(the_sample, len(the_sample), replace=True)
    return replicates

bootstraps = bootstrap_replicate(squared_rv[1:80], B=2000)
print('Interval estimates: ', interval_estimate(bootstraps.mean(axis=0)))
p = plot_intervals(bootstraps.mean(axis=0), bin_ct=40, subtitle= 'Bootstrap applied to the squared rv mean')
show(p)

Interval estimates:  {'mean': 0.02224072874084596, 'sd': 0.003931812761745101, 'normal_interval': (0.014534375727825562, 0.029947081753866355), 'quantile_interval': (0.015062368209781282, 0.030587791233485827)}
Limits  0.00650766797499543 0.03861449898631596
