# Project Template: Regression Experiments using sentence embeddings

ISE 201, section 33

28 Mar 2024  JM Agosta

_A template to follow for your final class project._

Your presentation from this notebook should be divided into sections. Each section starts with a markdown cell that mentions the purpose of the following cells. Pls divide your computation into manageably sized cells. You can check intermediate results by an expression in the last line of the cell -- no need for a print function to enclose them.

The goal of this project is to consolidate the work you've done over the past projects, to run a regression experiment using sentence embedding data. An experiment works by

- picking an evaluation measure (a "statistic"),
- running a baseline evaluation of the regression,
- coming up with a presumed improvement to the regression that could be a generation or selection of features going into the regression,
- making a comparative evaluation to see if the improvement is "statistically" significant.

You can refer to the previous class notebooks that can be found on Google Drive, and feel free to share ideas that others have contributed in the class, that can be found in the notebooks they've shared.  

Design your notebook, with sections and visualizations that are self-explanatory (label your graph axes!) so that you make the best use of the 8 minutes you'll have to present.

## Setup. Import libraries.  Check if any need to be loaded that are missing.

In [229]:
import pandas as pd
import numpy as np
import re, os
# For vector norms, and determinants.
import numpy.linalg as la
# Plotting
import matplotlib as plt
import seaborn as sns

# javascript plots
from bokeh.plotting import figure, show
from bokeh.models import BoxAnnotation, ColumnDataSource, VBar, Span, Legend
from bokeh.io import output_notebook
from bokeh.palettes import Category10
output_notebook()

import sklearn.linear_model as sl

# Distributions
from scipy.stats import norm

# Principal component analysis
from sklearn.decomposition import PCA, TruncatedSVD
# The random number generator can be used to select a random set of columns for visualization
from numpy.random import Generator, PCG64
rng = Generator(PCG64())

In [230]:
# use statsmodels to compute OLS regression standard errors
# since scikit learn doesn't compute them.
# This update fails
# os.system('python -m pip install git+https://github.com/statsmodels/statsmodels')
import statsmodels.api as sm
sm.__version__

'0.14.1'

### Set globals

In [231]:
# Global variables
EMBEDDING_DIMENSIONS = 384     # This model creates vectors of this length.
TEST_SET_SIZE = 500
BOOTSTRAP_RESAMPLES = 200
RELOAD_PACKAGE = False         # Reload the embedding package

In [232]:
# If you need to rerun the sentence embeddings transformer, you'll have to reload the package since
# its not included in Colab.  Set RELOAD_PACKAGE to True to reload it.


# We'll use this dataset of 3000 labeled text samples.
# https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences

# For converting the text into vectors, use this llm package
# See https://pypi.org/project/sentence-transformers/
if RELOAD_PACKAGE:
  try:
      from sentence_transformers import SentenceTransformer
  except:

      os.system('python -m pip install sentence-transformers')
      from sentence_transformers import SentenceTransformer



## Load the data, and run the sentence tranformers on it, if needed

If you're interested in looking at a larger set of 50,000 examples you can find them on the Kaggle site:

We've been using the data set as indexed by embeddings from ---a set of 3,000 items from 3 different sources, - "IMDB", Amazon, and Yelp reviews, on which the 'all-MiniLM-L6-v2' sentenceTransformer model generates 384 features.

If you need to download the dataset, go to:

    https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences

If you want to use a larger set, here's an NLP data set from Kaggle  of 50,000 items.

To download the dataset, go to:

https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

You would need to download the "IMDB Dataset.csv" file - 66.21MB


In [233]:
# Run this if you want to run on Colab and use the already featurized data

if not os.path.exists('/content/drive'):
  from google.colab import drive
  drive.mount('/content/drive/')
os.chdir('/content/drive/MyDrive/ISE201/notebooks/data')
os.listdir(os.getcwd())

['3000_sentence_embeddings.parquet']

In [234]:
# Load the embedding vectors dataset
if os.environ['HOME'] == '/Users/jma':
    featurized_parquet_file = '/Users/jma/Library/CloudStorage/OneDrive-Personal/' +\
    'teaching/sjsu/ISE201/ISE 201 - Math Foundations for Decision and Data Science/embeddings/' +\
    'sentence_embeddings.parquet'
else:
  # Find the file on Googe Drive
    featurized_parquet_file = './3000_sentence_embeddings.parquet'

text_df = pd.read_parquet(featurized_parquet_file)

# Extract the outcome y vector
outcomes = text_df['sentiment'].values

# Extract the vector field and expand it to multiple rows.
n_samples, embedding_dim = text_df.shape
data_ar = np.empty((n_samples, EMBEDDING_DIMENSIONS), 'float')
for i in range(n_samples):
    x = text_df.values[i,2]
    data_ar[i] = x

# Check the data shapes: raw NLP data, and itsEMB sentence embeddings, outocomes
text_df.shape, data_ar.shape, outcomes.shape

((2748, 3), (2748, 384), (2748,))

In [235]:
# Split the data into a learning and holdout set
# Create a random sample, with replacement
sample_size = outcomes.shape[0]- TEST_SET_SIZE
learning_set = np.random.choice(len(data_ar),\
                                sample_size,\
                                replace=False)
learning_data = data_ar[learning_set,:]
learning_outcome = text_df['sentiment'].values[learning_set]

# Run predictions against a hold-out set
not_learning_set = list(set(text_df.index).difference(set(learning_set)))
holdout_data = data_ar[not_learning_set,:]
# holdout_data = sm.add_constant(holdout_data)
holdout_actuals = text_df['sentiment'].values[not_learning_set]
len(learning_set), sample_size, learning_outcome.shape

(2248, 2248, (2248,))

## Pick an evaluation metric & create visualization graphics

In [236]:
# An in-sample metric
def AIC(the_fit):
  'Use this to compute AIC on different models.'
  return 2* (the_fit.df_model+1) - 2*the_fit.llf

# A metric that can be used in-sample or out-of-sample
def one_zero_error(predicted, actual, test_threshold = 0.5):
  'The predicted is a continuous value, so it needs to be thresholded. '
  test_df = pd.DataFrame(dict(prediction=pd.Series(predicted >= test_threshold), actual=actual))
  test_error = sum(test_df.prediction != actual)/len(actual)
  return test_error

In [237]:
# A plot that shows the difference between the normal approximation interval and an equivalent coverage empirical interval,

def expand_range(low, high, by=2.0):
    mid = (low + high)/2
    half_width = by * abs(high - low)/2
    return (mid - half_width, mid + half_width)

# Compute two different interval estimates
# The first assuming a normal distribution, using standard deviation
# The second computing the equivalent quantiles 0.025 and 0.975, from the empirical data. Note: For extreme distributions, such as those on a closed or half interval, this may not be the best approximation
def interval_estimate(data_list):
    data_ar = np.array(data_list)
    mn = data_ar.mean()
    sd = data_ar.std()
    return {'mean': mn, 'sd': sd,
        'normal_interval': (mn-1.96*sd, mn+1.96*sd),
            'quantile_interval': (np.percentile(data_ar, 2.5), np.percentile(data_ar, 97.5))}

def plot_intervals(the_sample, bin_ct = 50, subtitle = ''):
    normal_approx = interval_estimate(the_sample)
    lower_limit = min(normal_approx['normal_interval'][0], normal_approx['quantile_interval'][0])
    upper_limit = max(normal_approx['normal_interval'][1], normal_approx['quantile_interval'][1])
    lower_plot_limit, upper_plot_limit = expand_range(lower_limit, upper_limit)
    print('Limits ', lower_plot_limit, upper_plot_limit)
    p = figure(width = 800, height = 300, x_range = (lower_plot_limit,  upper_plot_limit ), title='Normal versus Empirical Bootstrap Interval Estimates.\n' + subtitle)

    hist, bin_edges = np.histogram(the_sample, bins=bin_ct, density=True)
    hist_df = pd.DataFrame(dict(density=hist, rv= bin_edges[:-1]), columns=['rv', 'density'])
    # Shade the background corresponding to the normal interval
    box = BoxAnnotation(left=normal_approx['normal_interval'][0], right=normal_approx['normal_interval'][1], bottom=0, top=max(1, max(hist)), fill_alpha=0.2, fill_color='#D55E00')
    p.add_layout(box)
    bar_width = (bin_edges[-1] - bin_edges[0])/bin_ct
    hist_src = ColumnDataSource(hist_df)

    glyph = VBar(x='rv', top='density', bottom=0, width=bar_width, fill_color='limegreen')
    p.add_glyph(hist_src, glyph)
    # Overlay a selection of the histogram within the estimate interval
    interval_src = ColumnDataSource(hist_df[(hist_df.rv > normal_approx['quantile_interval'][0]) & (hist_df.rv < normal_approx['quantile_interval'][1])])
    glyph = VBar(x='rv', top='density', bottom=0, width=bar_width, fill_color='darkgreen')
    p.add_glyph(interval_src, glyph)
    # Overlay the normal distribution within the interval.
    x = np.linspace(normal_approx['normal_interval'][0], normal_approx['normal_interval'][1])
    p.line(x=x, y=norm.pdf(x, loc= normal_approx['mean'], scale = normal_approx['sd']))
    return p

## Run the baseline regression

In [238]:
# Run an OLS regression on the full data.
def regress(y, X):
  design_matrix = X # sm.add_constant(X)
  ols_model = sm.OLS(y, design_matrix)
  #print(ols_model.fit().summary())
  return ols_model.fit()

def print_eval_summary(the_fit):
  # Just show the evaluation table, without coefs (tables[1] give coefs)
  s = the_fit.summary()
  print (s.tables[0])
  return None

In [239]:
fit = regress(outcomes, data_ar)
print_eval_summary(fit)
'AIC: ', AIC(fit)

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.849
Model:                            OLS   Adj. R-squared (uncentered):              0.825
Method:                 Least Squares   F-statistic:                              34.85
Date:                Sat, 30 Mar 2024   Prob (F-statistic):                        0.00
Time:                        05:22:04   Log-Likelihood:                         -360.32
No. Observations:                2748   AIC:                                      1485.
Df Residuals:                    2366   BIC:                                      3746.
Df Model:                         382                                                  
Covariance Type:            nonrobust                                                  


('AIC: ', 1486.6478837927152)

In [240]:
# Compute in-sample error
one_zero_error(fit.predict(), outcomes, test_threshold = 0.5)

0.06841339155749636

## An example Interval estimate for out-of-sample 0-1 error via Bootstrap re-sampling

In [241]:
# Create a large number of bootstrap replications

def bootstrap_replicates(the_sample, B=BOOTSTRAP_RESAMPLES):
    replicates = np.zeros((len(the_sample), B), dtype='int')
    for k in range(B):
        replicates[:,k] = np.random.choice(the_sample, len(the_sample), replace=True).astype('int')
    return replicates

resamplings = bootstrap_replicates(list(range(sample_size)))
resamplings.shape

(2248, 200)

In [224]:
# Bootstrap the zero-one error on a hold-out set.

# Create a large number of bootstrap replications
resampled_errors = np.zeros(BOOTSTRAP_RESAMPLES)

# Retrain the model on the resampled learning set,
# and compute the hold-out set error for each resample
print('learning AIC: ')
for k in range(BOOTSTRAP_RESAMPLES):
  resampled_data  = learning_data[resamplings[:,k],:] #[resamplings[:,k],:]
  retraining_fit = regress(learning_outcome[resamplings[:,k]], resampled_data) # [resamplings[:,k]], resampled_data)
  print(AIC(retraining_fit), end=', ')


  # Predict on the hold-out set
  hold_out_predict = retraining_fit.predict(holdout_data)
  resampled_errors[k] = one_zero_error(hold_out_predict, holdout_actuals)


learning AIC: 
820.5940694308738, 996.9854574322089, 855.1051770412487, 801.3387115882633, 811.3364580079442, 948.486844398557, 785.4408160581452, 847.7226041245776, 936.8544080094407, 893.2464042715173, 905.3592927112895, 933.2542854999092, 811.0524741100744, 812.8359526518661, 825.8616101995831, 1029.6701864565293, 851.7223093523389, 821.6344778392631, 791.4813600270472, 786.6576783291994, 810.7910403809747, 897.0395060005894, 874.84331183427, 914.4149962096135, 746.2675010529356, 849.4826067610029, 728.5723307139906, 1065.983898592307, 967.1680043341348, 929.1799780899519, 941.8000595463509, 831.707085438301, 904.7196224553827, 907.3084737619502, 846.4245778417708, 923.5466528694633, 848.7103406599426, 746.1561692841933, 927.5219781168698, 855.942787040578, 862.6805067199512, 1006.9787061853967, 803.3588636928926, 904.999898219362, 813.1551033758824, 906.7633413904341, 790.3843372071706, 763.6861741319199, 854.5202086112904, 856.8403927286354, 675.846441884325, 1069.7570656503085, 7

In [225]:
# What to the bootstrap errors look like?
pd.DataFrame(resampled_errors).describe()

Unnamed: 0,0
count,200.0
mean,0.154195
std,0.019573
min,0.091954
25%,0.143678
50%,0.155172
75%,0.166667
max,0.212644


In [226]:
#Plot the error distribution, and compute the interval estimate limit.
p = plot_intervals(resampled_errors)
show(p)

Limits  0.07766034339696512 0.230730461200736


##  Generate selected features, for an alternative model

In [227]:
#TBD

## Compare baseline and alternative models

In [228]:
#TBD