# Project 3 Report

### Team Members and Contributions
This team consisted of Luke Peng (lukep2), Xiying Zhao (xiyingz2), and Tuan Tran (atran49). Luke and Xiying are both Masters students studying Statistics on campus; Tuan is a Masters student in Computer Science in the online MCS program.

Tuan wrote the code that performed data pre-processing and hyperparameter searching for the model in Section 1, and worked on the report. Luke executed the hyperparameter search, completed Section 2, and also worked on the report. Xiying validated the results.

## Section 1

### Technical Details
 - Very little data pre-processing was required; all we had to do for this step was remove the HTML tags using a regex.
 - We trained a logistic regression model (`LogisticRegressionCV`) with an elastic net penalty using the 1536 OpenAI embeddings as features and the sentiment as the response
 - We searched for optimal values of `Cs` (the inverse of the regularization strength) among the options `[0.1,0.5,1,2]`, and `l1_ratio` (alpha) among the options `[0.1,0.5,0.9]`
 - We performed hyperparameter search using 5-fold cross-validation, with the AUC as the scoring metric
 - The optimal model used a `C` value of 2 and an `l1_ratio` of 0.5.

### Performance
The computer system used to train the model has a gpu of NVIDIA RTX 3060 Ti, a processor of 11th gen intel i7-11700 and a RAM of 32GB. The AUC and execution for each split is shown in the table below (the execution time only includes the time to train the model using the optimal hyperparameters and make a prediction):

| Split | AUC    | Time (s) |
| -------- | -------- | ------- |
| 1 | 0.98682 | 23.45  |
| 2 | 0.98634 | 23.63 |
| 3 | 0.98614 | 26.13 |
| 4 | 0.98663 | 25.39 |
| 5 | 0.98603 | 25.57 |

## Section 2
The original Python notebook that generated this HTML document is on Luke's [GitHub](https://github.com/lukepeng02/stat542/tree/main/project3/proj3report.ipynb). To ensure reproducibility of these results, please move it into your `split_1/` directory after downloading from the site (optional). Moreover, the code for the steps mentioned in the notebook that aren't actually included here can be found in [this Python notebook](https://github.com/lukepeng02/stat542/tree/main/project3/part2.ipynb).

Our interpretability analysis focuses on the sentiment of individual sentences within the review. To achieve this, we used the [Sentence-BERT](https://sbert.net/) embedding, which can embed multiple words and even sentences as a single vector. After splitting reviews in to sentences, we use a pretrained Linear Regression model to map these Sentence-Bert embeddings to OpenAI embeddings. We then use these embeddings to predict the sentiment probability of each sentence. To visualize the result, we use a red-white-green gradient: sentences with high positive sentiment are highlighted in green, those with highly negative sentiment in red, and those with neutral sentiment in white. This gradient visually emphasizes the most meaningful parts of each review, making it easier to interpret the model's predictions.

We begin by loading the test data in `split_1` and removing the HTML tags:

In [None]:
# import the necessary libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import re
import requests
import joblib
from io import BytesIO

# load and pre-process the data
test = pd.read_csv("test.csv")
test['review'] = test['review'].str.replace('<.*?>', ' ', regex=True)

y_test = pd.read_csv("test_y.csv")
y_test['sentiment'] = y_test["sentiment"].astype(int)

Next, we will fetch the model trained on this split from GitHub. (The code used to train this model can be found in the submitted `mymain.py` file.)

In [5]:
model_url = "https://github.com/lukepeng02/stat542/raw/refs/heads/main/project3/part1_classifier.joblib"
response = requests.get(model_url)

loaded_model = joblib.load(BytesIO(response.content))

This model was trained on all 1536 OpenAI embeddings. However, we need to create a mapping of some other embedding to the OpenAI embeddings, as per the project instructions. After embedding the reviews in the training set, we trained a Linear Regression model, using the new embeddings as the $X$ matrix and the OpenAI embeddings as the $Y$. We can now load the resulting coefficient matrix $W$ and intercept vector $b$ from GitHub.

In [6]:
W_url = "https://github.com/lukepeng02/stat542/raw/refs/heads/main/project3/W_matrix.npy"
response = requests.get(W_url)

W_loaded = np.load(BytesIO(response.content))

b_url = "https://github.com/lukepeng02/stat542/raw/refs/heads/main/project3/b_vector.npy"
response = requests.get(b_url)

b_loaded = np.load(BytesIO(response.content))

Next, we randomly sampled five positive and five negative reviews.

In [9]:
np.random.seed(8209)

# randomly select 5 positive and 5 negative reviews from test set
positive_idx = y_test[y_test['sentiment'] == 1].index
negative_idx = y_test[y_test['sentiment'] == 0].index

positive_sample = np.random.choice(positive_idx, size=5, replace=False)
negative_sample = np.random.choice(negative_idx, size=5, replace=False)

sample_idx = np.concatenate((positive_sample, negative_sample))

sample_reviews = list(test['review'][sample_idx])

We first embedded each review into Sentence-BERT. Then, for each review, we embedded every sentence into Sentence-BERT. All of these are stored on GitHub. The code used in this step can be found in the second Python notebook mentioned at the beginning of this section. (Note that the Sentence-BERT embeddings stored on GitHub correspond to this sample *only*, meaning they are somewhat hard-coded. However, if you want to generate embeddings for other reviews, you can use the code in that notebook, simply changing the seed.)

We also defined a function that changes the color of text, according to a red-white-green gradient. Given some probability value `yhat` (this denotes the probability of having a positive sentiment), the text will be white if $0.25\leq$ `yhat` $\leq0.75$, red if `yhat` $<0.25$, and green if `yhat` $>0.75$.

In [10]:
def interp_red_green(sentence, yhat, cutoff=0.25):
    red = np.array([255,0,0], dtype=np.float64)
    white = np.array([255,255,255], dtype=np.float64)
    green = np.array([0,255,0], dtype=np.float64)

    if yhat <= cutoff: # linear interpolation
        rgb_color = np.array(red + yhat/cutoff * (white - red), dtype=np.int32)
    elif yhat >= 1 - cutoff:
        rgb_color = np.array(white + (yhat - (1-cutoff))/cutoff * (green - white), dtype=np.int32)
    else:
        rgb_color = white.astype(np.int32)

    ansi_code = f"\033[48;2;{rgb_color[0]};{rgb_color[1]};{rgb_color[2]}m"
    ansi_null = "\033[0m"
    return ansi_code + sentence + ansi_null

Now, we can finally load the embeddings from GitHub. For each Sentence-BERT embedding, we approximated the corresponding OpenAI embedding using our Linear Regression coefficients, and then ran them through our pre-trained model from Section 1. This generated a probability for each review and sentence, which we then used for visualization purposes. The following is our result (note that the first five are positive, while the last five are negative):

In [11]:
sentence_regex = "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s" # regex used to split sentences

review_url = "https://github.com/lukepeng02/stat542/raw/refs/heads/main/project3/sample_sbert.npy"
response = requests.get(review_url)
all_review_embedding = np.load(BytesIO(response.content)) # file contains Sentence-BERT embedding for each sentence in review
all_review_openai = all_review_embedding @ W_loaded.T + b_loaded
all_pred_prob = loaded_model.predict_proba(all_review_openai)[:,1]

for i in range(10):
    url = f"https://github.com/lukepeng02/stat542/raw/refs/heads/main/project3/sbert_sentence_embeddings/sbert{i}.npy"
    response = requests.get(url)
    review_embedding = np.load(BytesIO(response.content)) # file contains Sentence-BERT embedding for each sentence in review
    review_openai = review_embedding @ W_loaded.T + b_loaded
    pred_prob = loaded_model.predict_proba(review_openai)[:,1]

    print(f"\033[1mSampled review number {i+1}:\033[0m")
    print(f"Overall probability of review: {np.round(all_pred_prob[i], 6)}")
    prob_str = "Probability"
    print(f"{prob_str:^12} Sentence")
    review_sentences = [s for s in re.split(sentence_regex, sample_reviews[i]) if not s.isspace() and s != ""]
    for idx, sentence in enumerate(review_sentences):
        colored_sentence = interp_red_green(sentence, pred_prob[idx])
        print(f"{np.round(pred_prob[idx], 6):^12} {colored_sentence}")

    print()

[1mSampled review number 1:[0m
Overall probability of review: 0.991365
Probability  Sentence
  0.804622   [48;2;199;255;199mYes, it's not a great cinematic achievement, but Toy Soldiers is a fun and entertaining movie.[0m
  0.988018   [48;2;12;255;12mThe young cast does a great job with both dramatic and comedic aspects of the story, and I particularly liked Shawn Phelan as Derek/\Yogurt\".[0m
  0.98754    [48;2;12;255;12mI've seen this one plenty of times over the years, and will probably see it several more.[0m
  0.992956   [48;2;7;255;7mJust don't think too much and you'll love it - enjoy!"[0m

[1mSampled review number 2:[0m
Overall probability of review: 0.94589
Probability  Sentence
  0.999723   [48;2;0;255;0mA great story, based on a true story about a young black man and all the difficulties along the road.[0m
  0.000971   [48;2;255;0;0mBeing that this is Denzel Washington's first ever movie that he himself was gonna direct, i have to admit i was a tad sceptical, 

These are the advantages and disadvantages of our interpretability approach:

Pros:
- With breaking reviews in sentences and analyzing them individually, we can have a better understanding how each sentence contributes to the predictive probability. This will help to distinguish the sentences that are more meaningful (the green and red ones) than others.
- With the mapping of sentence-Bert embedding to OpenAI embeddings using linear regression, this interpretability result will align with the original logistic regression model in part 1. Hence it is more likely to be accurate the support the model in part 1.
- This approach can be scalable to larger dataset as the embeddings are pre-computed.

Cons:
- Analyzing revews at the sentence level may overlook the overall context. The sentiments may be drived by the flow of a review, so a sequence of bad or good sentence would impact the sentiment more than when the bad one and good one are alternating.
- The gradient for each color relies on a predefined threshold, which is not optimal as we have no information about what threshold should be the best for each color. And the worst case scenario happens when sentences with borderline probabilities.
- With the focus on the sentences over words, this approachs ignore the sentiment at the word-level