## Section 1: Sentiment Classification Model

### Technical Details

This‬‭ section aims‬‭ to‬‭ construct‬‭ a‬‭ binary‬‭ classification model‬‭ to predict the sentiment of movie reviews.<br><br>
I began by extracting 1536-dimensional embeddings from the training data to establish X_train as the input features, and assigned the "sentiment" column to y_train.
Given that all features are embeddings of a consistent type, I opted not to perform any data pre-processing. <br><br>I proceeded by building a logistic regression model with a penalty parameter set to ‘elasticnet’, utilizing the solver 'saga' and a maximum iteration of 1000.
To optimize the model, I employed GridSearchCV with 5-fold cross-validation to fine-tune the hyperparameters "C" and "l1_ratio". The parameter combination that yielded the highest "roc_auc" score was identified as C = 10 and l1_ratio = 0.1. Subsequently, I utilized this tuned model to predict the AUC score across all 5 splits.

### Performance Metrics‬‭


‭- The computer system I used is: ‬<br> MacBook Pro 13.3" Laptop - Apple M2 chip - 24GB Memory‬ - 1TB SSD.‬<br>
‭- The performance and the execution time (including data loading) on the 5 splits are as follows:‬

| Split No. | AUC | Running Time |
|:--------:|:--------:|:--------:|
|&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0.9870943141872321&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;22.849359035491943&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
|&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0.9867909406872515&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;22.652480125427246&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
|&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0.9864186818834573&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;23.762585163116455&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
|&nbsp;&nbsp;&nbsp;4&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0.9869783852661665&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;23.567501068115234&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
|&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0.9862663732679459&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;24.464163064956665&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|

## Section 2: Interpretability Analysis

### Interpretability Approach

First, I conducted the transformation of the reviews from the raw data (split 1) into BERT embeddings consisting of 768 dimensions. These BERT embeddings were designated as X, while the original OpenAI embeddings, encompassing 1536 dimensions, were designated as Y. The subsequent step involved training a linear regression model to align BERT embeddings with OpenAI embeddings.

Following the alignment process, I utilized a pretrained sentiment classification model from section 1 to predict the overall sentiment probability of reviews in the test sample (X_test). The classification of reviews as positive or negative was determined based on a predetermined threshold set at 0.5.

Subsequently, I randomly selected 5 positive reviews and 5 negative reviews, partitioned them into individual sentences, extracted the BERT embeddings for each sentence, and aligned them with OpenAI embeddings using the pretrained linear regression model. These aligned embeddings were then fed into the pretrained sentiment classification model to predict the sentence-level sentiment probability, thereby capturing the contribution of each sentence to the overall sentiment assessment.

Finally, I identified and highlighted the sentences with significant contributions, defined as those with probabilities exceeding 0.99 for positive overall review predictions and falling below 0.01 for negative overall review predictions.

### Effectiveness and Limitations

The above interpretability analysis presents several strengths and limitations:<br>
<br>
**Effectiveness:**<br>
1.Sentence-Level Analysis: The analysis delves into the individual sentences within reviews, allowing for a granular examination of their sentiment contributions. This can provide valuable insights into the specific aspects of reviews that influence the overall sentiment.<br>
2.Highlighting Significant Sentences: Identifying and highlighting sentences with high contributions to the overall sentiment facilitates a more focused interpretation of the sentiment analysis results. This can aid in understanding the key drivers of positive or negative sentiments within the movie reviews.<br>
<br>
**Limitations:**<br>
1.Sentence-Level Sections: The interpretability of sentiment contributions at the sentence level may be limited when dealing with long and complex sentences. Breaking down the text into smaller sections for analysis may help improve the accuracy of sentiment interpretation.<br>
2.Model Reliance: The interpretability heavily relies on the performance of the pretrained sentiment classification model and the linear regression alignment model. Any shortcomings or biases in these models can potentially impact the accuracy and reliability of the interpretability analysis.<br>
3.Interpretation Subjectivity: Using a fixed threshold of 0.5 to determine positive or negative sentiments may oversimplify the sentiment analysis process. Also, the definition of “significant contribution” based on probabilities exceeding 0.99 for positive reviews and below 0.01 for negative reviews is arbitrary. Different thresholds may lead to varying interpretations of the results.

### Code

In [1]:
# Import necessary packages
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from transformers import BertTokenizer, BertModel

import joblib
import requests
from io import BytesIO

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from IPython.display import HTML

[nltk_data] Downloading package punkt to /Users/minjie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [3]:
# Functions to transfer texts to BERT embeddings
def get_bert_embeddings(texts):
    embeddings_list = []
    for i, text in enumerate(texts):
        # Call the text_to_bert_embedding function to get the embeddings for each text
        embeddings = text_to_bert_embedding(text)
        # Detach the gradient and convert the embeddings to a numpy array
        embedding_array = embeddings.detach().numpy()
        # Append the embedding array to the embeddings_list
        embeddings_list.append(embedding_array)
    return embeddings_list

def text_to_bert_embedding(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    # Get the embeddings from the BERT model
    outputs = model(**inputs)    
    # Extract the last layer embeddings (CLS token) from the output
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze()    
    return embeddings

In [4]:
# Load data
train_data = pd.read_csv("./F24_Proj3_data/split_1/train.csv")
test_data = pd.read_csv("./F24_Proj3_data/split_1/test.csv")
X_test = test_data.iloc[:, 2:].values

In [5]:
# Pre trained and upload the BERT embeddings of all split 1 reviews to github
# Download the BERT embeddings and denote it as X
url1 = "https://github.com/minjiefu/test/releases/download/bert_embeddings/bert_embeddings.npy"
response1 = requests.get(url1)
X = np.load(BytesIO(response1.content), allow_pickle=True)

In [6]:
# Denote the OpenAI embeddings as Y
Y = train_data.iloc[:, 3:].values

In [7]:
# Fit a Linear Regression Model to align BERT embeddings to OpenAI embeddings
reg = LinearRegression().fit(X, Y)

In [8]:
# Load the Sentiment Classification Model trained in section 1
url = "https://raw.githubusercontent.com/minjiefu/test/main/trained_model.pkl"
response = requests.get(url)
part1_model = joblib.load(BytesIO(response.content))

In [9]:
# Classify the overall review prediction as positive or negative based on a threshold at 0.5
y_pred = part1_model.predict_proba(X_test)[:,1]
y_pred = np.where(y_pred >= 0.5, 1, 0)
output = pd.DataFrame(data={'id': test_data["id"], 'prob': y_pred})

In [10]:
# Randomly select 5 positive reviews
np.random.seed(42)
selected_ids = output[output['prob'] == 1].sample(n=5, axis=0)['id']
selected_positive_reviews = test_data[test_data['id'].isin(selected_ids)]["review"]
selected_positive_reviews

713      (Review is of the original 1950's version not ...
3515     It may be a remake of the 1937 film by Capra, ...
5259     John Schelesinger's career as a film director ...
8074     Finding the premise intriguing, and reading th...
23636    Some people don't appreciate the magical eleme...
Name: review, dtype: object

In [11]:
# Randomly select 5 negative reviews
np.random.seed(42)
selected_ids = output[output['prob'] == 0].sample(n=5, axis=0)['id']
selected_negative_reviews = test_data[test_data['id'].isin(selected_ids)]["review"]
selected_negative_reviews

483      I don't understand why this movie was released...
799      Watching this Movie? l thought to myself, what...
2888     For die-hard Judy Garland fans only. There are...
7366     Good attempt at tackling the unconventional to...
18457    Thunderbirds (2004) <br /><br />Director: Jona...
Name: review, dtype: object

In [12]:
# For positive reviews
html_str = ""
for i in range(5):
    # select 1 positive review
    review = selected_positive_reviews.iloc[i]
    # Divide the review into sentences
    sentences = sent_tokenize(review)
    # Get the BERT embeddings of these sentences
    sentences_embeddings = get_bert_embeddings(sentences)
    # Aligned the BERT embeddings of these sentences to OpenAI embeddings
    aligned_embeddings = reg.predict(sentences_embeddings)
    # Use the Sentiment Classification Model trained in section 1 to predict sentence-level sentiment probabilities
    sentence_y_pred = part1_model.predict_proba(aligned_embeddings)[:, 1]
    # Highlight the sentence if the sentiment probability is greater than 0.99
    html_str += f"<h2>Positive Review {i+1}</h2>"
    for sentence, prob in zip(sentences, sentence_y_pred):
        if prob > 0.99:
            html_str += f"<p style='background-color: yellow;'>{sentence}</p>"
        else:
            html_str += f"<p>{sentence}</p>"

HTML(html_str)

In [13]:
# For negative reviews
html_str = ""
for i in range(5):
    review = selected_negative_reviews.iloc[i]
    sentences = sent_tokenize(review)
    sentences_embeddings = get_bert_embeddings(sentences)
    aligned_embeddings = reg.predict(sentences_embeddings)
    sentence_y_pred = part1_model.predict_proba(aligned_embeddings)[:, 1]
    # Highlight the sentence if the sentiment probability is less than 0.01
    html_str += f"<h2>Negtive Review {i+1}</h2>"
    for sentence, prob in zip(sentences, sentence_y_pred):
        if prob < 0.01:
            html_str += f"<p style='background-color: yellow;'>{sentence}</p>"
        else:
            html_str += f"<p>{sentence}</p>"

HTML(html_str)