# Section 1

#### Discuss the technical details of the sentiment classification model, including data preprocessing and other key aspects of your model implementation. Your explanation should be detailed enough for your PSL classmates to replicate your results.

We used GridSearchCV for hyperparameter selection. The penalty is elasticnet, and saga is the solver. We used a max iteration count of 2000 to ensure convergence due to the large dataset size.

#### Report the AUC of your predictions on each of the 5 test datasets (refer to the evaluation metric described above), the execution time of your code, and the specifications of the computer system used (e.g., Macbook Pro, 2.53 GHz, 4GB memory, or AWS t2.large) for each of the 5 splits.

Split 1: 
- AUC Score for LogisticRegression: 0.9870942
- Execution time: 30.1842 seconds
  
Split 2:
- AUC Score for LogisticRegression: 0.9867907
- Execution time: 30.1352 seconds
  
Split 3:
- AUC Score for LogisticRegression: 0.9864187
- Execution time: 30.8156 seconds
  
Split 4:
- AUC Score for LogisticRegression: 0.9869783
- Execution time: 31.2468 seconds
  
Split 5:
- AUC Score for LogisticRegression: 0.9862662
- Execution time: 31.1201 seconds

Computer system specs: Macbook Pro, 3.2 Ghz, 32 GB memory

# Section 2

#### Provide a detailed explanation of your interpretability approach.

Obtain BERT embeddings

In [26]:
import pandas as pd
import random
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LinearRegression

In [28]:
test_file_path = f'./F24_Proj3_data/split_1/test.csv'
test_y_file_path = f'./F24_Proj3_data/split_1/test_y.csv'

test = pd.read_csv(test_file_path)
test_y = pd.read_csv(test_y_file_path)

In [29]:
random.seed(42)

Grab 5 random positive reviews and 5 random negative reviews and combine them into one DataFrame.

In [30]:
positive_reviews = test[test_y['sentiment'] == 1].sample(5, random_state=42)
negative_reviews = test[test_y['sentiment'] == 0].sample(5, random_state=42)
selected_reviews = pd.concat([positive_reviews, negative_reviews])

Convert the DataFrame into a string array and encode it in SentenceTransformer.

In [31]:
review = selected_reviews['review'].to_numpy()

model = SentenceTransformer("all-MiniLM-L6-v2")
X_embeddings = model.encode(review)

Now find a linear transformation W such that Y roughly equals X*W.

In [33]:
Y_embeddings = selected_reviews.iloc[:, 2:]

model = LinearRegression()
model.fit(X_embeddings, Y_embeddings)
W_optimal = model.coef_

intercept = model.intercept_

In [34]:
print("Optimal weights (W):", W_optimal)
print("Intercept (bias):", intercept)

Optimal weights (W): [[-0.00786921 -0.00081155  0.00093443 ... -0.00090624  0.00169501
   0.00380206]
 [ 0.00710443  0.00240365 -0.00580404 ... -0.00151746  0.00207799
  -0.00403878]
 [-0.01501093 -0.00217633 -0.00672629 ...  0.00034385  0.00499082
  -0.00024948]
 ...
 [ 0.0015455  -0.00022838 -0.00616118 ... -0.00295009 -0.00050465
   0.00158576]
 [ 0.00295547 -0.00340234  0.0056772  ...  0.00038215  0.00539475
   0.00391316]
 [ 0.01103682 -0.00359466  0.00985823 ...  0.00404666  0.00261448
  -0.00288311]]
Intercept (bias): [-0.00032262  0.05894218 -0.05273826 ...  0.01001744  0.00936972
 -0.02275034]
