# Run Phase 2: Upgrading the Embedding Model to MPNet

Our previous experiments showed:
1. Similarity features are the correct strategy (Score: 0.705).
2. Adding more engineered features from those similarities did not help (Score: 0.696).

**New Hypothesis:** The current features are good, but their quality can be improved by using a more powerful sentence embedding model. We will upgrade from `all-MiniLM-L6-v2` to the state-of-the-art `all-mpnet-base-v2`.

### 1. Setup and Library Imports

In [1]:
import pandas as pd
import numpy as np
import os
import lightgbm as lgb
from sentence_transformers import SentenceTransformer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

### 2. Generate Embeddings with the Upgraded MPNet Model
**Note:** This step will be slower than with MiniLM as the model is larger.

In [3]:
# IMPORTANT: Add your new Kaggle dataset and update this path!
model_path = './all-mpnet-base-v2-local/'

print(f"Loading SentenceTransformer model from: {model_path}")
embed_model = SentenceTransformer(model_path)
print("Model loaded successfully.")

text_cols = ['body', 'positive_example_1', 'positive_example_2', 'negative_example_1', 'negative_example_2']

for col in text_cols:
    print(f"Generating embeddings for: {col}")
    df[f'{col}_vec'] = embed_model.encode(df[col].astype(str).tolist(), show_progress_bar=True).tolist()
    test_df[f'{col}_vec'] = embed_model.encode(test_df[col].astype(str).tolist(), show_progress_bar=True).tolist()

Loading SentenceTransformer model from: ./all-mpnet-base-v2-local/
Model loaded successfully.
Generating embeddings for: body


Batches: 100%|██████████| 64/64 [01:32<00:00,  1.44s/it]
Batches: 100%|██████████| 1/1 [00:00<00:00,  1.11it/s]


Generating embeddings for: positive_example_1


Batches: 100%|██████████| 64/64 [01:22<00:00,  1.29s/it]
Batches: 100%|██████████| 1/1 [00:00<00:00,  1.16it/s]


Generating embeddings for: positive_example_2


Batches: 100%|██████████| 64/64 [01:19<00:00,  1.24s/it]
Batches: 100%|██████████| 1/1 [00:00<00:00,  1.77it/s]


Generating embeddings for: negative_example_1


Batches: 100%|██████████| 64/64 [01:09<00:00,  1.09s/it]
Batches: 100%|██████████| 1/1 [00:00<00:00,  1.82it/s]


Generating embeddings for: negative_example_2


Batches: 100%|██████████| 64/64 [01:07<00:00,  1.06s/it]
Batches: 100%|██████████| 1/1 [00:01<00:00,  1.09s/it]


### 3. Create Similarity Features

In [4]:
def calculate_similarity(df_row, vec_col_1, vec_col_2):
    vec1 = np.array(df_row[vec_col_1]).reshape(1, -1)
    vec2 = np.array(df_row[vec_col_2]).reshape(1, -1)
    return cosine_similarity(vec1, vec2)[0][0]

print("Calculating similarity features...")
for df_ in [df, test_df]:
    df_['sim_pos_1'] = df_.apply(lambda row: calculate_similarity(row, 'body_vec', 'positive_example_1_vec'), axis=1)
    df_['sim_pos_2'] = df_.apply(lambda row: calculate_similarity(row, 'body_vec', 'positive_example_2_vec'), axis=1)
    df_['sim_neg_1'] = df_.apply(lambda row: calculate_similarity(row, 'body_vec', 'negative_example_1_vec'), axis=1)
    df_['sim_neg_2'] = df_.apply(lambda row: calculate_similarity(row, 'body_vec', 'negative_example_2_vec'), axis=1)
print("Similarity features created.")

Calculating similarity features...
Similarity features created.


### 4. Train Model on the Higher-Quality Features

In [5]:
# We are reverting to the original 4 features, as we proved they work best
features = ['sim_pos_1', 'sim_pos_2', 'sim_neg_1', 'sim_neg_2']
X = df[features]
y = df['rule_violation']
X_test = test_df[features]

NFOLDS = 5
skf = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=42)

oof_preds = np.zeros((len(df),))
test_preds = np.zeros((len(test_df),))

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"===== FOLD {fold+1} =====")
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
    
    model = lgb.LGBMClassifier(objective='binary', random_state=42, n_estimators=500)
    model.fit(X_train, y_train, 
              eval_set=[(X_val, y_val)], 
              eval_metric='auc', 
              callbacks=[lgb.early_stopping(100, verbose=False)])
    
    val_fold_preds = model.predict_proba(X_val)[:, 1]
    test_fold_preds = model.predict_proba(X_test)[:, 1]
    
    oof_preds[val_idx] = val_fold_preds
    test_preds += test_fold_preds / NFOLDS

overall_cv_score = roc_auc_score(y, oof_preds)
print(f"\nOverall CV AUC Score with MPNet: {overall_cv_score:.4f}")

===== FOLD 1 =====
[LightGBM] [Info] Number of positive: 825, number of negative: 798
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001522 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1020
[LightGBM] [Info] Number of data points in the train set: 1623, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508318 -> initscore=0.033275
[LightGBM] [Info] Start training from score 0.033275
===== FOLD 2 =====
[LightGBM] [Info] Number of positive: 825, number of negative: 798
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000503 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1020
[LightGBM] [Info] Number of data points in the train set: 1623, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508318 -

### 5. Create Final Submission

In [6]:
submission_df = pd.DataFrame({
    'row_id': test_df['row_id'],
    'rule_violation': test_preds
})
submission_df.to_csv('submission_mpnet.csv', index=False)

print("SUCCESS: New submission_mpnet.csv has been generated.")
print(submission_df.head())

SUCCESS: New submission_mpnet.csv has been generated.
   row_id  rule_violation
0    2029        0.499974
1    2030        0.425424
2    2031        0.518751
3    2032        0.452337
4    2033        0.626429
