# Run Phase: Similarity Features (Offline Version)

This notebook implements our similarity-based strategy in an offline Kaggle environment.

**Setup Steps:**
1.  **Download Model:** Run the `SentenceTransformer('all-MiniLM-L6-v2').save('folder_name')` command on a local machine with internet.
2.  **Upload to Kaggle:** Create a new Kaggle Dataset and upload the model folder.
3.  **Add to Notebook:** Use the "Add data" button to attach your Kaggle Dataset to this notebook.
4.  **Update Path:** Ensure the `model_path` variable below points to the correct directory.

### 1. Setup and Library Imports

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sentence_transformers import SentenceTransformer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load the datasets from the competition's input folder
df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

### 2. Generate Sentence Embeddings (Offline)
We now load the model from the Kaggle dataset we added, not from the internet.

In [3]:
# IMPORTANT: Update this path to match your Kaggle dataset's name
# It will be /kaggle/input/YOUR-DATASET-NAME/SAVED-FOLDER-NAME/
model_path = './all-mpnet-base-v2-local'  # Local path to the pre-trained model

# Load the pre-trained model from the local path
print(f"Loading SentenceTransformer model from: {model_path}")
embed_model = SentenceTransformer(model_path)
print("Model loaded successfully.")

# List of columns we need to convert to vectors
text_cols = ['body', 'positive_example_1', 'positive_example_2', 'negative_example_1', 'negative_example_2']

# Generate embeddings for both train and test data
for col in text_cols:
    print(f"Generating embeddings for: {col}")
    df[f'{col}_vec'] = embed_model.encode(df[col].astype(str).tolist(), show_progress_bar=True).tolist()
    test_df[f'{col}_vec'] = embed_model.encode(test_df[col].astype(str).tolist(), show_progress_bar=True).tolist()

Loading SentenceTransformer model from: ./all-mpnet-base-v2-local
Model loaded successfully.
Generating embeddings for: body


Batches: 100%|██████████| 64/64 [00:45<00:00,  1.40it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  2.42it/s]


Generating embeddings for: positive_example_1


Batches: 100%|██████████| 64/64 [00:41<00:00,  1.54it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  2.22it/s]


Generating embeddings for: positive_example_2


Batches: 100%|██████████| 64/64 [00:41<00:00,  1.53it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  4.23it/s]


Generating embeddings for: negative_example_1


Batches: 100%|██████████| 64/64 [00:37<00:00,  1.68it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  4.17it/s]


Generating embeddings for: negative_example_2


Batches: 100%|██████████| 64/64 [00:37<00:00,  1.72it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  1.81it/s]


### 3. Create Similarity Features

In [4]:
def calculate_similarity(df_row, vec_col_1, vec_col_2):
    vec1 = np.array(df_row[vec_col_1]).reshape(1, -1)
    vec2 = np.array(df_row[vec_col_2]).reshape(1, -1)
    return cosine_similarity(vec1, vec2)[0][0]

print("Calculating similarity features...")
for df_ in [df, test_df]:
    df_['sim_pos_1'] = df_.apply(lambda row: calculate_similarity(row, 'body_vec', 'positive_example_1_vec'), axis=1)
    df_['sim_pos_2'] = df_.apply(lambda row: calculate_similarity(row, 'body_vec', 'positive_example_2_vec'), axis=1)
    df_['sim_neg_1'] = df_.apply(lambda row: calculate_similarity(row, 'body_vec', 'negative_example_1_vec'), axis=1)
    df_['sim_neg_2'] = df_.apply(lambda row: calculate_similarity(row, 'body_vec', 'negative_example_2_vec'), axis=1)

print("Similarity features created.")

Calculating similarity features...
Similarity features created.


In [5]:
for df_ in [df, test_df]:
    df_['sim_pos_avg'] = (df_['sim_pos_1'] + df_['sim_pos_2']) / 2
    df_['sim_neg_avg'] = (df_['sim_neg_1'] + df_['sim_neg_2']) / 2
    df_['sim_diff'] = df_['sim_pos_avg'] - df_['sim_neg_avg']

### 4. Train Model on New Features

In [None]:
# Update your features list for training
features = ['sim_pos_1', 'sim_pos_2', 'sim_neg_1', 'sim_neg_2', 
            'sim_pos_avg', 'sim_neg_avg', 'sim_diff']
X = df[features]
y = df['rule_violation']
X_test = test_df[features]

NFOLDS = 5
skf = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=42)

oof_preds = np.zeros((len(df),))
test_preds = np.zeros((len(test_df),))

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"===== FOLD {fold+1} =====")
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
    
    model = lgb.LGBMClassifier(objective='binary', random_state=42, n_estimators=500)
    model.fit(X_train, y_train, 
              eval_set=[(X_val, y_val)], 
              eval_metric='auc', 
              callbacks=[lgb.early_stopping(100, verbose=False)])
    
    val_fold_preds = model.predict_proba(X_val)[:, 1]
    test_fold_preds = model.predict_proba(X_test)[:, 1]
    
    oof_preds[val_idx] = val_fold_preds
    test_preds += test_fold_preds / NFOLDS

overall_cv_score = roc_auc_score(y, oof_preds)
print(f"\nOverall CV AUC Score on Similarity Features: {overall_cv_score:.4f}")

===== FOLD 1 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000372 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1785
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196
===== FOLD 2 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000162 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1785
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from sco

### 5. Create Final Submission

In [7]:
submission_df = pd.DataFrame({
    'row_id': test_df['row_id'],
    'rule_violation': test_preds
})
submission_df.to_csv('submission.csv', index=False)

print("SUCCESS: New submission.csv has been generated.")
print(submission_df.head())

SUCCESS: New submission.csv has been generated.
   row_id  rule_violation
0    2029        0.480263
1    2030        0.454711
2    2031        0.522667
3    2032        0.521842
4    2033        0.596411
