# Run Phase: Tackling Overfitting with Cross-Validation and Regularization

Our previous models performed well locally but failed to generalize to the hidden test set. This is a classic overfitting problem. This notebook introduces a more robust training and validation strategy to combat this.

**Key Upgrades:**
1.  **Cross-Validation (`StratifiedKFold`):** Instead of a single train/validation split, we will use 5-fold cross-validation. This gives us a much more reliable estimate of our model's true performance on unseen data.
2.  **Regularization:** We will add parameters to both our `TfidfVectorizer` and `LGBMClassifier` to deliberately make them simpler, which forces them to learn more general patterns.
3.  **Ensembling:** Our final submission will be an average of the predictions from the 5 models trained during cross-validation. This is a powerful technique to improve robustness.

### 1. Setup and Data Loading

In [1]:
import pandas as pd
import numpy as np
import re
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score

In [2]:
# Load the datasets
df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

### 2. Text Cleaning and Feature Engineering

In [3]:
# Same cleaning function as before
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    return text

# Apply cleaning
df['cleaned_text'] = (df['rule'] + " [SEP] " + df['body']).apply(clean_text)
test_df['cleaned_text'] = (test_df['rule'] + " [SEP] " + test_df['body']).apply(clean_text)

### 3. Cross-Validation and Model Training

In [4]:
# Define features (X) and target (y)
X = df['cleaned_text']
y = df['rule_violation']
X_test = test_df['cleaned_text']

# --- Cross-Validation Setup ---
NFOLDS = 50
skf = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=42)

# --- Model Training Loop ---
oof_preds = np.zeros((len(df),))
test_preds = np.zeros((len(test_df),))

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"===== FOLD {fold+1} =====")
    
    # Split data for this fold
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
    
    # --- Vectorizer with Regularization ---
    vectorizer = TfidfVectorizer(
        ngram_range=(1, 2),
        max_features=8000,  # Reduced features to regularize
        stop_words='english'
    )
    
    X_train_vec = vectorizer.fit_transform(X_train)
    X_val_vec = vectorizer.transform(X_val)
    X_test_vec = vectorizer.transform(X_test)
    
    # --- LightGBM with Regularization ---
    lgbm = lgb.LGBMClassifier(
        objective='binary',
        random_state=42,
        n_estimators=500,       # More trees
        learning_rate=0.05,
        num_leaves=20,          # Reduced complexity
        reg_alpha=0.1,          # L1 Regularization
        reg_lambda=0.1          # L2 Regularization
    )
    
    lgbm.fit(X_train_vec, y_train,
             eval_set=[(X_val_vec, y_val)],
             eval_metric='auc',
             callbacks=[lgb.early_stopping(100, verbose=False)])
    
    # --- Make Predictions ---
    val_fold_preds = lgbm.predict_proba(X_val_vec)[:, 1]
    test_fold_preds = lgbm.predict_proba(X_test_vec)[:, 1]
    
    # Store predictions
    oof_preds[val_idx] = val_fold_preds
    test_preds += test_fold_preds / NFOLDS # Average test predictions across folds

# Calculate the overall Out-of-Fold (OOF) CV score
overall_cv_score = roc_auc_score(y, oof_preds)
print(f"\nOverall CV AUC Score: {overall_cv_score:.4f}")

===== FOLD 1 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001897 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9637
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196
===== FOLD 2 =====




[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002190 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9641
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196
===== FOLD 3 =====




[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001824 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9650
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196
===== FOLD 4 =====




[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001961 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9620
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 237
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196
===== FOLD 5 =====




[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002316 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9658
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 244
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 6 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001733 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9666
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 7 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001539 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9660
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 8 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001582 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9647
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 239
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196
===== FOLD 9 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001840 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9628
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 239
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from



===== FOLD 10 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001709 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9638
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 11 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001876 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9670
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 12 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001660 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9645
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 239
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 13 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001900 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9645
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 14 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001777 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9660
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196
===== FOLD 15 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001897 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9656
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training fr



===== FOLD 16 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001571 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9620
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 240
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 17 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001663 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9646
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196
===== FOLD 18 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002834 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9670
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training fr



[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003079 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9639
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 20 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002334 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9654
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 21 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001688 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9665
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 22 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001845 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9671
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 245
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 23 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001604 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9655
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 24 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002023 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9631
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 240
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 25 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001573 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9664
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 26 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001632 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9658
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 27 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002541 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9645
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 28 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001902 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9626
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 238
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 29 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001602 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9644
[LightGBM] [Info] Number of data points in the train set: 1988, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508048 -> initscore=0.032196
[LightGBM] [Info] Start training from score 0.032196




===== FOLD 30 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001836 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9653
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 31 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001651 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9616
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 238
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 32 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001637 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9645
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 33 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002078 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9649
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 34 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001753 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9666
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 35 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001881 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9688
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 245
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 36 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001644 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9673
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 245
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 37 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001630 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9655
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 38 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002949 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9641
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 39 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002022 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9655
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 40 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001890 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9658
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 41 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001777 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9655
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 241
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 42 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001523 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9633
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 239
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 43 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001811 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9658
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 44 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002107 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9642
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 239
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 45 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001605 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9653
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 46 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002541 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9615
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 239
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186
===== FOLD 47 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002498 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9657
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 242
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training fr



===== FOLD 48 =====
[LightGBM] [Info] Number of positive: 1011, number of negative: 978
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001664 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9648
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 240
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508296 -> initscore=0.033186
[LightGBM] [Info] Start training from score 0.033186




===== FOLD 49 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 979
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001537 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9668
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 245
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.507793 -> initscore=0.031174
[LightGBM] [Info] Start training from score 0.031174
===== FOLD 50 =====
[LightGBM] [Info] Number of positive: 1010, number of negative: 979
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001644 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9665
[LightGBM] [Info] Number of data points in the train set: 1989, number of used features: 243
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.507793 -> initscore=0.031174
[LightGBM] [Info] Start training fr



### 4. Create Final Submission

In [5]:
# Create submission file from the averaged test predictions
submission_df = pd.DataFrame({
    'row_id': test_df['row_id'],
    'rule_violation': test_preds
})
submission_df.to_csv('submission_cv_lgbm.csv', index=False)

print("SUCCESS: New submission_cv_lgbm.csv has been generated.")
print(submission_df.head())

SUCCESS: New submission_cv_lgbm.csv has been generated.
   row_id  rule_violation
0    2029        0.427059
1    2030        0.616545
2    2031        0.639463
3    2032        0.439648
4    2033        0.669288
