# Walk Phase: A More Robust Baseline with LightGBM

Our simple baseline showed that the hidden test set is very different from the training data. This means we need a model that can generalize better.

**This notebook improves on the baseline in three key ways:**
1.  **Text Cleaning:** We will add a function to convert text to lowercase and remove noise like links and punctuation. This helps the model focus on meaningful words.
2.  **Better Features (N-grams):** We will configure our `TfidfVectorizer` to see two-word phrases (`ngram_range=(1, 2)`), which provides more context than single words alone.
3.  **A More Powerful Model (LightGBM):** We will replace `LogisticRegression` with `LightGBM`, a gradient boosting model that is a standard in Kaggle competitions for its ability to learn complex patterns.

### 1. Setup and Data Loading

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score

In [2]:
# Load the datasets
df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

### 2. Feature Engineering and Text Cleaning

In [3]:
# Define the text cleaning function
def clean_text(text):
    text = str(text).lower()  # Convert to lowercase
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetic characters
    return text

In [4]:
# Create the initial combined_text feature
df['combined_text'] = df['rule'] + " [SEP] " + df['body']
test_df['combined_text'] = test_df['rule'] + " [SEP] " + test_df['body']

# Apply the cleaning function
print("Cleaning text data...")
df['cleaned_text'] = df['combined_text'].apply(clean_text)
test_df['cleaned_text'] = test_df['combined_text'].apply(clean_text)
print("Cleaning complete.")

Cleaning text data...
Cleaning complete.


### 3. Model Training and Validation
First, we'll get a new local validation score with this stronger pipeline.

In [5]:
# Define features (X) and target (y)
X = df['cleaned_text']
y = df['rule_violation']

# Create a training and a validation set
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [6]:
# Initialize the improved TfidfVectorizer
vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),      # Use 1-word and 2-word phrases
    max_features=15000,      # Limit vocabulary size to the top 15k features
    stop_words='english'     # Remove common English stop words
)

# Fit and transform the training data, then transform the validation data
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)

In [7]:
# Initialize and train the LightGBM model
print("Training LightGBM model for validation...")
lgbm = lgb.LGBMClassifier(objective='binary', random_state=42)
lgbm.fit(X_train_vec, y_train)

# Make predictions and evaluate
val_preds = lgbm.predict_proba(X_val_vec)[:, 1]
auc_score = roc_auc_score(y_val, val_preds)

print(f"New Validation AUC Score with LightGBM: {auc_score:.4f}")

Training LightGBM model for validation...
[LightGBM] [Info] Number of positive: 825, number of negative: 798
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001571 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8906
[LightGBM] [Info] Number of data points in the train set: 1623, number of used features: 184
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508318 -> initscore=0.033275
[LightGBM] [Info] Start training from score 0.033275
New Validation AUC Score with LightGBM: 0.7763




### 4. Final Submission Pipeline
Now we'll use this improved pipeline to train on all the data and generate a new submission file.

In [8]:
# Step 1: Define final data
X_full = df['cleaned_text']
y_full = df['rule_violation']
X_test = test_df['cleaned_text']

In [9]:
# Step 2: Initialize a NEW vectorizer and model for the final submission
final_vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    max_features=15000,
    stop_words='english'
)
final_model = lgb.LGBMClassifier(objective='binary', random_state=42)

In [10]:
# Step 3: Fit vectorizer and train model on ALL training data
print("Training final LightGBM model on the full dataset...")
X_full_vec = final_vectorizer.fit_transform(X_full)
final_model.fit(X_full_vec, y_full)
print("Training complete.")

Training final LightGBM model on the full dataset...
[LightGBM] [Info] Number of positive: 1031, number of negative: 998
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001705 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9727
[LightGBM] [Info] Number of data points in the train set: 2029, number of used features: 246
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.508132 -> initscore=0.032531
[LightGBM] [Info] Start training from score 0.032531
Training complete.


In [11]:
# Step 4: Make predictions on the test set
X_test_vec = final_vectorizer.transform(X_test)
test_predictions = final_model.predict_proba(X_test_vec)[:, 1]



In [12]:
# Step 5: Create and save the submission file
submission_df = pd.DataFrame({
    'row_id': test_df['row_id'],
    'rule_violation': test_predictions
})
submission_df.to_csv('submission_lgbm.csv', index=False)

print("\nSUCCESS: New submission_lgbm.csv has been generated.")
print("Here are the first 5 predictions:")
print(submission_df.head())


SUCCESS: New submission_lgbm.csv has been generated.
Here are the first 5 predictions:
   row_id  rule_violation
0    2029        0.341977
1    2030        0.669740
2    2031        0.851339
3    2032        0.520972
4    2033        0.931842
