# Part 1: Setup, EDA, and Baseline Validation
This section is for exploring the data and validating our modeling approach. We will:
1. Load the libraries and data.
2. Perform Exploratory Data Analysis (EDA) to understand the dataset.
3. Build a baseline model on a *split* of the training data to get a reliable local validation score.

### 1.1 - Setup and Data Loading

In [2]:
# Import standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Set a style for all our plots
sns.set_style("whitegrid")

In [3]:
# Load the datasets
df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")

### 1.2 - Exploratory Data Analysis (EDA)

In [4]:
# Create the 'combined_text' feature for analysis and modeling
# This gives the model context about both the rule and the comment body
df['combined_text'] = df['rule'] + " [SEP] " + df['body']
test_df['combined_text'] = test_df['rule'] + " [SEP] " + test_df['body']

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2029 entries, 0 to 2028
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   row_id              2029 non-null   int64 
 1   body                2029 non-null   object
 2   rule                2029 non-null   object
 3   subreddit           2029 non-null   object
 4   positive_example_1  2029 non-null   object
 5   positive_example_2  2029 non-null   object
 6   negative_example_1  2029 non-null   object
 7   negative_example_2  2029 non-null   object
 8   rule_violation      2029 non-null   int64 
 9   combined_text       2029 non-null   object
dtypes: int64(2), object(8)
memory usage: 158.6+ KB


In [5]:
# Check target variable distribution (it's well-balanced)
print("Target Distribution:\n", df['rule_violation'].value_counts(normalize=True))

Target Distribution:
 rule_violation
1    0.508132
0    0.491868
Name: proportion, dtype: float64


### 1.3 - Baseline Model Validation
Here, we prove our simple TF-IDF + Logistic Regression approach works by testing it on a local validation set.

In [6]:
# Define features (X) and target (y)
X = df['combined_text']
y = df['rule_violation']

# Create a training and a validation set
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,       # Use 20% for validation
    random_state=42,     # For reproducibility
    stratify=y           # Keep target balance in both sets
)

In [7]:
# Initialize the vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer on the TRAINING DATA ONLY and transform it
X_train_vec = vectorizer.fit_transform(X_train)

# Use the already-fitted vectorizer to transform the validation data
X_val_vec = vectorizer.transform(X_val)

In [8]:
# Initialize and train the validation model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

# Make predictions on the validation set
val_preds = model.predict_proba(X_val_vec)[:, 1]

# Evaluate the performance
auc_score = roc_auc_score(y_val, val_preds)
print(f"Validation AUC Score: {auc_score:.4f}")

Validation AUC Score: 0.8293


# Part 2: Final Submission Pipeline
--- 
**IMPORTANT:** This section creates the final `submission.csv` file. It trains a new model on ALL the training data. For a correct result, always run these cells sequentially after a kernel restart.

**How to Run:**
1. Click **"Kernel"** in the menu bar.
2. Click **"Restart & Run All"**.
3. Wait for all cells to finish executing.
4. Your new `submission.csv` file will be ready.

In [9]:
# Step 1: Define features (X) and target (y) using the FULL training dataset
X_full = df['combined_text']
y_full = df['rule_violation']

# Also define our test data text
X_test = test_df['combined_text']

In [10]:
# Step 2: Initialize a NEW vectorizer and model for the final submission
final_vectorizer = TfidfVectorizer()
final_model = LogisticRegression(max_iter=1000)

In [11]:
# Step 3: Fit the vectorizer and train the model on ALL the training data
print("Training final model on the full dataset...")

# CORRECT: Use .fit_transform() on the full training data to learn the vocabulary
X_full_vec = final_vectorizer.fit_transform(X_full)
final_model.fit(X_full_vec, y_full)

print("Training complete.")

Training final model on the full dataset...
Training complete.


In [12]:
# Step 4: Prepare the test data using the trained vectorizer

# CORRECT: Use .transform() ONLY. This applies the vocabulary learned from the training data.
# This was the source of the original bug.
X_test_vec = final_vectorizer.transform(X_test)

In [13]:
# Step 5: Make predictions on the test data
test_predictions = final_model.predict_proba(X_test_vec)[:, 1]

In [15]:
# Step 6: Create and save the submission file
submission_df = pd.DataFrame({
    'row_id': test_df['row_id'],
    'rule_violation': test_predictions
})

submission_df.to_csv('submission.csv', index=False)

print("submission.csv created successfully!")
print(submission_df.head())

submission.csv created successfully!
   row_id  rule_violation
0    2029        0.282724
1    2030        0.480705
2    2031        0.626757
3    2032        0.601275
4    2033        0.711869
