# Machine Learning Pipeline Example

This notebook demonstrates how to build a robust machine learning pipeline using `scikit-learn`. The key benefits of this approach are:

1.  **Preventing Data Leakage**: By splitting the data first and using a pipeline, we ensure that information from the validation/test set doesn't leak into the training process (e.g., when imputing or scaling).
2.  **Consistency**: The same preprocessing steps are guaranteed to be applied to the training, validation, and test data.
3.  **Simplicity & Readability**: It bundles multiple steps into a single object, making the workflow cleaner and easier to manage.

## 1. Imports

First, we import all the necessary libraries. We'll need `pandas` for data manipulation and various modules from `scikit-learn` and `xgboost` to build our pipeline and model.

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

## 2. Load Data

Next, we load the raw training data from `train.csv`. It's important to start with the original, unprocessed data.

In [25]:
try:
    # This is the original data, before any preprocessing
    df = pd.read_csv('train.csv')
    print('Data loaded successfully.')
except FileNotFoundError:
    print("Make sure 'train.csv' is in the same directory as this notebook.")

Data loaded successfully.


## 2.1 Feature Engineering

In [None]:
def feature_engineer(data):
    epsilon = 1e-6
    data['reports_per_day'] = data['reports_received'] / (data['account_age_days'] + epsilon)
    data['kdr_x_hs'] = data['kill_death_ratio'] * data['headshot_percentage']
    data['cheating_skill_metric'] = data['accuracy_score'] * data['headshot_percentage'] * data['spray_control_score']
    return data
df = feature_engineer(df)
test_df = feature_engineer(test_df)
df = df.dropna(subset=['is_cheater'])

## 3. Define Features and Target

We separate our dataset into features (the input variables, `X`) and the target (the variable we want to predict, `y`). We exclude `player_id` as it is an identifier and not a predictive feature.

In [27]:
# Assuming all columns except 'is_cheater' and 'player_id' are features.
features = [col for col in df.columns if col not in ['is_cheater', 'player_id', 'id']]
X = df[features]
y = df['is_cheater']

print(f'{len(features)} features selected.')

31 features selected.


## 4. Split Data into Training and Validation Sets

This is a critical step. We split the data **before** applying any preprocessing. This prevents data leakage, ensuring that our model's performance on the validation set is a true reflection of its ability to generalize to new, unseen data. We use `stratify=y` to maintain the same proportion of cheaters and non-cheaters in both the training and validation sets.

In [28]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f'Training set shape: {X_train.shape}')
print(f'Validation set shape: {X_val.shape}')

Training set shape: (78198, 31)
Validation set shape: (19550, 31)


## 5. Define the Preprocessing Pipeline

Here, we define the steps to clean and prepare the data. We create a `Pipeline` for our numeric features that first imputes missing values (filling them with the mean of the column) and then scales the data (so all features have a similar range).

We use a `ColumnTransformer` to apply this pipeline to all our numeric feature columns.

In [None]:
# Define preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', IterativeImputer(random_state=42)),
    ('scaler', RobustScaler())
    ])

# Create a preprocessor object using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, features)
    ],
    remainder='passthrough' # Keep other columns (if any)
    )

## 6. Create the Full Model Pipeline

Now we chain the preprocessor and the classifier (`XGBClassifier`) into a single, final pipeline. When we call `.fit()` on this pipeline, the data will flow through the preprocessing steps and then be used to train the model.

In [30]:
scale_pos_weight = y_train.value_counts()[0] / y_train.value_counts()[1]
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss', scale_pos_weight=scale_pos_weight))
    ])

## 7. Train the Model

We train the entire pipeline on our training data. The pipeline handles applying the transformations correctly: the imputer and scaler are `fit` only on the training data, and then used to `transform` the training data before passing it to the classifier.

In [31]:
print("Training the model pipeline...")
model_pipeline.fit(X_train, y_train)
print("Training complete.")

Training the model pipeline...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Training complete.


## 8. Evaluate the Model

Now we evaluate our trained pipeline on the validation set. When we call `.predict()`, the pipeline automatically applies the same transformations that were learned from the training data (using the same mean for imputation, the same scaling factors, etc.) before making a prediction. This ensures consistency and gives us a reliable performance metric.

In [32]:
print("Evaluating the model on the validation set...")
y_pred = model_pipeline.predict(X_val)

print("Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))

Evaluating the model on the validation set...
Accuracy: 0.7767774936061381
Classification Report:
               precision    recall  f1-score   support

         0.0       0.88      0.76      0.82     12724
         1.0       0.64      0.81      0.72      6826

    accuracy                           0.78     19550
   macro avg       0.76      0.78      0.77     19550
weighted avg       0.80      0.78      0.78     19550



## 9. Make Predictions on the Test Set

Finally, we use the trained pipeline to make predictions on the official test data and generate a submission file. The process is exactly the same as for the validation set, demonstrating the power and simplicity of the pipeline.

In [33]:
print("Making predictions on the test set...")
try:
    test_df = pd.read_csv('test.csv')
    test_features = test_df[features] # Make sure test set has the same feature columns

    # The pipeline automatically applies all the same preprocessing steps
    test_predictions = model_pipeline.predict(test_features)

    # Create submission file
    submission_df = pd.DataFrame({'id': test_df['id'], 'is_cheater': test_predictions})
    submission_df.to_csv('submission.csv', index=False)
    print("Submission file 'submission.csv' created successfully.")

except FileNotFoundError:
    print("Could not find 'test.csv'. Skipping prediction part.")

print("Pipeline example complete.")

Making predictions on the test set...
Submission file 'submission.csv' created successfully.
Pipeline example complete.
