<a href="https://www.kaggle.com/code/llkh0a/ensemble-training?scriptVersionId=244414570" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Ensemble Modeling for BFRB Sensor Data Classification

This notebook demonstrates an ensemble approach for classifying body-focused repetitive behaviors (BFRBs) using sensor data from the CMI competition. The following models are used in the ensemble:

- **LightGBM**
- **XGBoost**
- **CatBoost**

## Feature Selection

Features are constructed as follows:
- For each sequence, statistical features (mean, std, min, max) are extracted from all numeric sensor columns (IMU, thermopile, ToF, etc.).
- Demographic features from the `train_demographics.csv` file (such as age, sex, handedness, height, etc.) are merged and included as input features for the models.
- The final feature set for each sequence includes all aggregated sensor statistics and all demographic columns except for the subject identifier.

for model submission checkout this notebook: https://www.kaggle.com/code/llkh0a/ensemble-inference/

# Data Exploration
Explore the data: check shapes, columns, and some basic statistics.

In [None]:
import numpy as np
import pandas as pd

# Read data
train = pd.read_csv('/kaggle/input/cmi-detect-behavior-with-sensor-data/train.csv')
train_demo = pd.read_csv('/kaggle/input/cmi-detect-behavior-with-sensor-data/train_demographics.csv')

print('Train shape:', train.shape)
print('Train demographics shape:', train_demo.shape)
print('Train columns:', train.columns.tolist())
print(train.head())

# Data Cleaning
Check and handle missing values and outliers in sensor data.

In [None]:
# Check missing values
missing = train.isnull().sum()
print('Missing values per column:')
print(missing[missing > 0])

# Replace -1 in ToF columns with NaN for easier statistics
tof_cols = [col for col in train.columns if col.startswith('tof_')]
train[tof_cols] = train[tof_cols].replace(-1, np.nan)

# Feature Engineering
Extract statistical features for each sequence and merge with demographics.

In [None]:
def extract_features(df):
    feats = []
    # Only use numeric columns for aggregation
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    # Remove columns that should not be aggregated
    exclude_cols = ['row_id', 'sequence_id', 'sequence_counter', 'subject']
    numeric_cols = [c for c in numeric_cols if c not in exclude_cols]
    for seq_id, group in df.groupby('sequence_id'):
        feat = {'sequence_id': seq_id}
        for col in numeric_cols:
            feat[col + '_mean'] = group[col].mean()
            feat[col + '_std'] = group[col].std()
            feat[col + '_min'] = group[col].min()
            feat[col + '_max'] = group[col].max()
        feat['subject'] = group['subject'].iloc[0]
        feats.append(feat)
    return pd.DataFrame(feats)

X = extract_features(train)
# Merge demographic features (excluding subject key)
demographic_features = [col for col in train_demo.columns if col != 'subject']
X = X.merge(train_demo, on='subject', how='left')
# Fill any remaining missing values in features with column mean (or 0 as fallback)
X = X.fillna(X.mean(numeric_only=True)).fillna(0)
# Add demographic features to training set
feature_cols = [col for col in X.columns if col not in ['sequence_id', 'subject']]
y = train.groupby('sequence_id')['gesture'].first().values


In [None]:
import joblib
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_enc = le.fit_transform(y)
joblib.dump(le, 'label_encoder.joblib')

In [None]:
# Print label map
label_map = {i: label for i, label in enumerate(le.classes_)}
print('Label map:', label_map)

# Training & Ensemble
Train LightGBM, XGBoost, CatBoost models and ensemble their predictions.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import VotingClassifier
import joblib

X_train, X_val, y_train, y_val = train_test_split(
    X[feature_cols], y_enc, test_size=0.2, random_state=42, stratify=y_enc)

lgbm = LGBMClassifier()
xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
cat = CatBoostClassifier(verbose=0)

ensemble = VotingClassifier(estimators=[
    ('lgbm', lgbm),
    ('xgb', xgb),
    ('cat', cat)
], voting='soft')

ensemble.fit(X_train, y_train)
y_pred = ensemble.predict(X_val)
print('Validation accuracy:', accuracy_score(y_val, y_pred))

# Save the ensemble model to a file
joblib.dump(ensemble, 'ensemble_model.joblib')
print('Model saved as ensemble_model.joblib')