# Predicting Household Occupancy using Motion Sensor Data

Process of building a model to predict whether a household is single or multiple occupancy using motion sensor data.
The steps include data loading, feature engineering, handling class imbalance, model training, and evaluation.


# 1. Importing Necessary Libraries

Importing the required libraries for data processing, machine learning, and evaluation.


In [1]:
import sqlite3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE

# 2. Loading Data and Data Preprocessing

Connecting to the SQLite database and load the data from the `homes` and `motion` tables.
Converting the `datetime` column in the motion data to pandas datetime objects for easier manipulation.


In [2]:
# Connect to the database
conn = sqlite3.connect(r"C:\Users\Prashanth\Downloads\data.db")

# Load the homes data
homes_df = pd.read_sql_query("SELECT * FROM homes", conn)

# Load the motion data
motion_df = pd.read_sql_query("SELECT * FROM motion", conn)

# Convert datetime to pandas datetime
motion_df['datetime'] = pd.to_datetime(motion_df['datetime'])

# 3. Feature Engineering

Creatng new features from the motion data to capture different aspects of household activity.


In [3]:
# Enhanced feature engineering
def create_features(group):
    morning = group['datetime'].dt.hour.between(6, 12).sum()
    afternoon = group['datetime'].dt.hour.between(12, 18).sum()
    evening = group['datetime'].dt.hour.between(18, 24).sum()
    night = group['datetime'].dt.hour.between(0, 6).sum()
    
    return pd.Series({
        'motion_count': group['id'].count(),
        'time_range_hours': (group['datetime'].max() - group['datetime'].min()).total_seconds() / 3600,
        'unique_locations': group['location'].nunique(),
        'activity_hours': group['datetime'].dt.hour.nunique(),
        'events_per_hour': group['id'].count() / ((group['datetime'].max() - group['datetime'].min()).total_seconds() / 3600),
        'morning_activity': morning,
        'afternoon_activity': afternoon,
        'evening_activity': evening,
        'night_activity': night
    })


In [4]:
# Apply the feature engineering function 
motion_features = motion_df.groupby('home_id').apply(create_features).reset_index()

# 4. Merging DataFrames and Handling Missing Values

Merging the homes data with the newly created motion features and handle any missing values.


In [5]:
# Merging homes and motion features
merged_df = homes_df.merge(motion_features, left_on='id', right_on='home_id', how='left')

# Filling NaN values (homes with no motion data) with 0
merged_df = merged_df.fillna(0)

# Prepared features and target
X = merged_df[['motion_count', 'time_range_hours', 'unique_locations', 'activity_hours',
               'events_per_hour', 'morning_activity', 'afternoon_activity', 'evening_activity', 'night_activity']]
y = merged_df['multiple_occupancy']


# 5. Handling Class Imbalance

Using SMOTE (Synthetic Minority Over-sampling Technique) to handle class imbalance in the target variable.
Train-Test Split method to split the resampled data into training and testing sets.



In [6]:
#SMOTE for oversampling
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
# Spliting the resampled data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# 6. Hyperparameter Tuning and Model Training

Performing hyperparameter tuning using Randomized Search CV to find the best parameters for our Random Forest model.


In [7]:
# Defining the parameter grid for RandomizedSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [8]:
# Performing RandomizedSearchCV
rf = RandomForestClassifier(random_state=42)
random_search = RandomizedSearchCV(rf, param_distributions=param_grid, n_iter=20, cv=5, random_state=42)
random_search.fit(X_train_scaled, y_train)


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
                   n_iter=20,
                   param_distributions={'max_depth': [None, 10, 20, 30],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [100, 200, 300]},
                   random_state=42)

# 7. Model Evaluation

Evaluating the performance of the best model on the test set and displaying the accuracy and classification report.


In [9]:
# Using the best model
best_rf = random_search.best_estimator_
# Making predictions
y_pred = best_rf.predict(X_test_scaled)
# Evaluating the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.72

Classification Report:
              precision    recall  f1-score   support

           0       0.65      1.00      0.79        13
           1       1.00      0.42      0.59        12

    accuracy                           0.72        25
   macro avg       0.82      0.71      0.69        25
weighted avg       0.82      0.72      0.69        25



# 8. Feature Importance

Displaying the importance of each feature in the trained Random Forest model.


In [10]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': best_rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)

# Printing best parameters
print("\nBest Parameters:", random_search.best_params_)


Feature Importance:
              feature  importance
0        motion_count    0.167973
7    evening_activity    0.161169
4     events_per_hour    0.153963
6  afternoon_activity    0.127808
5    morning_activity    0.112540
8      night_activity    0.089866
1    time_range_hours    0.087646
2    unique_locations    0.079559
3      activity_hours    0.019476

Best Parameters: {'n_estimators': 100, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 20}
