# Model to Predict House Occupany #

By taking the approach of a supervised prediction model, my aim is to model the motion data in the same structure as the home data. The modelled motion data can then be used to train a prediction model (e.g. a logistic regression model), and the model be validated by the home data. 

## Load Data

Using sqlite3, load the data into a pandas dataframe and perform a small amount of data exploration and prepartion

In [44]:
import sqlite3
import pandas as pd

In [46]:
# creating the file path from my local machine 
dbfile = 'C:/Users/Nadia/Documents/Imperial/data.db'
# Create a SQL connection to SQLite database
con = sqlite3.connect(dbfile)

cur = con.cursor()

table_list = [a for a in cur.execute("SELECT name FROM sqlite_master WHERE type = 'table'")]
print(table_list)

[('homes',), ('motion',)]


In [48]:
homes = pd.read_sql_query('SELECT * FROM homes', con)
motion = pd.read_sql_query('SELECT * FROM motion', con)

In [49]:
motion.head()

Unnamed: 0,id,home_id,datetime,location
0,e41218b439d933a1cd9ad158f78e9198,205c42ec747e2db13cb92087a99433f1,2024-01-01 00:00:10+00,lounge
1,92d48d869ae50b0764cfb8d70494f618,7d2f2e0a9e059b4fb8106bb0ad4b8a39,2024-01-01 00:00:17+00,lounge
2,65c18ba64884442dd47c2fd4cf3630e4,44a880cc6fc3a7db3464092f650ae7f1,2024-01-01 00:00:18+00,lounge
3,90d6336d189c929aa50fa08e5aee5f41,49b83fce41b676266b98cd1e095f1c11,2024-01-01 00:00:43+00,lounge
4,6e3d73bed24b95ffdfe5ec017787f039,14328a0b7574e912c2e23d62c9476a07,2024-01-01 00:00:57+00,lounge


In [50]:
motion.location.unique()

array(['lounge', 'bedroom1', 'hallway', 'kitchen', 'bathroom1', 'WC1',
       'living room', 'dining room', 'conservatory', 'study'],
      dtype=object)

In [51]:
motion.dtypes

id          object
home_id     object
datetime    object
location    object
dtype: object

In [52]:
# convert datetime to datetime in order to calculate time difference
motion['datetime'] = pd.to_datetime(motion['datetime'])

## Feature Engineering

To model the motion data similarly to the homes data, I aim to create a table with the same columns using the home_id and a new column generated using the home_id, datetime and location. 


To create an extremely simple model, I will be making an assumption based on room size and average walking speed, to say X is the time it would take to leave and enter a new location:

the case when home_id[i] = home_id[j] and location[i] != location[j],
if datetime[j] - datetime[i] < X, then there is more than one person in the house,
        
From this, I plan to generate a new dataframe with home_id and if datetime[j] - datetime[i] < X, multiple occupancy = 1, otherwise 0.

In [55]:
# making the assumption that it takes 10 seconds to walk between rooms
x = pd.Timedelta(seconds=10)

# ordering data by home_id and datetime
motion = motion.sort_values(by=['home_id', 'datetime'])
# calculating time difference between events in the same room 
motion['prev_home_id'] = motion['home_id'].shift(1)
motion['prev_datetime'] = motion['datetime'].shift(1)
motion['prev_location'] = motion['location'].shift(1)

motion['time_diff'] = motion['datetime'] - motion['prev_datetime']
motion['different_location'] = motion['location'] != motion['prev_location']
motion['same_home'] = motion['home_id'] == motion['prev_home_id']

motion['multiple_occupancy'] = (motion['time_diff'] < x) & motion['different_location'] & motion['same_home']

motion

Unnamed: 0,id,home_id,datetime,location,prev_home_id,prev_datetime,prev_location,time_diff,different_location,same_home,multiple_occupancy
2336,32d084228887d5d9b9d622a6a5bde799,0904961f621c9bd03542b43b992ec431,2024-01-01 08:27:15+00:00,hallway,,NaT,,NaT,True,False,False
2348,e466bb0bd5ceefa24d266d74e190d0b6,0904961f621c9bd03542b43b992ec431,2024-01-01 08:28:19+00:00,hallway,0904961f621c9bd03542b43b992ec431,2024-01-01 08:27:15+00:00,hallway,0 days 00:01:04,False,True,False
2452,3dbd2f7b6e0ff2086d2982aaec5f2f6d,0904961f621c9bd03542b43b992ec431,2024-01-01 08:35:04+00:00,hallway,0904961f621c9bd03542b43b992ec431,2024-01-01 08:28:19+00:00,hallway,0 days 00:06:45,False,True,False
2473,e4e5f042872c8d2e43c7c44ce54d469b,0904961f621c9bd03542b43b992ec431,2024-01-01 08:36:20+00:00,hallway,0904961f621c9bd03542b43b992ec431,2024-01-01 08:35:04+00:00,hallway,0 days 00:01:16,False,True,False
2478,37bf0bd90502b17555550cd5649b5e71,0904961f621c9bd03542b43b992ec431,2024-01-01 08:36:42+00:00,kitchen,0904961f621c9bd03542b43b992ec431,2024-01-01 08:36:20+00:00,hallway,0 days 00:00:22,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...
579525,bd95f0c14ff1504a40e503287dac80fe,f5a2b27c9d8bb7f59b7b0684d3555e52,2024-01-31 22:21:42+00:00,bedroom1,f5a2b27c9d8bb7f59b7b0684d3555e52,2024-01-31 22:20:28+00:00,bedroom1,0 days 00:01:14,False,True,False
580005,e949397e2b19f5f98c85738dbdca0035,f5a2b27c9d8bb7f59b7b0684d3555e52,2024-01-31 23:01:27+00:00,bedroom1,f5a2b27c9d8bb7f59b7b0684d3555e52,2024-01-31 22:21:42+00:00,bedroom1,0 days 00:39:45,False,True,False
580067,1d65e35000bfa3c7a419ca254550174a,f5a2b27c9d8bb7f59b7b0684d3555e52,2024-01-31 23:06:36+00:00,bedroom1,f5a2b27c9d8bb7f59b7b0684d3555e52,2024-01-31 23:01:27+00:00,bedroom1,0 days 00:05:09,False,True,False
580230,fca590a53887a0a206befa3332b11ae3,f5a2b27c9d8bb7f59b7b0684d3555e52,2024-01-31 23:33:28+00:00,bedroom1,f5a2b27c9d8bb7f59b7b0684d3555e52,2024-01-31 23:06:36+00:00,bedroom1,0 days 00:26:52,False,True,False


In [68]:
# creating binary labels
multiple_occupancy_count = motion.groupby('home_id')['multiple_occupancy'].sum().reset_index()

total_events = motion.groupby('home_id').size().reset_index(name='total_events')
unique_locations = motion.groupby('home_id')['location'].nunique().reset_index(name='unique_locations')

motion_predict = multiple_occupancy_count.merge(total_events, on='home_id').merge(unique_locations, on='home_id')

# Generating binary labels: 1 if there's evidence of multiple people, 0 otherwise
motion_predict['multiple_occup'] = (motion_predict['multiple_occupancy'] > 0).astype(int)

motion_predict

Unnamed: 0,home_id,multiple_occupancy,total_events,unique_locations,multiple_occup
0,0904961f621c9bd03542b43b992ec431,135,3442,3,1
1,0f44ff9edd221e417195f4398d2f3853,3232,15303,6,1
2,14328a0b7574e912c2e23d62c9476a07,2937,15977,8,1
3,15663392d490688cd4b0e5aa3d5b6ef3,284,6296,5,1
4,16d71b9c46d9abd765bf395818efe527,384,6780,4,1
5,205c42ec747e2db13cb92087a99433f1,1298,16191,7,1
6,20a3ebd4470c712d6f6d99908d931e09,124,7166,3,1
7,2739e3f7409068a94cf6e3eac643c2e7,4019,20501,4,1
8,2a035e0f88dd05d3c5e61ebee0531a4c,816,7377,4,1
9,2b5ce37a65e82735416d69b987d99fe8,24,4510,6,1


## Prediction Model

The table generated in the previous section could be used to train a prediction model (possibly logistic regression https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) by randomising the data and splitting it into a training and testing set, usually with 75:25 ratio. Hyperparameterisation can be used to tune the model.
The homes data can then be used to validate the model.