<a href="https://colab.research.google.com/github/hrootscraft/sensor-data-analysis/blob/main/BehaviourDataAnalysisSensorData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What can you tell about this person?

When you run the first cell with the `gdown` command, you will get a file named  `sample_data_for_user_663960.csv` in your local directory to load up.

This is the data collected by an IOT network of PIR motion sensors deployed in the home of an older adult aged 80+ living alone. The sensors trigger and send data to our cloud when it detects motion in its vicinity every minute or so.

These are the fields which are relevant for our EDA:
1. `sTyp`: Type of ZIGBEE Sensor
2. `sloc`: Name of location covered by Motion sensor
3. `gwTz`: Timezone locale of the user
4. `rcvdTm`: Timestamp when the Motion sensor was triggered
5. `motion`: If the value is "Motion", then it is a motion event. Otherwise it is NOT
6. `isAppl`: Whether the event was created by an appliance or not

In [None]:
!gdown "1pv6YiFFuQ8VtJIvZyZKTtwqbF74GxniI"

Downloading...
From: https://drive.google.com/uc?id=1pv6YiFFuQ8VtJIvZyZKTtwqbF74GxniI
To: /content/sample_data_for_user_663960.csv
100% 12.1M/12.1M [00:00<00:00, 75.7MB/s]


# Library imports

In [None]:
import pandas as pd
import numpy as np

import datetime
import json
from tqdm import tqdm

import plotly.express as px
import plotly.graph_objects as go

import warnings
warnings.filterwarnings("ignore")

from sklearn.ensemble import IsolationForest

# Load data

In [None]:
df = pd.read_csv('sample_data_for_user_663960.csv')
# display(df)

# Cursory analysis

In [None]:
# df.info()

In [None]:
df.shape

(38846, 24)

In [None]:
df.isnull().sum()

Unnamed: 0.1     0
Unnamed: 0       0
_id              0
sStat            0
motion           0
rcvdTm           0
ep              37
utc              0
Time             0
dateTag          0
inOut            0
sBtLow          37
sBtWar          37
gwMID            0
sMID            37
sNm             37
sTyp             0
sTech            0
uqID             0
sloc             0
gwTz            37
isAppl           0
isExitD          0
__v              0
dtype: int64

In [None]:
df.duplicated().sum()

0

# Preprocess data

In [None]:
df_copy = df.copy()

In [None]:
cols_to_keep = ['sTyp','sloc','gwTz','rcvdTm','motion','isAppl']
df_copy = df_copy.drop(columns=[col for col in df_copy.columns if col not in cols_to_keep])
display(df_copy)

Unnamed: 0,motion,rcvdTm,sTyp,sloc,gwTz,isAppl
0,Motion,{'$date': '2021-08-31T23:39:33Z'},Motion Sensor,Bedroom,Europe/London,False
1,Motion,{'$date': '2021-08-31T23:37:07Z'},Motion Sensor,Bedroom,Europe/London,False
2,Motion,{'$date': '2021-08-31T22:12:25Z'},Motion Sensor,Bedroom,Europe/London,False
3,Motion,{'$date': '2021-08-31T22:11:32Z'},Motion Sensor,Bedroom,Europe/London,False
4,Motion,{'$date': '2021-08-31T22:10:14Z'},Motion Sensor,Bedroom,Europe/London,False
...,...,...,...,...,...,...
38841,Motion,{'$date': '2021-06-01T00:16:59Z'},Motion Sensor,Bedroom,Europe/London,False
38842,Motion,{'$date': '2021-06-01T00:13:57Z'},Motion Sensor,Bedroom,Europe/London,False
38843,Motion,{'$date': '2021-06-01T00:12:12Z'},Motion Sensor,Bedroom,Europe/London,False
38844,Motion,{'$date': '2021-06-01T00:10:51Z'},Motion Sensor,Bedroom,Europe/London,False


In [None]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38846 entries, 0 to 38845
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   motion  38846 non-null  object
 1   rcvdTm  38846 non-null  object
 2   sTyp    38846 non-null  object
 3   sloc    38846 non-null  object
 4   gwTz    38809 non-null  object
 5   isAppl  38846 non-null  bool  
dtypes: bool(1), object(5)
memory usage: 1.5+ MB


In [None]:
df_copy.duplicated().sum()

83

- In the view that we have taken from the given dataset there seems to be some duplicates. Since we are supposed to be dealing with the view itself we may want to drop the duplicate values; this also aids having a unique datetime[ns,UTC] column that can be set as the index column when visualizing the daily motion of the person.

In [None]:
df_copy[df_copy.duplicated()]

Unnamed: 0,motion,rcvdTm,sTyp,sloc,gwTz,isAppl
2731,Motion,{'$date': '2021-08-26T06:40:03Z'},Motion Sensor,Livingroom,Europe/London,False
2732,Motion,{'$date': '2021-08-26T06:40:03Z'},Motion Sensor,Livingroom,Europe/London,False
2734,Motion,{'$date': '2021-08-26T06:38:22Z'},Motion Sensor,Livingroom,Europe/London,False
2735,Motion,{'$date': '2021-08-26T06:38:22Z'},Motion Sensor,Livingroom,Europe/London,False
2737,Motion,{'$date': '2021-08-26T06:36:38Z'},Motion Sensor,Livingroom,Europe/London,False
...,...,...,...,...,...,...
36023,Motion,{'$date': '2021-06-07T23:02:50Z'},Motion Sensor,Bedroom,Europe/London,False
37264,Motion,{'$date': '2021-06-04T19:01:22Z'},Motion Sensor,Livingroom,Europe/London,False
37265,Motion,{'$date': '2021-06-04T19:01:22Z'},Motion Sensor,Livingroom,Europe/London,False
37653,Motion,{'$date': '2021-06-03T17:22:29Z'},Motion Sensor,Stairs,Europe/London,False


In [None]:
df_copy.drop_duplicates(inplace=True)

In [None]:
df_copy['rcvdTm'].duplicated().sum()

707

In [None]:
df_copy[df_copy['rcvdTm'].duplicated(keep=False)]
# let's handle this after converting rcvdTm to timestamps so that we can sort as per time and keep the row that comes second

Unnamed: 0,motion,rcvdTm,sTyp,sloc,gwTz,isAppl
196,Motion,{'$date': '2021-08-31T14:59:30Z'},Motion Sensor,Stairs,Europe/London,False
197,Motion,{'$date': '2021-08-31T14:59:30Z'},Motion Sensor,Hallway,Europe/London,False
271,Motion,{'$date': '2021-08-31T11:38:16Z'},Motion Sensor,Stairs,Europe/London,False
272,Motion,{'$date': '2021-08-31T11:38:16Z'},Motion Sensor,Hallway,Europe/London,False
343,Motion,{'$date': '2021-08-31T10:03:11Z'},Motion Sensor,Stairs,Europe/London,False
...,...,...,...,...,...,...
38767,Motion,{'$date': '2021-06-01T07:24:41Z'},Motion Sensor,Hallway,Europe/London,False
38799,Motion,{'$date': '2021-06-01T06:17:01Z'},Motion Sensor,Hallway,Europe/London,False
38800,Motion,{'$date': '2021-06-01T06:17:01Z'},Motion Sensor,Toilet,Europe/London,False
38801,Motion,{'$date': '2021-06-01T06:14:09Z'},Motion Sensor,Toilet,Europe/London,False


In [None]:
type(df_copy.rcvdTm[0])

str

In [None]:
df_copy['dt'] = pd.to_datetime(df_copy['rcvdTm'].apply(lambda x: json.loads(x.replace("'", '"'))['$date']))
df_copy

Unnamed: 0,motion,rcvdTm,sTyp,sloc,gwTz,isAppl,dt
0,Motion,{'$date': '2021-08-31T23:39:33Z'},Motion Sensor,Bedroom,Europe/London,False,2021-08-31 23:39:33+00:00
1,Motion,{'$date': '2021-08-31T23:37:07Z'},Motion Sensor,Bedroom,Europe/London,False,2021-08-31 23:37:07+00:00
2,Motion,{'$date': '2021-08-31T22:12:25Z'},Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:12:25+00:00
3,Motion,{'$date': '2021-08-31T22:11:32Z'},Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:11:32+00:00
4,Motion,{'$date': '2021-08-31T22:10:14Z'},Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:10:14+00:00
...,...,...,...,...,...,...,...
38841,Motion,{'$date': '2021-06-01T00:16:59Z'},Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:16:59+00:00
38842,Motion,{'$date': '2021-06-01T00:13:57Z'},Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:13:57+00:00
38843,Motion,{'$date': '2021-06-01T00:12:12Z'},Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:12:12+00:00
38844,Motion,{'$date': '2021-06-01T00:10:51Z'},Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:10:51+00:00


In [None]:
# check the datatype esp of dt
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38763 entries, 0 to 38845
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   motion  38763 non-null  object             
 1   rcvdTm  38763 non-null  object             
 2   sTyp    38763 non-null  object             
 3   sloc    38763 non-null  object             
 4   gwTz    38726 non-null  object             
 5   isAppl  38763 non-null  bool               
 6   dt      38763 non-null  datetime64[ns, UTC]
dtypes: bool(1), datetime64[ns, UTC](1), object(5)
memory usage: 3.1+ MB


- In this case, +00:00 indicates that the time is in UTC (Coordinated Universal Time) with no timezone offset applied.

In [None]:
df_copy.drop('rcvdTm',axis=1,inplace=True)

In [None]:
df_copy_1 = df_copy.copy()

In [None]:
df_copy_1

Unnamed: 0,motion,sTyp,sloc,gwTz,isAppl,dt
0,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 23:39:33+00:00
1,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 23:37:07+00:00
2,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:12:25+00:00
3,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:11:32+00:00
4,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:10:14+00:00
...,...,...,...,...,...,...
38841,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:16:59+00:00
38842,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:13:57+00:00
38843,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:12:12+00:00
38844,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:10:51+00:00


In [None]:
df_copy_1.dt.duplicated().sum()

707

In [None]:
df_copy_1 = df_copy_1.sort_values('dt')

In [None]:
df_copy_1[df_copy_1.dt.duplicated(keep=False)] # observe the sloc

Unnamed: 0,motion,sTyp,sloc,gwTz,isAppl,dt
38802,Motion,Motion Sensor,Hallway,Europe/London,False,2021-06-01 06:14:09+00:00
38801,Motion,Motion Sensor,Toilet,Europe/London,False,2021-06-01 06:14:09+00:00
38800,Motion,Motion Sensor,Toilet,Europe/London,False,2021-06-01 06:17:01+00:00
38799,Motion,Motion Sensor,Hallway,Europe/London,False,2021-06-01 06:17:01+00:00
38767,Motion,Motion Sensor,Hallway,Europe/London,False,2021-06-01 07:24:41+00:00
...,...,...,...,...,...,...
343,Motion,Motion Sensor,Stairs,Europe/London,False,2021-08-31 10:03:11+00:00
272,Motion,Motion Sensor,Hallway,Europe/London,False,2021-08-31 11:38:16+00:00
271,Motion,Motion Sensor,Stairs,Europe/London,False,2021-08-31 11:38:16+00:00
197,Motion,Motion Sensor,Hallway,Europe/London,False,2021-08-31 14:59:30+00:00


In [None]:
df_copy_1[df_copy_1.dt.duplicated()] # observe the sloc

Unnamed: 0,motion,sTyp,sloc,gwTz,isAppl,dt
38801,Motion,Motion Sensor,Toilet,Europe/London,False,2021-06-01 06:14:09+00:00
38799,Motion,Motion Sensor,Hallway,Europe/London,False,2021-06-01 06:17:01+00:00
38766,Motion,Motion Sensor,Toilet,Europe/London,False,2021-06-01 07:24:41+00:00
38617,Motion,Motion Sensor,Hallway,Europe/London,False,2021-06-01 12:51:36+00:00
38580,Motion,Motion Sensor,Hallway,Europe/London,False,2021-06-01 14:34:04+00:00
...,...,...,...,...,...,...
831,Motion,Motion Sensor,Toilet,Europe/London,False,2021-08-30 07:17:52+00:00
664,Motion,Motion Sensor,Stairs,Europe/London,False,2021-08-30 13:01:30+00:00
343,Motion,Motion Sensor,Stairs,Europe/London,False,2021-08-31 10:03:11+00:00
271,Motion,Motion Sensor,Stairs,Europe/London,False,2021-08-31 11:38:16+00:00


In [None]:
# Drop duplicates except for the last occurrence :
# The person can be at only one location at one point in time and that is the farther in time
df_cleaned_timestamps = df_copy_1[~df_copy_1['dt'].duplicated(keep='last')]
df_cleaned_timestamps

Unnamed: 0,motion,sTyp,sloc,gwTz,isAppl,dt
38845,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:02:49+00:00
38844,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:10:51+00:00
38843,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:12:12+00:00
38842,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:13:57+00:00
38841,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-06-01 00:16:59+00:00
...,...,...,...,...,...,...
4,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:10:14+00:00
3,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:11:32+00:00
2,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 22:12:25+00:00
1,Motion,Motion Sensor,Bedroom,Europe/London,False,2021-08-31 23:37:07+00:00


In [None]:
df_cleaned_timestamps.dt.duplicated().sum()

0

In [None]:
df_cleaned_timestamps.isnull().sum()

motion     0
sTyp       0
sloc       0
gwTz      37
isAppl     0
dt         0
dtype: int64

In [None]:
mask = df['gwTz'].isna()
df_cleaned_timestamps[mask]

Unnamed: 0,motion,sTyp,sloc,gwTz,isAppl,dt
38238,Motion,Motion Sensor,Out of Location,,False,2021-06-02 10:19:21.912000+00:00
37786,Motion,Motion Sensor,Out of Location,,False,2021-06-03 11:59:24.051000+00:00
35358,Motion,Motion Sensor,Out of Location,,False,2021-06-09 16:15:39.493000+00:00
35043,Motion,Motion Sensor,Out of Location,,False,2021-06-10 10:55:39.877000+00:00
32107,Motion,Motion Sensor,Out of Location,,False,2021-06-16 16:58:38.381000+00:00
31819,Motion,Motion Sensor,Out of Location,,False,2021-06-17 11:34:38.767000+00:00
31039,Motion,Motion Sensor,Out of Location,,False,2021-06-19 12:15:03.388000+00:00
30718,Motion,Motion Sensor,Out of Location,,False,2021-06-20 09:03:03.868000+00:00
30352,Motion,Motion Sensor,Out of Location,,False,2021-06-21 09:43:04.470000+00:00
29070,Motion,Motion Sensor,Out of Location,,False,2021-06-24 10:13:06.233000+00:00


In [None]:
# we observe that for all the NaN timezone locale values are occuring simutaneous to the sloc 'Out of Location'
df_cleaned_timestamps['sloc'].value_counts()['Out of Location']

37

- When the elderly gets out of location ie his timezone locale is undetectable it can be inferred as either of the following:
1. He has a spot in the house where the sensors can't capture the motion
2. There's some bug in the sensor capturing
3. He has gone outside of his house
<br>
- This information can prove to be an asset so we won't discard the rows with gwTz=NaN for now.

In [None]:
df_cleaned_timestamps['isAppl'].value_counts()[False]

38056

- No event was created by an appliance.

In [None]:
df_cleaned_timestamps.set_index('dt',inplace=True)
df_cleaned_timestamps.index.name = None
df_cleaned_timestamps

Unnamed: 0,motion,sTyp,sloc,gwTz,isAppl
2021-06-01 00:02:49+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False
2021-06-01 00:10:51+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False
2021-06-01 00:12:12+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False
2021-06-01 00:13:57+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False
2021-06-01 00:16:59+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False
...,...,...,...,...,...
2021-08-31 22:10:14+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False
2021-08-31 22:11:32+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False
2021-08-31 22:12:25+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False
2021-08-31 23:37:07+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False


In [None]:
print(df_cleaned_timestamps.columns)

Index(['motion', 'sTyp', 'sloc', 'gwTz', 'isAppl'], dtype='object')


In [None]:
type(df_cleaned_timestamps.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [None]:
df_cleaned_timestamps['d'] = df_cleaned_timestamps.index.day
df_cleaned_timestamps['m'] = df_cleaned_timestamps.index.month
df_cleaned_timestamps['y'] = df_cleaned_timestamps.index.year
df_cleaned_timestamps['m_name'] = df_cleaned_timestamps.index.month_name()
df_cleaned_timestamps['nth_day_of_week'] = df_cleaned_timestamps.index.dayofweek
df_cleaned_timestamps['nth_day_of_week_name'] = df_cleaned_timestamps.index.day_name()
df_cleaned_timestamps['is_weekend'] = np.where(df_cleaned_timestamps['nth_day_of_week_name'].isin(['Saturday', 'Sunday']), 1, 0)
df_cleaned_timestamps['nth_week_of_year'] = df_cleaned_timestamps.index.isocalendar().week
df_cleaned_timestamps['quarter'] = df_cleaned_timestamps.index.quarter
df_cleaned_timestamps['hr'] = df_cleaned_timestamps.index.hour
df_cleaned_timestamps['min'] = df_cleaned_timestamps.index.minute

# df_cleaned_timestamps['sec'] = df_cleaned_timestamps['dt'].index.second

display(df_cleaned_timestamps)

Unnamed: 0,motion,sTyp,sloc,gwTz,isAppl,d,m,y,m_name,nth_day_of_week,nth_day_of_week_name,is_weekend,nth_week_of_year,quarter,hr,min
2021-06-01 00:02:49+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,2
2021-06-01 00:10:51+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,10
2021-06-01 00:12:12+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,12
2021-06-01 00:13:57+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,13
2021-06-01 00:16:59+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-08-31 22:10:14+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,31,8,2021,August,1,Tuesday,0,35,3,22,10
2021-08-31 22:11:32+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,31,8,2021,August,1,Tuesday,0,35,3,22,11
2021-08-31 22:12:25+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,31,8,2021,August,1,Tuesday,0,35,3,22,12
2021-08-31 23:37:07+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,31,8,2021,August,1,Tuesday,0,35,3,23,37


In [None]:
df_cleaned_timestamps.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 38056 entries, 2021-06-01 00:02:49+00:00 to 2021-08-31 23:39:33+00:00
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   motion                38056 non-null  object
 1   sTyp                  38056 non-null  object
 2   sloc                  38056 non-null  object
 3   gwTz                  38019 non-null  object
 4   isAppl                38056 non-null  bool  
 5   d                     38056 non-null  int64 
 6   m                     38056 non-null  int64 
 7   y                     38056 non-null  int64 
 8   m_name                38056 non-null  object
 9   nth_day_of_week       38056 non-null  int64 
 10  nth_day_of_week_name  38056 non-null  object
 11  is_weekend            38056 non-null  int64 
 12  nth_week_of_year      38056 non-null  UInt32
 13  quarter               38056 non-null  int64 
 14  hr                    38056 non-null  i

In [None]:
unique_values = {} # stores the unique values of every column where column is the key and list of uniques values is it's value

for column in df_cleaned_timestamps.columns:
    unique_values[column] = df_cleaned_timestamps[column].unique()

for column, values in unique_values.items():
    print(f"Unique values of {column}:")
    print(values)
    print(f"Count of unique values in {column}: {len(values)}")
    print()

Unique values of motion:
['Motion']
Count of unique values in motion: 1

Unique values of sTyp:
['Motion Sensor']
Count of unique values in sTyp: 1

Unique values of sloc:
['Bedroom' 'Stairs' 'Hallway' 'Livingroom' 'Kitchen' 'Bathroom' 'Toilet'
 'Conservatory' 'Cloakroom' 'Out of Location']
Count of unique values in sloc: 10

Unique values of gwTz:
['Europe/London' nan]
Count of unique values in gwTz: 2

Unique values of isAppl:
[False]
Count of unique values in isAppl: 1

Unique values of d:
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31]
Count of unique values in d: 31

Unique values of m:
[6 7 8]
Count of unique values in m: 3

Unique values of y:
[2021]
Count of unique values in y: 1

Unique values of m_name:
['June' 'July' 'August']
Count of unique values in m_name: 3

Unique values of nth_day_of_week:
[1 2 3 4 5 6 0]
Count of unique values in nth_day_of_week: 7

Unique values of nth_day_of_week_name:
['Tuesday' 'Wednesday' 'Thursday

- We have sensor data of June, July, August 2021
- To understand the person's behavorial patterns what we mainly need is the sloc and at what time was he in a certain location.

# Visualize data

## A. Person's daily order of motion

We need this visual to compare the time he is spending in each room wrt to the other rooms in addition to the order of his motion.

### i. For every day in the dataset

In [None]:
# Convert the index date to a list and then get unique values
unique_dates = np.unique(df_cleaned_timestamps.index.date).tolist()

In [None]:
# unique_dates = np.unique(df_copy_1.index.date).tolist()
# for date in unique_dates:
#   print(f"{date} type {type(date)}") # 2021-06-01 type <class 'datetime.date'>

In [None]:
# This order is to be plotted on the y-axis on every graph
category_order = ['Bedroom', 'Stairs', 'Hallway', 'Livingroom', 'Kitchen', 'Bathroom','Toilet','Conservatory','Cloakroom','Out of Location']

category_colors = {
    'Bedroom': 'red',
    'Stairs': 'green',
    'Hallway': 'blue',
    'Livingroom': 'orange',
    'Kitchen': 'purple',
    'Bathroom': 'cyan',
    'Toilet': 'yellow',
    'Conservatory': 'magenta',
    'Cloakroom': 'lime',
    'Out of Location': 'pink'
}

In [None]:
# Iterate over each day in the dataset
# for i, day in enumerate(tqdm(unique_dates)):
for i,day in enumerate(tqdm(unique_dates[30:50])):
    # Filter the data for the current day
    df_day = df_cleaned_timestamps[df_cleaned_timestamps.index.date == day]

    # Create the scatter plot for the current day
    fig = px.scatter(df_day, x=df_day.index, y='sloc', color='sloc')

    # Customizing the plot layout
    fig.update_layout(
        title='Categorical Time Series Data - {}'.format(day),
        xaxis_title='Time',
        yaxis_title='Category',
        yaxis={'categoryorder': 'array', 'categoryarray': category_order}
    )
    # Update the color for each category
    for category, color in category_colors.items():
        fig.for_each_trace(lambda t: t.update(marker=dict(color=color)) if t.name == category else ())
    # Add faint line connecting the category points
    for i in range(1, len(df_day)):
        fig.add_trace(go.Scatter(
            x=[df_day.index[i-1], df_day.index[i]],
            y=[df_day['sloc'].iloc[i-1], df_day['sloc'].iloc[i]],
            mode='lines',
            line=dict(color='darkgrey', width=1, dash='dash')
        ))

    # Display the plot
    fig.show()

  0%|          | 0/20 [00:00<?, ?it/s]

  5%|▌         | 1/20 [00:01<00:22,  1.16s/it]

 10%|█         | 2/20 [00:01<00:11,  1.54it/s]

 15%|█▌        | 3/20 [00:01<00:08,  2.12it/s]

 20%|██        | 4/20 [00:02<00:06,  2.49it/s]

 25%|██▌       | 5/20 [00:02<00:05,  2.72it/s]

 30%|███       | 6/20 [00:02<00:04,  2.83it/s]

 35%|███▌      | 7/20 [00:03<00:04,  2.73it/s]

 40%|████      | 8/20 [00:03<00:04,  2.92it/s]

 45%|████▌     | 9/20 [00:03<00:03,  3.03it/s]

 50%|█████     | 10/20 [00:04<00:03,  2.88it/s]

 55%|█████▌    | 11/20 [00:04<00:03,  2.93it/s]

 60%|██████    | 12/20 [00:04<00:02,  3.11it/s]

 65%|██████▌   | 13/20 [00:04<00:02,  3.22it/s]

 70%|███████   | 14/20 [00:05<00:01,  3.11it/s]

 75%|███████▌  | 15/20 [00:05<00:01,  2.86it/s]

 80%|████████  | 16/20 [00:06<00:01,  2.87it/s]

 85%|████████▌ | 17/20 [00:06<00:01,  2.94it/s]

 90%|█████████ | 18/20 [00:06<00:00,  3.10it/s]

 95%|█████████▌| 19/20 [00:06<00:00,  3.13it/s]

100%|██████████| 20/20 [00:07<00:00,  2.76it/s]


### ii. For every day in the specified range

In [None]:
start_date = df_cleaned_timestamps.index.min().date()
end_date = df_cleaned_timestamps.index.max().date()

print("Start Date:", start_date)
print("End Date:", end_date)

Start Date: 2021-06-01
End Date: 2021-08-31


In [None]:
# Define the start and end dates for the range
start_date = datetime.date(2021, 6, 1)
end_date = datetime.date(2021, 6, 2)

# Iterate over each day in the date range
current_date = start_date
while current_date <= end_date:
    # Filter the data for the current day
    df_day = df_cleaned_timestamps[df_cleaned_timestamps.index.date == current_date]
    # Create the scatter plot for the current day
    fig = px.scatter(df_day, x=df_day.index, y='sloc', color='sloc')
    # Customizing the plot layout
    fig.update_layout(
        title='Categorical Time Series Data - {}'.format(current_date),
        xaxis_title='Time',
        yaxis_title='Category',
        yaxis={'categoryorder': 'array', 'categoryarray': category_order}
    )
    # Update the color for each category
    for category, color in category_colors.items():
        fig.for_each_trace(lambda t: t.update(marker=dict(color=color)) if t.name == category else ())
    # Add faint line connecting the category points
    for i in range(1, len(df_day)):
        fig.add_trace(go.Scatter(
            x=[df_day.index[i-1], df_day.index[i]],
            y=[df_day['sloc'].iloc[i-1], df_day['sloc'].iloc[i]],
            mode='lines',
            line=dict(color='darkgrey', width=1, dash='dash')
        ))
    fig.show()
    # Move to the next day
    current_date += datetime.timedelta(days=1)

- Upon eyeballing the the motion of this person over each day given over June, July, August 2021, we can make a few generalizations and observations :
  - Whenever the person goes out of location, he is generally in the kitchen. It might mean something is wrong with the kitchen sensor.
  - One of the habitual observations that can be made is : <br> The person usually wakes up around 6 am, goes to the toilet using the staircase and the hallway and on the way coming back or going to the toilet passes or stays a few minutes in the kitchen or the living room. Then he goes to the bathroom and then  back to the bedroom between 7 and 8 am and stays there for about 30 minutes.
  - He usually goes to the bedroom and sleeps around 9.30 am.

## B. Time he spends in each room

This is needed for preventive measures. For instance, if the person is spending an abnormal amount of time in the toilet, he might be experiencing some illness and the IOT devices can make a note of this.

In [None]:
df_copy_2 = df_cleaned_timestamps.copy()
df_copy_2

Unnamed: 0,motion,sTyp,sloc,gwTz,isAppl,d,m,y,m_name,nth_day_of_week,nth_day_of_week_name,is_weekend,nth_week_of_year,quarter,hr,min
2021-06-01 00:02:49+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,2
2021-06-01 00:10:51+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,10
2021-06-01 00:12:12+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,12
2021-06-01 00:13:57+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,13
2021-06-01 00:16:59+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,1,6,2021,June,1,Tuesday,0,22,2,0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-08-31 22:10:14+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,31,8,2021,August,1,Tuesday,0,35,3,22,10
2021-08-31 22:11:32+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,31,8,2021,August,1,Tuesday,0,35,3,22,11
2021-08-31 22:12:25+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,31,8,2021,August,1,Tuesday,0,35,3,22,12
2021-08-31 23:37:07+00:00,Motion,Motion Sensor,Bedroom,Europe/London,False,31,8,2021,August,1,Tuesday,0,35,3,23,37


In [None]:
df_copy_2.index.nunique()

38056

### Preprocess data

In [None]:
unique_dates = np.unique(df_copy_2.index.date).tolist() # each date in unique_dates is of type <class 'datetime.date'>
new_df = pd.DataFrame(index=unique_dates)
new_df # we intend to fill each day in the df with columns: time_spent_in_{sloc_category}

2021-06-01
2021-06-02
2021-06-03
2021-06-04
2021-06-05
...
2021-08-27
2021-08-28
2021-08-29
2021-08-30
2021-08-31


In [None]:
type(new_df.index)

pandas.core.indexes.base.Index

In [None]:
sloc_categories = df_copy_2['sloc'].unique()

for sloc_category in sloc_categories:
    column_name = f"time_in_{sloc_category}"
    new_df[column_name] = 0

new_df

Unnamed: 0,time_in_Bedroom,time_in_Stairs,time_in_Hallway,time_in_Livingroom,time_in_Kitchen,time_in_Bathroom,time_in_Toilet,time_in_Conservatory,time_in_Cloakroom,time_in_Out of Location
2021-06-01,0,0,0,0,0,0,0,0,0,0
2021-06-02,0,0,0,0,0,0,0,0,0,0
2021-06-03,0,0,0,0,0,0,0,0,0,0
2021-06-04,0,0,0,0,0,0,0,0,0,0
2021-06-05,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
2021-08-27,0,0,0,0,0,0,0,0,0,0
2021-08-28,0,0,0,0,0,0,0,0,0,0
2021-08-29,0,0,0,0,0,0,0,0,0,0
2021-08-30,0,0,0,0,0,0,0,0,0,0


In [None]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92 entries, 2021-06-01 to 2021-08-31
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   time_in_Bedroom          92 non-null     int64
 1   time_in_Stairs           92 non-null     int64
 2   time_in_Hallway          92 non-null     int64
 3   time_in_Livingroom       92 non-null     int64
 4   time_in_Kitchen          92 non-null     int64
 5   time_in_Bathroom         92 non-null     int64
 6   time_in_Toilet           92 non-null     int64
 7   time_in_Conservatory     92 non-null     int64
 8   time_in_Cloakroom        92 non-null     int64
 9   time_in_Out of Location  92 non-null     int64
dtypes: int64(10)
memory usage: 7.9+ KB


In [None]:
type(new_df.index)

pandas.core.indexes.base.Index

***Heuristic Measures***

**Logic:**
<br>
Let's say t1,s1 is the initial timestamp,sloc and t2,s2 is the next timestamp,sloc.
- If s1==s2, then t2-t1 is the time spent by the person in s1.
- If s1!=s2, then t2-t1 can be the time spent in either room s1 or s2 or both rooms have some proportion of (t2-t1). For the sake of simplicity we'll assume half the time spent is in s1 and the other half is in s2.

In [None]:
for i,day in enumerate(tqdm(unique_dates)):
    # Filter the data for the current day
    df_day = df_copy_2[df_copy_2.index.date == day] # df_day -> <class 'pandas.core.frame.DataFrame'>
    # print(df_day) # df_day is a view taken from df_copy_2 for that particular day

    tstamps_list = df_day.index.tolist() # list of all the timestamps each of which is a Timestamp('2021-06-01 00:02:49+0000', tz='UTC')

    # iterate through every timestamp
    for j in range(len(tstamps_list) - 1):
        current_date = tstamps_list[j]
        next_date = tstamps_list[j + 1]
        diff = next_date-current_date
        current_sloc = df_day.at[current_date, 'sloc']
        next_sloc = df_day.at[next_date, 'sloc']
        # print(f'Current date {current_date}\tCurrent sloc {current_sloc}')
        # print(f'Current date {next_date}\tCurrent sloc {next_sloc}\n\n')
        if current_sloc == next_sloc:
          name_of_room = f'time_in_{current_sloc}'
          cell_value = new_df.at[day, name_of_room]
          new_df.at[day, name_of_room] = cell_value + diff.total_seconds()  # Update the corresponding cell with the calculated time diff.total_seconds()erence
        else:
          name_of_room_1 = f'time_in_{current_sloc}'
          name_of_room_2 = f'time_in_{next_sloc}'
          cell_value_1 = new_df.at[day, name_of_room_1]
          cell_value_2 = new_df.at[day, name_of_room_2]
          new_df.at[day, name_of_room_1] = cell_value_1 + diff.total_seconds()/2
          new_df.at[day, name_of_room_2] = cell_value_2 + diff.total_seconds()/2

100%|██████████| 92/92 [00:04<00:00, 20.39it/s]


- We observe that the data structure is INCONSISTENT, some current_sloc and next_sloc have two different location at the same instance and this information is shown with another tiny dataframe at that row meaning there are some duplicate rows in the timestamps.
- Let's go up and handle this.
- We have df_cleaned_timestamps now.

In [None]:
new_df

Unnamed: 0,time_in_Bedroom,time_in_Stairs,time_in_Hallway,time_in_Livingroom,time_in_Kitchen,time_in_Bathroom,time_in_Toilet,time_in_Conservatory,time_in_Cloakroom,time_in_Out of Location
2021-06-01,30665.0,1415.5,1778.0,41636.0,7856.5,558.5,571.5,1746.0000,0.0,0.0000
2021-06-02,33391.5,2274.5,1316.0,37701.0,7709.5,532.0,1604.0,925.0000,310.5,344.0000
2021-06-03,30051.0,1525.5,1337.0,41293.0,6363.5,1087.5,603.0,696.5000,0.0,328.0000
2021-06-04,29298.5,1032.0,1223.0,44043.0,6715.5,988.0,787.5,131.0000,94.5,0.0000
2021-06-05,31310.0,1352.0,860.0,37731.5,11435.0,970.0,913.0,647.5000,0.0,0.0000
...,...,...,...,...,...,...,...,...,...,...
2021-08-27,24242.0,2323.5,812.5,41205.5,9004.5,638.5,570.5,1168.0000,0.0,0.0000
2021-08-28,25936.5,3284.0,3823.0,21381.5,24622.0,1255.0,604.0,1834.0000,0.0,0.0000
2021-08-29,23633.0,1932.5,2211.0,37422.5,9256.0,1555.5,195.0,435.5000,54.0,0.0000
2021-08-30,22962.5,2684.0,2557.5,37898.0,10174.0,447.5,535.0,1178.5000,0.0,0.0000


In [None]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92 entries, 2021-06-01 to 2021-08-31
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   time_in_Bedroom          92 non-null     float64
 1   time_in_Stairs           92 non-null     float64
 2   time_in_Hallway          92 non-null     float64
 3   time_in_Livingroom       92 non-null     float64
 4   time_in_Kitchen          92 non-null     float64
 5   time_in_Bathroom         92 non-null     float64
 6   time_in_Toilet           92 non-null     float64
 7   time_in_Conservatory     92 non-null     float64
 8   time_in_Cloakroom        92 non-null     float64
 9   time_in_Out of Location  92 non-null     float64
dtypes: float64(10)
memory usage: 10.0+ KB


In [None]:
new_df_copy = new_df.copy()

In [None]:
new_df_copy = new_df_copy.div(3600)
new_df_copy

Unnamed: 0,time_in_Bedroom,time_in_Stairs,time_in_Hallway,time_in_Livingroom,time_in_Kitchen,time_in_Bathroom,time_in_Toilet,time_in_Conservatory,time_in_Cloakroom,time_in_Out of Location
2021-06-01,8.518056,0.393194,0.493889,11.565556,2.182361,0.155139,0.158750,0.485000,0.00000,0.000000
2021-06-02,9.275417,0.631806,0.365556,10.472500,2.141528,0.147778,0.445556,0.256944,0.08625,0.095556
2021-06-03,8.347500,0.423750,0.371389,11.470278,1.767639,0.302083,0.167500,0.193472,0.00000,0.091111
2021-06-04,8.138472,0.286667,0.339722,12.234167,1.865417,0.274444,0.218750,0.036389,0.02625,0.000000
2021-06-05,8.697222,0.375556,0.238889,10.480972,3.176389,0.269444,0.253611,0.179861,0.00000,0.000000
...,...,...,...,...,...,...,...,...,...,...
2021-08-27,6.733889,0.645417,0.225694,11.445972,2.501250,0.177361,0.158472,0.324444,0.00000,0.000000
2021-08-28,7.204583,0.912222,1.061944,5.939306,6.839444,0.348611,0.167778,0.509444,0.00000,0.000000
2021-08-29,6.564722,0.536806,0.614167,10.395139,2.571111,0.432083,0.054167,0.120972,0.01500,0.000000
2021-08-30,6.378472,0.745556,0.710417,10.527222,2.826111,0.124306,0.148611,0.327361,0.00000,0.000000


### Visual and Insight

In [None]:
new_df_copy.index = pd.to_datetime(new_df_copy.index)
fig = px.bar(new_df_copy, title='Count of Categorical Variables over Time')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

In [None]:
new_df_copy.describe()

Unnamed: 0,time_in_Bedroom,time_in_Stairs,time_in_Hallway,time_in_Livingroom,time_in_Kitchen,time_in_Bathroom,time_in_Toilet,time_in_Conservatory,time_in_Cloakroom,time_in_Out of Location
count,92.0,92.0,92.0,92.0,92.0,92.0,92.0,92.0,92.0,92.0
mean,7.828305,0.565091,0.437551,10.320988,2.905813,0.258132,0.234583,0.274583,0.030817,0.153424
std,0.831379,0.347095,0.390142,1.531482,1.068635,0.089766,0.094041,0.383525,0.249052,0.513427
min,5.779444,0.249869,0.159028,5.771944,1.012692,0.115,0.054167,0.0,0.0,0.0
25%,7.171215,0.394271,0.290989,10.058333,2.28434,0.180694,0.164826,0.087118,0.0,0.0
50%,8.044306,0.497014,0.366944,10.663819,2.731528,0.247153,0.226458,0.189097,0.0,0.0
75%,8.491806,0.584953,0.449861,11.343728,3.213671,0.314618,0.290139,0.328125,0.0,0.097083
max,9.281389,3.18203,3.780425,12.569444,6.839444,0.496806,0.532361,3.260972,2.383056,2.79875


- We can say that the person sleeps for an average of 8 hours per day
- And is in the toilet for an average of 14 minutes per day (0.23458×60=14.0748)

# Approach to detect change in behaviour

## 1. Anomaly detection using Isolation Forest Algorithm

In [None]:
new_df_copy_1 = new_df_copy.copy()

In [None]:
X=new_df_copy_1.values

In [None]:
clf=IsolationForest(contamination=0.10)

In [None]:
clf.fit(X)

In [None]:
clf.predict(X)

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1, -1,
        1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1,  1,  1])

In [None]:
predictions = clf.predict(X)
(predictions>0).mean()

0.8913043478260869

~89% data is normal

In [None]:
abnormal_index = np.where(predictions<0)

In [None]:
abnormal_index

(array([11, 19, 25, 33, 47, 48, 61, 67, 75, 88]),)

In [None]:
abnormal_df = new_df_copy_1.iloc[abnormal_index[0]]

fig = px.bar(abnormal_df)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

- The days marked above have some anomaly ie the person has spent an anomalous amount of time in any of the rooms.

In [None]:
dec_fun = clf.decision_function(X)

In [None]:
dec_fun = clf.decision_function(X)

In [None]:
new_df_copy_1['pct_normal'] = dec_fun
px.line(new_df_copy_1['pct_normal'])

--WORK IN PROGRESS--