# Detecting Human Activities Through Smartwatch Sensor - Preprocesing

- Data set source:  WISDM Lab of Frodham University, NY
https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset

Data captured using

- Two kinds of devices:
    - Smartphone (Samsung Galaxy S5)
    - Smartwatch (LG G)

    
- Two kinds of embedded kinematic sensors (for each device):
    - Accelerometer - for measurement of linear acceleration (m/sec^2)
    - Gyroscope - for measurement of angular velocity (rad/sec)


In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import glob
from tqdm import tqdm
import filecmp

PBAR_FORMAT='{desc:12}{percentage:3.0f}%|{bar:27}[ {n:4d}/{total:4d}, {elapsed}<{remaining}{postfix} ]'

Each activity is represented as an alphabet in the dataset. To make meaningful inference of data we map it to actual activity

In [2]:
activity_codes_mapping = {'A': 'walking',
                          'B': 'jogging',
                          'C': 'stairs',
                          'D': 'sitting',
                          'E': 'standing',
                          'F': 'typing',
                          'G': 'brushing teeth',
                          'H': 'eating soup',
                          'I': 'eating chips',
                          'J': 'eating pasta',
                          'K': 'drinking from cup',
                          'L': 'eating sandwich',
                          'M': 'kicking soccer ball',
                          'O': 'playing catch tennis ball',
                          'P': 'dribbling basket ball',
                          'Q': 'writing',
                          'R': 'clapping',
                          'S': 'folding clothes'}

# Dataset understanding

## Watch Accelerometer

- Load the phone accelerometer sensor data for participant 1

In [3]:
df_wa_p01 = pd.read_csv(r'../dataset/raw/watch/accel/data_1601_accel_watch.txt', names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
print(df_wa_p01.info())
df_wa_p01.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64840 entries, 0 to 64839
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  64840 non-null  int64  
 1   activity_code   64840 non-null  object 
 2   timestamp       64840 non-null  int64  
 3   x               64840 non-null  float64
 4   y               64840 non-null  float64
 5   z               64840 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 3.0+ MB
None


Unnamed: 0,participant_id,activity_code,timestamp,x,y,z
0,1601,A,1896411611733301,-2.969708,-1.949329,10.726623;
1,1601,A,1896411661695801,-3.486855,-2.420987,11.660361;
2,1601,A,1896411711658874,-2.826056,-2.854338,9.792884;
3,1601,A,1896411761623926,-3.30729,-3.076998,9.926959;
4,1601,A,1896411811593717,-3.99682,-2.847155,9.280524;


    - Observation
        - Most of the data are numeric except for z column

- Load the watch accelerometer sensor data for participant 2

In [4]:
df_wa_p02 = pd.read_csv(r'../dataset/raw/watch/accel/data_1601_accel_watch.txt', names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
print(df_wa_p02.info())
df_wa_p02.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64840 entries, 0 to 64839
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  64840 non-null  int64  
 1   activity_code   64840 non-null  object 
 2   timestamp       64840 non-null  int64  
 3   x               64840 non-null  float64
 4   y               64840 non-null  float64
 5   z               64840 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 3.0+ MB
None


Unnamed: 0,participant_id,activity_code,timestamp,x,y,z
0,1601,A,1896411611733301,-2.969708,-1.949329,10.726623;
1,1601,A,1896411661695801,-3.486855,-2.420987,11.660361;
2,1601,A,1896411711658874,-2.826056,-2.854338,9.792884;
3,1601,A,1896411761623926,-3.30729,-3.076998,9.926959;
4,1601,A,1896411811593717,-3.99682,-2.847155,9.280524;


    - Observation
        - Most of the data are numeric except for z column

- Lets confirm is the remaining participants data is also similar dtype 

In [5]:
for file_name in tqdm(glob.glob(r'../dataset/raw/watch/accel/data_*_accel_watch.txt'),
                      desc="Checking",
                     bar_format=PBAR_FORMAT):
    df_wa = pd.read_csv(file_name, names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
    if (df_wa.dtypes != df_wa_p01.dtypes).any():
        print(f"mismatch in df for file {file_name}")

Checking    100%|███████████████████████████[   51/  51, 00:02<00:00 ]


    - Observation:
          - All data csv's have same data type as that of participant 1

- Check for missing values in csv files

In [6]:
for file_name in tqdm(glob.glob(r'../dataset/raw/watch/accel/data_*_accel_watch.txt'),
                      desc="Checking",
                     bar_format=PBAR_FORMAT):
    df_wa = pd.read_csv(file_name, names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
    if (df_wa.notnull().any().any() == False):
        print("missing value present")

Checking    100%|███████████████████████████[   51/  51, 00:02<00:00 ]


    - Observation:
          - All data csv's have no missing values

- Converting z column to float, and add activity column for easy inference

In [13]:
df_wa_p01.z = df_wa_p01.z.str.strip(';')
df_wa_p01.z = pd.to_numeric(df_wa_p01.z)

In [14]:
df_wa_p01['activity'] = df_wa_p01['activity_code'].map(activity_codes_mapping)
df_wa_p01 = df_wa_p01[['participant_id', 'activity_code', 'activity', 'timestamp', 'x', 'y', 'z']]

df_wa_p01

Unnamed: 0,participant_id,activity_code,activity,timestamp,x,y,z
0,1601,A,walking,1896411611733301,-2.969708,-1.949329,10.726623
1,1601,A,walking,1896411661695801,-3.486855,-2.420987,11.660361
2,1601,A,walking,1896411711658874,-2.826056,-2.854338,9.792884
3,1601,A,walking,1896411761623926,-3.307290,-3.076998,9.926959
4,1601,A,walking,1896411811593717,-3.996820,-2.847155,9.280524
...,...,...,...,...,...,...,...
64835,1601,S,folding clothes,1890263664188560,7.785346,-2.825756,11.020660
64836,1601,S,folding clothes,1890263714160022,5.726332,-1.128267,12.821099
64837,1601,S,folding clothes,1890263764099780,6.719925,-2.854487,12.031014
64838,1601,S,folding clothes,1890263814039267,8.802881,-2.334945,9.308805


## Watch Gyroscope

In [7]:
df_wg_p01 = pd.read_csv(r'../dataset/raw/watch/gyro/data_1601_gyro_watch.txt', names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
print(df_wg_p01.info())
df_wg_p01.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64829 entries, 0 to 64828
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  64829 non-null  int64  
 1   activity_code   64829 non-null  object 
 2   timestamp       64829 non-null  int64  
 3   x               64829 non-null  float64
 4   y               64829 non-null  float64
 5   z               64829 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 3.0+ MB
None


Unnamed: 0,participant_id,activity_code,timestamp,x,y,z
0,1601,A,1896411611733301,0.70336,-0.436308,-0.9538892;
1,1601,A,1896411661695801,0.254884,-0.42459,-0.99330395;
2,1601,A,1896411711658874,0.301756,-0.22219,-0.9496281;
3,1601,A,1896411761623926,0.287907,-0.14123,-1.0955694;
4,1601,A,1896411811593717,0.402956,0.272093,-1.0060872;


    - Observation
        - Most of the data are numeric except for z column

- Load the watch gyro sensor data for participant 2

In [8]:
df_wg_p02 = pd.read_csv(r'../dataset/raw/watch/gyro/data_1602_gyro_watch.txt', names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
print(df_wg_p02.info())
df_wg_p02.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64950 entries, 0 to 64949
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  64950 non-null  int64  
 1   activity_code   64950 non-null  object 
 2   timestamp       64950 non-null  int64  
 3   x               64950 non-null  float64
 4   y               64950 non-null  float64
 5   z               64950 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 3.0+ MB
None


Unnamed: 0,participant_id,activity_code,timestamp,x,y,z
0,1602,A,522177087120822,0.24977,-0.047898,0.40667614;
1,1602,A,522177136620822,0.194376,0.020279,0.24688648;
2,1602,A,522177186120822,0.116612,-0.047898,0.0785747;
3,1602,A,522177235620822,0.102764,-0.166143,-0.08334549;
4,1602,A,522177285120822,0.262553,-0.080921,-0.21224248;


    - Observation
        - Similar pattern seen as participant 1

- Lets confirm is the remaining participants data is also similar dtype 

In [9]:
for file_name in tqdm(glob.glob(r'../dataset/raw/watch/gyro/data_*_gyro_watch.txt'),
                      desc="Checking",
                     bar_format=PBAR_FORMAT):
    df_wg = pd.read_csv(file_name, names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
    if (df_wg.dtypes != df_wg_p01.dtypes).any():
        print(f"mismatch in df for file {file_name}")

Checking    100%|███████████████████████████[   51/  51, 00:02<00:00 ]


    - Observation:
          - All data csv's have same data type as that of participant 1

- Check for missing values in csv files

In [10]:
for file_name in tqdm(glob.glob(r'../dataset/raw/watch/gyro/data_*_gyro_watch.txt'),
                      desc="Checking",
                     bar_format=PBAR_FORMAT):
    df_wg = pd.read_csv(file_name, names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
    if (df_wg.notnull().any().any() == False):
        print("missing value present")

Checking    100%|███████████████████████████[   51/  51, 00:03<00:00 ]


    - Observation:
          - All data csv's have no missing values

- Converting z column to float, and add activity column for easy inference

In [11]:
df_wg_p01.z = df_wg_p01.z.str.strip(';')
df_wg_p01.z = pd.to_numeric(df_wg_p01.z)

In [12]:
df_wg_p01['activity'] = df_wg_p01['activity_code'].map(activity_codes_mapping)
df_wg_p01 = df_wg_p01[['participant_id', 'activity_code', 'activity', 'timestamp', 'x', 'y', 'z']]

df_wg_p01

Unnamed: 0,participant_id,activity_code,activity,timestamp,x,y,z
0,1601,A,walking,1896411611733301,0.703360,-0.436308,-0.953889
1,1601,A,walking,1896411661695801,0.254884,-0.424590,-0.993304
2,1601,A,walking,1896411711658874,0.301756,-0.222190,-0.949628
3,1601,A,walking,1896411761623926,0.287907,-0.141230,-1.095569
4,1601,A,walking,1896411811593717,0.402956,0.272093,-1.006087
...,...,...,...,...,...,...,...
64824,1601,S,folding clothes,1890263614217010,0.723241,0.378951,-0.126360
64825,1601,S,folding clothes,1890263664188560,0.222569,1.128895,-0.416112
64826,1601,S,folding clothes,1890263714160022,0.332291,1.107589,-0.382024
64827,1601,S,folding clothes,1890263764099780,-0.184362,1.096937,-1.043553
