# Detecting Human Activities Through Smartphone Sensor - Preprocesing

- Data set source:  WISDM Lab of Frodham University, NY
https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset

Data captured using

- Two kinds of devices:
    - Smartphone (Samsung Galaxy S5)
    - Smartwatch (LG G)

    
- Two kinds of embedded kinematic sensors (for each device):
    - Accelerometer - for measurement of linear acceleration (m/sec^2)
    - Gyroscope - for measurement of angular velocity (rad/sec)


In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import glob
from tqdm import tqdm
import filecmp

PBAR_FORMAT='{desc:12}{percentage:3.0f}%|{bar:27}[ {n:4d}/{total:4d}, {elapsed}<{remaining}{postfix} ]'

Each activity is represented as an alphabet in the dataset. To make meaningful inference of data we map it to actual activity

In [2]:
activity_codes_mapping = {'A': 'walking',
                          'B': 'jogging',
                          'C': 'stairs',
                          'D': 'sitting',
                          'E': 'standing',
                          'F': 'typing',
                          'G': 'brushing teeth',
                          'H': 'eating soup',
                          'I': 'eating chips',
                          'J': 'eating pasta',
                          'K': 'drinking from cup',
                          'L': 'eating sandwich',
                          'M': 'kicking soccer ball',
                          'O': 'playing catch tennis ball',
                          'P': 'dribbling basket ball',
                          'Q': 'writing',
                          'R': 'clapping',
                          'S': 'folding clothes'}

# Dataset understanding

## Phone Accelerometer

- Load the phone accelerometer sensor data for participant 1

In [3]:
df_pa_p01 = pd.read_csv(r'../dataset/raw/phone/accel/data_1601_accel_phone.txt', names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
print(df_pa_p01.info())
df_pa_p01.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81457 entries, 0 to 81456
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  81457 non-null  int64  
 1   activity_code   81457 non-null  object 
 2   timestamp       81457 non-null  int64  
 3   x               81457 non-null  float64
 4   y               81457 non-null  float64
 5   z               81457 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 3.7+ MB
None


Unnamed: 0,participant_id,activity_code,timestamp,x,y,z
0,1601,A,265073308304101,4.703409,9.127296,0.06404489;
1,1601,A,265073348330612,5.354632,15.635334,-0.6290765;
2,1601,A,265073388368581,6.399701,12.926893,0.45010993;
3,1601,A,265073428111445,10.532093,13.207614,-1.0247183;
4,1601,A,265073468081082,16.129736,2.683301,1.1426327;


    - Observation
        - Most of the data are numeric except for z column

- Load the phone accelerometer sensor data for participant 2

In [4]:
df_pa_p02 = pd.read_csv(r'../dataset/raw/phone/accel/data_1602_accel_phone.txt', names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
print(df_pa_p02.info())
df_pa_p02.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84890 entries, 0 to 84889
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  84890 non-null  int64  
 1   activity_code   84890 non-null  object 
 2   timestamp       84890 non-null  int64  
 3   x               84890 non-null  float64
 4   y               84890 non-null  float64
 5   z               84890 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 3.9+ MB
None


Unnamed: 0,participant_id,activity_code,timestamp,x,y,z
0,1602,A,99019527581830,-0.923737,-11.386169,8.799728;
1,1602,A,99019577935834,-2.18541,-12.316559,8.982025;
2,1602,A,99019628289837,4.615936,-6.947418,1.9848633;
3,1602,A,99019678643841,6.473099,-11.660782,-2.8489838;
4,1602,A,99019728997845,-2.869598,-3.154541,-8.561188;


    - Observation
        - Similar pattern seen as participant 1

- Lets confirm is the remaining participants data is also similar dtype 

In [5]:
for file_name in tqdm(glob.glob(r'../dataset/raw/phone/accel/data_*_accel_phone.txt'),
                      desc="Checking",
                     bar_format=PBAR_FORMAT):
    df_pa = pd.read_csv(file_name, names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
    if (df_pa.dtypes != df_pa_p01.dtypes).any():
        print(f"mismatch in df for file {file_name}")

Checking    100%|███████████████████████████[   51/  51, 00:03<00:00 ]


    - Observation:
          - All data csv's have no missing values

- Check for missing values in csv files

In [6]:
for file_name in tqdm(glob.glob(r'../dataset/raw/phone/accel/data_*_accel_phone.txt'),
                      desc="Checking",
                     bar_format=PBAR_FORMAT):
    df_pa = pd.read_csv(file_name, names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
    if (df_pa.notnull().any().any() == False):
        print("missing value present")

Checking    100%|███████████████████████████[   51/  51, 00:03<00:00 ]


    - Observation:
          - All data csv's have no missing values

- Converting z column to float, and add activity column for easy inference

In [7]:
df_pa_p01.z = df_pa_p01.z.str.strip(';')
df_pa_p01.z = pd.to_numeric(df_pa_p01.z)

In [8]:
df_pa_p01['activity'] = df_pa_p01['activity_code'].map(activity_codes_mapping)
df_pa_p01 = df_pa_p01[['participant_id', 'activity_code', 'activity', 'timestamp', 'x', 'y', 'z']]

df_pa_p01

Unnamed: 0,participant_id,activity_code,activity,timestamp,x,y,z
0,1601,A,walking,265073308304101,4.703409,9.127296,0.064045
1,1601,A,walking,265073348330612,5.354632,15.635334,-0.629077
2,1601,A,walking,265073388368581,6.399701,12.926893,0.450110
3,1601,A,walking,265073428111445,10.532093,13.207614,-1.024718
4,1601,A,walking,265073468081082,16.129736,2.683301,1.142633
...,...,...,...,...,...,...,...
81452,1601,S,folding clothes,258908699056416,2.015319,9.988011,0.746392
81453,1601,S,folding clothes,258908738947822,1.681927,10.074801,1.726219
81454,1601,S,folding clothes,258908778855321,1.148020,9.127296,1.492186
81455,1601,S,folding clothes,258908818435165,1.417966,9.126099,1.077989


## Phone Gyroscope

In [9]:
df_pg_p01 = pd.read_csv(r'../dataset/raw/phone/gyro/data_1601_gyro_phone.txt', names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
print(df_pg_p01.info())
df_pg_p01.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81193 entries, 0 to 81192
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  81193 non-null  int64  
 1   activity_code   81193 non-null  object 
 2   timestamp       81193 non-null  int64  
 3   x               81193 non-null  float64
 4   y               81193 non-null  float64
 5   z               81193 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 3.7+ MB
None


Unnamed: 0,participant_id,activity_code,timestamp,x,y,z
0,1601,A,265073308304101,-0.02024,-0.004261,-0.023435818;
1,1601,A,265073348330612,-1.213602,0.055394,-0.36964676;
2,1601,A,265073388368581,-2.417352,1.124387,-1.644502;
3,1601,A,265073428111445,-3.075152,1.530252,-1.6729978;
4,1601,A,265073468081082,0.011185,4.576909,-0.24367924;


    - Observation
        - Most of the data are numeric except for z column

- Load the watch gyro sensor data for participant 2

In [10]:
df_pg_p02 = pd.read_csv(r'../dataset/raw/phone/gyro/data_1602_gyro_phone.txt', names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
print(df_pg_p02.info())
df_pg_p02.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64286 entries, 0 to 64285
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  64286 non-null  int64  
 1   activity_code   64286 non-null  object 
 2   timestamp       64286 non-null  int64  
 3   x               64286 non-null  float64
 4   y               64286 non-null  float64
 5   z               64286 non-null  object 
dtypes: float64(2), int64(2), object(2)
memory usage: 2.9+ MB
None


Unnamed: 0,participant_id,activity_code,timestamp,x,y,z
0,1602,A,99019678643841,-0.719742,0.398651,-0.16711426;
1,1602,A,99019728997845,-0.957825,1.615234,-0.111709595;
2,1602,A,99019779351849,-1.881058,1.473206,0.83522034;
3,1602,A,99019829705853,-1.57579,0.241714,0.34750366;
4,1602,A,99019880059857,-1.639481,1.097153,0.34846497;


    - Observation
        - Similar pattern seen as participant 1

- Lets confirm is the remaining participants data is also similar dtype 

In [11]:
for file_name in tqdm(glob.glob(r'../dataset/raw/phone/gyro/data_*_gyro_phone.txt'),
                      desc="Checking",
                     bar_format=PBAR_FORMAT):
    df_pg = pd.read_csv(file_name, names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
    if (df_pg.dtypes != df_pg_p01.dtypes).any():
        print(f"mismatch in df for file {file_name}")

Checking    100%|███████████████████████████[   51/  51, 00:02<00:00 ]


    - Observation:
          - All data csv's have same data type as that of participant 1

- Check for missing values in csv files

In [12]:
for file_name in tqdm(glob.glob(r'../dataset/raw/watch/gyro/data_*_gyro_watch.txt'),
                      desc="Checking",
                     bar_format=PBAR_FORMAT):
    df_pg = pd.read_csv(file_name, names = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z'], index_col=None, header=None)
    if (df_pg.notnull().any().any() == False):
        print("missing value present")

Checking    100%|███████████████████████████[   51/  51, 00:02<00:00 ]


    - Observation:
          - All data csv's have no missing values

- Converting z column to float, and add activity column for easy inference

In [13]:
df_pg_p01.z = df_pg_p01.z.str.strip(';')
df_pg_p01.z = pd.to_numeric(df_pg_p01.z)

In [14]:
df_pg_p01['activity'] = df_pg_p01['activity_code'].map(activity_codes_mapping)
df_pg_p01 = df_pg_p01[['participant_id', 'activity_code', 'activity', 'timestamp', 'x', 'y', 'z']]

df_pg_p01

Unnamed: 0,participant_id,activity_code,activity,timestamp,x,y,z
0,1601,A,walking,265073308304101,-0.020240,-0.004261,-0.023436
1,1601,A,walking,265073348330612,-1.213602,0.055394,-0.369647
2,1601,A,walking,265073388368581,-2.417352,1.124387,-1.644502
3,1601,A,walking,265073428111445,-3.075152,1.530252,-1.672998
4,1601,A,walking,265073468081082,0.011185,4.576909,-0.243679
...,...,...,...,...,...,...,...
81188,1601,S,folding clothes,258908699056416,0.034621,-0.045806,-0.041812
81189,1601,S,folding clothes,258908738947822,0.146474,0.045274,-0.059655
81190,1601,S,folding clothes,258908778855321,0.019175,0.034887,-0.078563
81191,1601,S,folding clothes,258908818435165,0.226901,-0.001065,-0.113451
