# Human Activity Recognition Model - Part 1
## Inspecting and pre-processing the dataset, ready for exploratory data analysis (EDA) in Part 2.
#### Natasha Qayyum - 2021

Background: <br>
MEx is a multimodel dataset containing data for 7 different physiotherapy exercises performed by 30 subjects, recorded by four sensor modalities. This analysis utilises the data recorded by one of these modalities, the Sensing Tex Pressure Mat (sampling frequency 15Hz, frame size 32 * 16), which is both rich and sparse.

The aim is to intelligently select a subset of the 512 features, to train a machine learning (ML) model whilst demonstrating a beneficial application of Feature Selection in the pre-processing steps. The ML model seeks to predict which physiotherapy exercise a subject is performing on the pressure mat, according to sensor readings. 

Benefits of feature selection include:
- Reduced training times
- Reduced computational requirements
- Remove irrelevant features, for example those pressure points which have never picked up a signal
- Reduce overfitting, separating the signal from the noise

### Importing Dependency Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from timeit import default_timer as timer
import seaborn as sns

### Exploring and Preparing the Data for Analysis
- Opening a sample of the dataset to observe the structure and volume. <br>
- Concatinating all the csvs into one dataframe. <br>
- Checking for nulls and data quality. <br>
- Writing some new columns into the dataframe to aid further analysis. <br>


In [2]:
#read one example .csv out of the 210 found in the dataset
example_path = "data/01/01_pm_1.csv"

example_data = pd.read_csv(example_path, header=None, error_bad_lines=False)

example_data



  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,503,504,505,506,507,508,509,510,511,512
0,2018-11-08 11:34:51.468000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,88.0,94.0,68.0,77.0,55.0,193.0,387.0,331.0,125.0,6.0
1,2018-11-08 11:34:51.535000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,104.0,93.0,58.0,78.0,53.0,192.0,388.0,330.0,123.0,6.0
2,2018-11-08 11:34:51.602000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,106.0,93.0,64.0,78.0,53.0,195.0,390.0,330.0,119.0,7.0
3,2018-11-08 11:34:51.669000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,108.0,94.0,66.0,79.0,55.0,196.0,391.0,324.0,106.0,5.0
4,2018-11-08 11:34:51.737000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,125.0,94.0,64.0,79.0,55.0,194.0,391.0,321.0,114.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
962,2018-11-08 11:35:56.149000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,50.0,72.0,112.0,64.0,64.0,43.0,30.0,17.0,3.0,0.0
963,2018-11-08 11:35:56.216000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,48.0,71.0,107.0,65.0,60.0,41.0,31.0,16.0,0.0,0.0
964,2018-11-08 11:35:56.282000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,50.0,73.0,105.0,70.0,64.0,47.0,32.0,15.0,2.0,0.0
965,2018-11-08 11:35:56.351000,20.0,3.0,2.0,0.0,0.0,0.0,72.0,1493.0,1949.0,...,53.0,73.0,104.0,67.0,62.0,47.0,34.0,14.0,0.0,0.0


The data has been recorded in a timeseries with 512 dimensions per timestamp, and up to 1000 readings. Each column represents a point in the pressure mat: 32 by 16 pressure points arranged in a rectangular yoga-style mat. <br><br>
It will be useful to create new columns to show where the input came from i.e. the subject number (folder) and the exercise number (file).
<br><br>
Note subject 22 has no _2 csv so make this consistent by importing all the csvs except the _2 for all individuals.
<br><br>
The next step is to design a single dataframe with additional useful columns that can be engineered using existing data, and populate it using all 210 .csv files (7 exercises x 30 subjects). We therefore expect the pre-processed, concatenated dataframe to hold ~100,000,000 data points before applying feature selection methods.

In [3]:
#choosing headers for desired additional features
colNames = ["Subject", "Exercise", "a_Time", "r_Time", "Total_p"] #a = actual, r = relative, p = pressure

coords = [] #creating a seperate list to populate with coordinates

#using for loop to rename the 512 pressure point columns into X and Y co-ordinates to display their position on the mat
for x in range(32):
    for y in range(16):
        nextCol = str(x+1) + "_" + str(y+1)
        colNames.append(nextCol) #appending the renamed coord headers to the additional headers list
        coords.append(nextCol) #populating the coords list with coordinates

#initialising an empty dataframe to visualise the new desired structure of data
pm_df = pd.DataFrame(columns = colNames)

pm_df

Unnamed: 0,Subject,Exercise,a_Time,r_Time,Total_p,1_1,1_2,1_3,1_4,1_5,...,32_7,32_8,32_9,32_10,32_11,32_12,32_13,32_14,32_15,32_16


We now have an empty dataframe with 517 features (columns):<br>
Subject --> The individual performing the exercise. This information was originally found in the folder structure.<br>
Exercise --> The exercise number (1-7) performed by the subject. This information was originally in the .csv file name.<br>
a_Time --> The actual time and date these readings were recorded.<br>
r_Time --> The relative time these readings were recorded. <br>
Total_p --> The total pressure on the mat at that point in time (sum of 1_1 to 32_16).<br>
1_1 to 32_16 --> The (X,Y) coordinates of the pressure point on the mat. <br><br>
The next step is to populate this dataframe with data from the 210 .csv files.

In [4]:
debug = False

#defining the root directory of the pressure mat data
pmRoot = "data/"

#timing information - start timer
start = timer()

#for subjects 1 to 30
for i in range(1, 31):
    
    #for exercises 1 to 7
    for j in range(1, 8):
        
        #read in a pressure mat data frame
        
        #formatting the numbers to match file system
        if(i<10): folder = "0" + str(i) + "/"
        else: folder = str(i) + "/"
        file = "0" + str(j) + "_pm_1.csv"
        
        #read in the .csv files
        if (debug == True): print(pmRoot + folder + file)
        pm_df_in = pd.read_csv(pmRoot + folder + file)
        
        #calculate the total pressure for each time increment
        pm_temp_totals = pm_df_in.drop(pm_df_in.columns[0], axis = 1).sum(axis = 1)
        
        #reformatting the time string into an actual datetime object
        for k in range(len(pm_df_in)):
            timeString = pm_df_in.iloc[k,0]
            if len(timeString) < 20:
                timeString = timeString + "."
            while len(timeString) < 26:
                timeString = timeString + "0"
            pm_df_in.iat[k,0] = datetime.strptime(timeString, "%Y-%m-%d %H:%M:%S.%f")
        
        #create a measure of absolute time for each data point
        pm_temp_a_Time = pm_df_in.iloc[:,0]
        
        #create a measure of relative time for each data point
        #have to do this in a for loop for some reason?
        pm_temp_r_Time = [None] * len(pm_temp_a_Time)
        for time in range(len(pm_temp_a_Time)):
            pm_temp_r_Time[time] = pm_temp_a_Time[time] - pm_temp_a_Time[0]
        
        #create a dataframe to store our bad boys in
        pm_temp_df = pd.DataFrame(columns = colNames)
        
        #add in our data
        pm_temp_df["a_Time"] = pm_temp_a_Time
        pm_temp_df["r_Time"] = pm_temp_r_Time
        pm_temp_df["Total_p"] = pm_temp_totals
        
        pm_temp_df[coords] = pm_df_in.iloc[:,1:len(pm_df_in.columns)]
        
        #label the data with the subject and exercise
        pm_temp_df["Subject"] = pd.Series([i for x in range(len(pm_temp_df.index))], index=pm_temp_df.index)
        pm_temp_df["Exercise"] = pd.Series([j for x in range(len(pm_temp_df.index))], index=pm_temp_df.index)
        
        #append to our dataframe
        pm_df = pm_df.append(pm_temp_df)
        
#         break
#     break

#timing information - end timer
end = timer()
dt = end - start
print("Importing time elapsed: {:.2f}".format(dt))

pm_df

Importing time elapsed: 149.30


Unnamed: 0,Subject,Exercise,a_Time,r_Time,Total_p,1_1,1_2,1_3,1_4,1_5,...,32_7,32_8,32_9,32_10,32_11,32_12,32_13,32_14,32_15,32_16
0,1,1,2018-11-08 11:34:51.535000,0 days 00:00:00,27375.0,20.0,3.0,2.0,0.0,0.0,...,104.0,93.0,58.0,78.0,53.0,192.0,388.0,330.0,123.0,6.0
1,1,1,2018-11-08 11:34:51.602000,0 days 00:00:00.067000,27466.0,20.0,3.0,2.0,0.0,0.0,...,106.0,93.0,64.0,78.0,53.0,195.0,390.0,330.0,119.0,7.0
2,1,1,2018-11-08 11:34:51.669000,0 days 00:00:00.134000,27423.0,20.0,3.0,2.0,0.0,0.0,...,108.0,94.0,66.0,79.0,55.0,196.0,391.0,324.0,106.0,5.0
3,1,1,2018-11-08 11:34:51.737000,0 days 00:00:00.202000,27651.0,20.0,3.0,2.0,0.0,0.0,...,125.0,94.0,64.0,79.0,55.0,194.0,391.0,321.0,114.0,6.0
4,1,1,2018-11-08 11:34:51.804000,0 days 00:00:00.269000,27416.0,20.0,3.0,2.0,0.0,0.0,...,129.0,95.0,64.0,79.0,55.0,194.0,381.0,321.0,114.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
922,30,7,2019-03-26 16:45:51.454000,0 days 00:01:01.976000,578.0,6.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
923,30,7,2019-03-26 16:45:51.522000,0 days 00:01:02.044000,633.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
924,30,7,2019-03-26 16:45:51.589000,0 days 00:01:02.111000,696.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
925,30,7,2019-03-26 16:45:51.656000,0 days 00:01:02.178000,704.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Data Quality Checks
Checking for null values, data types, summary statistics etc

In [5]:
#checking for null values in the dataframe
pm_df.isnull().values.any()

False

In [6]:
#double check - count how many null values per feature
pm_df.isnull().sum()

Subject     0
Exercise    0
a_Time      0
r_Time      0
Total_p     0
           ..
32_12       0
32_13       0
32_14       0
32_15       0
32_16       0
Length: 517, dtype: int64

In [7]:
#alternative syntax to get a total number of nulls in the entire dataframe
pm_df.isnull().sum().sum()

0

In [8]:
pm_df.isna().sum()

Subject     0
Exercise    0
a_Time      0
r_Time      0
Total_p     0
           ..
32_12       0
32_13       0
32_14       0
32_15       0
32_16       0
Length: 517, dtype: int64

In [9]:
# print information, shape, and data type for the data frame
pm_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 188116 entries, 0 to 926
Columns: 517 entries, Subject to 32_16
dtypes: float64(513), object(3), timedelta64[ns](1)
memory usage: 743.4+ MB


In [10]:
# determine count of unique values for each column in the dataframe
pm_df.nunique()

Subject         30
Exercise         7
a_Time      188028
r_Time       33786
Total_p      34983
             ...  
32_12         1114
32_13         1088
32_14         1409
32_15          967
32_16          772
Length: 517, dtype: int64

The data frame has thousands of unique readings proving how rich this dataset is.

In [11]:
#calculate summary statistics
pm_df.describe()

Unnamed: 0,r_Time,Total_p,1_1,1_2,1_3,1_4,1_5,1_6,1_7,1_8,...,32_7,32_8,32_9,32_10,32_11,32_12,32_13,32_14,32_15,32_16
count,188116,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0,...,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0,188116.0
mean,0 days 00:00:31.623077074,11842.681579,12.377932,13.159715,61.414829,43.587239,106.736886,22.935402,70.000808,37.098514,...,51.002621,66.692684,54.638691,39.68765,46.822211,56.33886,49.970444,64.546812,47.949877,18.94
std,0 days 00:00:19.584585868,9789.128896,28.873021,91.491629,298.730191,284.8407,393.843321,174.000595,311.089049,250.543081,...,190.557298,198.845428,161.162071,136.571267,168.401224,139.256255,135.477124,160.796258,123.80938,70.452058
min,0 days 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0 days 00:00:14.994000,3324.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0 days 00:00:30.251000,9644.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0 days 00:00:47.256000,18455.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,37.0,20.0,2.0,5.0,38.0,24.0,40.0,25.0,5.0
max,0 days 00:01:57.602000,41866.0,1041.0,3122.0,3008.0,3025.0,3266.0,2309.0,2998.0,2663.0,...,1754.0,1745.0,1386.0,1266.0,1495.0,1754.0,1875.0,2119.0,1869.0,1022.0


Interesting to see how sparse this data is: at a glance there are many readings of zero pressure across all quartiles, despite the total pressure mean being nearly 12,000. <br><br>
The hypothesis to explore is whether intelligently reducing the number of features used in a machine learning model will result in an equally positive, or better, performance. It does this by reducing overfitting on redundant features; increased computational performance is a serendipitous by-product, too.

### Saving the dataframe locally
Building this dataframe was a hefty process, so I'm going to export it to a new .csv for easy access in the future.

In [12]:
# Save data to csv for later processing
pm_df.to_csv('all_pm_data.csv')

### Head over to Part 2 for some EDA.