# This notebook prepares the data for NN models
In this Notebook we 
- Remove some of the reading at the beginning and the end of each activity. This eliminates the noise presented when the participant is changing activity.

- Select the acceleration columns, and reshape them.

- Select the participants' demographic data such as age, weight, etc. Then reshape them to be aligned with acceleration data.

- Extract the labels with the same method.

**Note** that we used the NumPy module, and the labels are not one-hot encoded, and the data is not shuffled.

You can change the parameters to manipulate this process.




#### Data frame dictionary:
- *df* = the raw csv dataframe
- *df_clean* = final cleaned and processed dataframe
- *df_level* = df for each activity level
- *df_level_clean* = clean version of df_level
- *df_temp* = a helper dtaframe to store temporary data for each participant and each level

#### numPy array dictionary
- accel_array, contains x,y, and z acceleration and a shape of (number_of_sequences, lenght_of_each_seq, number_of_axis i.e. 3)
- meta_array, has the demographic data and its size is (number_of_sequences, len(mata_column_list) 
- label_array, contains the labels for each sequence with a size of number_of_sequences, 1)

#### variable dictionary:
- n_ignore: number of reading to ignore from the beginning and the end of each activity. If set to 600, ignores 20 seconds as the frequency is 30Hz
- window_size_second: length of the window used for sequencing in seconds. Each window_size_second is a sequence 
- frequency: Of the accelerometer
- lenght_of_each_seq: window_size_second * frequency


In [1]:
import pandas as pd
import numpy as np


n_ignore = 600 # ignores 20 sec with a frequency of 30 Hz
window_size_second = 3
frequency = 30
lenght_of_each_seq = window_size_second * frequency

In [2]:
input_dir =  'Z:/Research/dfuller/Walkabilly/studies/smarphone_accel/data/Ethica_Jaeger_Merged/pocket/'
input_file_name = 'pocket_with_couns_and_vec_meg_30Hz.csv'


In [3]:
df = pd.read_csv(input_dir + input_file_name)

#### Remove the noise 

Igonre n_ignore item from the beginning and the end of each activity for each person

In [4]:
participant_list = list(df.participant_id.unique())

In [11]:
# 112 , 121 .122 , and 132 as they are not properly classified.
misclass_participants = [112,121,122,132]
# participant_list.remove()
participant_list = [elem for elem in participant_list if elem not in misclass_participants]
participant_list

[108,
 111,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 124,
 125,
 126,
 123,
 127,
 129,
 130,
 131,
 128,
 133,
 136,
 137,
 138,
 159,
 142,
 143,
 109,
 144,
 145,
 146,
 147,
 149,
 152,
 150,
 151,
 153,
 139,
 134,
 141,
 155,
 140,
 154,
 156,
 157]

In [12]:
df.head(2)

Unnamed: 0,record_time,x_axis,y_axis,z_axis,participant_id,wear_location,V.O2,VO2.kg,V.CO2,V.E,trimmed_activity,height,weight,age,gender,x,y,z,counts_vec_mag
0,2019-01-07T10:30:01Z,0.266111,-0.069464,0.96688,108,pock,,,,,Lying,164.0,68.0,30,Female,13.0,0.0,73.0,74.1485
1,2019-01-07T10:30:01.033300Z,0.266401,-0.069351,0.963832,108,pock,,,,,Lying,164.0,68.0,30,Female,,,,74.1485


In [13]:
# select important columns, x, y, z, height, weight, age, gender, also participant_id for cleaning. remove it later
important_columns = ['x_axis','y_axis','z_axis','participant_id','trimmed_activity','height','weight','age','gender']
df = df[important_columns].copy()

In [14]:
# change gender to dummy
df.gender[df['gender']=='Female'] = 0
df.gender[df['gender']=='Male'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [15]:
df.head(3)

Unnamed: 0,x_axis,y_axis,z_axis,participant_id,trimmed_activity,height,weight,age,gender
0,0.266111,-0.069464,0.96688,108,Lying,164.0,68.0,30,0
1,0.266401,-0.069351,0.963832,108,Lying,164.0,68.0,30,0
2,0.262441,-0.067394,0.943335,108,Lying,164.0,68.0,30,0


In [16]:
# repeat for all PE levels 

# get levels to loop thru
PE_levels = df.trimmed_activity.unique()

# ceate empty df
df_clean = pd.DataFrame(columns = important_columns)

for level in PE_levels:
    print("working on {} level".format(level))
    df_level = df[df['trimmed_activity'] == level]
    df_level_clean = pd.DataFrame(columns = important_columns)


    for partip in participant_list:
        df_temp = df_level[df_level['participant_id'] == partip]
        df_temp_nrow = df_temp.shape[0]
        # ignore the noisy data in the beginning and the end
        df_temp = df_temp.iloc[n_ignore:df_temp_nrow-n_ignore,]
        
        # make it devisable by sequence length
        number_of_sequences = df_temp.shape[0] // lenght_of_each_seq
        n_row= number_of_sequences * lenght_of_each_seq
        df_temp = df_temp.iloc[:n_row,]
    
        df_level_clean = pd.concat([df_level_clean, df_temp])
        print("working on {} participant".format(partip))


    df_clean = pd.concat([df_clean, df_level_clean])

  
# df_clean = df_clean.drop('participant_id', axis = 1)
    

working on Lying level
working on 108 participant
working on 111 participant
working on 113 participant
working on 114 participant
working on 115 participant
working on 116 participant
working on 117 participant
working on 118 participant
working on 119 participant
working on 120 participant
working on 124 participant
working on 125 participant
working on 126 participant
working on 123 participant
working on 127 participant
working on 129 participant
working on 130 participant
working on 131 participant
working on 128 participant
working on 133 participant
working on 136 participant
working on 137 participant
working on 138 participant
working on 159 participant
working on 142 participant
working on 143 participant
working on 109 participant
working on 144 participant
working on 145 participant
working on 146 participant
working on 147 participant
working on 149 participant
working on 152 participant
working on 150 participant
working on 151 participant
working on 153 participant
worki

### We need to create something like an 1-D image. so we can feed it to CNN.
For image processing, an image has three channels, and two dimenssion. So for a 264*264 pixel image,the shape is:
264, 264, 3

In our case, if we use a window of 3 second we have 90 reading(30 Hz), similar to pixel number in images. And we have 3 dimenssion,z,y, andz so the input shape is (90,3)

Now if we have n input (n sequesnces or n images), the inout shape is (n,90,3)


## Data generator

### Create sequence of acceleration data

For each axis we do:


Get the axis and put in a numpy array

reshape it to (n,1) where n is the total length of acceleration data for a specific activity and person

stack all the axis horizontally. e.i. bind columns

reshape to (number_of_sequences, lenght_of_each_seq, number_of_axis i.e. 3) 

Note that n =number_of_sequences * lenght_of_each_seq

### Create meta data and labels


How to create meta data:
- follow the pre cell, but don't need to stack anything. We will process age, gender, labesl, etc separetely.
- the dim is **number_of_sequences, lenght_of_each_seq,**
- use numpy max and get the max (or min) and reduce the matrix to an array

In [17]:
df_clean.head()

Unnamed: 0,x_axis,y_axis,z_axis,participant_id,trimmed_activity,height,weight,age,gender
600,0.270312,-0.066408,0.961516,108,Lying,164.0,68.0,30,0
601,0.269653,-0.066482,0.959821,108,Lying,164.0,68.0,30,0
602,0.263954,-0.065834,0.942104,108,Lying,164.0,68.0,30,0
603,0.262995,-0.066327,0.941666,108,Lying,164.0,68.0,30,0
604,0.264132,-0.066887,0.947164,108,Lying,164.0,68.0,30,0


#### Acceleration sequense data generator

In [18]:
# sequence generator
# output size (number_of_sequences, lenght_of_each_seq, number_of_axis i.e. 3)

n_row = df_clean.shape[0]
number_of_sequences = int(n_row / lenght_of_each_seq)
print("We will have ", number_of_sequences ," sequences to available.")
accel_array = np.empty((n_row,0))


# repeat for all axes
axes_list = ['x_axis','y_axis','z_axis']
for axis in axes_list:

    # filter based on axis
    working_array = df_clean[axis]

    working_array = np.array(working_array).reshape(n_row,1)
    accel_array = np.hstack((accel_array, working_array))
    
    print(accel_array.shape)
n_axis = len(axes_list)
accel_array = accel_array.reshape((number_of_sequences, lenght_of_each_seq, 3)) 
print(accel_array.shape)



We will have  64754  sequences to available.
(5827860, 1)
(5827860, 2)
(5827860, 3)
(64754, 90, 3)


#### Meta data generator


In [19]:
# has the same logic as accelereation sequence generator
# for each column output size (number_of_sequences,  1)
# for all of them, the out put in meta_array  size ((number_of_sequences, len(mata_column_list)))



n_row = df_clean.shape[0]
number_of_sequences = int(n_row / lenght_of_each_seq)
print("We will have ", number_of_sequences ," sequences to available.")
# repeat for all meta data columns 
meta_column_list = ['height','weight','age','gender']


compressed_array = np.empty((number_of_sequences,1))
meta_array = np.empty((number_of_sequences, 0))

for meta in meta_column_list:

    # filter based on meta data column
    working_array = df_clean[meta]

    working_array = np.array(working_array).reshape(number_of_sequences, lenght_of_each_seq, 1)
    for i in range(number_of_sequences):
        compressed_array[i] = working_array[i,].max()
    
    meta_array = np.hstack((meta_array, compressed_array))
    print(compressed_array.shape, "    " , working_array.shape)
print(meta_array.shape)



We will have  64754  sequences to available.
(64754, 1)      (64754, 90, 1)
(64754, 1)      (64754, 90, 1)
(64754, 1)      (64754, 90, 1)
(64754, 1)      (64754, 90, 1)
(64754, 4)


In [20]:
# repeat for  trimmed activity which is the labels
# has the same logic as accelereation sequence generator
# for labels output size (number_of_sequences,  1)

n_row = df_clean.shape[0]
number_of_sequences = int(n_row / lenght_of_each_seq)
print("We will have ", number_of_sequences ," sequences to available.")
label_column_list = ['trimmed_activity']


label_array = np.empty((number_of_sequences,1), dtype=list)

# as we only have one outcome ( label) we don;t need another array to store all of them
# We could do it without the for loop as well
for label in label_column_list:
    # filter based on the column
    working_array = df_clean[label]
    working_array = np.array(working_array).reshape(number_of_sequences, lenght_of_each_seq, 1)
    print(working_array.shape)
    for i in range(number_of_sequences):
        label_array[i] = working_array[i,0] 
print(label_array.shape)
print(label_array)

We will have  64754  sequences to available.
(64754, 90, 1)
(64754, 1)
[['Lying']
 ['Lying']
 ['Lying']
 ...
 ['Running 7 METs']
 ['Running 7 METs']
 ['Running 7 METs']]


### Check to see if the data is intact

We ignored n_ignore data from the beginning of each activity. Therefore the first processed data in n_ignore th +1 data in the raw data frame

In [21]:
print("we start for ",n_ignore," when indexing from the raw dataframe")
print(accel_array[0,1])
print(meta_array[0])
print(label_array[0])
print(df.iloc[n_ignore:n_ignore+3,])


we start for  600  when indexing from the raw dataframe
[ 0.26965277 -0.06648238  0.95982104]
[164.  68.  30.   0.]
['Lying']
       x_axis    y_axis    z_axis  participant_id trimmed_activity  height  \
600  0.270312 -0.066408  0.961516             108            Lying   164.0   
601  0.269653 -0.066482  0.959821             108            Lying   164.0   
602  0.263954 -0.065834  0.942104             108            Lying   164.0   

     weight  age gender  
600    68.0   30      0  
601    68.0   30      0  
602    68.0   30      0  


In [22]:
# store the results as numpy objects
location = 'pocket'
output_file_name = input_dir +location + '-NN-data'
np.savez_compressed(output_file_name,
                    acceleration_data=accel_array,
                    metadata=meta_array,
                    labels=label_array
                   )

In [23]:
output_file_name

'Z:/Research/dfuller/Walkabilly/studies/smarphone_accel/data/Ethica_Jaeger_Merged/pocket/pocket-NN-data'