# This notebook prepares the data for NN models.
#### Clean the start and the end of each activity session
#### Select important columns and one hot encoding
#### Break down to shorter time periods
#### Encode output and shuffle the data


Data frame vocabulary:
- *df* = the raw csv dataframe
- *df_clean* = final cleaned and processed dataframe
- *df_level* = df for each activity level
- *df_level_clean* = clean version of df_level
- *df_temp* = a helper dtaframe to store temporary data for each participant and each level
    

In [289]:
import pandas as pd
import numpy as np


n_ignore = 600 # ignores 20 sec with a frequency of 30 Hz
window_size_second = 3
frequency = 30
lenght_of_each_seq = window_size_second * frequency

In [2]:
input_dir =  'Z:/Research/dfuller/Walkabilly/studies/smarphone_accel/data/Ethica_Jaeger_Merged/pocket/'
input_file_name = 'pocket_with_couns_and_vec_meg_30Hz.csv'


In [3]:
df = pd.read_csv(input_dir + input_file_name)

Break it down to several CSV file, Lying, Sitting, Walking, Running3, Running5, Running7
Keep Gender, weight and height


In [46]:
participant_list = list(df.participant_id.unique())

In [47]:
df.head(2)

Unnamed: 0,x_axis,y_axis,z_axis,participant_id,trimmed_activity,height,weight,age,gender
0,0.266111,-0.069464,0.96688,108,Lying,164.0,68.0,30,0
1,0.266401,-0.069351,0.963832,108,Lying,164.0,68.0,30,0


In [48]:
# select important columns, x, y, z, height, weight, age, gender, also participant_id for cleaning. remove it later
important_columns = ['x_axis','y_axis','z_axis','participant_id','trimmed_activity','height','weight','age','gender']
df = df[important_columns].copy()

In [49]:
# change gender to dummy
df.gender[df['gender']=='Female'] = 0
df.gender[df['gender']=='Male'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [50]:
df.head(3)

Unnamed: 0,x_axis,y_axis,z_axis,participant_id,trimmed_activity,height,weight,age,gender
0,0.266111,-0.069464,0.96688,108,Lying,164.0,68.0,30,0
1,0.266401,-0.069351,0.963832,108,Lying,164.0,68.0,30,0
2,0.262441,-0.067394,0.943335,108,Lying,164.0,68.0,30,0


In [51]:
#df_L = df[df['trimmed_activity'] == 'Lying']

##### Get rid of the first and last 20 Seconds for avoiding noise

In [52]:
# n_ignore = 600 # ignores 20 sec with a frequency of 30 Hz

# for partip in participant_list:
#     df_temp = df_L[df_L['participant_id'] == partip]
#     df_temp_nrow = df_temp.shape[0]
#     df_temp = df_temp[n_ignore:df_temp_nrow-n_ignore]
#     df_sub_clean = pd.concat(df_sub_clean, df_temp)

# # the first row of sub is zeros    
# df_L = df_sub_clean[1:,]
# df_L = df_L.drop('participant_id', axis = 1)

    

In [53]:
# list(df.columns)

In [275]:
# repeat for all PE levels 

# get levels to loop thru
PE_levels = df.trimmed_activity.unique()

# ceate empty df
df_clean = pd.DataFrame(columns = important_columns)

for level in PE_levels:
    print("working on {} level".format(level))
    df_level = df[df['trimmed_activity'] == level]
    df_level_clean = pd.DataFrame(columns = important_columns)


    

    for partip in participant_list:
        df_temp = df_level[df_level['participant_id'] == partip]
        df_temp_nrow = df_temp.shape[0]
        df_temp = df_temp.iloc[n_ignore:df_temp_nrow-n_ignore,]
        df_level_clean = pd.concat([df_level_clean, df_temp])
        print("working on {} participant".format(partip))


    df_clean = pd.concat([df_clean, df_level_clean])

  
# df_clean = df_clean.drop('participant_id', axis = 1)
    

working on Lying level
working on 108 participant
working on 111 participant
working on 112 participant
working on 113 participant
working on 114 participant
working on 115 participant
working on 116 participant
working on 117 participant
working on 118 participant
working on 119 participant
working on 120 participant
working on 121 participant
working on 122 participant
working on 124 participant
working on 125 participant
working on 126 participant
working on 123 participant
working on 127 participant
working on 129 participant
working on 130 participant
working on 131 participant
working on 132 participant
working on 128 participant
working on 133 participant
working on 136 participant
working on 137 participant
working on 138 participant
working on 159 participant
working on 142 participant
working on 143 participant
working on 109 participant
working on 144 participant
working on 145 participant
working on 146 participant
working on 147 participant
working on 149 participant
worki

### We need to create something like an 1-D image. so we can feed it to CNN.
For image processing, an image has three channels, and two dimenssion. So for a 264*264 pixel image,the shape is:
264, 264, 3

In our case, if we use a window of 3 second we have 90 reading(30 Hz), similar to pixel number in images. And we have 3 dimenssion,z,y, andz so the input shape is (90,3)

Now if we have n input (n sequesnces or n images), the inout shape is (n,90,3)


## Data generator

### Create sequence of acceleration data

For each axis we do:


Get the axis and put in a numpy array

reshape it to (n,1) where n is the total length of acceleration data for a specific activity and person

stack all the axis horizontally. e.i. bind columns

reshape to (number_of_sequences, lenght_of_each_seq, number_of_axis i.e. 3) 

Note that n =number_of_sequences * lenght_of_each_seq

### Create meta data and labels


How to create meta data:
- follow the pre cell, but don't need to stack anything. We will process age, gender, labesl, etc separetely.
- the dim is **number_of_sequences, lenght_of_each_seq,**
- use numpy max and get the max (or min) and reduce the matrix to an array

In [276]:
df_clean.head()

Unnamed: 0,x_axis,y_axis,z_axis,participant_id,trimmed_activity,height,weight,age,gender
600,0.270312,-0.066408,0.961516,108,Lying,164.0,68.0,30,0
601,0.269653,-0.066482,0.959821,108,Lying,164.0,68.0,30,0
602,0.263954,-0.065834,0.942104,108,Lying,164.0,68.0,30,0
603,0.262995,-0.066327,0.941666,108,Lying,164.0,68.0,30,0
604,0.264132,-0.066887,0.947164,108,Lying,164.0,68.0,30,0


In [297]:
# sequence generator
# output size (number_of_sequences, lenght_of_each_seq, number_of_axis i.e. 3)
# for level in PE_levels:
#     print("working on {} level".format(level)) 
#     for partip in participant_list:
#         df_temp = df_level[df_clean['participant_id'] == partip]
#         df_temp_nrow = df_temp.shape[0]
#         df_temp = df_temp.iloc[n_ignore:df_temp_nrow-n_ignore,]
#         df_level_clean = pd.concat([df_level_clean, df_temp])
#         print("working on {} participant".format(partip))

# repeat for all axes
axes_list = ['x_axis','y_axis','z_axis']
seq_array = np.empty

for axis in axes_list:

    # filter based on activity and participant id
    working_array = df_clean[axis][df_clean.trimmed_activity == 'Lying'][df_clean.participant_id == 120]

    #  to numpy()
    working_array = np.array(working_array)
    # working_array.reshape(28290,1)
    # prone the size so it's devisable to lenght_of_each_seq
    number_of_sequences = working_array.shape[0] // lenght_of_each_seq
    n_row= number_of_sequences * lenght_of_each_seq
    print(n_row, number_of_sequences, lenght_of_each_seq)
    working_array = working_array[:n_row].reshape(n_row, 1)
    working_array.shape





28260 314 90
28260 314 90
28260 314 90


In [266]:
# testing reshaping = Worked
x = np.arange(8).reshape(8,1)
# print(x)
y = np.arange(8,16).reshape(8,1)
z = np.arange(16,24).reshape(8,1)
a = np.hstack((x,y,z))
a = a.reshape((2,4,3))
a[0,:,:].shape

(4, 3)

In [268]:

a[:,:,1]

array([[ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [None]:
def breakDown(working_df, window_size=3):
    # truncate df rows to add the helper column
    n_row = working_df.shape[0] // window_size * window_size
    working_df= working_df.iloc[0:n_row]
    # help grouping
    helper_col_df = pd.DataFrame(data = list(range(0,window_size,1)) * (n_row//window_size), columns=['counter'])
    working_df = working_df.reset_index()
    working_df = pd.concat((working_df,helper_col_df), axis=1)


    

In [191]:
window_size = 3
working_df = df_temp.iloc[0:16,].copy() # replace with the reall data inside the loops

# truncate df rows to add the helper column
n_row = working_df.shape[0] // window_size * window_size
working_df= working_df.iloc[0:n_row]
# help grouping
helper_col_df = pd.DataFrame(data = list(range(0,window_size,1)) * (n_row//window_size), columns=['counter'])
print(helper_col_df.shape)
print(working_df.shape)
working_df = working_df.reset_index()
working_df = pd.concat((working_df,helper_col_df), axis=1)
print(working_df.head(7))
 # here we need to groupby the counter

(15, 1)
(15, 9)
     index    x_axis    y_axis    z_axis  participant_id trimmed_activity  \
0  6669936  0.423716  0.109737 -0.028063             157   Running 7 METs   
1  6669937  0.676451 -0.316380 -0.249883             157   Running 7 METs   
2  6669938  0.251704 -0.119815  0.003971             157   Running 7 METs   
3  6669939  0.455297 -0.550348  0.078684             157   Running 7 METs   
4  6669940 -0.173043  0.076750  0.257825             157   Running 7 METs   
5  6669941 -0.441424  0.122000  0.213023             157   Running 7 METs   
6  6669942 -0.479676  2.545022 -0.615723             157   Running 7 METs   

   height  weight  age gender  counter  
0   159.0    55.0   29      0        0  
1   159.0    55.0   29      0        1  
2   159.0    55.0   29      0        2  
3   159.0    55.0   29      0        0  
4   159.0    55.0   29      0        1  
5   159.0    55.0   29      0        2  
6   159.0    55.0   29      0        0  


In [220]:
# working_df = working_df.drop(columns='counter') 
working_df.head()
result = working_df.groupby(by='counter').apply((lambda x: melting(x)))
result = result.to_numpy()
result[1].shape

(5,)

In [208]:
def melting(df):
    x_axis_df = df.melt(value_vars='x_axis')
    return x_axis_df.value.to_numpy().transpose()


In [178]:
x_axis_df = working_df.melt(value_vars='x_axis')
y_axis_df = working_df.melt(value_vars='y_axis')
z_axis_df = working_df.melt(value_vars='z_axis')
# working_df.head(2)
df_temp.head(2)
working_df.head(2)
x_axis_np = x_axis_df.value.to_numpy()
y_axis_np = y_axis_df.value.to_numpy()
z_axis_np = z_axis_df.value.to_numpy()
total_np = np.array([x_axis_np, y_axis_np, z_axis_np])
total_np.shape

(3, 15)

In [None]:
# break into 150(use a variable) readings chunks by adding a helper column
# spread based on the helper column
# bind all the files of different levels
# export as the final output

In [134]:
helper_col_df = pd.DataFrame(data = list(range(0,window_size,1)) * (n_row//window_size), columns=['counter'])
helper_col_df


Unnamed: 0,counter
0,0
1,1
2,2
3,0
4,1
5,2
6,0
7,1
8,2
9,0


In [None]:
# before training
# suffle
# Seperate to x and y (y is the levels)
# one hot encode y

In [89]:
pd.period_range(1,10)

ValueError: Given date string not likely a datetime.

In [90]:
pd.interval_range(start=0, periods=4, freq=1)

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4]],
              closed='right',
              dtype='interval[int64]')

In [93]:
list(range(1,15,1)) * 3

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14]