# This notebook prepares the data for NN models.
#### Clean the start and the end of each activity session
#### Select important columns and one hot encoding
#### Break down to shorter time periods
#### Encode output and shuffle the data


Data frame vocabulary:
- *df* = the raw csv dataframe
- *df_clean* = final cleaned and processed dataframe
- *df_level* = df for each activity level
- *df_level_clean* = clean version of df_level
- *df_temp* = a helper dtaframe to store temporary data for each participant and each level
    

In [1]:
import pandas as pd

In [2]:
input_dir =  'Z:/Research/dfuller/Walkabilly/studies/smarphone_accel/data/Ethica_Jaeger_Merged/pocket/'
input_file_name = 'pocket_with_couns_and_vec_meg_30Hz.csv'


In [3]:
df = pd.read_csv(input_dir + input_file_name)

Break it down to several CSV file, Lying, Sitting, Walking, Running3, Running5, Running7
Keep Gender, weight and height


In [46]:
participant_list = list(df.participant_id.unique())

In [47]:
df.head(2)

Unnamed: 0,x_axis,y_axis,z_axis,participant_id,trimmed_activity,height,weight,age,gender
0,0.266111,-0.069464,0.96688,108,Lying,164.0,68.0,30,0
1,0.266401,-0.069351,0.963832,108,Lying,164.0,68.0,30,0


In [48]:
# select important columns, x, y, z, height, weight, age, gender, also participant_id for cleaning. remove it later
important_columns = ['x_axis','y_axis','z_axis','participant_id','trimmed_activity','height','weight','age','gender']
df = df[important_columns].copy()

In [49]:
# change gender to dummy
df.gender[df['gender']=='Female'] = 0
df.gender[df['gender']=='Male'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [50]:
df.head(3)

Unnamed: 0,x_axis,y_axis,z_axis,participant_id,trimmed_activity,height,weight,age,gender
0,0.266111,-0.069464,0.96688,108,Lying,164.0,68.0,30,0
1,0.266401,-0.069351,0.963832,108,Lying,164.0,68.0,30,0
2,0.262441,-0.067394,0.943335,108,Lying,164.0,68.0,30,0


In [51]:
#df_L = df[df['trimmed_activity'] == 'Lying']

##### Get rid of the first and last 20 Seconds for avoiding noise

In [52]:
# n_ignore = 600 # ignores 20 sec with a frequency of 30 Hz

# for partip in participant_list:
#     df_temp = df_L[df_L['participant_id'] == partip]
#     df_temp_nrow = df_temp.shape[0]
#     df_temp = df_temp[n_ignore:df_temp_nrow-n_ignore]
#     df_sub_clean = pd.concat(df_sub_clean, df_temp)

# # the first row of sub is zeros    
# df_L = df_sub_clean[1:,]
# df_L = df_L.drop('participant_id', axis = 1)

    

In [53]:
# list(df.columns)

In [56]:
# repeat for all PE levels 

# get levels to loop thru
PE_levels = df.trimmed_activity.unique()

# see dataframe vocabulary
# ceate empty df
# df_clean = df.iloc[0,].copy()
# df_clean = df_clean.replace(df_clean,0)
df_clean = pd.DataFrame(columns = important_columns)

for level in PE_levels:
    print("working on {} level".format(level))

    df_level = df[df['trimmed_activity'] == level]
    print("indexing finished")
#     df_level_clean = df.iloc[0,]
#     df_level_clean = df_level_clean.replace(df_level_clean,0)
    df_level_clean = pd.DataFrame(columns = important_columns)
    print("goinig to the particpant loop, df_level_clean shape :", df_level_clean.shape)
    print(df_level_clean)


    
    n_ignore = 600 # ignores 20 sec with a frequency of 30 Hz

    for partip in participant_list:
        df_temp = df_level[df_level['participant_id'] == partip]
        df_temp_nrow = df_temp.shape[0]
        df_temp = df_temp.iloc[n_ignore:df_temp_nrow-n_ignore,]
#         print("df_temp is created shape:", df_temp.shape)
        df_level_clean = pd.concat([df_level_clean, df_temp])
        print("working on {} participant".format(partip))


    df_clean = pd.concat([df_clean, df_level_clean])

  
df_clean = df_clean.drop('participant_id', axis = 1)
    

working on Lying level
indexing finished
goinig to the particpant loop, df_level_clean shape : (0, 9)
Empty DataFrame
Columns: [x_axis, y_axis, z_axis, participant_id, trimmed_activity, height, weight, age, gender]
Index: []
working on 108 participant
working on 111 participant
working on 112 participant
working on 113 participant
working on 114 participant
working on 115 participant
working on 116 participant
working on 117 participant
working on 118 participant
working on 119 participant
working on 120 participant
working on 121 participant
working on 122 participant
working on 124 participant
working on 125 participant
working on 126 participant
working on 123 participant
working on 127 participant
working on 129 participant
working on 130 participant
working on 131 participant
working on 132 participant
working on 128 participant
working on 133 participant
working on 136 participant
working on 137 participant
working on 138 participant
working on 159 participant
working on 142 part

working on 125 participant
working on 126 participant
working on 123 participant
working on 127 participant
working on 129 participant
working on 130 participant
working on 131 participant
working on 132 participant
working on 128 participant
working on 133 participant
working on 136 participant
working on 137 participant
working on 138 participant
working on 159 participant
working on 142 participant
working on 143 participant
working on 109 participant
working on 144 participant
working on 145 participant
working on 146 participant
working on 147 participant
working on 149 participant
working on 152 participant
working on 150 participant
working on 151 participant
working on 153 participant
working on 139 participant
working on 134 participant
working on 141 participant
working on 155 participant
working on 140 participant
working on 154 participant
working on 156 participant
working on 157 participant


In [149]:
window_size = 3
working_df = df_temp.iloc[0:16,].copy() # replace with the reall data inside the loops

# truncate df rows to add the helper column
n_row = working_df.shape[0] // window_size * window_size
working_df= working_df.iloc[0:n_row]
# help grouping
helper_col_df = pd.DataFrame(data = list(range(0,window_size,1)) * (n_row//window_size), columns=['counter'])
print(helper_col_df.shape)
print(working_df.shape)
working_df = working_df.reset_index()
working_df = pd.concat((working_df,helper_col_df), axis=1)
print(working_df.head(7))
 # here we need to groupby the counter

(15, 1)
(15, 11)
   level_0    index    x_axis    y_axis    z_axis  participant_id  \
0        0  6669936  0.423716  0.109737 -0.028063             157   
1        1  6669937  0.676451 -0.316380 -0.249883             157   
2        2  6669938  0.251704 -0.119815  0.003971             157   
3        3  6669939  0.455297 -0.550348  0.078684             157   
4        4  6669940 -0.173043  0.076750  0.257825             157   
5        5  6669941 -0.441424  0.122000  0.213023             157   
6        6  6669942 -0.479676  2.545022 -0.615723             157   

  trimmed_activity  height  weight  age gender  counter  counter  
0   Running 7 METs   159.0    55.0   29      0        0        0  
1   Running 7 METs   159.0    55.0   29      0        1        1  
2   Running 7 METs   159.0    55.0   29      0        2        2  
3   Running 7 METs   159.0    55.0   29      0        0        0  
4   Running 7 METs   159.0    55.0   29      0        1        1  
5   Running 7 METs   159.0  

In [178]:
x_axis_df = working_df.melt(value_vars='x_axis')
y_axis_df = working_df.melt(value_vars='y_axis')
z_axis_df = working_df.melt(value_vars='z_axis')
# working_df.head(2)
df_temp.head(2)
working_df.head(2)
x_axis_np = x_axis_df.value.to_numpy()
y_axis_np = y_axis_df.value.to_numpy()
z_axis_np = z_axis_df.value.to_numpy()
total_np = np.array([x_axis_np, y_axis_np, z_axis_np])
total_np.shape

(3, 15)

In [None]:
# break into 150(use a variable) readings chunks by adding a helper column
# spread based on the helper column
# bind all the files of different levels
# export as the final output

In [134]:
helper_col_df = pd.DataFrame(data = list(range(0,window_size,1)) * (n_row//window_size), columns=['counter'])
helper_col_df


Unnamed: 0,counter
0,0
1,1
2,2
3,0
4,1
5,2
6,0
7,1
8,2
9,0


In [None]:
# before training
# suffle
# Seperate to x and y (y is the levels)
# one hot encode y

In [89]:
pd.period_range(1,10)

ValueError: Given date string not likely a datetime.

In [90]:
pd.interval_range(start=0, periods=4, freq=1)

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4]],
              closed='right',
              dtype='interval[int64]')

In [93]:
list(range(1,15,1)) * 3

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14]