## Cargando la data HAR

https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones

In [1]:
# To store data
import pandas as pd
from numpy import mean
from numpy import std
from numpy import dstack
from pandas import read_csv
# To do linear algebra
import numpy as np
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

Función para leer un archivo como numpy array, desde cierto índice a cierto índice, ya que se quiere cargar los datos de solo un usuario

In [2]:
# load a single file as a numpy array
def load_file(filepath, ini_idx, fin_idx):
    dataframe = read_csv(filepath, header=None, delim_whitespace=True,skiprows=ini_idx,nrows=(fin_idx-ini_idx+1))
    return dataframe.values


función que recorre una lista de archivos y retorna un numpy array 3d. La función `dstack()` de NumPy nos permite apilar cada una de las matrices 3D cargadas en una única matriz 3D donde las variables se separan en la tercera dimensión (características).

In [3]:
# load a list of files and return as a 3d numpy array
def load_group(filenames, ini_idx, last_idx, prefix=''):
    loaded = list()
    for name in filenames:
        data = load_file(prefix + name, ini_idx, last_idx)
        loaded.append(data)
    # stack group so that features are the 3rd dimension
    loaded = dstack(loaded)
    return loaded

Podemos usar esta función para cargar todos los datos de la señal de entrada para un grupo dado, como train o test.
La función `load_dataset_group()` a continuación carga todos los datos de la señal de entrada y los datos de salida para un solo grupo(train o test) usando las convenciones de nomenclatura consistentes entre los directorios de test y train.

In [4]:
# load a dataset group, such as train or test
def load_dataset_group(group, client, prefix=''):
    
    #tuple of init index and last index of client
    (ini_idx, last_idx) = client_index(group, client, prefix) 
    
    filepath = prefix + group + '/Inertial Signals/'
    
    # load all 9 files as a single array
    filenames = list()
    
    # total acceleration
    filenames += ['total_acc_x_'+group+'.txt', 'total_acc_y_'+group+'.txt', 'total_acc_z_'+group+'.txt']
    
    # body acceleration
    filenames += ['body_acc_x_'+group+'.txt', 'body_acc_y_'+group+'.txt', 'body_acc_z_'+group+'.txt']
    
    # body gyroscope
    filenames += ['body_gyro_x_'+group+'.txt', 'body_gyro_y_'+group+'.txt', 'body_gyro_z_'+group+'.txt']
    
    # load input data
    X = load_group(filenames, ini_idx, last_idx, filepath)
    
    # load class output
    y = load_file(prefix + group + '/y_'+group+'.txt', ini_idx, last_idx)
    
    return X, y

Finalmente, podemos cargar cada uno de los conjuntos de datos de test y train.

La función `load_dataset()` a continuación implementa este comportamiento y retorna la X e y (para train y test) listos para ajustar y evaluar los modelos definidos.


In [5]:
from sklearn.model_selection import train_test_split
# load the dataset, returns train and test X and y elements
def load_dataset(prefix='', client=None):
    try:
        #se cargan los datos de cliente si sus datos están en el train
        X_client, y_client = load_dataset_group('train', client, prefix + './UCI_HAR_Dataset/')
        print(X_client.shape, y_client.shape)
        
        # zero-offset class values
        y_client = y_client - 1
        
        # one hot encode y
        y_client = to_categorical(y_client)
    except:
        #se cargan los datos del cliente si sus datos están en el test
        X_client, y_client = load_dataset_group('test', client, prefix + './UCI_HAR_Dataset/')
        print(X_client.shape, y_client.shape)
        
        # zero-offset class values
        y_client = y_client - 1
        
        # one hot encode y
        y_client = to_categorical(y_client)
        
    X_train, X_test, y_train, y_test = train_test_split(X_client, y_client, test_size = 0.3, random_state = 42)
    
    print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
    return (X_train, y_train), (X_test, y_test)

Función para encontrar los índices de la data de un usuario en particular.

In [17]:
def client_index(group, client, prefix): #client debe ser un número que represente a una cliente del dataset
    df = pd.read_csv(prefix + group +'/subject_'+ group+'.txt', header=None)
    df_= df
    #se definieron 5 clientes, que contendran los datos de un grupo de sujetos del dataset:
    ini, end = client
    df1 = df.loc[(df[0] == ini)]
    df2 = df_.loc[(df_[0] == end)]
    df = pd.concat([df1,df2])
    print((int(df.index[0]),int(df.index[-1])))
    return (int(df.index[0]),int(df.index[-1])) #se retorna una tupla con el indice inicial e  indice final 

Ahora se tiene la data lista para ser usada en un modelo CNN 1D.

In [20]:
df = pd.DataFrame((load_dataset(client=(1,1))[0][0][0]))
df

(0, 346)
(347, 128, 9) (347, 1)
(242, 128, 9) (242, 6) (105, 128, 9) (105, 6)


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,1.426164,-0.362485,0.278914,0.423367,-0.147059,0.332361,-0.563775,-0.166493,0.419944
1,1.496596,-0.591127,0.120137,0.493977,-0.376090,0.174196,0.142212,0.194743,0.492339
2,1.305815,-0.645547,0.012587,0.303323,-0.430905,0.067112,0.562869,0.504950,0.223340
3,0.973824,-0.543838,-0.001186,-0.028609,-0.329574,0.053662,0.469666,0.271681,-0.061946
4,0.691378,-0.424250,-0.015278,-0.311065,-0.210333,0.039756,0.361988,-0.350011,0.022662
...,...,...,...,...,...,...,...,...,...
123,0.760949,-0.259650,-0.028184,-0.243675,-0.004049,0.007952,0.853804,-0.999119,-0.126709
124,0.614900,-0.110923,-0.077985,-0.390169,0.144785,-0.042434,0.892525,-1.068560,0.159877
125,0.690008,-0.096675,-0.096612,-0.315476,0.159162,-0.061676,0.897483,-1.003608,0.339440
126,0.862856,-0.111694,-0.117733,-0.143001,0.144283,-0.083438,0.756824,-0.700187,0.291160


Acá exploramos el archivo `subject_train`, el cual contiene a los distintos voluntarios quienes fueron los responsable de los datos de entrenamiento.

In [8]:
df_train = pd.read_csv('./UCI_HAR_Dataset/train/subject_train.txt', header=None)
df_train[0].unique()

array([ 1,  3,  5,  6,  7,  8, 11, 14, 15, 16, 17, 19, 21, 22, 23, 25, 26,
       27, 28, 29, 30])

Acá exploramos el archivo `subject_test`, el cual contiene a los distintos voluntarios quienes fueron los responsable de los datos de pruebas.

In [9]:
df_test = pd.read_csv('./UCI_HAR_Dataset/test/subject_test.txt', header=None)
df_test[0].unique()

array([ 2,  4,  9, 10, 12, 13, 18, 20, 24])