[View in Colaboratory](https://colab.research.google.com/github/monstrim/tf-accel-recog/blob/master/TF_accel_recog.ipynb)

# TF-accel-recog

### Goal
* To label a time series of 6 channels of motion capturing sensors (XYZ accel + gyro) according to activity performed (walking, sitting, steps up, etc). The model we're building here should be then able to learn a similar task, but involving skate tricks rather than walking/sitting.

### Dataset
* Description: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
* Download link: http://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.zip

## Framework setup

In [0]:
import os
from zipfile import ZipFile
from urllib import request

import tensorflow as tf
import tensorflow.contrib.eager as tfe #hopefuly this won't be needed for long

tf.enable_eager_execution()

In [0]:
dataset_url = (
    'http://archive.ics.uci.edu/ml/'
    + 'machine-learning-databases/'
    + '00240/UCI%20HAR%20Dataset.zip'
)
filename = 'HAR_dataset.zip'

if not os.path.exists(filename):
    request.urlretrieve(dataset_url, filename)

with ZipFile(filename) as archive:
    files = [file for file in archive.namelist() 
             if not file.startswith('__MACOS')]
    archive.extractall(path='data', members=files)

## Getting acquainted with the data

Let's begin by checking the files we downloaded.

In [3]:
[os.path.join(dp, f) for dp, dn, fn in os.walk('data') for f in fn]

['data/UCI HAR Dataset/README.txt',
 'data/UCI HAR Dataset/.DS_Store',
 'data/UCI HAR Dataset/features.txt',
 'data/UCI HAR Dataset/features_info.txt',
 'data/UCI HAR Dataset/activity_labels.txt',
 'data/UCI HAR Dataset/train/subject_train.txt',
 'data/UCI HAR Dataset/train/y_train.txt',
 'data/UCI HAR Dataset/train/X_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_acc_y_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_gyro_z_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_acc_z_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_gyro_x_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/total_acc_x_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/total_acc_y_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_gyro_y_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/total_acc_z_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_acc_x_train.txt',
 'data/UCI HAR Dataset/test/X_test.txt',


It seems from the README.txt (which we get an error when we try to open here, but _trust me I checked_), that the X_train files contain only the preprocessed values (_a whole 500+ columns of it!_), and those aren't interesting for our needs. We want the raw data, and that's in the Inertial Signals folder. Thing is, it's one file per channel, and it's batched/windowed, so we need to look at the files and see how we can tease out the raw data as a time series, one line per sample, one column per channel.

In [4]:
samplefile = (
    'data/UCI HAR Dataset/train/'
    + 'Inertial Signals/body_gyro_x_train.txt'
)

with open(samplefile) as file:
    for i in range(10):
        line = file.readline().rstrip().split()
        print(i)
        print(line[0:64])
        print(line[64:])

0
['3.0191220e-002', '4.3710710e-002', '3.5687800e-002', '4.0402100e-002', '4.7096540e-002', '5.0184730e-002', '5.0544520e-002', '4.4992070e-002', '4.7686230e-002', '4.6812150e-002', '4.6488270e-002', '4.7304170e-002', '3.8029520e-002', '3.0937360e-002', '2.7908370e-002', '2.6142820e-002', '2.5280270e-002', '2.3425910e-002', '2.4704550e-002', '2.6897750e-002', '2.7873430e-002', '3.1608600e-002', '3.9103540e-002', '4.4853300e-002', '4.2579560e-002', '3.9404840e-002', '4.1656030e-002', '4.0514370e-002', '3.8484050e-002', '4.0457050e-002', '3.9734900e-002', '4.0506400e-002', '4.3393320e-002', '4.6023250e-002', '4.9989890e-002', '4.8593610e-002', '4.6582720e-002', '4.5674500e-002', '4.0787530e-002', '3.8018710e-002', '3.2026150e-002', '2.3275560e-002', '2.1962050e-002', '2.1051090e-002', '1.8896910e-002', '2.2128350e-002', '2.5359760e-002', '2.3395660e-002', '2.1170750e-002', '1.8735640e-002', '1.3434490e-002', '1.2047350e-002', '1.2554900e-002', '1.0037270e-002', '1.1512350e-002', '1.5904

OK. Each line has 128 samples, at 50Hz, with 50% overlap, so the first 64 samples on each line are the same as the last 64 on the previous one. To get one long sample, we need to select the first 64 samples on each line, then concatenate lines (checking the subject_train.txt file to see if we're still in the same run?). That will become a time series for a single channel, at 50Hz. We then need to concatenate the result for different files to get a single 6-channel time series.

However, we must remember the original preprocessing (and the labeling) was performed on the 128-sample windows (64 if we disregard the overlap), so each line on the train and test files will correspond to 64 lines in our raw time series.

### Dataset pipeline test

In [5]:
raw_train_dir = 'data/UCI HAR Dataset/train/Inertial Signals/'
raw_train_files = [
    os.path.join(raw_train_dir, x)
    for x in [
        'total_acc_x_train.txt',
        'total_acc_y_train.txt',
        'total_acc_z_train.txt',
        'body_gyro_x_train.txt',
        'body_gyro_y_train.txt',
        'body_gyro_z_train.txt'
    ]
]

raw_train_files

['data/UCI HAR Dataset/train/Inertial Signals/total_acc_x_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/total_acc_y_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/total_acc_z_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_gyro_x_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_gyro_y_train.txt',
 'data/UCI HAR Dataset/train/Inertial Signals/body_gyro_z_train.txt']

In [6]:
dataset = (
    tf.data.Dataset
    .from_tensor_slices(raw_train_files)
    .interleave(
        lambda x: (
            #tf.data.Dataset.from_tensors(x).repeat(10)
            tf.data.TextLineDataset(x)
        ), 6
    )
    #.map(lambda x: tf.decode_csv(x, [[0]], field_delim=' '))
    #.batch(6)
    .take(10)
)
for item in tfe.Iterator(dataset):
    print(item)

tf.Tensor(b'  1.0128170e+000  1.0228330e+000  1.0220280e+000  1.0178770e+000  1.0236800e+000  1.0169740e+000  1.0177460e+000  1.0192630e+000  1.0164170e+000  1.0207450e+000  1.0186430e+000  1.0195210e+000  1.0202600e+000  1.0180410e+000  1.0208290e+000  1.0186440e+000  1.0193980e+000  1.0203990e+000  1.0192220e+000  1.0220930e+000  1.0204330e+000  1.0205340e+000  1.0215030e+000  1.0199310e+000  1.0204800e+000  1.0189450e+000  1.0192380e+000  1.0199890e+000  1.0189170e+000  1.0197620e+000  1.0190210e+000  1.0178870e+000  1.0181360e+000  1.0195430e+000  1.0202420e+000  1.0187570e+000  1.0195340e+000  1.0198620e+000  1.0190600e+000  1.0207170e+000  1.0210550e+000  1.0201780e+000  1.0181080e+000  1.0147760e+000  1.0153740e+000  1.0184290e+000  1.0198950e+000  1.0186470e+000  1.0163870e+000  1.0170530e+000  1.0195720e+000  1.0210970e+000  1.0194880e+000  1.0172180e+000  1.0198760e+000  1.0220220e+000  1.0205740e+000  1.0215880e+000  1.0222980e+000  1.0193690e+000  1.0169800e+000  1.0167740e

WHY THE FUCK DOESN'T THIS ASSHOLE SEPARATE VALUES WITH A FUCKING COMMA??? Who uses fucking fixed width columns?? Python's string split function works just fine, but we're supposed to apply only tf functions inside a dataset map, plus split returns a list not a Tensor. SERIOUSLY, WTF.

We've tried:
* tf.decode_raw: will return one uint8 per character. We'd have to group them in 16-long batches, convert _back_ to characters, concatenate, then parse as a float. And I don't think there's a function just for parsing text values.
* tf.split_string: it needs a different shape? Can't figure this shit out, the error messages are (surprise?) _not fucking helpful_.
* tf.expand_dims + tf.split_string: Let's try adding the dimension it needs then? We can't because... hell if I know. The only thing the error message tells me it it's got something to do with being run in eager mode.

I've no idea what to do with this shit. My best guess is, abandon the Tensorflow dataset pipeline, make a preprocessing step in numpy or some shit, save the preprocessed file back to disk and work on it instead.

God fucking damn it.

## Data Preprocessing

Made peace with the fact we're gonna have to do this as an actual pre-step and save a processed file, rather than do it live in the Dataset pipeline.

### Goal
To preprocess 6 files into a single one. Each of the raw files has 128 samples per line, in fixed-width format. We need to get the first 64 of these, then transpose it, then concatenate all the lines as one long vector. That will be one column of the final file, and we need 6 of those.

Coffee machine hasn't been refilled in 4 hours tho, so I AIN'T GONNA DO THIS SHIT NOW.