## Initial Wrangle 

The goal is to consolidate all of the cvs' with motion capture data into a single xarray DataSet

First, I will create a function which creates an xarray DataArray for each subject (there are 28) and each speed (there are 3). 

This array has 4 "dimensions" with 

1. Time dimension
1. "Spatial" dimension
    * Note that we also have 3 velocity "dimensions"
1. Landmark type dimension
1. Right/Left dimension

making each have shape (4500, 6, 16, 2).

Each array is named by the subject id and speed.

As it got late, I'll concatenated them into a single dataset of $28 \times 3$ data arrays, and save them as netcdf files.


In [3]:
import numpy as np
import pandas as pd
import xarray as xr
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

A quick look at one of the csvs: 

In [24]:
df = pd.read_csv('../data/raw_data/RBDS001runT35markers.txt', delimiter = '	')
df.head(3)

Unnamed: 0,Time,R.ASISX,R.ASISY,R.ASISZ,L.ASISX,L.ASISY,L.ASISZ,R.PSISX,R.PSISY,R.PSISZ,...,R.MT1Z,R.MT5X,R.MT5Y,R.MT5Z,L.MT1X,L.MT1Y,L.MT1Z,L.MT5X,L.MT5Y,L.MT5Z
0,0.0,2428.56,966.182,1175.67,2420.26,992.218,963.925,2262.56,1019.2,1122.97,...,1210.08,1935.31,272.989,1104.49,2478.43,44.5323,1114.68,2424.95,47.9302,1016.78
1,0.007,2427.46,964.619,1175.02,2418.86,990.269,964.057,2260.62,1017.43,1123.75,...,1214.32,1952.63,270.517,1108.36,2454.69,44.1712,1114.18,2400.64,47.838,1015.82
2,0.013,2425.19,963.097,1175.69,2418.05,988.939,964.259,2258.95,1016.32,1124.15,...,1216.86,1970.66,267.118,1111.82,2430.86,43.9324,1113.85,2377.04,47.6751,1015.43


In [28]:
df.shape

(4500, 97)

That is good, as there are $1 + 2 \times 47$ rows as the paper says.

However, it turns out the time steps are not actually equally spaced, most likely due to rounding(if this project ever goes anywhere meaningful I should check this). This could make the signal processing awkward. I found it easiest to make them equally spaced by running a simple linear regression. This will give the time coordinate for the array. 

I did this using data from a single subject.

In [23]:
x = np.arange(df.shape[0])[:, None]
y = df['Time'].values

model = LinearRegression(fit_intercept = False)
model.fit(x, y, )
time_coord = model.predict(x)
print("mean absolute error of approximation", np.abs(time_coord - y).mean())

mean absolute error of approximation 0.00022223455829403023


I'll also have to change a lot of the names, separating out the left/right from the landmark portion from the x/y/z portion.

Recall that:
* x-axis is posterior to anterior
* y-axis is inferior to superior 
* z-axis is left to right 

I also changed coordinates from the "bottom-of-the-treadmill" coordinates to the midpoint of the right/left ASIS, PSIS, and illiac crest. In other words, the origin is moving in time, making the coordinates "lagrangian"

TODO: Add axis for pelvic floor

In [43]:
def make_data_array(subject, speed):
    """
    stores the positions and velocities of each landmark of a subject at a fixed speed as a 
    """
    speed = str(speed)[0] + str(speed)[-1]
    if subject < 10:
        filename = 'RBDS00' + str(subject)+ 'runT' + speed + 'markers.txt'
    if subject >= 10:
        filename = 'RBDS0' + str(subject) + 'runT' + str(speed)[0] + str(speed)[-1] + 'markers.txt'
    #load dataframe
    df = pd.read_csv('../data/raw_data/' + filename, 
                        delimiter = '	')
    # change the time coordinate
    df['Time'] = time_coord
    # set index
    df.set_index('Time', inplace = True)
    # change names 
    new_columns = [(x[2:-1], x[0], x[-1]) for x in df.columns]
    df.columns = pd.MultiIndex.from_tuples(new_columns)
    df.columns.names = ['landmark', 'side', 'axis']
    df.index.names = ['time']
    # convert to xarray
    xdf = xr.DataArray(df)
    # unstack multi-index
    xdf = xdf.unstack('dim_1')
    # compute the lift to the tangent bundle
    der = xdf.differentiate('time')
    der = der.assign_coords({'axis': ['TX', 'TY', 'TZ']})
    # combine the two 
    txdf = xr.concat([xdf, der], dim = 'axis')
    # reorder
    txdf = txdf.transpose('time', 'axis', 'landmark', 'side')
    # shift to origin to pelvic floor
    translation = (txdf.sel(landmark = 'ASIS', axis = ['X', 'Y', 'Z']).sum('side') 
                   + txdf.sel(landmark = 'PSIS', axis = ['X', 'Y', 'Z']).sum('side')
                   + txdf.sel(landmark = 'Iliac.Crest', axis = ['X', 'Y', 'Z']).sum('side'))/6
    txdf.loc[:, ['X', 'Y', 'Z'], :, :] -= translation.values[:, :, None, None]
    # add meta data
    txdf.attrs['Speed'] = speed
    txdf.attrs['Subject'] = subject
    txdf.attrs['time units'] = 'sec'
    txdf.attrs['distance units'] = 'mm'
    txdf.attrs['velocity/tangle bundle units'] = 'mm/s'
    txdf.attrs['origin of landmark positions'] = 'mid-point of the right/left ASIS/PSIS/Illiac crest'
    # give the array the obvious name:
    txdf = txdf.rename((subject, speed))
    return txdf

Here's an example output:

In [44]:
make_data_array(2, 2.5)