This notebook is just the same code as the one in Create_GAF_images.ipynb, but only keeping the final implementation. Besides, the npy files are not stored in OneDrive, but in a different folder.

In [1]:
import numpy as np
import pandas as pd 
import os

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from pyts.image import GramianAngularField

## 1 Import accelerometry data and subjects' sequence number (ID)

First, we load the following data:
- **matrix_3d.npy**: 3D matrix of accelerometry data. See 6_generate_time_series_matrix.ipynb for details.
- **matrix_SEQN.csv**: Sequence number (SEQN) of subjects in the same order as the rows of the 3D matrix. See 6_generate_time_series_matrix.ipynb for details.
- **train_ids.csv**: Sequence number of subjects in the training set.
- **test_ids.csv**: Sequence number of subjects in the test set.

In [2]:
physical_activity = np.load('../matrix_3d.npy')
subjects_seqn = pd.read_csv('../matrix_SEQN.csv')
train_ids = pd.read_csv('../train_IDs.csv', names=['SEQN'])
test_ids = pd.read_csv('../test_IDs.csv', names=['SEQN'])

In [3]:
physical_activity.shape # nof_subjects, 7 days, 1440 minutes

(7537, 7, 1440)

In [4]:
nof_subjects = physical_activity.shape[0] # 7537 subjects
nof_days = physical_activity.shape[1] # 7 days
nof_minutes_per_day = physical_activity.shape[2] # 1440 minutes

In [5]:
subjects_seqn.shape # nof_subjects

(7537, 1)

We have therefore the SEQN values that identify each subject. To identify the subject in the 3D matrix, we need to map those SEQN values to the row index of the 3D matrix. To do so, the following dictionary is created:

In [6]:
seqn_to_index = {seqn: idx for idx, seqn in enumerate(subjects_seqn['SEQN'].values)} # A dictionary to map SEQN to index

Testing dictionary

In [7]:
seqn_to_index

{73557.0: 0,
 73558.0: 1,
 73559.0: 2,
 73560.0: 3,
 73561.0: 4,
 73562.0: 5,
 73564.0: 6,
 73566.0: 7,
 73567.0: 8,
 73568.0: 9,
 73570.0: 10,
 73571.0: 11,
 73572.0: 12,
 73573.0: 13,
 73574.0: 14,
 73576.0: 15,
 73579.0: 16,
 73580.0: 17,
 73581.0: 18,
 73583.0: 19,
 73585.0: 20,
 73586.0: 21,
 73587.0: 22,
 73588.0: 23,
 73589.0: 24,
 73591.0: 25,
 73592.0: 26,
 73594.0: 27,
 73595.0: 28,
 73596.0: 29,
 73597.0: 30,
 73598.0: 31,
 73599.0: 32,
 73600.0: 33,
 73602.0: 34,
 73603.0: 35,
 73604.0: 36,
 73605.0: 37,
 73606.0: 38,
 73607.0: 39,
 73608.0: 40,
 73609.0: 41,
 73610.0: 42,
 73612.0: 43,
 73613.0: 44,
 73614.0: 45,
 73615.0: 46,
 73616.0: 47,
 73617.0: 48,
 73618.0: 49,
 73619.0: 50,
 73620.0: 51,
 73621.0: 52,
 73622.0: 53,
 73623.0: 54,
 73624.0: 55,
 73626.0: 56,
 73628.0: 57,
 73629.0: 58,
 73630.0: 59,
 73631.0: 60,
 73632.0: 61,
 73633.0: 62,
 73635.0: 63,
 73638.0: 64,
 73639.0: 65,
 73640.0: 66,
 73642.0: 67,
 73643.0: 68,
 73644.0: 69,
 73645.0: 70,
 73646.0: 71,
 7

train_ids y test_ids contain the SEQN of the subjects in the training and testing set, respectively. 

In [8]:
train_ids['SEQN'].iloc[0]

73729.0

seqn_to_index is a dictionary that maps the SEQN to the index of the subject in the data matrix.

In [9]:
seqn_to_index[train_ids['SEQN'].iloc[0]]

136

In [10]:
seqn_to_index[73729]

136

subjects_seqn is a list with the SEQN of the subjects in the data matrix in the order they appear in the data matrix.

In [11]:
subjects_seqn['SEQN'].iloc[136]

73729.0

Would it be better to have physical_activity and subjects_seqn in a single data structure such as a dictionary or a class?

## 2 Functions

In [12]:
def create_gaf_image(one_day_acc):
    """
    Creates a 2D GAF image from the accelerometry data of a single day, for a single subject (input in a 2d numpy array).
    """
    gaf_model = GramianAngularField(method='difference')
    gaf = gaf_model.fit_transform(one_day_acc)
    if gaf.ndim == 3:  # If the output is 3D, reduce it to 2D
        return gaf.squeeze(0)  # Assume the shape is (1, height, width) and we need (height, width)
    return gaf

In [13]:
def plot_gaf_image(gaf_image):
    """
    This function plots a GAF image
    """
    fig_w=4; fig_h=4;# figure width and height
    plt.figure(figsize=(fig_w, fig_h))
    #plt.imshow(gaf_image[0], cmap='rainbow', origin='lower')
    plt.imshow(gaf_image, cmap='rainbow', origin='lower')
    plt.axis('off')
    plt.show()

In [14]:
def average_over_periods(data, period_length):
    """
    Averages data over specified period lengths.

    Parameters:
    data (numpy array): The input data array to be averaged.
    period_length (int): The length of each period over which to average the data.

    Returns:
    numpy array: The array containing the averaged data.
    """
    # Ensure the data can be evenly divided into periods of the specified length
    if len(data) % period_length != 0:
        raise ValueError("The length of the data is not evenly divisible by the period length.")

    # Reshape the data into chunks and calculate the mean for each chunk
    reshaped_data = data.reshape(-1, period_length)
    averaged_data = reshaped_data.mean(axis=1)

    return averaged_data

In [15]:
def resample_acceleration_data(matrix_3d, period_length):
    # Get the shape of the original matrix
    nof_subjects, nof_days, nof_minutes = matrix_3d.shape

    # Ensure the data can be evenly divided into periods of the specified length
    if nof_minutes % period_length != 0:
        raise ValueError("The length of the data (1440) is not evenly divisible by the period length. Period_length must be a factor of 1440.")

    # Calculate the new number of minutes after resampling
    new_nof_minutes = nof_minutes // period_length

    # Initialize a new 3D matrix to store the resampled data
    matrix_3d_resampled = np.zeros((nof_subjects, nof_days, new_nof_minutes))

    # Iterate over each subject and each day
    for subject_idx in range(nof_subjects):
        for day_idx in range(nof_days):
            # Get the acceleration vector for the current subject and day
            acceleration_vector = matrix_3d[subject_idx, day_idx]

            # Apply the average_over_periods function to the acceleration vector
            resampled_vector = average_over_periods(acceleration_vector, period_length)

            # Store the resampled vector in the new 3D matrix
            matrix_3d_resampled[subject_idx, day_idx] = resampled_vector

    return matrix_3d_resampled

## 3 Creating the tensors from the time series data

For each subject, a tensor (npy file) is created. This tensor contains 7 GAF images, one for each day of the week. The size of the image can be downsampled from 1440, by selecting how many minutes per acceleration sample we want to keep.
So for each subject, we have a tensor of size 
- [n_samples_per_day, n_samples_per_day, days_per_week:7]

In [16]:
# Define the directory path where the files will be saved. AVOIDING ONE DRIVE
directory_path = 'C:\\Users\\Horacio\\Documents\\NPY_FILES'

In [17]:
def create_subject_tensor_resampled(subject_ids, physical_activity, prefix, minutes_per_sample):
    """
    Process each subject and store GAF images as separate 3D tensors in the format:
    (n_samples_per_day, n_samples_per_day, days_per_week).
    Each tensor for a subject's week is saved in an individual file named {prefix}_{seqn}_{n_samples_per_day}.npy.
    """

    # Resample the acceleration data to the specified number of minutes per sample
    physical_activity_resampled = resample_acceleration_data(physical_activity, minutes_per_sample)
    n_samples_per_day = physical_activity_resampled.shape[2]

    for seqn in subject_ids['SEQN']:
        subject_idx = seqn_to_index[seqn]
        weekly_gafs = []
        for day_idx in range(physical_activity_resampled.shape[1]):  # Assuming there are 7 days
            physical_activity_day = physical_activity_resampled[subject_idx, day_idx]
            gaf = create_gaf_image(physical_activity_day.reshape(1, n_samples_per_day))
            weekly_gafs.append(gaf)

        # Stack all GAF images along the third axis to form a 3D tensor for the week
        full_week_gaf = np.stack(weekly_gafs, axis=2)

        # Save the 3D tensor to a file, named using the subject's SEQN and the specified prefix
        filename = os.path.join(directory_path, f'{prefix}_{int(seqn)}_{n_samples_per_day}.npy')
        np.save(filename, full_week_gaf)

In [18]:
minutes_per_sample=10
create_subject_tensor_resampled(test_ids.head(50), physical_activity, "test", minutes_per_sample) # Limiting for debugging
create_subject_tensor_resampled(train_ids.head(50), physical_activity, "train", minutes_per_sample) # Limiting for debugging

In [20]:
test_77845_144 = np.load(os.path.join(directory_path, 'test_77845_144.npy'))

In [21]:
test_77845_144.shape

(144, 144, 7)