## Sprint 1

In [22]:
import numpy as np
import matplotlib.pyplot as plt

In [9]:
DATA_FOLDER = "actions"

action_labels = ['jumping jack', 'squat', 'in_place running', 'side lunge', 'boxing', 'overhead press', 'bicep curl']

## Exploratory Data Analysis


### What is contained in the name of each file?

The name of the files all have the following structure: `action-X-group-Y-rec-Z.npy`.
- X defines the type of exercise that was performed
 - 0: Jumping jack
 - 1: Squat
 - 2: In-place running
 - 3: Side lunge
 - 4: Boxing
 - 5: Overhead press
 - 6: Bicep curl
- Y defines the group that performed these exercises
- Z defines the person of that group.

The .npy format is the standard binary file format in NumPy for persisting a single arbitrary NumPy array on disk.

  

### How can you filter data by action, or by group?

In [10]:
import glob

#for actions
filtered_actions = glob.glob(f'{DATA_FOLDER}/*action-[1-2]*.npy')

#for groups
filtered_groups = glob.glob(f'{DATA_FOLDER}/*group-[1-7]*.npy')

Above, it was illustrated how to filter files using standard Regex code.

`*`: this is meant to match any number of characters, including none. This will match any string of characters before the word "action-" or "group-".

`[1-2]`: This part inside square brackets is a character range. It matches a single character that can be either 1 or 2.

A function was made to do this:

In [11]:
def filter_files_by_action_or_group(directory, groups_start, groups_end, action_start = 0, action_end = len(action_labels)-1):
    filtered=glob.glob(f"{directory}/*action-[{action_start}-{action_end}]*group-[{groups_start}-{groups_end}]*.npy")
    return filtered

### Is the data balanced?

In [None]:
actions = range(0, len(action_labels))
data_count = []

for action in actions:
    files = filter_files_by_action_or_group(DATA_FOLDER, 0, 9, action, action)
    data_count.append(len(files))

plt.figure(figsize=(10, 6))
plt.bar(action_labels, data_count, color='blue')
plt.xlabel('Actions')
plt.ylabel('Number of files')
plt.title('Number of Files per Action')
plt.xticks(actions)
plt.show()

### What does the data look like?

In [None]:
import numpy as np

arr=np.load(filtered_actions[0]) # load 1 file
nb_samples, nb_keypoints, nb_values=arr.shape
arr.shape

The array consists of 450 samples that contain the 18 keypoints their position. The value contains the X and Y position and the last value is the confidence that this keypoint is in this position.

In [None]:
arr

### What are the ranges for each feature?

### Do all (values for all) features make sense? Are there any outliers?

## Load Train and test data set

In [15]:
def load_data(path):
    arr = np.load(path)
    nb_samples, nb_keypoints, nb_values = arr.shape

    # Flatten samples, add an additional column: action identifier
    res = np.empty((nb_samples, nb_keypoints * nb_values + 1))
    res[:, :-1] = arr.reshape((nb_samples, nb_keypoints * nb_values))

    # Fetch information
    parts = path.split('-')
    action_idx = parts[1]

    # Set column info
    res[:, -1] = action_idx

    return res

In [21]:
train_files = filter_files_by_action_or_group(DATA_FOLDER, 0, 6)
Xy = np.vstack([load_data(c) for c in train_files])
X = Xy[:, :-1]

## Preprocessing

### Min-Max Scaler
The min-max scaling transforms the data by scaling features to a specific range in this case to [-1,1].

In [24]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(-1, 1))
X_scaled = scaler.fit_transform(X)

In [None]:
variances_scaled = np.var(X_scaled, axis=0)
plt.bar(range(len(variances_scaled)), variances_scaled)
plt.title('Variance of Features Across Recordings After Scaling')
plt.show()

### Standard Scaler
Normalization scales the feature values to have a mean of 0 and a standard deviation of 1. 

In [26]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

In [None]:
variances_normalized = np.var(X_normalized, axis=0)
plt.bar(range(len(variances_normalized)), variances_normalized)
plt.title('Variance of Features Across Recordings After Normalization')
plt.show()

### Principal Component Analysis (PCA) 

You can reduce variability using techniques like Principal Component Analysis (PCA) to eliminate redundant features and focus on the most significant variance contributors.

In [32]:
from sklearn.decomposition import PCA

# Reduce the number of components while retaining 95% of the variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_normalized)

In [None]:
variances_reduced = np.var(X_reduced, axis=0)
plt.bar(range(len(variances_reduced)), variances_reduced)
plt.title('Variance of Features Across Recordings After PCA')
plt.show()

In [None]:
import seaborn as sns

plt.figure()
sns.boxplot(data=X_reduced, orient="v")
plt.title('Boxplot of Features')
plt.xlabel('Feature Values')
plt.show()

### Pipeline

In [None]:
from sklearn.pipeline import make_pipeline


model = make_pipeline(StandardScaler(), PCA(n_components=0.95))