To process the data faster, we found that the best way is to call kaggle API and download directly to google colab and unzipped on colab locally.

In [None]:
! pip install -q kaggle

In [None]:
from google.colab import files

files.upload()

In [None]:
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle competitions download -c 'asl-signs'

Downloading asl-signs.zip to /content
100% 37.4G/37.4G [04:49<00:00, 166MB/s]
100% 37.4G/37.4G [04:49<00:00, 139MB/s]


In [None]:
! mkdir asl-signs

In [None]:
! unzip asl-signs.zip -d asl-signs

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: asl-signs/train_landmark_files/61333/644240510.parquet  
  inflating: asl-signs/train_landmark_files/61333/644760785.parquet  
  inflating: asl-signs/train_landmark_files/61333/64714634.parquet  
  inflating: asl-signs/train_landmark_files/61333/647782358.parquet  
  inflating: asl-signs/train_landmark_files/61333/647894613.parquet  
  inflating: asl-signs/train_landmark_files/61333/648219958.parquet  
  inflating: asl-signs/train_landmark_files/61333/648759798.parquet  
  inflating: asl-signs/train_landmark_files/61333/648810695.parquet  
  inflating: asl-signs/train_landmark_files/61333/649793223.parquet  
  inflating: asl-signs/train_landmark_files/61333/650186126.parquet  
  inflating: asl-signs/train_landmark_files/61333/650848108.parquet  
  inflating: asl-signs/train_landmark_files/61333/653707084.parquet  
  inflating: asl-signs/train_landmark_files/61333/653782862.parquet  
  inflating: asl-signs/tra

In [None]:
!pip install -q pandas pyarrow
!pip install -q mediapipe

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.5/34.5 MB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!ls

asl-signs  asl-signs.zip  kaggle.json  sample_data


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Data Folder Directry
main_dir = '/content/asl-signs/'
googledrive_dir = '/content/drive/MyDrive/Colab Notebooks/Data/asl-signs/'

During our study of the data and research on the possible model solutions, there is one transformer model approach caught our eye. This transformer model approach was designed by Wijkhuizen, M., in the Kaggle competition (2023). Our project team decided to follow Wijkhuizen, M.’s approach to create a transformer model as one of the models to test for this project. Our goal with this approach is to get a better understanding of the transformer model since Wijkhuizen, M.’s approach is to build a transformer model from scratch and not fine-turn a base model.

Following code are from https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training

In [None]:
# Code From https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from tqdm import tqdm
from sklearn.model_selection import GroupShuffleSplit
import json
# Constants (adjust these according to your data)
INPUT_SIZE = 64
N_COLS = 100  # Adjust based on your landmark indices
N_DIMS = 3    # Typically x, y, z coordinates
N_ROWS = 543

In [None]:
# Code From https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
# Load metadata
metadata_sub_dir = 'train.csv'
metadata_full_file_path = os.path.join(main_dir, metadata_sub_dir)
df_metadata = pd.read_csv(metadata_full_file_path)

# Load sign to index mapping
signmap_sub_dir = 'sign_to_prediction_index_map.json'
signmap_full_file_path = os.path.join(main_dir, signmap_sub_dir)
with open(signmap_full_file_path, 'r') as file:
    sign_to_index = json.load(file)
df_metadata['sign_index'] = df_metadata['sign'].map(sign_to_index)

In Wijkhuizen, M.’s transformer approach, all the data has been reformatted to 4D tensor (Number, Frame, KeyPoint, LandMark), where “Number” is the number of data file which also link to a label y for the meaning of signs. “Frame” is the frame of the video recording; if the frame is larger than 64, it will be downsampled, and if the frame is shorter than 64 frames, then it will be padded (2023). “KeyPoint” is the LandMark keypoint from the Mediapipe tracking result; this approach has limited only 66 key points: Lips has 40 key points, the dominant hand has 21 key points, and the dominant side pose has 5 key points, a total of 66 key points. The “LandMark” is 3 LandMark values of [x, y, z] from MediaPipe tracking. This has significantly reduced the size of data that need to be processed by the model and kept the most possible features that are important to ASL recognition.


In [None]:
# Code From https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
USE_TYPES = ['left_hand', 'pose', 'right_hand']
START_IDX = 468
LIPS_IDXS0 = np.array([
        61, 185, 40, 39, 37, 0, 267, 269, 270, 409,
        291, 146, 91, 181, 84, 17, 314, 405, 321, 375,
        78, 191, 80, 81, 82, 13, 312, 311, 310, 415,
        95, 88, 178, 87, 14, 317, 402, 318, 324, 308,
    ])
# Landmark indices in original data
LEFT_HAND_IDXS0 = np.arange(468,489)
RIGHT_HAND_IDXS0 = np.arange(522,543)
LEFT_POSE_IDXS0 = np.array([502, 504, 506, 508, 510])
RIGHT_POSE_IDXS0 = np.array([503, 505, 507, 509, 511])
LANDMARK_IDXS_LEFT_DOMINANT0 = np.concatenate((LIPS_IDXS0, LEFT_HAND_IDXS0, LEFT_POSE_IDXS0))
LANDMARK_IDXS_RIGHT_DOMINANT0 = np.concatenate((LIPS_IDXS0, RIGHT_HAND_IDXS0, RIGHT_POSE_IDXS0))
HAND_IDXS0 = np.concatenate((LEFT_HAND_IDXS0, RIGHT_HAND_IDXS0), axis=0)
N_COLS = LANDMARK_IDXS_LEFT_DOMINANT0.size
# Landmark indices in processed data
LIPS_IDXS = np.argwhere(np.isin(LANDMARK_IDXS_LEFT_DOMINANT0, LIPS_IDXS0)).squeeze()
LEFT_HAND_IDXS = np.argwhere(np.isin(LANDMARK_IDXS_LEFT_DOMINANT0, LEFT_HAND_IDXS0)).squeeze()
RIGHT_HAND_IDXS = np.argwhere(np.isin(LANDMARK_IDXS_LEFT_DOMINANT0, RIGHT_HAND_IDXS0)).squeeze()
HAND_IDXS = np.argwhere(np.isin(LANDMARK_IDXS_LEFT_DOMINANT0, HAND_IDXS0)).squeeze()
POSE_IDXS = np.argwhere(np.isin(LANDMARK_IDXS_LEFT_DOMINANT0, LEFT_POSE_IDXS0)).squeeze()

print(f'# HAND_IDXS: {len(HAND_IDXS)}, N_COLS: {N_COLS}')

# HAND_IDXS: 21, N_COLS: 66


In [None]:
# Code From https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
LIPS_START = 0
LEFT_HAND_START = LIPS_IDXS.size
RIGHT_HAND_START = LEFT_HAND_START + LEFT_HAND_IDXS.size
POSE_START = RIGHT_HAND_START + RIGHT_HAND_IDXS.size

print(f'LIPS_START: {LIPS_START}, LEFT_HAND_START: {LEFT_HAND_START}, RIGHT_HAND_START: {RIGHT_HAND_START}, POSE_START: {POSE_START}')

LIPS_START: 0, LEFT_HAND_START: 40, RIGHT_HAND_START: 61, POSE_START: 61


In [None]:
# Code From https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
ROWS_PER_FRAME = 543  # number of landmarks per frame
def load_and_preprocess_data(file_path, preprocess_layer):
    # Load data
    data_columns = ['x', 'y', 'z']
    data = pd.read_parquet(file_path, columns=data_columns)
    n_frames = int(len(data) / ROWS_PER_FRAME)
    data = data.values.reshape(n_frames, ROWS_PER_FRAME, len(data_columns))
    # Apply preprocessing using the PreprocessLayer
    processed_data = preprocess_layer(data.astype(np.float32))

    return processed_data

In [None]:
# Code From https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
# If True, processing data from scratch
# If False, loads preprocessed data
PREPROCESS_DATA = False
TRAIN_MODEL = True
# True: use 10% of participants as validation set
# False: use all data for training -> gives better LB result
USE_VAL = False

N_ROWS = 543
N_DIMS = 3
DIM_NAMES = ['x', 'y', 'z']
SEED = 42
NUM_CLASSES = 250
IS_INTERACTIVE = True
VERBOSE = 1 if IS_INTERACTIVE else 2

INPUT_SIZE = 64

BATCH_ALL_SIGNS_N = 4
BATCH_SIZE = 256
N_EPOCHS = 100
LR_MAX = 1e-3
N_WARMUP_EPOCHS = 0
WD_RATIO = 0.05
MASK_VAL = 4237

In [None]:
# Code From https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
class PreprocessLayer(tf.keras.layers.Layer):
    def __init__(self):
        super(PreprocessLayer, self).__init__()
        normalisation_correction = tf.constant([
                    # Add 0.50 to left hand (original right hand) and substract 0.50 of right hand (original left hand)
                    [0] * len(LIPS_IDXS) + [0.50] * len(LEFT_HAND_IDXS) + [0.50] * len(POSE_IDXS),
                    # Y coordinates stay intact
                    [0] * len(LANDMARK_IDXS_LEFT_DOMINANT0),
                    # Z coordinates stay intact
                    [0] * len(LANDMARK_IDXS_LEFT_DOMINANT0),
                ],
                dtype=tf.float32,
            )
        self.normalisation_correction = tf.transpose(normalisation_correction, [1,0])

    def pad_edge(self, t, repeats, side):
        if side == 'LEFT':
            return tf.concat((tf.repeat(t[:1], repeats=repeats, axis=0), t), axis=0)
        elif side == 'RIGHT':
            return tf.concat((t, tf.repeat(t[-1:], repeats=repeats, axis=0)), axis=0)

    @tf.function(
        input_signature=(tf.TensorSpec(shape=[None,N_ROWS,N_DIMS], dtype=tf.float32),),
    )
    def call(self, data0):
        # Number of Frames in Video
        N_FRAMES0 = tf.shape(data0)[0]

        # Find dominant hand by comparing summed absolute coordinates
        left_hand_sum = tf.math.reduce_sum(tf.where(tf.math.is_nan(tf.gather(data0, LEFT_HAND_IDXS0, axis=1)), 0, 1))
        right_hand_sum = tf.math.reduce_sum(tf.where(tf.math.is_nan(tf.gather(data0, RIGHT_HAND_IDXS0, axis=1)), 0, 1))
        left_dominant = left_hand_sum >= right_hand_sum

        # Count non NaN Hand values in each frame for the dominant hand
        if left_dominant:
            frames_hands_non_nan_sum = tf.math.reduce_sum(
                    tf.where(tf.math.is_nan(tf.gather(data0, LEFT_HAND_IDXS0, axis=1)), 0, 1),
                    axis=[1, 2],
                )
        else:
            frames_hands_non_nan_sum = tf.math.reduce_sum(
                    tf.where(tf.math.is_nan(tf.gather(data0, RIGHT_HAND_IDXS0, axis=1)), 0, 1),
                    axis=[1, 2],
                )

        # Find frames indices with coordinates of dominant hand
        non_empty_frames_idxs = tf.where(frames_hands_non_nan_sum > 0)
        non_empty_frames_idxs = tf.squeeze(non_empty_frames_idxs, axis=1)
        # Filter frames
        data = tf.gather(data0, non_empty_frames_idxs, axis=0)

        # Cast Indices in float32 to be compatible with Tensorflow Lite
        non_empty_frames_idxs = tf.cast(non_empty_frames_idxs, tf.float32)
        # Normalize to start with 0
        non_empty_frames_idxs -= tf.reduce_min(non_empty_frames_idxs)

        # Number of Frames in Filtered Video
        N_FRAMES = tf.shape(data)[0]

        # Gather Relevant Landmark Columns
        if left_dominant:
            data = tf.gather(data, LANDMARK_IDXS_LEFT_DOMINANT0, axis=1)
        else:
            data = tf.gather(data, LANDMARK_IDXS_RIGHT_DOMINANT0, axis=1)
            data = (
                    self.normalisation_correction + (
                        (data - self.normalisation_correction) * tf.where(self.normalisation_correction != 0, -1.0, 1.0))
                )

        # Video fits in INPUT_SIZE
        if N_FRAMES < INPUT_SIZE:
            # Pad With -1 to indicate padding
            non_empty_frames_idxs = tf.pad(non_empty_frames_idxs, [[0, INPUT_SIZE-N_FRAMES]], constant_values=-1)
            # Pad Data With Zeros
            data = tf.pad(data, [[0, INPUT_SIZE-N_FRAMES], [0,0], [0,0]], constant_values=0)
            # Fill NaN Values With 0
            data = tf.where(tf.math.is_nan(data), 0.0, data)
            return data, non_empty_frames_idxs
        # Video needs to be downsampled to INPUT_SIZE
        else:
            # Repeat
            if N_FRAMES < INPUT_SIZE**2:
                repeats = tf.math.floordiv(INPUT_SIZE * INPUT_SIZE, N_FRAMES0)
                data = tf.repeat(data, repeats=repeats, axis=0)
                non_empty_frames_idxs = tf.repeat(non_empty_frames_idxs, repeats=repeats, axis=0)

            # Pad To Multiple Of Input Size
            pool_size = tf.math.floordiv(len(data), INPUT_SIZE)
            if tf.math.mod(len(data), INPUT_SIZE) > 0:
                pool_size += 1

            if pool_size == 1:
                pad_size = (pool_size * INPUT_SIZE) - len(data)
            else:
                pad_size = (pool_size * INPUT_SIZE) % len(data)

            # Pad Start/End with Start/End value
            pad_left = tf.math.floordiv(pad_size, 2) + tf.math.floordiv(INPUT_SIZE, 2)
            pad_right = tf.math.floordiv(pad_size, 2) + tf.math.floordiv(INPUT_SIZE, 2)
            if tf.math.mod(pad_size, 2) > 0:
                pad_right += 1

            # Pad By Concatenating Left/Right Edge Values
            data = self.pad_edge(data, pad_left, 'LEFT')
            data = self.pad_edge(data, pad_right, 'RIGHT')

            # Pad Non Empty Frame Indices
            non_empty_frames_idxs = self.pad_edge(non_empty_frames_idxs, pad_left, 'LEFT')
            non_empty_frames_idxs = self.pad_edge(non_empty_frames_idxs, pad_right, 'RIGHT')

            # Reshape to Mean Pool
            data = tf.reshape(data, [INPUT_SIZE, -1, N_COLS, N_DIMS])
            non_empty_frames_idxs = tf.reshape(non_empty_frames_idxs, [INPUT_SIZE, -1])

            # Mean Pool
            data = tf.experimental.numpy.nanmean(data, axis=1)
            non_empty_frames_idxs = tf.experimental.numpy.nanmean(non_empty_frames_idxs, axis=1)

            # Fill NaN Values With 0
            data = tf.where(tf.math.is_nan(data), 0.0, data)

            return data, non_empty_frames_idxs

preprocess_layer = PreprocessLayer()

In [None]:
# Code From https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
# Function to preprocess the entire dataset
def preprocess_data(df, main_dir, preprocess_layer):
    X = np.zeros([len(df), INPUT_SIZE, N_COLS, N_DIMS], dtype=np.float32)
    y = np.zeros([len(df)], dtype=np.int32)
    NON_EMPTY_FRAME_IDXS = np.full([len(df), INPUT_SIZE], -1, dtype=np.float32)

    for row_idx, row in tqdm(df.iterrows(), total=len(df)):
        file_path = os.path.join(main_dir, row['path'])
        processed_data, non_empty_frame_idxs = load_and_preprocess_data(file_path, preprocess_layer)
        X[row_idx] = processed_data
        y[row_idx] = row['sign_index']
        NON_EMPTY_FRAME_IDXS[row_idx] = non_empty_frame_idxs
    # Save processed data
    np.save(os.path.join(googledrive_dir, 'X.npy'), X)
    np.save(os.path.join(googledrive_dir, 'y.npy'), y)
    np.save(os.path.join(googledrive_dir, 'NON_EMPTY_FRAME_IDXS.npy'), NON_EMPTY_FRAME_IDXS)

    # Optional: Train-Validation Split
    splitter = GroupShuffleSplit(test_size=0.10, n_splits=2, random_state=42)
    train_idxs, val_idxs = next(splitter.split(X, y, groups=df['participant_id']))

    # Save Train and Validation Sets
    X_train = X[train_idxs]
    NON_EMPTY_FRAME_IDXS_TRAIN = NON_EMPTY_FRAME_IDXS[train_idxs]
    y_train = y[train_idxs]
    np.save(os.path.join(googledrive_dir, 'X_train.npy'), X[train_idxs])
    np.save(os.path.join(googledrive_dir, 'y_train.npy'), y[train_idxs])
    np.save(os.path.join(googledrive_dir, 'NON_EMPTY_FRAME_IDXS_TRAIN.npy'), NON_EMPTY_FRAME_IDXS_TRAIN)
    X_val = X[val_idxs]
    NON_EMPTY_FRAME_IDXS_VAL = NON_EMPTY_FRAME_IDXS[val_idxs]
    y_val = y[val_idxs]
    np.save(os.path.join(googledrive_dir, 'X_val.npy'), X[val_idxs])
    np.save(os.path.join(googledrive_dir, 'y_val.npy'), y[val_idxs])
    np.save(os.path.join(googledrive_dir, 'NON_EMPTY_FRAME_IDXS_VAL.npy'), NON_EMPTY_FRAME_IDXS_VAL)

# Run the preprocessing
preprocess_data(df_metadata, main_dir, preprocess_layer)

100%|██████████| 94477/94477 [21:41<00:00, 72.61it/s]


Reference:

Wijkhuizen, M. (2023, April 04). GISLR TF Data Processing & Transformer Training. Kaggle. https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training